Site Scraping Framework
Project description
Grab is a python site scraping framework. Grab provides powerful interface to two libraries: lxml and pycurl. There are two ways how to use Grab: 1) Use Grab to configure network requests and to process fetched documents. In this way you should manually control flow of you program. 2) Use Grab::Spider to buld asynchronous site scrapers. This is how scrapy works.
Example of Grab usage:
from grab import Grab g = Grab() g.go('https://github.com/login') g.set_input('login', 'lorien') g.set_input('password', '***') g.submit() for elem in g.doc.select('//ul[@id="repo_listing"]/li/a'): print '%s: %s' % (elem.text(), elem.attr('href'))
Example of Grab::Spider usage:
from grab.spider import Spider, Task import logging class ExampleSpider(Spider): def task_generator(self): for lang in ('python', 'ruby', 'perl'): url = 'https://www.google.com/search?q=%s' % lang yield Task('search', url=url) def task_search(self, grab, task): print grab.doc.select('//div[@class="s"]//cite').text() logging.basicConfig(level=logging.DEBUG) bot = ExampleSpider() bot.run()
Installation
Pip is recommended way to install Grab and its dependencies:
$ pip install lxml $ pip install pycurl $ pip install grab
Documentation
Russian docs: http://docs.grablib.org English docs in progress.
Discussion group (Russian or English): http://groups.google.com/group/python-grab/
Contribution
If you found a bug or if you want new feature please create new issue on github:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file grab-0.4.13.tar.gz
.
File metadata
- Download URL: grab-0.4.13.tar.gz
- Upload date:
- Size: 149.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 361e3b274e595a35e6f26761d994357cd063c3e8d47e0aa2f31e6961da342e03 |
|
MD5 | 0ae887d8dd7fe16183d34cd385b2c3ad |
|
BLAKE2b-256 | ddcbedd06ab1415bf67942e088b345ca904c04f715c86f5a296f0621c5ae3791 |