Skip to main content

Site Scraping Framework

Project description

https://travis-ci.org/lorien/grab.png

Grab is a python site scraping framework. Grab provides tons of helpful methods to scrape web sites and to work with scraped content:

  • Automatic cookies (session) support

  • HTTP and SOCKS proxy with and without authorization

  • Keep-Alive support

  • IDN support

  • Tools to work with web forms

  • Easy multipart file uploading

  • Flexible customization of HTTP requests

  • Automatic charset detection

  • Powerful API of extracting info from HTML documents with XPATH queries

  • Asynchronous API to make thousands of simultaneous queries. This part of library called Spider and it is too big to even list its features in this README.

  • Python 3 ready

  • And much, much more

  • Grab has written by the guy who is doing site scraping since 2005

Check out docs: https://github.com/lorien/grab/tree/master/docs2/source

I am working hard now (Sep 2013) to complete the documentation in English.

Example of Grab usage:

from grab import Grab

g = Grab()
g.go('https://github.com/login')
g.set_input('login', 'lorien')
g.set_input('password', '***')
g.submit()
for elem in g.doc.select('//ul[@id="repo_listing"]/li/a'):
    print '%s: %s' % (elem.text(), elem.attr('href'))

Example of Grab::Spider usage:

from grab.spider import Spider, Task
import logging

class ExampleSpider(Spider):
    def task_generator(self):
        for lang in ('python', 'ruby', 'perl'):
            url = 'https://www.google.com/search?q=%s' % lang
            yield Task('search', url=url)

    def task_search(self, grab, task):
        print grab.doc.select('//div[@class="s"]//cite').text()


logging.basicConfig(level=logging.DEBUG)
bot = ExampleSpider()
bot.run()

Installation

Pip is recommended way to install Grab and its dependencies:

$ pip install lxml
$ pip install pycurl
$ pip install grab

See details here https://github.com/lorien/grab/blob/master/docs2/source/grab_installation.rst

Documentation

Russian docs: http://docs.grablib.org

English docs in progress: https://github.com/lorien/grab/tree/master/docs2/source

Mailing List (Ru/En languages): http://groups.google.com/group/python-grab/

Contribution

If you have found a bug or wish a new feature please open new issue on github:

Bitdeli badge

Project details


Release history Release notifications | RSS feed

This version

0.5.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

grab-0.5.0.tar.gz (160.9 kB view details)

Uploaded Source

File details

Details for the file grab-0.5.0.tar.gz.

File metadata

  • Download URL: grab-0.5.0.tar.gz
  • Upload date:
  • Size: 160.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for grab-0.5.0.tar.gz
Algorithm Hash digest
SHA256 0fe91bcbc62520ee490c1327c6ce47e753d0123c2df0ec67558294cbb0475942
MD5 6d1eb36b9f029f7280dc6e7eb70444a2
BLAKE2b-256 772cce2cbb29752ba13490dc24e9c0e392d37c59235b70454ea62835bd2649e4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page