a library for scraping things
Project description
A Python library for scraping things.
Features include:
HTTP, HTTPS, FTP requests via an identical API
HTTP caching, compression and cookies
redirect following
request throttling
robots.txt compliance (optional)
robust error handling
scrapelib is a project of Sunlight Labs (c) 2011. All code is released under a BSD-style license, see LICENSE for details.
Written by Michael Stephens <mstephens@sunlightfoundation.com> and James Turk <jturk@sunlightfoundation.com>.
Requirements
python >= 2.6
httplib2 optional but highly recommended.
Installation
scrapelib is available on PyPI and can be installed via pip install scrapelib
PyPI package: http://pypi.python.org/pypi/scrapelib
Source: http://github.com/sunlightlabs/scrapelib
Documentation: http://scrapelib.readthedocs.org/en/latest/
Example Usage
import scrapelib s = scrapelib.Scraper(requests_per_minute=10, allow_cookies=True, follow_robots=True) # Grab Google front page s.urlopen('http://google.com') # Will raise RobotExclusionError s.urlopen('http://google.com/search') # Will be throttled to 10 HTTP requests per minute while True: s.urlopen('http://example.com')
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file scrapelib-0.5.5.tar.gz
.
File metadata
- Download URL: scrapelib-0.5.5.tar.gz
- Upload date:
- Size: 11.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7ae539017454fde408c15a3a1e7d18a9ecdb60eb378561b29eca9d649bf5ab71 |
|
MD5 | 33cb5578bd6545b6b0f4dddf230fd7a8 |
|
BLAKE2b-256 | 6fdbae224b99199d32cae8b58c227dd09e0beab91f00c4c3f7e6d827d90d8407 |