Scrapy middleware which allows to crawl only new content
Project description
scrapy-crawl-once
This package provides a Scrapy middleware which allows to avoid re-crawling pages which were already downloaded in previous crawls.
License is MIT.
Installation
pip install scrapy-crawl-once
Usage
To enable it, modify your settings.py:
SPIDER_MIDDLEWARES = { # ... 'scrapy_crawl_once.CrawlOnceMiddleware': 100, # ... } DOWNLOADER_MIDDLEWARES = { # ... 'scrapy_crawl_once.CrawlOnceMiddleware': 50, # ... }
By default it does nothing. To avoid crawling a particular page multiple times set request.meta['crawl_once'] = True. When a response is received and a callback is successful, the fingerprint of such request is stored to a database. When spider schedules a new request middleware first checks if its fingerprint is in the database, and drops the request if it is there.
Other request.meta keys:
crawl_once_value - a value to store in DB. By default, timestamp is stored.
crawl_once_key - request unique id; by default request_fingerprint is used.
Settings
CRAWL_ONCE_ENABLED - set it to False to disable middleware. Default is True.
CRAWL_ONCE_PATH - a path to a folder with crawled requests database. By default .scrapy/crawl_once/ path inside a project dir is used; this folder contains <spider_name>.sqlite files with databases of seen requests.
CRAWL_ONCE_DEFAULT - default value for crawl_once meta key (False by default). When True, all requests are handled by this middleware unless disabled explicitly using request.meta['crawl_once'] = False.
Alternatives
https://github.com/scrapy-plugins/scrapy-deltafetch is a similar package; it does almost the same. Differences:
scrapy-deltafetch chooses whether to discard a request or not based on yielded items; scrapy-crawl-once uses an explicit request.meta['crawl_once'] flag.
scrapy-deltafetch uses bsddb3, scrapy-crawl-once uses sqlite.
Another alternative is a built-in Scrapy HTTP cache. Differences:
scrapy cache stores all pages on disc, scrapy-crawl-once only keeps request fingerprints;
scrapy cache allows a more fine grained invalidation consistent with how browsers work;
with scrapy cache all pages are still processed (though not all pages are downloaded).
Contributing
source code: https://github.com/TeamHG-Memex/scrapy-crawl-once
bug tracker: https://github.com/TeamHG-Memex/scrapy-crawl-once/issues
To run tests, install tox and run tox from the source checkout.
CHANGES
0.1.1 (2017-03-04)
new 'crawl_once/initial' value in scrapy stats - it contains the initial size (number of records) of crawl_once database.
0.1 (2017-03-03)
Initial release.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scrapy-crawl-once-0.1.1.tar.gz
.
File metadata
- Download URL: scrapy-crawl-once-0.1.1.tar.gz
- Upload date:
- Size: 5.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8ab832ab5c4073ba2aa498a8c6bb2a792117ecd7deadca41d7fd1cdee534caf4 |
|
MD5 | d3a838f8f0f01ab8555cc54e4d060d09 |
|
BLAKE2b-256 | fce3e196d13482add6f506976e92fca549f3bfdeb5a015a5dc5146cfacd30d32 |
File details
Details for the file scrapy_crawl_once-0.1.1-py2.py3-none-any.whl
.
File metadata
- Download URL: scrapy_crawl_once-0.1.1-py2.py3-none-any.whl
- Upload date:
- Size: 7.2 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 60ea4e7529f99ad1ec6cacbad53828fbfa5959cc4dddfe8047557e7c189e920c |
|
MD5 | 7cbd808e48d307faf08a88e23e87d7b3 |
|
BLAKE2b-256 | 49978684f7a85d6be3a52f50cce2411eaaaf6c4e0d6c1598fa7b4e99578ba2cb |