Skip to main content

Scrapy middleware which allows to crawl only new content

Project description

scrapy-crawl-once

PyPI Version Build Status Code Coverage

This package provides a Scrapy middleware which allows to avoid re-crawling pages which were already downloaded in previous crawls.

License is MIT.

Installation

pip install scrapy-crawl-once

Usage

To enable it, modify your settings.py:

SPIDER_MIDDLEWARES = {
    # ...
    'scrapy_crawl_once.CrawlOnceMiddleware': 100,
    # ...
}

DOWNLOADER_MIDDLEWARES = {
    # ...
    'scrapy_crawl_once.CrawlOnceMiddleware': 50,
    # ...
}

By default it does nothing. To avoid crawling a particular page multiple times set request.meta['crawl_once'] = True. When a response is received and a callback is successful, the fingerprint of such request is stored to a database. When spider schedules a new request middleware first checks if its fingerprint is in the database, and drops the request if it is there.

Other request.meta keys:

  • crawl_once_value - a value to store in DB. By default, timestamp is stored.

  • crawl_once_key - request unique id; by default request_fingerprint is used.

Settings

  • CRAWL_ONCE_ENABLED - set it to False to disable middleware. Default is True.

  • CRAWL_ONCE_PATH - a path to a folder with crawled requests database. By default .scrapy/crawl_once/ path inside a project dir is used; this folder contains <spider_name>.sqlite files with databases of seen requests.

  • CRAWL_ONCE_DEFAULT - default value for crawl_once meta key (False by default). When True, all requests are handled by this middleware unless disabled explicitly using request.meta['crawl_once'] = False.

Alternatives

https://github.com/scrapy-plugins/scrapy-deltafetch is a similar package; it does almost the same. Differences:

  • scrapy-deltafetch chooses whether to discard a request or not based on yielded items; scrapy-crawl-once uses an explicit request.meta['crawl_once'] flag.

  • scrapy-deltafetch uses bsddb3, scrapy-crawl-once uses sqlite.

Another alternative is a built-in Scrapy HTTP cache. Differences:

  • scrapy cache stores all pages on disc, scrapy-crawl-once only keeps request fingerprints;

  • scrapy cache allows a more fine grained invalidation consistent with how browsers work;

  • with scrapy cache all pages are still processed (though not all pages are downloaded).

Contributing

To run tests, install tox and run tox from the source checkout.

CHANGES

0.1 (2017-03-03)

Initial release.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-crawl-once-0.1.tar.gz (5.0 kB view details)

Uploaded Source

Built Distribution

scrapy_crawl_once-0.1-py2.py3-none-any.whl (7.0 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file scrapy-crawl-once-0.1.tar.gz.

File metadata

File hashes

Hashes for scrapy-crawl-once-0.1.tar.gz
Algorithm Hash digest
SHA256 4d552e52ddcdc285447aa4b9b71ffcbd602e8fd6be2a198ff885a61a1bf50047
MD5 4f6f2755603e59eb58c4092c573cbf69
BLAKE2b-256 8b56589f5cdf261ec1e9e2f04886970bbf7d0cfff80aef2754096d29b4b2c78d

See more details on using hashes here.

File details

Details for the file scrapy_crawl_once-0.1-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_crawl_once-0.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 8511cf936704875e389d23214b8991b00035e605c8a56659294d378404d40b4a
MD5 77b3bebf665781907a28f0a69671d9e0
BLAKE2b-256 432b7d1a983ffc5cf9ec3e80831546488466c4de5fb29fed3deb5251ba1c9574

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page