Skip to main content

Scrapy middleware to ignore previously crawled pages

Project description

https://travis-ci.org/scrapy-plugins/scrapy-deltafetch.svg?branch=master https://codecov.io/gh/scrapy-plugins/scrapy-deltafetch/branch/master/graph/badge.svg

This is a Scrapy spider middleware to ignore requests to pages containing items seen in previous crawls of the same spider, thus producing a “delta crawl” containing only new items.

This also speeds up the crawl, by reducing the number of requests that need to be crawled, and processed (typically, item requests are the most CPU intensive).

Requirements

DeltaFetch middleware depends on Python’s bsddb3 package.

On Ubuntu/Debian, you may need to install libdb-dev if it’s not installed already.

Installation

Install scrapy-deltafetch using pip:

$ pip install scrapy-deltafetch

Configuration

  1. Add DeltaFetch middleware by including it in SPIDER_MIDDLEWARES in your settings.py file:

    SPIDER_MIDDLEWARES = {
        'scrapy_deltafetch.DeltaFetch': 100,
    }

    Here, priority 100 is just an example. Set its value depending on other middlewares you may have enabled already.

  2. Enable the middleware using DELTAFETCH_ENABLED in your setting.py:

    DELTAFETCH_ENABLED = True

Usage

Following are the different options to control DeltaFetch middleware behavior.

Supported Scrapy settings

  • DELTAFETCH_ENABLED — to enable (or disable) this extension

  • DELTAFETCH_DIR — directory where to store state

  • DELTAFETCH_RESET — reset the state, clearing out all seen requests

These usually go in your Scrapy project’s settings.py.

Supported Scrapy spider arguments

  • deltafetch_reset — same effect as DELTAFETCH_RESET setting

Example:

$ scrapy crawl example -a deltafetch_reset=1

Supported Scrapy request meta keys

  • deltafetch_key — used to define the lookup key for that request. by default it’s Scrapy’s default Request fingerprint function, but it can be changed to contain an item id, for example. This requires support from the spider, but makes the extension more efficient for sites that many URLs for the same item.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-deltafetch-1.2.1.tar.gz (3.2 kB view details)

Uploaded Source

Built Distribution

scrapy_deltafetch-1.2.1-py2.py3-none-any.whl (3.9 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file scrapy-deltafetch-1.2.1.tar.gz.

File metadata

File hashes

Hashes for scrapy-deltafetch-1.2.1.tar.gz
Algorithm Hash digest
SHA256 08bb8156d2e6bbdf9f0aca749f4624fff89f1135c35b91c8bb8aeed6f40da1d4
MD5 93361130614d9dac5ecc900241570f7c
BLAKE2b-256 4512a564137c81bfcf3145044d4e5854a053dbe7805f19ce3bc36ee91f6c617e

See more details on using hashes here.

File details

Details for the file scrapy_deltafetch-1.2.1-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_deltafetch-1.2.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 8d45da18b415f7c0147c9dbe3cd7b3a9023eef64ac6f49f6a8d5c571f32f5ad8
MD5 1c28362a58b5aa50d7499915ffe91e20
BLAKE2b-256 908108bd21bc3ee364845d76adef09d20d85d75851c582a2e0bb7f959d49b8e5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page