Skip to main content

Scrapy extension to store info in storage service

Project description

A scrapy extension to store requests and responses information in storage service.

Installation

You can install scrapy-pagestorage using pip:

pip install scrapy-pagestorage

You can then enable the middleware in your settings.py:

SPIDER_MIDDLEWARES = {
    ...
    'scrapy_pagestorage.PageStorageMiddleware': 900
}

How to use it

Enable extension through settings.py:

PAGE_STORAGE_ENABLED = True
PAGE_STORAGE_ON_ERROR_ENABLED = True

Configure the exension through settings.py:

PAGE_STORAGE_MODE = "VERSIONED_CACHE"
PAGE_STORAGE_LIMIT = 100
PAGE_STORAGE_ON_ERROR_LIMIT = 100
PAGE_STORAGE_TRIM_HTML = True

The extension is auto-enabled for Portia spiders (SHUB_SPIDER_TYPE=portia).

Settings

PAGE_STORAGE_MODE

Default: None

A string which specifies if the extension will store information using cache store or versioned cache store (set PAGE_STORAGE_MODE=”VERSIONED_CACHE” to use versioned one).

PAGE_STORAGE_LIMIT

An integer to set a limit of visited pages amount to store.

PAGE_STORAGE_ON_ERROR_LIMIT

An integer to set a limit for page errors amount to store.

PAGE_STORAGE_TRIM_HTML

Default: False

Remove whitespace from the start and end of the HTML to reduce file size.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-pagestorage-0.2.2.tar.gz (3.5 kB view details)

Uploaded Source

Built Distribution

scrapy_pagestorage-0.2.2-py3-none-any.whl (5.0 kB view details)

Uploaded Python 3

File details

Details for the file scrapy-pagestorage-0.2.2.tar.gz.

File metadata

  • Download URL: scrapy-pagestorage-0.2.2.tar.gz
  • Upload date:
  • Size: 3.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.19.1 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.6.3

File hashes

Hashes for scrapy-pagestorage-0.2.2.tar.gz
Algorithm Hash digest
SHA256 dd307f245f6719c54a496b9b33771a3614e9f9c4473b83b0611df341ee7fe75d
MD5 f9f9fc0597ee4ea3339a73b9a10d3f19
BLAKE2b-256 fc75f9d9ae74e785a0adb8286db4f2ef551e351ce9e5d9227904e319cdfba05c

See more details on using hashes here.

File details

Details for the file scrapy_pagestorage-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: scrapy_pagestorage-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 5.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.19.1 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.6.3

File hashes

Hashes for scrapy_pagestorage-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7011d3fafb9ea4e3bec05113eb07c5ec742d749c001fe396d5ddf359a97c14e7
MD5 2e8b3a08bcc3cf6b23475e8c05e87403
BLAKE2b-256 b81528a19212c68e8d97301a4e49ccf111f20fb163228613b9100a910e54039c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page