scrapy-autoextract

Scrapinghub AutoExtract API integration for Scrapy

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Environment
- Console
Framework
- Scrapy
Intended Audience
- Developers
License
- OSI Approved :: BSD License
Natural Language
- English
Operating System
- OS Independent
Programming Language

Project description

This library integrates ScrapingHub’s AI Enabled Automatic Data Extraction into a Scrapy spider using a downloader middleware. The middleware adds the result of AutoExtract to response.meta['autoextract'] for consumption by the spider.

Installation

pip install scrapy-autoextract

scrapy-autoextract requires Python 3.5+

Configuration

Add the AutoExtract downloader middleware in the settings file:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_autoextract.AutoExtractMiddleware': 543,
}

Note that this should be the last downloader middleware to be executed.

Usage

The middleware is opt-in and can be explicitly enabled per request, with the {'autoextract': {'enabled': True}} request meta. All the options below can be set either in the project settings file, or just for specific spiders, in the custom_settings dict.

Available settings:

AUTOEXTRACT_USER [mandatory] is your AutoExtract API key
AUTOEXTRACT_URL [optional] the AutoExtract service url. Defaults to autoextract.scrapinghub.com.
AUTOEXTRACT_TIMEOUT [optional] sets the response timeout from AutoExtract. Defaults to 660 seconds. Can also be defined by setting the “download_timeout” in the request.meta.
AUTOEXTRACT_PAGE_TYPE [mandatory] defines the kind of document to be extracted. Current available options are “product” and “article”. Can also be defined on spider.page_type, or {'autoextract': {'pageType': '...'}} request meta. This is required for the AutoExtract classifier to know what kind of page needs to be extracted.

Within the spider, consuming the AutoExtract result is as easy as:

def parse(self, response):
    yield response.meta['autoextract']

Limitations

When using the AutoExtract middleware, there are some limitations.

The incoming spider request is rendered by AutoExtract, not just downloaded by Scrapy, which can change the result - the IP is different, headers are different, etc.
Only GET requests are supported
Custom headers and cookies are not supported (i.e. Scrapy features to set them don’t work)
Proxies are not supported (they would work incorrectly, sitting between Scrapy and AutoExtract, instead of AutoExtract and website)
AutoThrottle extension can work incorrectly for AutoExtract requests, because AutoExtract timing can be much larger than time required to download a page, so it’s best to use AUTHTHROTTLE_ENABLED=False in the settings.
Redirects are handled by AutoExtract, not by Scrapy, so these kinds of middlewares might have no effect
Retries should be disabled, because AutoExtract handles them internally (use RETRY_ENABLED=False in the settings) There is an exception, if there are too many requests sent in a short amount of time and AutoExtract returns HTTP code 429. For that case it’s best to use RETRY_HTTP_CODES=[429].

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Environment
- Console
Framework
- Scrapy
Intended Audience
- Developers
License
- OSI Approved :: BSD License
Natural Language
- English
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

0.7.0

Aug 5, 2021

0.6.1

Jun 2, 2021

0.6.0

Jun 1, 2021

0.5.2

Jan 27, 2021

0.5.1

Jan 22, 2021

0.5.0

Jan 21, 2021

0.4

Jan 30, 2020

0.3.1

Dec 17, 2019

0.3

Dec 10, 2019

0.2

Dec 9, 2019

This version

0.1

Oct 22, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-autoextract-0.1.tar.gz (7.1 kB view hashes)

Uploaded Oct 22, 2019 Source

Built Distribution

scrapy_autoextract-0.1-py2.py3-none-any.whl (8.2 kB view hashes)

Uploaded Oct 22, 2019 Python 2 Python 3

Hashes for scrapy-autoextract-0.1.tar.gz

Hashes for scrapy-autoextract-0.1.tar.gz
Algorithm	Hash digest
SHA256	`7f8ee98091d343ed9352eb37f48a7bb16819608fb793c49cb422a156a187f96b`
MD5	`38718acc04040b63b69754c1a097ca4a`
BLAKE2b-256	`c2e5abcd25f30b6bdd10b41123dc880070a770586c44ad626b3070f2889ae6d9`

Hashes for scrapy_autoextract-0.1-py2.py3-none-any.whl

Hashes for scrapy_autoextract-0.1-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`4a45724527d67ba77670f67ea4221f9070794f6e6e648a898d6cfe14276219b3`
MD5	`6671c0eb467f0259df37da03cbeba290`
BLAKE2b-256	`0af3501e69ba3bd9f8eff2b6f5c457d40be0a7044f6d59eeb2ece39cd180e49d`