Scrapinghub AutoExtract API integration for Scrapy
Project description
This library integrates ScrapingHub’s AI Enabled Automatic Data Extraction into a Scrapy spider using a downloader middleware. The middleware adds the result of AutoExtract to response.meta['autoextract'] for consumption by the spider.
Installation
pip install scrapy-autoextract
scrapy-autoextract requires Python 3.5+
Configuration
Add the AutoExtract downloader middleware in the settings file:
DOWNLOADER_MIDDLEWARES = { 'scrapy_autoextract.AutoExtractMiddleware': 543, }
Note that this should be the last downloader middleware to be executed.
Usage
The middleware is opt-in and can be explicitly enabled per request, with the {'autoextract': {'enabled': True}} request meta. All the options below can be set either in the project settings file, or just for specific spiders, in the custom_settings dict.
Available settings:
AUTOEXTRACT_USER [mandatory] is your AutoExtract API key
AUTOEXTRACT_URL [optional] the AutoExtract service url. Defaults to autoextract.scrapinghub.com.
AUTOEXTRACT_TIMEOUT [optional] sets the response timeout from AutoExtract. Defaults to 660 seconds. Can also be defined by setting the “download_timeout” in the request.meta.
AUTOEXTRACT_PAGE_TYPE [mandatory] defines the kind of document to be extracted. Current available options are “product” and “article”. Can also be defined on spider.page_type, or {'autoextract': {'pageType': '...'}} request meta. This is required for the AutoExtract classifier to know what kind of page needs to be extracted.
Within the spider, consuming the AutoExtract result is as easy as:
def parse(self, response): yield response.meta['autoextract']
Limitations
When using the AutoExtract middleware, there are some limitations.
The incoming spider request is rendered by AutoExtract, not just downloaded by Scrapy, which can change the result - the IP is different, headers are different, etc.
Only GET requests are supported
Custom headers and cookies are not supported (i.e. Scrapy features to set them don’t work)
Proxies are not supported (they would work incorrectly, sitting between Scrapy and AutoExtract, instead of AutoExtract and website)
AutoThrottle extension can work incorrectly for AutoExtract requests, because AutoExtract timing can be much larger than time required to download a page, so it’s best to use AUTHTHROTTLE_ENABLED=False in the settings.
Redirects are handled by AutoExtract, not by Scrapy, so these kinds of middlewares might have no effect
Retries should be disabled, because AutoExtract handles them internally (use RETRY_ENABLED=False in the settings) There is an exception, if there are too many requests sent in a short amount of time and AutoExtract returns HTTP code 429. For that case it’s best to use RETRY_HTTP_CODES=[429].
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for scrapy_autoextract-0.1-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4a45724527d67ba77670f67ea4221f9070794f6e6e648a898d6cfe14276219b3 |
|
MD5 | 6671c0eb467f0259df37da03cbeba290 |
|
BLAKE2b-256 | 0af3501e69ba3bd9f8eff2b6f5c457d40be0a7044f6d59eeb2ece39cd180e49d |