Python interface to Scrapinghub Automatic Extraction API
Project description
Python client libraries for Scrapinghub AutoExtract API. It allows to extract product and article information from any website.
Both synchronous and asyncio wrappers are provided by this package.
License is BSD 3-clause.
Installation
pip install scrapinghub-autoextract
scrapinghub-autoextract requires Python 3.6+ for CLI tool and for the asyncio API; basic, synchronous API works with Python 3.5.
Usage
First, make sure you have an API key. To avoid passing it in api_key argument with every call, you can set SCRAPINGHUB_AUTOEXTRACT_KEY environment variable with the key.
Command-line interface
The most basic way to use the client is from a command line. First, create a file with urls, an URL per line (e.g. urls.txt). Second, set SCRAPINGHUB_AUTOEXTRACT_KEY env variable with your AutoExtract API key (you can also pass API key as --api-key script argument).
Then run a script, to get the results:
python -m autoextract urls.txt --page-type article > res.jl
Run python -m autoextract --help to get description of all supported options.
Synchronous API
Synchronous API provides an easy way to try autoextract in a script. For production usage asyncio API is strongly recommended.
You can send requests as described in API docs:
from autoextract.sync import request_raw query = [{'url': 'http://example.com.foo', 'pageType': 'article'}] results = request_raw(query)
Note that if there are several URLs in the query, results can be returned in arbitrary order.
There is also a autoextract.sync.request_batch helper, which accepts URLs and page type, and ensures results are in the same order as requested URLs:
from autoextract.sync import request_batch urls = ['http://example.com/foo', 'http://example.com/bar'] results = request_batch(urls, page_type='article')
asyncio API
Basic usage is similar to sync API (request_raw), but asyncio event loop is used:
from autoextract.aio import request_raw async def foo(): results1 = await request_raw(query) # ...
There is also request_parallel function, which allows to process many URLs in parallel, using both batching and multiple connections:
import sys from autoextract.aio import request_parallel, create_session async def foo(): async with create_session() as session: res_iter = request_parallel(urls, page_type='article', n_conn=10, batch_size=3, session=session) for f in res_iter: try: batch_result = await f for res in batch_result: # do something with a result except ApiError as e: print(e, file=sys.stderr) raise
request_parallel and request_raw functions handle throttling (http 429 errors) and network errors, retrying a request in these cases.
CLI interface implementation (autoextract/__main__.py) can serve as an usage example.
Contributing
Source code: https://github.com/scrapinghub/scrapinghub-autoextract
Issue tracker: https://github.com/scrapinghub/scrapinghub-autoextract/issues
Use tox to run tests with different Python versions:
tox
The command above also runs type checks; we use mypy.
Changes
TBA
Initial release.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scrapinghub-autoextract-0.1.tar.gz
.
File metadata
- Download URL: scrapinghub-autoextract-0.1.tar.gz
- Upload date:
- Size: 11.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 672e67b9443aa5ab78345de212b273f92031c95688474b58b0b3fe46ba2d13fa |
|
MD5 | 18ad64552554031e4bc6b67efc4d3677 |
|
BLAKE2b-256 | 811c826a9aa957870fc84f1306ecc3b7d71a9eb4a57b254eb31b5e0813985d1c |
File details
Details for the file scrapinghub_autoextract-0.1-py3-none-any.whl
.
File metadata
- Download URL: scrapinghub_autoextract-0.1-py3-none-any.whl
- Upload date:
- Size: 12.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f0a9e69c49e5f1e3d1cdfa6069c322b1d9fa8d10c59a422295aa34cf74c14672 |
|
MD5 | 731b2061c99f5b9ef99cf61bed9d2b19 |
|
BLAKE2b-256 | 30ef71ab8223947762163e062a0c79ce5019cce474831e77cf19d1fafd97e2d2 |