Skip to main content

Analyze Scrapy Cloud data

Project description

Arche

PyPI PyPI - Python Version GitHub Build Status Codecov Code style: black GitHub commit activity Join the chat at https://gitter.im/scrapinghub/arche

pip install arche

Arche (pronounced as Arkey) helps to verify data using set of defined rules, for example:

  • Validation with JSON schema
  • Coverage
  • Duplicates
  • Garbage symbols
  • Comparison of two jobs

We use it in Scrapinghub to ensure quality of scraped data

Installation

Arche requires Jupyter environment, supporting both JupyterLab and Notebook UI

For JupyterLab, you will need to properly install plotly extensions

Then just pip install arche

Use case

  • You need to check the quality of data from Scrapy Cloud jobs continuously.

    Say, you scraped some website and have the data ready in the cloud. A typical approach would be:

  • You want to use it in your application to verify Scrapy Cloud data

Developer Setup

pipenv install --dev
pipenv shell
tox

Contribution

Any contributions are welcome!

  • Fork or create a new branch
  • Make desired changes
  • Open a pull request

Changes

Most recent releases are shown at the top. Each release shows:

  • Added: New classes, methods, functions, etc
  • Changed: Additional parameters, changes to inputs or outputs, etc
  • Fixed: Bug fixes that don't change documented behaviour

Note that the top-most release is changes in the unreleased master branch on Github. Parentheses after an item show the name or github id of the contributor of that change.

Keep a Changelog, Semantic Versioning.

[0.3.6dev] (Work In Progress)

Added

Changed

Fixed

Removed

[0.3.5] (2019-05-14)

Added

  • Arche() supports any iterables with item dicts, fixing jsonschema consistency, #83
  • Items.from_array to read raw data from iterables, #83

Changed

Fixed

Removed

[0.3.4] (2019-05-06)

Fixed

  • basic_json_schema() fails with long 1.0 types, #80

[0.3.3] (2019-05-03)

Added

  • Accept dataframes as source or target, #69

Changed

  • data_quality_report plots the same "Fields Coverage" instead of green "Scraped Fields Coverage"
  • Plot theme changed from ggplot2 to seaborn, #62
  • Same target and source raise an error, was a warning before
  • Passed rules marked with green PASSED.

Fixed

Removed

  • Deprecated Arche.basic_json_schema(), use basic_json_schema()
  • Removed Quickstart.md as redundant - documentation lives in notebooks

[0.3.2] (2019-04-18)

Added

  • Allow reading private raw schemas directly from bitbucket, #58

Changed

  • Progress widgets are removed before printing graphs
  • New plotly v4 API

Fixed

  • Failing Compare Prices For Same Urls when url is nan, #67
  • Empty graphs in Jupyter Notebook, #63

Removed

  • Scraped Items History graphs

[0.3.1] (2019-04-12)

Fixed

  • Empty graphs due to lack of plotlyjs, #61

[0.3.0] (2019-04-12)

Fixed

  • Big notebook size, replaced cufflinks with plotly and ipython, #39

Changed

  • Fields Coverage now is printed as a bar plot, #9
  • Fields Counts renamed to Coverage Difference and results in 2 bar plots, #9, #51:
    • Coverage from job stats fields counts which reflects coverage for each field for both jobs
    • Coverage difference more than 5% which prints >5% difference between the coverages (was ratio difference before)
  • Compare Scraped Categories renamed to Category Coverage Difference and results in 2 bar plots for each category, #52:
    • Coverage for field which reflects value counts (categories) coverage for the field for both jobs
    • Coverage difference more than 10% for field which shows >10% differences between the category coverages
  • Boolean Fields plots Coverage for boolean fields graph which reflects normalized value counts for boolean fields for both jobs, #53

Removed

  • cufflinks dependency
  • Deprecated category_field tag

[2019.03.25]

Added

  • CHANGES.md
  • new arche.rules.duplicates.find_by() to find duplicates by chosen columns
import arche
from arche.readers.items import JobItems
df = JobItems(0, "235801/1/15").df
arche.rules.duplicates.find_by(df, ["title", "category"]).show()
  • basic_json_schema().json() prints a schema in JSON format
  • Result.show() to print a rule result, e.g.
from arche.rules.garbage_symbols import garbage_symbols
from arche.readers.items import JobItems
items = JobItems(0, "235801/1/15")
garbage_symbols(items).show()
  • notebooks to documentation

Changed

  • Tags rule returns unused tags, #2
  • basic_json_schema() prints a schema as a python dict

Deprecated

  • Arche().basic_json_schema() deprecated in favor of arche.basic_json_schema()

Removed

Fixed

  • Arche().basic_json_schema() not using items_numbers argument

2019.03.18

  • Last release without CHANGES updates

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arche-0.3.5.tar.gz (1.7 MB view details)

Uploaded Source

File details

Details for the file arche-0.3.5.tar.gz.

File metadata

  • Download URL: arche-0.3.5.tar.gz
  • Upload date:
  • Size: 1.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.1

File hashes

Hashes for arche-0.3.5.tar.gz
Algorithm Hash digest
SHA256 b948983b97a7ebdb93ca4b2764d9f195b21e36a97ade9c59e901366f6df560a3
MD5 0a7e47f69ffacc12d4aa292a08183793
BLAKE2b-256 46b8c85d1500a846434525325953d081490a5f51800df523211a973e25755553

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page