Analyze Scrapy Cloud data
Project description
Arche
pip install arche
Arche (pronounced Arkey) helps to verify scraped data using set of defined rules, for example:
- Validation with JSON schema
- Coverage (items, fields, categorical data, including booleans and enums)
- Duplicates
- Garbage symbols
- Comparison of two jobs
We use it in Scrapinghub, among the other tools, to ensure quality of scraped data
Installation
Arche requires Jupyter environment, supporting both JupyterLab and Notebook UI
For JupyterLab, you will need to properly install plotly extensions
Then just pip install arche
Why
To check the quality of scraped data continuously. For example, if you scraped a website, a typical approach would be to validate the data with Arche. You can also create a schema and then set up Spidermon
Developer Setup
pipenv install --dev
pipenv shell
tox
Contribution
Any contributions are welcome! See https://github.com/scrapinghub/arche/issues if you want to take on something or suggest an improvement/report a bug.
Changes
Most recent releases are shown at the top. Each release shows:
- Added: New classes, methods, functions, etc
- Changed: Additional parameters, changes to inputs or outputs, etc
- Fixed: Bug fixes that don't change documented behaviour
Note that the top-most release is changes in the unreleased master branch on Github. Parentheses after an item show the name or github id of the contributor of that change.
Keep a Changelog, Semantic Versioning.
[0.3.6] (2019-07-12)
Added
- Categories rule with a plot showing unique values and count per field. By default,
report_all()
only includes fields which have less or equal to 10 unique values. See https://arche.readthedocs.io/en/latest/nbs/Rules.html#Category-fields, #100 - Category documentation
Changed
Arche.report_all()
does not shorten report by default, addedshort
parameter.- Data is consistent with Dash and Spidermon:
_type, _key
fields are dropped from dataframe, raw data, basic schema, #104, #106 df.index
now stores_key
insteadbasic_json_schema()
works withdeleted
jobsstart
is supported for Collections, #112enum
is counted as acategory
tag, #18Garbage Symbols
searches in str representation of nested fields instead of expanded df, #130- Show real coverage difference (negative\positive) instead of absolute, #114
Fixed
Arche.glance()
, #88- Item links in Schema validation errors, #89
- Empty NAN bars on category graphs, #93
data_quality_report()
, #95- Wrong number of Collection Items if it contains item 0, #112
Removed
- Responses Per Item Ratio rule
- Deprecated
expand
parameter and removedflat_df
, sinceGarbage Rule
deal with nested data itself, #133
[0.3.5] (2019-05-14)
Added
Arche()
supports any iterables with item dicts, fixing jsonschema consistency, #83Items.from_array
to read raw data from iterables, #83
Changed
- If reading from pandas df directly, store raw data in numpy array. See gotchas http://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html#support-for-integer-na
Fixed
Removed
[0.3.4] (2019-05-06)
Fixed
- basic_json_schema() fails with long
1.0
types, #80
[0.3.3] (2019-05-03)
Added
- Accept dataframes as source or target, #69
Changed
- data_quality_report plots the same "Fields Coverage" instead of green "Scraped Fields Coverage"
- Plot theme changed from ggplot2 to seaborn, #62
- Same target and source raise an error, was a warning before
- Passed rules marked with green PASSED.
Fixed
- Online documentation now renders graphs https://arche.readthedocs.io/en/latest/, #41
- Error colours are back in
report_all()
.
Removed
- Deprecated
Arche.basic_json_schema()
, usebasic_json_schema()
- Removed Quickstart.md as redundant - documentation lives in notebooks
[0.3.2] (2019-04-18)
Added
- Allow reading private raw schemas directly from bitbucket, #58
Changed
- Progress widgets are removed before printing graphs
- New plotly v4 API
Fixed
- Failing
Compare Prices For Same Urls
when url isnan
, #67 - Empty graphs in Jupyter Notebook, #63
Removed
- Scraped Items History graphs
[0.3.1] (2019-04-12)
Fixed
- Empty graphs due to lack of plotlyjs, #61
[0.3.0] (2019-04-12)
Fixed
- Big notebook size, replaced cufflinks with plotly and ipython, #39
Changed
- Fields Coverage now is printed as a bar plot, #9
- Fields Counts renamed to Coverage Difference and results in 2 bar plots, #9, #51:
- Coverage from job stats fields counts which reflects coverage for each field for both jobs
- Coverage difference more than 5% which prints >5% difference between the coverages (was ratio difference before)
- Compare Scraped Categories renamed to Category Coverage Difference and results in 2 bar plots for each category, #52:
- Coverage for
field
which reflects value counts (categories) coverage for the field for both jobs - Coverage difference more than 10% for
field
which shows >10% differences between the category coverages
- Coverage for
- Boolean Fields plots Coverage for boolean fields graph which reflects normalized value counts for boolean fields for both jobs, #53
Removed
cufflinks
dependency- Deprecated
category_field
tag
[2019.03.25]
Added
- CHANGES.md
- new
arche.rules.duplicates.find_by()
to find duplicates by chosen columns
import arche
from arche.readers.items import JobItems
df = JobItems(0, "235801/1/15").df
arche.rules.duplicates.find_by(df, ["title", "category"]).show()
basic_json_schema().json()
prints a schema in JSON formatResult.show()
to print a rule result, e.g.
from arche.rules.garbage_symbols import garbage_symbols
from arche.readers.items import JobItems
items = JobItems(0, "235801/1/15")
garbage_symbols(items).show()
- notebooks to documentation
Changed
- Tags rule returns unused tags, #2
basic_json_schema()
prints a schema as a python dict
Deprecated
Arche().basic_json_schema()
deprecated in favor ofarche.basic_json_schema()
Removed
Fixed
Arche().basic_json_schema()
not usingitems_numbers
argument
2019.03.18
- Last release without CHANGES updates
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.