Goodtables is a framework to inspect tabular data.

These details have been verified by PyPI

Maintainers

These details have not been verified by PyPI

Project links

Homepage

Project description

goodtables-py

Goodtables is a framework to validate tabular data. It can check the structure of your data (e.g. all rows have the same number of columns), and its contents (e.g. all dates are valid).

Features

Structural checks: Ensure that there are no empty rows, no blank headers, etc.
Content checks: Ensure that the values have the correct types ("string", "number", "date", etc.), that their format is valid ("string must be an e-mail"), and that they respect the constraints ("age must be a number greater than 18").
Support for multiple tabular formats: CSV, Excel files, LibreOffice, Data Package, etc.
Parallelized validations for multi-table datasets
Command line interface

Features
Getting Started
Validation
Developer documentation
- Semantic versioning
- Validate
FAQ
- How can I add a new custom check?
- How can I add support for a new tabular file type?
Changelog

Getting Started

Installing

pip install goodtables
pip install goodtables[ods]  # If you need LibreOffice's ODS file support

Running on CLI

goodtables data.csv

Use goodtables --help to see the different options.

Running on Python

from goodtables import validate

report = validate('invalid.csv')
report['valid'] # false
report['table-count'] # 1
report['error-count'] # 3
report['tables'][0]['valid'] # false
report['tables'][0]['source'] # 'invalid.csv'
report['tables'][0]['errors'][0]['code'] # 'blank-header'

You can read a more in depth explanation on using goodtables with Python on the developer documentation section. Check also the examples folder for other examples.

Validation

Basic checks

The basic checks can't be disabled, as they deal with goodtables being able to read the files.

check	description
io-error	Data reading error because of IO error.
http-error	Data reading error because of HTTP error.
source-error	Data reading error because of not supported or inconsistent contents.
scheme-error	Data reading error because of incorrect scheme.
format-error	Data reading error because of incorrect format.
encoding-error	Data reading error because of an encoding problem.

Structural checks

These checks validate that the structure of the file are valid.

check	description
blank-header	There is a blank header name. All cells in the header row must have a value.
duplicate-header	There are multiple columns with the same name. All column names must be unique.
blank-row	Rows must have at least one non-blank cell.
duplicate-row	Rows can't be duplicated.
extra-value	A row has more columns than the header.
missing-value	A row has less columns than the header.

Content checks

These checks validate the contents of the file. To use them, you need to pass a Table Schema. If you don't have a schema, goodtables can infer it if you use the infer_schema option.

If your schema only covers part of the data, you can use the infer_fields to infer the remaining fields.

Lastly, if the order of the fields in the data is different than in your schema, enable the order_fields option.

check	description
schema-error	Schema is not valid.
non-matching-header	The header's name in the schema is different from what's in the data.
extra-header	The data contains a header not defined in the schema.
missing-header	The data doesn't contain a header defined in the schema.
type-or-format-error	The value can’t be cast based on the schema type and format for this field.
required-constraint	This field is a required field, but it contains no value.
pattern-constraint	This field value's should conform to the defined pattern.
unique-constraint	This field is a unique field but it contains a value that has been used in another row.
enumerable-constraint	This field value should be equal to one of the values in the enumeration constraint.
minimum-constraint	This field value should be greater or equal than constraint value.
maximum-constraint	This field value should be less or equal than constraint value.
minimum-length-constraint	A length of this field value should be greater or equal than schema constraint value.
maximum-length-constraint	A length of this field value should be less or equal than schema constraint value.

Advanced checks

check	description
blacklisted-value	Ensure there are no cells with the blacklisted values.
deviated-value	Ensure numbers are within a number of standard deviations from the average.
sequential-value	Ensure numbers are sequential.
truncated-value	Detect values that were potentially truncated.
custom-constraint	Defines a constraint based on the values of other columns (e.g. `value * quantity == total`).

blacklisted-value

Sometimes we have to check for some values we don't want to have in out dataset. It accepts following options:

option	type	description
column	int/str	Column number or name
blacklist	list of str	List of blacklisted values

Consider the following CSV file:

id,name
1,John
2,bug
3,bad
5,Alex

Let's check that the name column doesn't contain rows with bug or bad:

from goodtables import validate

report = validate('data.csv', checks=[
    {'blacklisted-value': {'column': 'name', 'blacklist': ['bug', 'bad']}},
])
# error on row 3 with code "blacklisted-value"
# error on row 4 with code "blacklisted-value"

deviated-value

This check helps to find outlines in a column containing positive numbers. It accepts following options:

option	type	description
column	int/str	Column number or name
average	str	Average type, either "mean", "median" or "mode"
interval	int	Values must be inside range `average ± standard deviation * interval`

Consider the following CSV file:

temperature
1
-2
7
0
1
2
5
-4
100
8
3

We use median to get an average of the column values and allow interval of 3 standard deviations. For our case median is 2.0 and standard deviation is 29.73 so all valid values must be inside the [-87.19, 91.19] interval.

report = validate('data.csv', checks=[
    {'deviated-value': {'column': 'temperature', 'average': 'median', 'interval': 3}},
])
# error on row 10 with code "deviated-value"

sequential-value

This checks is for pretty common case when a column should have integers that sequentially increment. It accepts following options:

option	type	description
column	int/str	Column number or name

Consider the following CSV file:

id,name
1,one
2,two
3,three
5,five

Let's check if the id column contains sequential integers:

from goodtables import validate

report = validate('data.csv', checks=[
    {'sequential-value': {'column': 'id'}},
])
# error on row 5 with code "sequential-value"

truncated-value

Some database or spreadsheet software (like MySQL or Excel) could cutoff values on saving. There are some well-known heuristics to find this bad values. See https://github.com/propublica/guides/blob/master/data-bulletproofing.md for more detailed information.

Consider the following CSV file:

id,amount,comment
1,14000000,good
2,2147483647,bad
3,32767,bad
4,234234234,bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbad

To detect all probably truncated values we could use truncated-value check:

report = validate('data.csv', checks=[
    'truncated-value',
])
# error on row 3 with code "truncated-value"
# error on row 4 with code "truncated-value"
# error on row 5 with code "truncated-value"

custom-constraint

With Table Schema we could create constraints for an individual field but sometimes it's not enough. With a custom constraint check every row could be checked against given limited python expression in which variable names resolve to column values. See list of available operators. It accepts following options:

constraint (str): Constraint definition (e.g. col1 + col2 == col3)

Consider csv file like this:

id,name,salary,bonus
1,Alex,1000,200
2,Sam,2500,500
3,Ray,1350,500
4,John,5000,1000

Let's say our business rule is to be shy on bonuses:

report = validate('data.csv', checks=[
    {'custom-constraint': {'constraint': 'salary > bonus * 4'}},
])
# error on row 4 with code "custom-constraint"

Validation report format

The validation report follows the JSON Schema defined on goodtables/schemas/report.json. As an example, this is the report generated by running goodtables --row-limit 5 --json on the file data/datapackages/invalid/datapackage.json:

{
    "time": 0.015,
    "valid": false,
    "error-count": 2,
    "table-count": 2,
    "tables": [
        {
            "datapackage": "data/datapackages/invalid/datapackage.json",
            "time": 0.005,
            "valid": false,
            "error-count": 1,
            "row-count": 4,
            "source": "data/datapackages/invalid/data.csv",
            "headers": [
                "id",
                "name",
                "description",
                "amount"
            ],
            "format": "inline",
            "schema": "table-schema",
            "errors": [
                {
                    "code": "blank-row",
                    "row-number": 3,
                    "message": "Row 3 is completely blank"
                }
            ]
        },
        {
            "datapackage": "data/datapackages/invalid/datapackage.json",
            "time": 0.004,
            "valid": false,
            "error-count": 1,
            "row-count": 5,
            "source": "data/datapackages/invalid/data2.csv",
            "headers": [
                "parent",
                "comment"
            ],
            "format": "inline",
            "schema": "table-schema",
            "errors": [
                {
                    "code": "blank-row",
                    "row-number": 4,
                    "message": "Row 4 is completely blank"
                }
            ]
        }
    ],
    "warnings": [
        "Table \"data/datapackages/invalid/data2.csv\" inspection has reached 5 row(s) limit"
    ],
    "preset": "datapackage"
}

Developer documentation

Semantic versioning

We follow the Semantic Versioning specification to define our version numbers. This means that we'll increase the major version number when there's a breaking change. Because of this, we recommend you to explicitly specify the goodtables version on your dependency list (e.g. setup.py or requirements.txt).

Validate

Goodtables validates your tabular dataset to find structural and content errors. Consider you have a file named invalid.csv. Let's validate it:

report = validate('invalid.csv')

We could also pass a remote URI instead of a local path. It supports CSV, XLS, XLSX, ODS, JSON, and all other formats supported by the tabulator library.

Validation report

The output of the validate() method is a report dictionary. It includes information if the data was valid, count of errors, list of table reports, which individual checks failed, etc.

Resulting report will be looking like this:

{
    "time": 0.009,
    "error-count": 1,
    "warnings": [
        "Table \"data/invalid.csv\" inspection has reached 1 error(s) limit"
    ],
    "preset": "table",
    "valid": false,
    "tables": [
        {
            "errors": [
                {
                    "row-number": null,
                    "message": "Header in column 3 is blank",
                    "row": null,
                    "column-number": 3,
                    "code": "blank-header"
                }
            ],
            "error-count": 1,
            "headers": [
                "id",
                "name",
                "",
                "name"
            ],
            "scheme": "file",
            "row-count": 2,
            "valid": false,
            "encoding": "utf-8",
            "time": 0.007,
            "schema": null,
            "format": "csv",
            "source": "data/invalid"
        }
    ],
    "table-count": 1
}

Rase report errors are standardized and described in Data Quality Spec. The errors are divided in one of the following categories:

source - data can't be loaded or parsed
structure - general tabular errors like duplicate headers
schema - error of checks against Table Schema
custom - custom checks errors

Checks

Check is a main validation actor in goodtables. The list of enabled checks can be changed using checks and skip_checks arguments. Let's explore the options on an example:

report = validate('data.csv') # by default structure and schema (if available) checks
report = validate('data.csv', checks=['structure']) # only structure checks
report = validate('data.csv', checks=['schema']) # only schema (if available) checks
report = validate('data.csv', checks=['bad-headers']) # check only 'bad-headers'
report = validate('data.csv', skip_checks=['bad-headers']) # exclude 'bad-headers'

By default a dataset will be validated against all available Data Quality Spec errors. Some checks can be unavailable for validation. For example, if the schema isn't provided, only the structure checks will be done.

Presets

Goodtables support different formats of tabular datasets. They're called presets. A tabular dataset is some data that can be split in a list of data tables, as:

Dataset

We can change the preset using the preset argument for validate(). By default, it'll be inferred from the source, falling back to table. To validate a data package, we can do:

report = validate('datapackage.json') # implicit preset
report = validate('datapackage.json', preset='datapackage') # explicit preset

This will validate all tabular resources in the datapackage.

It's also possible to validate a list of files using the "nested" preset. To do so, the first argument to validate() should be a list of dictionaries, where each key in the dictionary is named after a parameter on validate(). For example:

report = validate([{'source': 'data1.csv'}, {'source': 'data2.csv'}]) # implicit preset
report = validate([{'source': 'data1.csv'}, {'source': 'data2.csv'}], preset='nested') # explicit preset

Is similar to:

report_data1 = validate('data1.csv')
report_data2 = validate('data2.csv')

The difference is that goodtables validates multiple tables in parallel, so calling using the "nested" preset should run faster.

Contributing

This project follows the Open Knowledge International coding standards.

We recommend you to use virtualenv to isolate goodtables from the rest of the packages in your machine.

To install goodtables and the development dependencies, run:

$ make install

To run the tests, use:

$ make test

FAQ

How can I add a new custom check?

To create a custom check user could use a check decorator. This way the builtin check could be overridden (use the spec error code like duplicate-row) or could be added a check for a custom error (use type, context and position arguments):

from goodtables import validate, check, Error

@check('custom-check', type='custom', context='body')
def custom_check(cells):
    errors = []
    for cell in cells:
        message = 'Custom error on column {column_number} and row {row_number}'
        error = Error(
            'custom-error',
            cell,
            message
        )
        errors.append(error)
    return errors

report = validate('data.csv', checks=['custom-check'])

For now this documentation section is incomplete. Please see builtin checks to learn more about checking protocol.

How can I add support for a new tabular file type?

To create a custom preset user could use a preset decorator. This way the builtin preset could be overridden or could be added a custom preset.

from tabulator import Stream
from tableschema import Schema
from goodtables import validate

@preset('custom-preset')
def custom_preset(source, **options):
    warnings = []
    tables = []
    for table in source:
        try:
            tables.append({
                'source':  str(source),
                'stream':  Stream(...),
                'schema': Schema(...),
                'extra': {...},
            })
        except Exception:
            warnings.append('Warning message')
    return warnings, tables

report = validate(source, preset='custom-preset')

For now this documentation section is incomplete. Please see builtin presets to learn more about the dataset extraction protocol.

Changelog

v2.1

New features:

A new key added to the error.to_dict return: message-data

v2.0

Breaking changes:

Checks method signature now only receives the current row's cells list
Checks raise errors by returning an array of Error objects
Cells have the row number in the row-number key
Files with ZIP extension are presumed to be datapackages, so goodtables mydatapackage.zip works
Improvements to goodtables CLI (#233)
New goodtables init <data paths> command to create a new datapackage.json with the files passed as parameters and their inferred schemas.

Bug fixes:

Fix bug with truncated-values check on date fields (#250)

v1.5

New API added:

Validation source now could be a pathlib.Path

v1.4

Improved behaviour:

rebased on Data Quality Spec v1
rebased on Data Package Spec v1
rebased on Table Schema Spec v1
treat primary key as required/unique field

v1.3

New advanced checks added:

blacklisted-value
custom-constraint
deviated-value
sequential-value
truncated-value

v1.2

New API added:

report.preset
report.tables[].schema

v1.1

New API added:

report.tables[].scheme
report.tables[].format
report.tables[].encoding

v1.0

This version includes various big changes. A migration guide is under development and will be published here.

v0.6

First version of goodtables.

Project details

These details have been verified by PyPI

Maintainers

akariv amercader okfn pwalsh roll

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

2.5.4

Feb 24, 2021

2.5.2

Oct 12, 2020

2.5.1

Sep 21, 2020

2.5.0

Jun 12, 2020

2.4.16

Apr 29, 2020

2.4.15

Apr 27, 2020

2.4.14

Apr 24, 2020

2.4.13

Apr 23, 2020

2.4.12

Mar 24, 2020

2.4.11

Mar 24, 2020

2.4.10

Mar 23, 2020

2.4.9

Mar 18, 2020

2.4.8

Mar 16, 2020

2.4.7

Mar 9, 2020

2.4.6

Mar 9, 2020

2.4.5

Mar 9, 2020

2.4.4

Mar 2, 2020

2.4.3

Feb 27, 2020

2.4.2

Dec 18, 2019

2.4.1

Nov 26, 2019

2.4.0

Oct 31, 2019

2.3.1

Oct 30, 2019

2.3.0

Oct 29, 2019

2.2.1

Jul 8, 2019

2.2.0

Jun 24, 2019

2.1.6

Jun 10, 2019

2.1.5

Jun 3, 2019

2.1.4

Apr 17, 2019

This version

2.1.3

Apr 11, 2019

2.1.2

Apr 11, 2019

2.1.1

Oct 1, 2018

2.1.0

Sep 20, 2018

2.0.2

Apr 12, 2018

2.0.0

Mar 22, 2018

1.5.1

Oct 11, 2017

1.5.0

Oct 5, 2017

1.4.5

Oct 5, 2017

1.4.4

Oct 3, 2017

1.4.3

Sep 20, 2017

1.4.2

Sep 19, 2017

1.4.1

Sep 10, 2017

1.4.0

Sep 8, 2017

1.3.0

Aug 9, 2017

1.2.0

Aug 9, 2017

1.1.0

Aug 8, 2017

1.0.0

Jul 19, 2017

1.0.0a16 pre-release

Jun 16, 2017

1.0.0a15 pre-release

Jun 9, 2017

1.0.0a13 pre-release

May 31, 2017

1.0.0a12 pre-release

May 26, 2017

1.0.0a11 pre-release

May 25, 2017

1.0.0a10 pre-release

May 25, 2017

1.0.0a9 pre-release

May 18, 2017

1.0.0a8 pre-release

Mar 3, 2017

1.0.0a7 pre-release

Mar 3, 2017

1.0.0a6 pre-release

Mar 3, 2017

1.0.0a5 pre-release

Feb 6, 2017

1.0.0a4 pre-release

Nov 18, 2016

1.0.0a3 pre-release

Nov 3, 2016

1.0.0a2 pre-release

Nov 2, 2016

0.7.6

Sep 12, 2016

0.7.5

Jun 6, 2016

0.7.4

May 19, 2016

0.7.3

May 19, 2016

0.7.2

May 11, 2016

0.7.1

May 1, 2016

0.7.0

Apr 22, 2016

0.6.5

Apr 4, 2016

0.6.4

Mar 31, 2016

0.6.3

Mar 30, 2016

0.6.2

Mar 28, 2016

0.6.1

Mar 17, 2016

0.6.0

Feb 18, 2016

0.5.7

Feb 18, 2016

0.5.6

Jun 7, 2015

0.5.5

Jun 7, 2015

0.5.4

Jun 7, 2015

0.5.3

Apr 28, 2015

0.5.2

Apr 14, 2015

0.5.1

Apr 14, 2015

0.5.0

Mar 30, 2015

0.4.9

Mar 30, 2015

0.4.8

Mar 30, 2015

0.4.7

Mar 30, 2015

0.4.6

Mar 25, 2015

0.4.5

Mar 22, 2015

0.4.4

Mar 19, 2015

0.4.3

Mar 18, 2015

0.4.2

Mar 18, 2015

0.4.1

Mar 18, 2015

0.4.0

Mar 18, 2015

0.3.9

Mar 17, 2015

0.3.8

Mar 16, 2015

0.3.7

Mar 16, 2015

0.3.6

Mar 11, 2015

0.3.5

Mar 10, 2015

0.3.4

Mar 9, 2015

0.3.3

Mar 8, 2015

0.3.2

Mar 5, 2015

0.3.1

Mar 5, 2015

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

goodtables-2.1.3.tar.gz (159.2 kB view hashes)

Uploaded Apr 11, 2019 Source

Built Distribution

goodtables-2.1.3-py2.py3-none-any.whl (49.0 kB view hashes)

Uploaded Apr 11, 2019 Python 2 Python 3

Hashes for goodtables-2.1.3.tar.gz

Hashes for goodtables-2.1.3.tar.gz
Algorithm	Hash digest
SHA256	`7e8bc644b32270b1e7e146da9c3512f8b299d85c613674de77b6a0580e648695`
MD5	`aa67c3de91628a596c62089023bd0093`
BLAKE2b-256	`9dd673f19f6fb5345f3f5f45bd3aecad7bc063ff273a0814a12f7d9772458344`

Hashes for goodtables-2.1.3-py2.py3-none-any.whl

Hashes for goodtables-2.1.3-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`65092d388229a45c97fe3552c2a18a3b7c0aeb10170418bf23b4c4f664105bda`
MD5	`c22fccb39c99e2ca23ca429cbefe7a7d`
BLAKE2b-256	`e8f452892a03bd81cb97381cb78a4ef9f7b08b4d12d61b216bebe5e828fa4597`

goodtables 2.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

goodtables-py

Features

Table of Contents

Getting Started

Installing

Running on CLI

Running on Python

Validation

Basic checks

Structural checks

Content checks

Advanced checks

blacklisted-value

deviated-value

sequential-value

truncated-value

custom-constraint

Validation report format

Developer documentation

Semantic versioning

Validate

Validation report

Checks

Presets

Contributing

FAQ

How can I add a new custom check?

How can I add support for a new tabular file type?

Changelog

v2.1

v2.0

v1.5

v1.4

v1.3

v1.2

v1.1

v1.0

v0.6

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution