Goodtables is a framework to inspect tabular data.
Project description
goodtables-py
=============
| |Travis|
| |Coveralls|
| |PyPi|
| |Gitter|
| Goodtables is a framework to validate tabular data. It can check the
structure
| of your data (e.g. all rows have the same number of columns), and its
contents
| (e.g. all dates are valid).
Features
--------
- **Structural checks**: Ensure that there are no empty rows, no blank
headers, etc.
- **Content checks**: Ensure that the values have the correct types
("string", "number", "date", etc.), that their format is valid
("string must be an e-mail"), and that they respect the constraints
("age must be a number greater than 18").
- **Support for multiple tabular formats**: CSV, Excel files,
LibreOffice, Data Package, etc.
- **Parallelized validations for multi-table datasets**
- **Command line interface**
Table of Contents
-----------------
- `Features <#features>`__
- `Getting Started <#getting-started>`__
- `Installing <#installing>`__
- `Running on CLI <#running-on-cli>`__
- `Running on Python <#running-on-python>`__
- `Validation <#validation>`__
- `Basic checks <#basic-checks>`__
- `Structural checks <#structural-checks>`__
- `Content checks <#content-checks>`__
- `Advanced checks <#advanced-checks>`__
- `Validation report format <#validation-report-format>`__
- `Developer documentation <#developer-documentation>`__
- `Semantic versioning <#semantic-versioning>`__
- `Validate <#validate>`__
- `FAQ <#faq>`__
- `How can I add a new custom
check? <#how-can-i-add-a-new-custom-check>`__
- `How can I add support for a new tabular file
type? <#how-can-i-add-support-for-a-new-tabular-file-type>`__
- `Changelog <#changelog>`__
Getting Started
---------------
Installing
~~~~~~~~~~
::
pip install goodtables
pip install goodtables[ods] # If you need LibreOffice's ODS file support
Running on CLI
~~~~~~~~~~~~~~
::
goodtables data.csv
Use ``goodtables --help`` to see the different options.
Running on Python
~~~~~~~~~~~~~~~~~
.. code:: python
from goodtables import validate
report = validate('invalid.csv')
report['valid'] # false
report['table-count'] # 1
report['error-count'] # 3
report['tables'][0]['valid'] # false
report['tables'][0]['source'] # 'invalid.csv'
report['tables'][0]['errors'][0]['code'] # 'blank-header'
| You can read a more in depth explanation on using goodtables with
Python on
| the `developer documentation <#developer-documentation>`__ section.
Check also
| the `examples <examples>`__ folder for other examples.
Validation
----------
Basic checks
~~~~~~~~~~~~
The basic checks can't be disabled, as they deal with goodtables being
able to read the files.
+------------------+-------------------------------------------------------------------------+
| check | description |
+==================+=========================================================================+
| io-error | Data reading error because of IO error. |
+------------------+-------------------------------------------------------------------------+
| http-error | Data reading error because of HTTP error. |
+------------------+-------------------------------------------------------------------------+
| source-error | Data reading error because of not supported or inconsistent contents. |
+------------------+-------------------------------------------------------------------------+
| scheme-error | Data reading error because of incorrect scheme. |
+------------------+-------------------------------------------------------------------------+
| format-error | Data reading error because of incorrect format. |
+------------------+-------------------------------------------------------------------------+
| encoding-error | Data reading error because of an encoding problem. |
+------------------+-------------------------------------------------------------------------+
Structural checks
~~~~~~~~~~~~~~~~~
These checks validate that the structure of the file are valid.
+--------------------+-----------------------------------------------------------------------------------+
| check | description |
+====================+===================================================================================+
| blank-header | There is a blank header name. All cells in the header row must have a value. |
+--------------------+-----------------------------------------------------------------------------------+
| duplicate-header | There are multiple columns with the same name. All column names must be unique. |
+--------------------+-----------------------------------------------------------------------------------+
| blank-row | Rows must have at least one non-blank cell. |
+--------------------+-----------------------------------------------------------------------------------+
| duplicate-row | Rows can't be duplicated. |
+--------------------+-----------------------------------------------------------------------------------+
| extra-value | A row has more columns than the header. |
+--------------------+-----------------------------------------------------------------------------------+
| missing-value | A row has less columns than the header. |
+--------------------+-----------------------------------------------------------------------------------+
Content checks
~~~~~~~~~~~~~~
These checks validate the contents of the file. To use them, you need to
pass a `Table
Schema <https://specs.frictionlessdata.io/table-schema/>`__. If you
don't have a schema, goodtables can infer it if you use the
``infer_schema`` option.
If your schema only covers part of the data, you can use the
``infer_fields`` to infer the remaining fields.
Lastly, if the order of the fields in the data is different than in your
schema, enable the ``order_fields`` option.
+-----------------------------+-------------------------------------------------------------------------------------------+
| check | description |
+=============================+===========================================================================================+
| schema-error | Schema is not valid. |
+-----------------------------+-------------------------------------------------------------------------------------------+
| non-matching-header | The header's name in the schema is different from what's in the data. |
+-----------------------------+-------------------------------------------------------------------------------------------+
| extra-header | The data contains a header not defined in the schema. |
+-----------------------------+-------------------------------------------------------------------------------------------+
| missing-header | The data doesn't contain a header defined in the schema. |
+-----------------------------+-------------------------------------------------------------------------------------------+
| type-or-format-error | The value can’t be cast based on the schema type and format for this field. |
+-----------------------------+-------------------------------------------------------------------------------------------+
| required-constraint | This field is a required field, but it contains no value. |
+-----------------------------+-------------------------------------------------------------------------------------------+
| pattern-constraint | This field value's should conform to the defined pattern. |
+-----------------------------+-------------------------------------------------------------------------------------------+
| unique-constraint | This field is a unique field but it contains a value that has been used in another row. |
+-----------------------------+-------------------------------------------------------------------------------------------+
| enumerable-constraint | This field value should be equal to one of the values in the enumeration constraint. |
+-----------------------------+-------------------------------------------------------------------------------------------+
| minimum-constraint | This field value should be greater or equal than constraint value. |
+-----------------------------+-------------------------------------------------------------------------------------------+
| maximum-constraint | This field value should be less or equal than constraint value. |
+-----------------------------+-------------------------------------------------------------------------------------------+
| minimum-length-constraint | A length of this field value should be greater or equal than schema constraint value. |
+-----------------------------+-------------------------------------------------------------------------------------------+
| maximum-length-constraint | A length of this field value should be less or equal than schema constraint value. |
+-----------------------------+-------------------------------------------------------------------------------------------+
Advanced checks
~~~~~~~~~~~~~~~
+----------------------------------------------+---------------------------------------------------------------------------------------------------+
| check | description |
+==============================================+===================================================================================================+
| `blacklisted-value <#blacklisted-value>`__ | Ensure there are no cells with the blacklisted values. |
+----------------------------------------------+---------------------------------------------------------------------------------------------------+
| `deviated-value <#deviated-value>`__ | Ensure numbers are within a number of standard deviations from the average. |
+----------------------------------------------+---------------------------------------------------------------------------------------------------+
| `sequential-value <#sequential-value>`__ | Ensure numbers are sequential. |
+----------------------------------------------+---------------------------------------------------------------------------------------------------+
| `truncated-value <#truncated-value>`__ | Detect values that were potentially truncated. |
+----------------------------------------------+---------------------------------------------------------------------------------------------------+
| `custom-constraint <#custom-constraint>`__ | Defines a constraint based on the values of other columns (e.g. ``value * quantity == total``). |
+----------------------------------------------+---------------------------------------------------------------------------------------------------+
blacklisted-value
^^^^^^^^^^^^^^^^^
Sometimes we have to check for some values we don't want to have in out
dataset. It accepts following options:
+-------------+---------------+------------------------------+
| option | type | description |
+=============+===============+==============================+
| column | int/str | Column number or name |
+-------------+---------------+------------------------------+
| blacklist | list of str | List of blacklisted values |
+-------------+---------------+------------------------------+
Consider the following CSV file:
.. code:: csv
id,name
1,John
2,bug
3,bad
5,Alex
Let's check that the ``name`` column doesn't contain rows with ``bug``
or ``bad``:
.. code:: python
from goodtables import validate
report = validate('data.csv', checks=[
{'blacklisted-value': {'column': 'id', 'blacklist': ['bug', 'bad']}},
])
# error on row 3 with code "blacklisted-value"
# error on row 4 with code "blacklisted-value"
deviated-value
^^^^^^^^^^^^^^
This check helps to find outlines in a column containing positive
numbers. It accepts following options:
+------------+-----------+---------------------------------------------------------------------------+
| option | type | description |
+============+===========+===========================================================================+
| column | int/str | Column number or name |
+------------+-----------+---------------------------------------------------------------------------+
| average | str | Average type, either "mean", "median" or "mode" |
+------------+-----------+---------------------------------------------------------------------------+
| interval | int | Values must be inside range ``average ± standard deviation * interval`` |
+------------+-----------+---------------------------------------------------------------------------+
Consider the following CSV file:
.. code:: csv
temperature
1
-2
7
0
1
2
5
-4
100
8
3
We use ``median`` to get an average of the column values and allow
interval of 3 standard deviations. For our case median is ``2.0`` and
standard deviation is ``29.73`` so all valid values must be inside the
``[-87.19, 91.19]`` interval.
.. code:: python
report = validate('data.csv', checks=[
{'deviated-value': {'column': 'temperature', 'average': 'median', 'interval': 3}},
])
# error on row 10 with code "deviated-value"
sequential-value
^^^^^^^^^^^^^^^^
This checks is for pretty common case when a column should have integers
that sequentially increment. It accepts following options:
+----------+-----------+-------------------------+
| option | type | description |
+==========+===========+=========================+
| column | int/str | Column number or name |
+----------+-----------+-------------------------+
Consider the following CSV file:
.. code:: csv
id,name
1,one
2,two
3,three
5,five
Let's check if the ``id`` column contains sequential integers:
.. code:: python
from goodtables import validate
report = validate('data.csv', checks=[
{'sequential-value': {'column': 'id'}},
])
# error on row 5 with code "sequential-value"
truncated-value
^^^^^^^^^^^^^^^
Some database or spreadsheet software (like MySQL or Excel) could cutoff
values on saving. There are some well-known heuristics to find this bad
values. See
https://github.com/propublica/guides/blob/master/data-bulletproofing.md
for more detailed information.
Consider the following CSV file:
.. code:: csv
id,amount,comment
1,14000000,good
2,2147483647,bad
3,32767,bad
4,234234234,bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbad
To detect all probably truncated values we could use ``truncated-value``
check:
.. code:: python
report = validate('data.csv', checks=[
'truncated-value',
])
# error on row 3 with code "truncated-value"
# error on row 4 with code "truncated-value"
# error on row 5 with code "truncated-value"
custom-constraint
^^^^^^^^^^^^^^^^^
With Table Schema we could create constraints for an individual field
but sometimes it's not enough. With a custom constraint check every row
could be checked against given limited python expression in which
variable names resolve to column values. See list of `available
operators <https://github.com/danthedeckie/simpleeval#operators>`__. It
accepts following options:
.. raw:: html
<dl>
<dt>constraint (str)</dt>
<dd>Constraint definition (e.g. <code>col1 + col2 == col3</code>)</dd>
</dl>
Consider csv file like this:
.. code:: csv
id,name,salary,bonus
1,Alex,1000,200
2,Sam,2500,500
3,Ray,1350,500
4,John,5000,1000
Let's say our business rule is to be shy on bonuses:
.. code:: python
report = validate('data.csv', checks=[
{'custom-constraint': {'constraint': 'salary > bonus * 4'}},
])
# error on row 4 with code "custom-constraint"
Validation report format
~~~~~~~~~~~~~~~~~~~~~~~~
| The validation report follows the JSON Schema defined on
| `goodtables/schemas/report.json <goodtables/schemas/report.json>`__.
As an example, this is
| the report generated by running ``goodtables --row-limit 5 --json`` on
the file
| `data/datapackages/invalid/datapackage.json <data/datapackage/invalid/datapackage.json>`__:
.. code:: json
{
"time": 0.015,
"valid": false,
"error-count": 2,
"table-count": 2,
"tables": [
{
"datapackage": "data/datapackages/invalid/datapackage.json",
"time": 0.005,
"valid": false,
"error-count": 1,
"row-count": 4,
"source": "data/datapackages/invalid/data.csv",
"headers": [
"id",
"name",
"description",
"amount"
],
"format": "inline",
"schema": "table-schema",
"errors": [
{
"code": "blank-row",
"row-number": 3,
"message": "Row 3 is completely blank"
}
]
},
{
"datapackage": "data/datapackages/invalid/datapackage.json",
"time": 0.004,
"valid": false,
"error-count": 1,
"row-count": 5,
"source": "data/datapackages/invalid/data2.csv",
"headers": [
"parent",
"comment"
],
"format": "inline",
"schema": "table-schema",
"errors": [
{
"code": "blank-row",
"row-number": 4,
"message": "Row 4 is completely blank"
}
]
}
],
"warnings": [
"Table \"data/datapackages/invalid/data2.csv\" inspection has reached 5 row(s) limit"
],
"preset": "datapackage"
}
Developer documentation
-----------------------
Semantic versioning
~~~~~~~~~~~~~~~~~~~
| We follow the `Semantic Versioning <https://semver.org/>`__
specification to define our version
| numbers. This means that we'll increase the major version number when
there's a
| breaking change. Because of this, we recommend you to explicitly
specify the
| goodtables version on your dependency list (e.g. ``setup.py`` or
| ``requirements.txt``).
Validate
~~~~~~~~
| Goodtables validates your tabular dataset to find structural and
content
| errors. Consider you have a file named ``invalid.csv``. Let's validate
it:
.. code:: python
report = validate('invalid.csv')
| We could also pass a remote URI instead of a local path. It supports
CSV, XLS,
| XLSX, ODS, JSON, and all other formats supported by the
`tabulator <https://github.com/frictionlessdata/tabulator-py/>`__
| library.
Validation report
^^^^^^^^^^^^^^^^^
| The output of the ``validate()`` method is a report dictionary. It
includes
| information if the data was valid, count of errors, list of table
reports, which
| individual checks failed, etc.
Resulting report will be looking like this:
.. code:: json
{
"time": 0.009,
"error-count": 1,
"warnings": [
"Table \"data/invalid.csv\" inspection has reached 1 error(s) limit"
],
"preset": "table",
"valid": false,
"tables": [
{
"errors": [
{
"row-number": null,
"message": "Header in column 3 is blank",
"row": null,
"column-number": 3,
"code": "blank-header"
}
],
"error-count": 1,
"headers": [
"id",
"name",
"",
"name"
],
"scheme": "file",
"row-count": 2,
"valid": false,
"encoding": "utf-8",
"time": 0.007,
"schema": null,
"format": "csv",
"source": "data/invalid"
}
],
"table-count": 1
}
| Rase report errors are standardized and described
| in
| `Data Quality
Spec <https://github.com/frictionlessdata/data-quality-spec/blob/master/spec.json>`__.
| The errors are divided in one of the following categories:
- ``source`` - data can't be loaded or parsed
- ``structure`` - general tabular errors like duplicate headers
- ``schema`` - error of checks against `Table
Schema <http://specs.frictionlessdata.io/table-schema/>`__
- ``custom`` - custom checks errors
Checks
^^^^^^
| Check is a main validation actor in goodtables. The list of enabled
checks can
| be changed using ``checks`` and ``skip_checks`` arguments. Let's
explore the options
| on an example:
.. code:: python
report = validate('data.csv') # by default structure and schema (if available) checks
report = validate('data.csv', checks=['structure']) # only structure checks
report = validate('data.csv', checks=['schema']) # only schema (if available) checks
report = validate('data.csv', checks=['bad-headers']) # check only 'bad-headers'
report = validate('data.csv', skip_checks=['bad-headers']) # exclude 'bad-headers'
| By default a dataset will be validated against all available Data
Quality Spec
| errors. Some checks can be unavailable for validation. For example, if
the
| schema isn't provided, only the ``structure`` checks will be done.
Presets
^^^^^^^
| Goodtables support different formats of tabular datasets. They're
called
| presets. A tabular dataset is some data that can be split in a list of
data
| tables, as:
|Dataset|
| We can change the preset using the ``preset`` argument for
``validate()``. By
| default, it'll be inferred from the source, falling back to ``table``.
To validate
| a `data package <https://specs.frictionlessdata.io/data-package/>`__,
we can do:
.. code:: python
report = validate('datapackage.json') # implicit preset
report = validate('datapackage.json', preset='datapackage') # explicit preset
This will validate all tabular resources in the datapackage.
| It's also possible to validate a list of files using the "nested"
preset. To do
| so, the first argument to ``validate()`` should be a list of
dictionaries, where
| each key in the dictionary is named after a parameter on
``validate()``. For example:
.. code:: python
report = validate([{'source': 'data1.csv'}, {'source': 'data2.csv'}]) # implicit preset
report = validate([{'source': 'data1.csv'}, {'source': 'data2.csv'}], preset='nested') # explicit preset
Is similar to:
.. code:: python
report_data1 = validate('data1.csv')
report_data2 = validate('data2.csv')
| The difference is that goodtables validates multiple tables in
parallel, so
| calling using the "nested" preset should run faster.
Contributing
------------
This project follows the `Open Knowledge International coding
standards <https://github.com/okfn/coding-standards>`__.
| We recommend you to use ``virtualenv`` to isolate goodtables from the
rest of the
| packages in your machine.
To install goodtables and the development dependencies, run:
::
$ make install
To run the tests, use:
.. code:: bash
$ make test
FAQ
---
How can I add a new custom check?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To create a custom check user could use a ``check`` decorator. This way
the builtin check could be overridden (use the spec error code like
``duplicate-row``) or could be added a check for a custom error (use
``type``, ``context`` and ``position`` arguments):
.. code:: python
from goodtables import validate, check, Error
@check('custom-check', type='custom', context='body')
def custom_check(cells):
errors = []
for cell in cells:
message = 'Custom error on column {column_number} and row {row_number}'
error = Error(
'custom-error',
cell,
message
)
errors.append(error)
return errors
report = validate('data.csv', checks=['custom-check'])
For now this documentation section is incomplete. Please see builtin
checks to learn more about checking protocol.
How can I add support for a new tabular file type?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To create a custom preset user could use a ``preset`` decorator. This
way the builtin preset could be overridden or could be added a custom
preset.
.. code:: python
from tabulator import Stream
from tableschema import Schema
from goodtables import validate
@preset('custom-preset')
def custom_preset(source, **options):
warnings = []
tables = []
for table in source:
try:
tables.append({
'source': str(source),
'stream': Stream(...),
'schema': Schema(...),
'extra': {...},
})
except Exception:
warnings.append('Warning message')
return warnings, tables
report = validate(source, preset='custom-preset')
For now this documentation section is incomplete. Please see builtin
presets to learn more about the dataset extraction protocol.
Changelog
---------
v2.0
~~~~
Breaking changes:
- Checks method signature now only receives the current row's ``cells``
list
- Checks raise errors by returning an array of ``Error`` objects
- Cells have the row number in the ``row-number`` key
- Files with ZIP extension are presumed to be datapackages, so
``goodtables mydatapackage.zip`` works
- Improvements to goodtables CLI
(`#233 <https://github.com/frictionlessdata/goodtables-py/issues/233>`__)
- New ``goodtables init <data paths>`` command to create a new
``datapackage.json`` with the files passed as parameters and their
inferred schemas.
Bug fixes:
- Fix bug with ``truncated-values`` check on date fields
(`#250 <https://github.com/frictionlessdata/goodtables-py/issues/250>`__)
v1.5
~~~~
New API added:
- Validation ``source`` now could be a ``pathlib.Path``
v1.4
~~~~
Improved behaviour:
- rebased on Data Quality Spec v1
- rebased on Data Package Spec v1
- rebased on Table Schema Spec v1
- treat primary key as required/unique field
v1.3
~~~~
New advanced checks added:
- ``blacklisted-value``
- ``custom-constraint``
- ``deviated-value``
- ``sequential-value``
- ``truncated-value``
v1.2
~~~~
New API added:
- ``report.preset``
- ``report.tables[].schema``
v1.1
~~~~
New API added:
- ``report.tables[].scheme``
- ``report.tables[].format``
- ``report.tables[].encoding``
v1.0
~~~~
This version includes various big changes. A migration guide is under
development and will be published here.
v0.6
~~~~
First version of ``goodtables``.
.. |Travis| image:: https://img.shields.io/travis/frictionlessdata/goodtables-py/master.svg
:target: https://travis-ci.org/frictionlessdata/goodtables-py
.. |Coveralls| image:: http://img.shields.io/coveralls/frictionlessdata/goodtables-py.svg?branch=master
:target: https://coveralls.io/r/frictionlessdata/goodtables-py?branch=master
.. |PyPi| image:: https://img.shields.io/pypi/v/goodtables.svg
:target: https://pypi-hypernode.com/pypi/goodtables
.. |Gitter| image:: https://img.shields.io/gitter/room/frictionlessdata/chat.svg
:target: https://gitter.im/frictionlessdata/chat
.. |Dataset| image:: data/dataset.png
=============
| |Travis|
| |Coveralls|
| |PyPi|
| |Gitter|
| Goodtables is a framework to validate tabular data. It can check the
structure
| of your data (e.g. all rows have the same number of columns), and its
contents
| (e.g. all dates are valid).
Features
--------
- **Structural checks**: Ensure that there are no empty rows, no blank
headers, etc.
- **Content checks**: Ensure that the values have the correct types
("string", "number", "date", etc.), that their format is valid
("string must be an e-mail"), and that they respect the constraints
("age must be a number greater than 18").
- **Support for multiple tabular formats**: CSV, Excel files,
LibreOffice, Data Package, etc.
- **Parallelized validations for multi-table datasets**
- **Command line interface**
Table of Contents
-----------------
- `Features <#features>`__
- `Getting Started <#getting-started>`__
- `Installing <#installing>`__
- `Running on CLI <#running-on-cli>`__
- `Running on Python <#running-on-python>`__
- `Validation <#validation>`__
- `Basic checks <#basic-checks>`__
- `Structural checks <#structural-checks>`__
- `Content checks <#content-checks>`__
- `Advanced checks <#advanced-checks>`__
- `Validation report format <#validation-report-format>`__
- `Developer documentation <#developer-documentation>`__
- `Semantic versioning <#semantic-versioning>`__
- `Validate <#validate>`__
- `FAQ <#faq>`__
- `How can I add a new custom
check? <#how-can-i-add-a-new-custom-check>`__
- `How can I add support for a new tabular file
type? <#how-can-i-add-support-for-a-new-tabular-file-type>`__
- `Changelog <#changelog>`__
Getting Started
---------------
Installing
~~~~~~~~~~
::
pip install goodtables
pip install goodtables[ods] # If you need LibreOffice's ODS file support
Running on CLI
~~~~~~~~~~~~~~
::
goodtables data.csv
Use ``goodtables --help`` to see the different options.
Running on Python
~~~~~~~~~~~~~~~~~
.. code:: python
from goodtables import validate
report = validate('invalid.csv')
report['valid'] # false
report['table-count'] # 1
report['error-count'] # 3
report['tables'][0]['valid'] # false
report['tables'][0]['source'] # 'invalid.csv'
report['tables'][0]['errors'][0]['code'] # 'blank-header'
| You can read a more in depth explanation on using goodtables with
Python on
| the `developer documentation <#developer-documentation>`__ section.
Check also
| the `examples <examples>`__ folder for other examples.
Validation
----------
Basic checks
~~~~~~~~~~~~
The basic checks can't be disabled, as they deal with goodtables being
able to read the files.
+------------------+-------------------------------------------------------------------------+
| check | description |
+==================+=========================================================================+
| io-error | Data reading error because of IO error. |
+------------------+-------------------------------------------------------------------------+
| http-error | Data reading error because of HTTP error. |
+------------------+-------------------------------------------------------------------------+
| source-error | Data reading error because of not supported or inconsistent contents. |
+------------------+-------------------------------------------------------------------------+
| scheme-error | Data reading error because of incorrect scheme. |
+------------------+-------------------------------------------------------------------------+
| format-error | Data reading error because of incorrect format. |
+------------------+-------------------------------------------------------------------------+
| encoding-error | Data reading error because of an encoding problem. |
+------------------+-------------------------------------------------------------------------+
Structural checks
~~~~~~~~~~~~~~~~~
These checks validate that the structure of the file are valid.
+--------------------+-----------------------------------------------------------------------------------+
| check | description |
+====================+===================================================================================+
| blank-header | There is a blank header name. All cells in the header row must have a value. |
+--------------------+-----------------------------------------------------------------------------------+
| duplicate-header | There are multiple columns with the same name. All column names must be unique. |
+--------------------+-----------------------------------------------------------------------------------+
| blank-row | Rows must have at least one non-blank cell. |
+--------------------+-----------------------------------------------------------------------------------+
| duplicate-row | Rows can't be duplicated. |
+--------------------+-----------------------------------------------------------------------------------+
| extra-value | A row has more columns than the header. |
+--------------------+-----------------------------------------------------------------------------------+
| missing-value | A row has less columns than the header. |
+--------------------+-----------------------------------------------------------------------------------+
Content checks
~~~~~~~~~~~~~~
These checks validate the contents of the file. To use them, you need to
pass a `Table
Schema <https://specs.frictionlessdata.io/table-schema/>`__. If you
don't have a schema, goodtables can infer it if you use the
``infer_schema`` option.
If your schema only covers part of the data, you can use the
``infer_fields`` to infer the remaining fields.
Lastly, if the order of the fields in the data is different than in your
schema, enable the ``order_fields`` option.
+-----------------------------+-------------------------------------------------------------------------------------------+
| check | description |
+=============================+===========================================================================================+
| schema-error | Schema is not valid. |
+-----------------------------+-------------------------------------------------------------------------------------------+
| non-matching-header | The header's name in the schema is different from what's in the data. |
+-----------------------------+-------------------------------------------------------------------------------------------+
| extra-header | The data contains a header not defined in the schema. |
+-----------------------------+-------------------------------------------------------------------------------------------+
| missing-header | The data doesn't contain a header defined in the schema. |
+-----------------------------+-------------------------------------------------------------------------------------------+
| type-or-format-error | The value can’t be cast based on the schema type and format for this field. |
+-----------------------------+-------------------------------------------------------------------------------------------+
| required-constraint | This field is a required field, but it contains no value. |
+-----------------------------+-------------------------------------------------------------------------------------------+
| pattern-constraint | This field value's should conform to the defined pattern. |
+-----------------------------+-------------------------------------------------------------------------------------------+
| unique-constraint | This field is a unique field but it contains a value that has been used in another row. |
+-----------------------------+-------------------------------------------------------------------------------------------+
| enumerable-constraint | This field value should be equal to one of the values in the enumeration constraint. |
+-----------------------------+-------------------------------------------------------------------------------------------+
| minimum-constraint | This field value should be greater or equal than constraint value. |
+-----------------------------+-------------------------------------------------------------------------------------------+
| maximum-constraint | This field value should be less or equal than constraint value. |
+-----------------------------+-------------------------------------------------------------------------------------------+
| minimum-length-constraint | A length of this field value should be greater or equal than schema constraint value. |
+-----------------------------+-------------------------------------------------------------------------------------------+
| maximum-length-constraint | A length of this field value should be less or equal than schema constraint value. |
+-----------------------------+-------------------------------------------------------------------------------------------+
Advanced checks
~~~~~~~~~~~~~~~
+----------------------------------------------+---------------------------------------------------------------------------------------------------+
| check | description |
+==============================================+===================================================================================================+
| `blacklisted-value <#blacklisted-value>`__ | Ensure there are no cells with the blacklisted values. |
+----------------------------------------------+---------------------------------------------------------------------------------------------------+
| `deviated-value <#deviated-value>`__ | Ensure numbers are within a number of standard deviations from the average. |
+----------------------------------------------+---------------------------------------------------------------------------------------------------+
| `sequential-value <#sequential-value>`__ | Ensure numbers are sequential. |
+----------------------------------------------+---------------------------------------------------------------------------------------------------+
| `truncated-value <#truncated-value>`__ | Detect values that were potentially truncated. |
+----------------------------------------------+---------------------------------------------------------------------------------------------------+
| `custom-constraint <#custom-constraint>`__ | Defines a constraint based on the values of other columns (e.g. ``value * quantity == total``). |
+----------------------------------------------+---------------------------------------------------------------------------------------------------+
blacklisted-value
^^^^^^^^^^^^^^^^^
Sometimes we have to check for some values we don't want to have in out
dataset. It accepts following options:
+-------------+---------------+------------------------------+
| option | type | description |
+=============+===============+==============================+
| column | int/str | Column number or name |
+-------------+---------------+------------------------------+
| blacklist | list of str | List of blacklisted values |
+-------------+---------------+------------------------------+
Consider the following CSV file:
.. code:: csv
id,name
1,John
2,bug
3,bad
5,Alex
Let's check that the ``name`` column doesn't contain rows with ``bug``
or ``bad``:
.. code:: python
from goodtables import validate
report = validate('data.csv', checks=[
{'blacklisted-value': {'column': 'id', 'blacklist': ['bug', 'bad']}},
])
# error on row 3 with code "blacklisted-value"
# error on row 4 with code "blacklisted-value"
deviated-value
^^^^^^^^^^^^^^
This check helps to find outlines in a column containing positive
numbers. It accepts following options:
+------------+-----------+---------------------------------------------------------------------------+
| option | type | description |
+============+===========+===========================================================================+
| column | int/str | Column number or name |
+------------+-----------+---------------------------------------------------------------------------+
| average | str | Average type, either "mean", "median" or "mode" |
+------------+-----------+---------------------------------------------------------------------------+
| interval | int | Values must be inside range ``average ± standard deviation * interval`` |
+------------+-----------+---------------------------------------------------------------------------+
Consider the following CSV file:
.. code:: csv
temperature
1
-2
7
0
1
2
5
-4
100
8
3
We use ``median`` to get an average of the column values and allow
interval of 3 standard deviations. For our case median is ``2.0`` and
standard deviation is ``29.73`` so all valid values must be inside the
``[-87.19, 91.19]`` interval.
.. code:: python
report = validate('data.csv', checks=[
{'deviated-value': {'column': 'temperature', 'average': 'median', 'interval': 3}},
])
# error on row 10 with code "deviated-value"
sequential-value
^^^^^^^^^^^^^^^^
This checks is for pretty common case when a column should have integers
that sequentially increment. It accepts following options:
+----------+-----------+-------------------------+
| option | type | description |
+==========+===========+=========================+
| column | int/str | Column number or name |
+----------+-----------+-------------------------+
Consider the following CSV file:
.. code:: csv
id,name
1,one
2,two
3,three
5,five
Let's check if the ``id`` column contains sequential integers:
.. code:: python
from goodtables import validate
report = validate('data.csv', checks=[
{'sequential-value': {'column': 'id'}},
])
# error on row 5 with code "sequential-value"
truncated-value
^^^^^^^^^^^^^^^
Some database or spreadsheet software (like MySQL or Excel) could cutoff
values on saving. There are some well-known heuristics to find this bad
values. See
https://github.com/propublica/guides/blob/master/data-bulletproofing.md
for more detailed information.
Consider the following CSV file:
.. code:: csv
id,amount,comment
1,14000000,good
2,2147483647,bad
3,32767,bad
4,234234234,bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbad
To detect all probably truncated values we could use ``truncated-value``
check:
.. code:: python
report = validate('data.csv', checks=[
'truncated-value',
])
# error on row 3 with code "truncated-value"
# error on row 4 with code "truncated-value"
# error on row 5 with code "truncated-value"
custom-constraint
^^^^^^^^^^^^^^^^^
With Table Schema we could create constraints for an individual field
but sometimes it's not enough. With a custom constraint check every row
could be checked against given limited python expression in which
variable names resolve to column values. See list of `available
operators <https://github.com/danthedeckie/simpleeval#operators>`__. It
accepts following options:
.. raw:: html
<dl>
<dt>constraint (str)</dt>
<dd>Constraint definition (e.g. <code>col1 + col2 == col3</code>)</dd>
</dl>
Consider csv file like this:
.. code:: csv
id,name,salary,bonus
1,Alex,1000,200
2,Sam,2500,500
3,Ray,1350,500
4,John,5000,1000
Let's say our business rule is to be shy on bonuses:
.. code:: python
report = validate('data.csv', checks=[
{'custom-constraint': {'constraint': 'salary > bonus * 4'}},
])
# error on row 4 with code "custom-constraint"
Validation report format
~~~~~~~~~~~~~~~~~~~~~~~~
| The validation report follows the JSON Schema defined on
| `goodtables/schemas/report.json <goodtables/schemas/report.json>`__.
As an example, this is
| the report generated by running ``goodtables --row-limit 5 --json`` on
the file
| `data/datapackages/invalid/datapackage.json <data/datapackage/invalid/datapackage.json>`__:
.. code:: json
{
"time": 0.015,
"valid": false,
"error-count": 2,
"table-count": 2,
"tables": [
{
"datapackage": "data/datapackages/invalid/datapackage.json",
"time": 0.005,
"valid": false,
"error-count": 1,
"row-count": 4,
"source": "data/datapackages/invalid/data.csv",
"headers": [
"id",
"name",
"description",
"amount"
],
"format": "inline",
"schema": "table-schema",
"errors": [
{
"code": "blank-row",
"row-number": 3,
"message": "Row 3 is completely blank"
}
]
},
{
"datapackage": "data/datapackages/invalid/datapackage.json",
"time": 0.004,
"valid": false,
"error-count": 1,
"row-count": 5,
"source": "data/datapackages/invalid/data2.csv",
"headers": [
"parent",
"comment"
],
"format": "inline",
"schema": "table-schema",
"errors": [
{
"code": "blank-row",
"row-number": 4,
"message": "Row 4 is completely blank"
}
]
}
],
"warnings": [
"Table \"data/datapackages/invalid/data2.csv\" inspection has reached 5 row(s) limit"
],
"preset": "datapackage"
}
Developer documentation
-----------------------
Semantic versioning
~~~~~~~~~~~~~~~~~~~
| We follow the `Semantic Versioning <https://semver.org/>`__
specification to define our version
| numbers. This means that we'll increase the major version number when
there's a
| breaking change. Because of this, we recommend you to explicitly
specify the
| goodtables version on your dependency list (e.g. ``setup.py`` or
| ``requirements.txt``).
Validate
~~~~~~~~
| Goodtables validates your tabular dataset to find structural and
content
| errors. Consider you have a file named ``invalid.csv``. Let's validate
it:
.. code:: python
report = validate('invalid.csv')
| We could also pass a remote URI instead of a local path. It supports
CSV, XLS,
| XLSX, ODS, JSON, and all other formats supported by the
`tabulator <https://github.com/frictionlessdata/tabulator-py/>`__
| library.
Validation report
^^^^^^^^^^^^^^^^^
| The output of the ``validate()`` method is a report dictionary. It
includes
| information if the data was valid, count of errors, list of table
reports, which
| individual checks failed, etc.
Resulting report will be looking like this:
.. code:: json
{
"time": 0.009,
"error-count": 1,
"warnings": [
"Table \"data/invalid.csv\" inspection has reached 1 error(s) limit"
],
"preset": "table",
"valid": false,
"tables": [
{
"errors": [
{
"row-number": null,
"message": "Header in column 3 is blank",
"row": null,
"column-number": 3,
"code": "blank-header"
}
],
"error-count": 1,
"headers": [
"id",
"name",
"",
"name"
],
"scheme": "file",
"row-count": 2,
"valid": false,
"encoding": "utf-8",
"time": 0.007,
"schema": null,
"format": "csv",
"source": "data/invalid"
}
],
"table-count": 1
}
| Rase report errors are standardized and described
| in
| `Data Quality
Spec <https://github.com/frictionlessdata/data-quality-spec/blob/master/spec.json>`__.
| The errors are divided in one of the following categories:
- ``source`` - data can't be loaded or parsed
- ``structure`` - general tabular errors like duplicate headers
- ``schema`` - error of checks against `Table
Schema <http://specs.frictionlessdata.io/table-schema/>`__
- ``custom`` - custom checks errors
Checks
^^^^^^
| Check is a main validation actor in goodtables. The list of enabled
checks can
| be changed using ``checks`` and ``skip_checks`` arguments. Let's
explore the options
| on an example:
.. code:: python
report = validate('data.csv') # by default structure and schema (if available) checks
report = validate('data.csv', checks=['structure']) # only structure checks
report = validate('data.csv', checks=['schema']) # only schema (if available) checks
report = validate('data.csv', checks=['bad-headers']) # check only 'bad-headers'
report = validate('data.csv', skip_checks=['bad-headers']) # exclude 'bad-headers'
| By default a dataset will be validated against all available Data
Quality Spec
| errors. Some checks can be unavailable for validation. For example, if
the
| schema isn't provided, only the ``structure`` checks will be done.
Presets
^^^^^^^
| Goodtables support different formats of tabular datasets. They're
called
| presets. A tabular dataset is some data that can be split in a list of
data
| tables, as:
|Dataset|
| We can change the preset using the ``preset`` argument for
``validate()``. By
| default, it'll be inferred from the source, falling back to ``table``.
To validate
| a `data package <https://specs.frictionlessdata.io/data-package/>`__,
we can do:
.. code:: python
report = validate('datapackage.json') # implicit preset
report = validate('datapackage.json', preset='datapackage') # explicit preset
This will validate all tabular resources in the datapackage.
| It's also possible to validate a list of files using the "nested"
preset. To do
| so, the first argument to ``validate()`` should be a list of
dictionaries, where
| each key in the dictionary is named after a parameter on
``validate()``. For example:
.. code:: python
report = validate([{'source': 'data1.csv'}, {'source': 'data2.csv'}]) # implicit preset
report = validate([{'source': 'data1.csv'}, {'source': 'data2.csv'}], preset='nested') # explicit preset
Is similar to:
.. code:: python
report_data1 = validate('data1.csv')
report_data2 = validate('data2.csv')
| The difference is that goodtables validates multiple tables in
parallel, so
| calling using the "nested" preset should run faster.
Contributing
------------
This project follows the `Open Knowledge International coding
standards <https://github.com/okfn/coding-standards>`__.
| We recommend you to use ``virtualenv`` to isolate goodtables from the
rest of the
| packages in your machine.
To install goodtables and the development dependencies, run:
::
$ make install
To run the tests, use:
.. code:: bash
$ make test
FAQ
---
How can I add a new custom check?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To create a custom check user could use a ``check`` decorator. This way
the builtin check could be overridden (use the spec error code like
``duplicate-row``) or could be added a check for a custom error (use
``type``, ``context`` and ``position`` arguments):
.. code:: python
from goodtables import validate, check, Error
@check('custom-check', type='custom', context='body')
def custom_check(cells):
errors = []
for cell in cells:
message = 'Custom error on column {column_number} and row {row_number}'
error = Error(
'custom-error',
cell,
message
)
errors.append(error)
return errors
report = validate('data.csv', checks=['custom-check'])
For now this documentation section is incomplete. Please see builtin
checks to learn more about checking protocol.
How can I add support for a new tabular file type?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To create a custom preset user could use a ``preset`` decorator. This
way the builtin preset could be overridden or could be added a custom
preset.
.. code:: python
from tabulator import Stream
from tableschema import Schema
from goodtables import validate
@preset('custom-preset')
def custom_preset(source, **options):
warnings = []
tables = []
for table in source:
try:
tables.append({
'source': str(source),
'stream': Stream(...),
'schema': Schema(...),
'extra': {...},
})
except Exception:
warnings.append('Warning message')
return warnings, tables
report = validate(source, preset='custom-preset')
For now this documentation section is incomplete. Please see builtin
presets to learn more about the dataset extraction protocol.
Changelog
---------
v2.0
~~~~
Breaking changes:
- Checks method signature now only receives the current row's ``cells``
list
- Checks raise errors by returning an array of ``Error`` objects
- Cells have the row number in the ``row-number`` key
- Files with ZIP extension are presumed to be datapackages, so
``goodtables mydatapackage.zip`` works
- Improvements to goodtables CLI
(`#233 <https://github.com/frictionlessdata/goodtables-py/issues/233>`__)
- New ``goodtables init <data paths>`` command to create a new
``datapackage.json`` with the files passed as parameters and their
inferred schemas.
Bug fixes:
- Fix bug with ``truncated-values`` check on date fields
(`#250 <https://github.com/frictionlessdata/goodtables-py/issues/250>`__)
v1.5
~~~~
New API added:
- Validation ``source`` now could be a ``pathlib.Path``
v1.4
~~~~
Improved behaviour:
- rebased on Data Quality Spec v1
- rebased on Data Package Spec v1
- rebased on Table Schema Spec v1
- treat primary key as required/unique field
v1.3
~~~~
New advanced checks added:
- ``blacklisted-value``
- ``custom-constraint``
- ``deviated-value``
- ``sequential-value``
- ``truncated-value``
v1.2
~~~~
New API added:
- ``report.preset``
- ``report.tables[].schema``
v1.1
~~~~
New API added:
- ``report.tables[].scheme``
- ``report.tables[].format``
- ``report.tables[].encoding``
v1.0
~~~~
This version includes various big changes. A migration guide is under
development and will be published here.
v0.6
~~~~
First version of ``goodtables``.
.. |Travis| image:: https://img.shields.io/travis/frictionlessdata/goodtables-py/master.svg
:target: https://travis-ci.org/frictionlessdata/goodtables-py
.. |Coveralls| image:: http://img.shields.io/coveralls/frictionlessdata/goodtables-py.svg?branch=master
:target: https://coveralls.io/r/frictionlessdata/goodtables-py?branch=master
.. |PyPi| image:: https://img.shields.io/pypi/v/goodtables.svg
:target: https://pypi-hypernode.com/pypi/goodtables
.. |Gitter| image:: https://img.shields.io/gitter/room/frictionlessdata/chat.svg
:target: https://gitter.im/frictionlessdata/chat
.. |Dataset| image:: data/dataset.png
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
goodtables-2.0.0.tar.gz
(160.4 kB
view details)
Built Distribution
File details
Details for the file goodtables-2.0.0.tar.gz
.
File metadata
- Download URL: goodtables-2.0.0.tar.gz
- Upload date:
- Size: 160.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7d6dc779c9cdfd2c85c694c12717d4f50ac44702927ce1b2c1101f6b502a7aa2 |
|
MD5 | 9afd0bfb398e95122ae5011493db5ec2 |
|
BLAKE2b-256 | d8691d4d245be619f0d2fa64cab38fc303db337db54b2b493d348244d2433124 |
Provenance
File details
Details for the file goodtables-2.0.0-py2.py3-none-any.whl
.
File metadata
- Download URL: goodtables-2.0.0-py2.py3-none-any.whl
- Upload date:
- Size: 57.0 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4b80dab6ef92f40f6865b792e03b07dfc6e7521364e665033ba9567d287bd45a |
|
MD5 | 49d8a6a7f0ce342c1aab41f236ea3bae |
|
BLAKE2b-256 | 6ea61ae907662860521ac961906172f31b27294a219e613c4adfaabc151c6460 |