Skip to main content

Goodtables is a framework to inspect tabular data.

Project description

goodtables-py
=============

| |Travis|
| |Coveralls|
| |PyPi|
| |Gitter|

Goodtables is a framework to validate tabular data.

Version v1.0 includes various important changes. Please read a
`migration guide <#v10>`__.

Features
--------

- tabular data validation
- general, structure and schema checks
- support for different input data presets
- support various source schemes and formats
- parallel computation for multi-table datasets
- builtin command-line interface

Getting Started
---------------

Installation
~~~~~~~~~~~~

The package use semantic versioning. It means that major versions could
include breaking changes. It's highly recommended to specify
``tabulator`` version range if you use ``setup.py`` or
``requirements.txt`` file e.g. ``goodtables<2.0``.

.. code:: bash

$ pip install goodtables
$ pip install goodtables[ods] # With ods format support

Example
~~~~~~~

Let's start with a simple example. We just run ``validate`` function
against our data table. As a result we get a ``goodtables`` report.

.. code:: python

from goodtables import validate

report = validate('invalid.csv')
report['valid'] # false
report['table-count'] # 1
report['error-count'] # 3
report['tables'][0]['valid'] # false
report['tables'][0]['source'] # 'invalid.csv'
report['tables'][0]['errors'][0]['code'] # 'blank-header'

There is an
`examples <https://github.com/frictionlessdata/goodtables-py/tree/master/examples>`__
directory containing other code listings.

Documentation
-------------

The whole public API of this package is described here and follows
semantic versioning rules. Everything outside of this readme are private
API and could be changed without any notification on any new version.

Validate
~~~~~~~~

Goodtables validates your tabular dataset to find source, structure and
schema errors. Consider you have a file named ``invalid.csv``. Let's
validate it:

.. code:: py

report = validate('invalid.csv')

We could validate not only a local file but also remote link, file-like
object, inline data and even more. And it could be not only CSV but also
XLS, XLSX, ODS, JSON and many more. Under the hood ``goodtables`` use
powerful
`tabulator <https://github.com/frictionlessdata/goodtables-py>`__
library. All schemes and formats supported by ``tabulator`` are
supported by ``goodtables``.

Report
^^^^^^

As a result of validation goodtables returns a report dictionary. It
includes valid flag, count of errors, list of reports per table
including errors etc. Resulting report will be looking like this:

|Report|

Base report errors are standardized and described in `Data Quality
Spec <https://github.com/frictionlessdata/data-quality-spec/blob/master/spec.json>`__.
All errors fails into three base and one additional categories:

- ``source`` - data can't be loaded or parsed
- ``structure`` - general tabular errors like duplicate headers
- ``schema`` - error of checks against `Table
Schema <http://specs.frictionlessdata.io/table-schema/>`__
- ``custom`` - custom checks errors

Presets
^^^^^^^

With ``goodtables`` different kind of tabular datasets could be
validated. Tabular dataset is a something that could be split to list of
data tables:

|Dataset|

To work with different kind of datasets we use ``preset`` argument for
``validate`` function. By default it will be inferred with ``table`` as
a fallback value. Let's validate a `data
package <http://specs.frictionlessdata.io/data-package/>`__. As a result
we get report of the same form but it will be having more that 1 table
if there are more than 1 resource in data package:

.. code:: py

report = validate('datapackage.json') # implicit preset
report = validate('datapackage.json', preset='datapackage') # explicit preset

To validate list of files we use ``nested`` preset. For nested preset
first argument should be a list containing dictionaries with keys named
after ``validate`` argument names. First argument is a ``source`` and we
talk other arguments in next sections. Technically ``goodtables``
validates list of tables in parallel so it should be effective to do
many tables validation in one run:

.. code:: py

report = validate([{'source': 'data1.csv'}, {'source': 'data2.csv'}]) # implicit preset
report = validate([{'source': 'data1.csv'}, {'source': 'data2.csv'}], preset='nested') # explicit preset

Checks
^^^^^^

Check is a main validation actor in goodtables. Every check is
associated with a Data Quality Spec error. List of checks could be
customized using ``checks`` argument. Let's explore options on an
example:

.. code:: python

report = validate('data.csv') # by default all spec checks (if a schema is provided)
report = validate('data.csv', checks='structure') # only spec structure checks
report = validate('data.csv', checks='schema') # only spec schema checks (if a schema is provided)
report = validate('data.csv', checks={'spec': True, 'bad-headers': False}) # spec checks excluding 'bad-headers'
report = validate('data.csv', checks={'bad-headers': True}) # check only 'bad-headers'

By default a dataset will be validated against all available Data
Quality Spec errors. Some checks could be not available for validation
e.g. if schema is not provided only ``structure`` checks will be done.

``validate(source, **options)``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- **[Arguments - for ``table`` preset]**
- ``source (path/url/dict/file-like)`` - validation source containing
data table
- ``preset (str)`` - dataset type could be ``table`` (default),
``datapackage``, ``nested`` or custom. For the most cases preset will
be inferred from the source.
- ``schema (path/url/dict/file-like)`` - Table Schema to validate data
source against
- ``headers (list/int)`` - headers list or source row number containing
headers. If number is given for plain source headers row and all rows
before will be removed and for keyed source no rows will be removed.
- ``scheme (str)`` - source scheme with ``file`` as default. For the
most cases scheme will be inferred from source. See `list of the
supported
schemes <https://github.com/frictionlessdata/tabulator-py#schemes>`__.
- ``format (str)`` - source format with ``None`` (detect) as default.
For the most cases format will be inferred from source. See `list of
the supported
formats <https://github.com/frictionlessdata/tabulator-py#formats>`__.
- ``encoding (str)`` - source encoding with ``None`` (detect) as
default.
- ``skip_rows (int/str[])`` - list of rows to skip by row number or row
comment. Example: ``skip_rows=[1, 2, '#', '//']`` - rows 1, 2 and all
rows started with ``#`` and ``//`` will be skipped.
- ``<name> (<type>)`` - additional options supported by different
schema and format. See `list of schema
options <https://github.com/frictionlessdata/tabulator-py#schemes>`__
and `list of format
options <https://github.com/frictionlessdata/tabulator-py#schemes>`__.
- **[Arguments - for ``datapackage`` preset]**
- ``source (path/url/dict/file-like)`` - validation source containing
data package descriptor
- ``preset (str)`` - dataset type could be ``table`` (default),
``datapackage``, ``nested`` or custom. For the most cases preset will
be inferred from the source.
- ``<name> (<type>)`` - options to pass to Data Package constructor
- **[Arguments - for ``nested`` preset]**
- ``source (dict[])`` - list of dictionaries with keys named after
arguments for corresponding preset
- ``preset (str)`` - dataset type could be ``table`` (default),
``datapackage``, ``nested`` or custom. For the most cases preset will
be inferred from the source.
- **[Arguments]**
- ``checks (str/dict)`` - checks configuration
- ``infer_schema (bool)`` - infer schema if not passed
- ``infer_fields (bool)`` - infer schema for columns not presented in
schema
- ``order_fields (bool)`` - order source columns based on schema fields
order
- ``error_limit (int)`` - error limit per table
- ``table_limit (int)`` - table limit for dataset
- ``row_limit (int)`` - row limit per table
- **[Raises]**
- ``(exceptions.GoodtablesException)`` - raise on any non-tabular error
- **[Returns]**
- ``(dict)`` - returns a ``goodtables`` report

Validation against schema
~~~~~~~~~~~~~~~~~~~~~~~~~

If we run a simple table validation there will not be schema checks
involved:

.. code:: py

report = validate('invalid.csv') # only structure checks

That's because there is no `Table
Schema <http://specs.frictionlessdata.io/table-schema/>`__ to check
against. We have two options to fix it:

- provide ``schema`` argument containing Table Schema descriptor
- use ``infer_schema`` option to infer Table Schema from data source

Sometimes we have schema covering data table only partially e.g. table
has headers ``name, age, position`` but schema has only ``name`` and
``age`` fields. In this case we use ``infer_fields`` option:

.. code:: py

# schema will be complemented by `position` field
report = validate('data.csv', schema='schema.json', infer_fields=True)

Other possible discrepancy situation when your schema fields have other
order that data table columns. Options ``order_fieds`` is to rescue:

.. code:: py

# sync source/schema fields order
report = validate('data.csv', schema='schema.json', order_fields=True)

Validation limits
~~~~~~~~~~~~~~~~~

If we need to save time/resources we could limit validation. By default
limits have some reasonable values but it could be set to any values by
user. Let's see on the available limits:

- errors per table limit
- tables per dataset limit
- rows per table limit

The most common cast is stopping on the first error found:

.. code:: py

report = validate('data.csv', error_limit=1)

Custom presets
~~~~~~~~~~~~~~

It’s a provisional API. If you use it as a part of other program
please pin concrete ``goodtables`` version to your requirements
file.

To create a custom preset user could use a ``preset`` decorator. This
way the builtin preset could be overridden or could be added a custom
preset.

.. code:: python

from tabulator import Stream
from tableschema import Schema
from goodtables import validate

@preset('custom-preset')
def custom_preset(source, **options):
warnings = []
tables = []
for table in source:
try:
tables.append({
'source': str(source),
'stream': Stream(...),
'schema': Schema(...),
'extra': {...},
})
except Exception:
warnings.append('Warning message')
return warnings, tables

report = validate(source, preset='custom-preset', custom_presets=[custom_preset])

See builtin presets to learn more about the dataset extraction protocol.

Custom checks
~~~~~~~~~~~~~

It’s a provisional API. If you use it as a part of other program
please pin concrete ``goodtables`` version to your requirements
file.

To create a custom check user could use a ``check`` decorator. This way
the builtin check could be overridden (use the spec error code like
``duplicate-row``) or could be added a check for a custom error (use
``type``, ``context`` and ``after/before`` arguments):

.. code:: python

from goodtables import validate, check

@check('custom-error', type='structure', context='body', after='blank-row')
def custom_check(errors, columns, row_number, state=None):
for column in columns:
errors.append({
'code': 'custom-error',
'message': 'Custom error',
'row-number': row_number,
'column-number': column['number'],
})
columns.remove(column)

report = validate('data.csv', custom_checks=[custom_check])

See builtin checks to learn more about checking protocol.

Spec
~~~~

Data Quality Spec is shipped with the library:

.. code:: py

from goodtables import spec

spec['version'] # spec version
spec['errors'] # list of errors

``spec``
^^^^^^^^

- ``(dict)`` - returns Data Quality Spec

Exceptions
~~~~~~~~~~

``exceptions.GoodtablesException``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Base class for all ``goodtables`` exceptions.

CLI
~~~

It’s a provisional API. If you use it as a part of other program
please pin concrete ``goodtables`` version to your requirements
file.

All common goodtables tasks could be done using a command-line
interface. For example write a following command to the shell to inspect
a data table or a data package:

::

$ goodtables data.csv
$ goodtables datapackage.json

And the ``goodtables`` report will be printed to the standard output in
nicely-formatted way.

``$ goodtables``
^^^^^^^^^^^^^^^^

::

Usage: cli.py [OPTIONS] SOURCE

https://github.com/frictionlessdata/goodtables-py#cli

Options:
--preset TEXT
--schema TEXT
--checks TEXT
--infer-schema
--infer-fields
--order-fields
--error-limit INTEGER
--table-limit INTEGER
--row-limit INTEGER
--json
--version Show the version and exit.
--help Show this message and exit.

Inspector
~~~~~~~~~

This API could be deprecated in the future. It's recommended to use
``validate`` counterpart.

``Inspector(**settings)``
^^^^^^^^^^^^^^^^^^^^^^^^^

``inspector.inspect(source, **source_options)``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Contributing
------------

The project follows the `Open Knowledge International coding
standards <https://github.com/okfn/coding-standards>`__.

| Recommended way to get started is to create and activate a project
virtual environment.
| To install package and development dependencies into active
environment:

::

$ make install

To run tests with linting and coverage:

.. code:: bash

$ make test

| For linting ``pylama`` configured in ``pylama.ini`` is used. On this
stage it's already
| installed into your environment and could be used separately with more
fine-grained control
| as described in documentation -
https://pylama.readthedocs.io/en/latest/.

For example to sort results by error type:

.. code:: bash

$ pylama --sort <path>

| For testing ``tox`` configured in ``tox.ini`` is used.
| It's already installed into your environment and could be used
separately with more fine-grained control as described in documentation
- https://testrun.org/tox/latest/.

| For example to check subset of tests against Python 2 environment with
increased verbosity.
| All positional arguments and options after ``--`` will be passed to
``py.test``:

.. code:: bash

tox -e py27 -- -v tests/<path>

| Under the hood ``tox`` uses ``pytest`` configured in ``pytest.ini``,
``coverage``
| and ``mock`` packages. This packages are available only in tox
environments.

Changelog
---------

Here described only breaking and the most important changes. The full
changelog and documentation for all released versions could be found in
nicely formatted `commit
history <https://github.com/frictionlessdata/goodtables-py/commits/master>`__.

v1.2
~~~~

New API added:

- ``report.preset``
- ``report.tables[].schema``

v1.1
~~~~

New API added:

- ``report.tables[].scheme``
- ``report.tables[].format``
- ``report.tables[].encoding``

v1.0
~~~~

This version includes various big changes. A migration guide is under
development and will be published here.

v0.6
~~~~

First version of ``goodtables``.

.. |Travis| image:: https://img.shields.io/travis/frictionlessdata/goodtables-py/master.svg
:target: https://travis-ci.org/frictionlessdata/goodtables-py
.. |Coveralls| image:: http://img.shields.io/coveralls/frictionlessdata/goodtables-py.svg?branch=master
:target: https://coveralls.io/r/frictionlessdata/goodtables-py?branch=master
.. |PyPi| image:: https://img.shields.io/pypi/v/goodtables.svg
:target: https://pypi-hypernode.com/pypi/goodtables
.. |Gitter| image:: https://img.shields.io/gitter/room/frictionlessdata/chat.svg
:target: https://gitter.im/frictionlessdata/chat
.. |Report| image:: http://i.imgur.com/fZkc2OI.png
.. |Dataset| image:: https://raw.githubusercontent.com/frictionlessdata/goodtables-py/master/data/dataset.png

Project details


Release history Release notifications | RSS feed

This version

1.2.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

goodtables-1.2.0.tar.gz (63.7 kB view details)

Uploaded Source

Built Distribution

goodtables-1.2.0-py2.py3-none-any.whl (40.4 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file goodtables-1.2.0.tar.gz.

File metadata

  • Download URL: goodtables-1.2.0.tar.gz
  • Upload date:
  • Size: 63.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for goodtables-1.2.0.tar.gz
Algorithm Hash digest
SHA256 ab3c92adff7b8eb164278f180f2640bac3287f01b34f3f534b0427525c27e233
MD5 0316d379e3bfc11a5e939f8f79013dff
BLAKE2b-256 b0398d724b7c78b1433d934dc89748fc5d72499add956e70a6f12a4ed63734f5

See more details on using hashes here.

Provenance

File details

Details for the file goodtables-1.2.0-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for goodtables-1.2.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 4e63c94f02920ba4972777bff61e40f48422b353297ac30a18138eab45be9cef
MD5 66cb0ec9990e6a7d4b4d529c66ecebbb
BLAKE2b-256 a1a16efdf70833980497e5eabf11f0803732622344c9350f525eb5de0988a241

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page