Skip to main content

A library for exploring and validating machine learning data.

Project description

TensorFlow Data Validation

Python PyPI Documentation

TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).

TF Data Validation includes:

  • Scalable calculation of summary statistics of training and test data.
  • Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of features (Facets)
  • Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies
  • A schema viewer to help you inspect the schema.
  • Anomaly detection to identify anomalies, such as missing features, out-of-range values, or wrong feature types, to name a few.
  • An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.

For instructions on using TFDV, see the get started guide and try out the example notebook. Some of the techniques implemented in TFDV are described in a technical paper published in SysML'19.

Caution: TFDV may be backwards incompatible before version 1.0.

Installing from PyPI

The recommended way to install TFDV is using the PyPI package:

pip install tensorflow-data-validation

Build with Docker

This is the recommended way to build TFDV under Linux, and is continuously tested at Google.

1. Install Docker

Please first install docker and docker-compose by following the directions: docker; docker-compose.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

Then, run the following at the project root:

sudo docker-compose build manylinux2010
sudo docker-compose run -e PYTHON_VERSION=${PYTHON_VERSION} manylinux2010

where PYTHON_VERSION is one of {35, 36, 37}.

A wheel will be produced under dist/.

4. Install the pip package

pip install dist/*.whl

Build from source

1. Prerequisites

To compile and use TFDV, you need to set up some prerequisites.

Install NumPy

If NumPy is not installed on your system, install it now by following these directions.

Install Bazel

If Bazel is not installed on your system, install it now by following these directions.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

TFDV uses Bazel to build the pip package from source. Before invoking the following commands, make sure the python in your $PATH is the one of the target version and has NumPy installed.

bazel run -c opt --cxxopt=-D_GLIBCXX_USE_CXX11_ABI=0 tensorflow_data_validation:build_pip_package

Note that we are assuming here that dependent packages (e.g. PyArrow) are built with a GCC older than 5.1 and use the flag D_GLIBCXX_USE_CXX11_ABI=0 to be compatible with the old std::string ABI.

You can find the generated .whl file in the dist subdirectory.

4. Install the pip package

pip install dist/*.whl

Supported platforms

TFDV is tested on the following 64-bit operating systems:

  • macOS 10.14.6 (Mojave) or later.
  • Ubuntu 16.04 or later.
  • Windows 7 or later.

Notable Dependencies

TensorFlow is required.

Apache Beam is required; it's the way that efficient distributed computation is supported. By default, Apache Beam runs in local mode but can also run in distributed mode using Google Cloud Dataflow and other Apache Beam runners.

Apache Arrow is also required. TFDV uses Arrow to represent data internally in order to make use of vectorized numpy functions.

Compatible versions

The following table shows the package versions that are compatible with each other. This is determined by our testing framework, but other untested combinations may also work.

tensorflow-data-validation tensorflow apache-beam[gcp] pyarrow
GitHub master nightly (1.x/2.x) 2.20.0 0.16.0
0.22.1 1.15 / 2.2 2.20.0 0.16.0
0.22.0 1.15 / 2.2 2.20.0 0.16.0
0.21.5 1.15 / 2.1 2.17.0 0.15.0
0.21.4 1.15 / 2.1 2.17.0 0.15.0
0.21.2 1.15 / 2.1 2.17.0 0.15.0
0.21.1 1.15 / 2.1 2.17.0 0.15.0
0.21.0 1.15 / 2.1 2.17.0 0.15.0
0.15.0 1.15 / 2.0 2.16.0 0.14.0
0.14.1 1.14 2.14.0 0.14.0
0.14.0 1.14 2.14.0 0.14.0
0.13.1 1.13 2.11.0 n/a
0.13.0 1.13 2.11.0 n/a
0.12.0 1.12 2.10.0 n/a
0.11.0 1.11 2.8.0 n/a
0.9.0 1.9 2.6.0 n/a

Questions

Please direct any questions about working with TF Data Validation to Stack Overflow using the tensorflow-data-validation tag.

Links

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

tensorflow_data_validation-0.22.1-cp37-cp37m-win_amd64.whl (1.8 MB view details)

Uploaded CPython 3.7m Windows x86-64

tensorflow_data_validation-0.22.1-cp37-cp37m-manylinux2010_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.22.1-cp37-cp37m-macosx_10_9_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

tensorflow_data_validation-0.22.1-cp36-cp36m-win_amd64.whl (1.8 MB view details)

Uploaded CPython 3.6m Windows x86-64

tensorflow_data_validation-0.22.1-cp36-cp36m-manylinux2010_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.22.1-cp36-cp36m-macosx_10_9_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.6m macOS 10.9+ x86-64

tensorflow_data_validation-0.22.1-cp35-cp35m-win_amd64.whl (1.8 MB view details)

Uploaded CPython 3.5m Windows x86-64

tensorflow_data_validation-0.22.1-cp35-cp35m-manylinux2010_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.5m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.22.1-cp35-cp35m-macosx_10_6_intel.whl (3.0 MB view details)

Uploaded CPython 3.5m macOS 10.6+ intel

File details

Details for the file tensorflow_data_validation-0.22.1-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.22.1-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.7

File hashes

Hashes for tensorflow_data_validation-0.22.1-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 ffc6ba4b7ac01c9554a09b41ea36dc87a7208a83b8c9ed24b7b7c81b3ca88be2
MD5 016062a0cbe3831c133f32127b492599
BLAKE2b-256 5b4505cdb560c06c04e99263c9b796db1db4f52958dd3fa5cbb99d391619f9b7

See more details on using hashes here.

Provenance

File details

Details for the file tensorflow_data_validation-0.22.1-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.22.1-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 d43ddb04dbade9569f1e0004b98fce62bc903b22e68a4589ac5d31beebf1650d
MD5 d85caa6794f2b06ffb01c297688ffd33
BLAKE2b-256 26a4f16e8951230e113773f3c4cdf79e7cf52dffec557573e660194cf297af5f

See more details on using hashes here.

Provenance

File details

Details for the file tensorflow_data_validation-0.22.1-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.22.1-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 bae31c4949fcdbb1da27837a2a594f7a219e4bf17c201c1729b2c6d50b4f088b
MD5 943e6d287a7d08d0ccd2fabd535d995a
BLAKE2b-256 8422168251e29ed48a2ae5a312db5d7f65398b09e8cd7714cf4f82367540b783

See more details on using hashes here.

Provenance

File details

Details for the file tensorflow_data_validation-0.22.1-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.22.1-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.7

File hashes

Hashes for tensorflow_data_validation-0.22.1-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 67260831ec3f887faeb57a64750c6ba9b8bbf70825fea33a1700750304bd3a01
MD5 11a18fa8d59f23d04a116cb868797cb9
BLAKE2b-256 5cddce6cf5b7337b329959748d99e5398d73d2089c71995964bb0d6dcf98ff5e

See more details on using hashes here.

Provenance

File details

Details for the file tensorflow_data_validation-0.22.1-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.22.1-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 679998dd1b0adea2364b6d2e6b7930a0055684be0322881123945daad4194855
MD5 26082da1aeed85150f0395929b7de58b
BLAKE2b-256 a535af73731cd084273901759c1f28ef9608f95e02d441d23e7090e8d6041c17

See more details on using hashes here.

Provenance

File details

Details for the file tensorflow_data_validation-0.22.1-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.22.1-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 76bb2af79ec247d4f642cf0dd3d07875263977ddf776810a2e821935a534c9ed
MD5 8c2a1f25a781ccba5a055d2cd5ba1cc0
BLAKE2b-256 46b44fe708bc803875dd57281b10c05b08bd6406566a1a91920d862ef8064211

See more details on using hashes here.

Provenance

File details

Details for the file tensorflow_data_validation-0.22.1-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.22.1-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.7

File hashes

Hashes for tensorflow_data_validation-0.22.1-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 c080628ab0e2d15e076bf9619c6e101ecf36d41bc62cabd7af655115dc4e56fa
MD5 6f31c1c9f11841b5ecc0d302a4c12319
BLAKE2b-256 8dba438cfa9756343157d4f6d4fc119307d5760e0a4ad0718919c8a4d9808cb9

See more details on using hashes here.

Provenance

File details

Details for the file tensorflow_data_validation-0.22.1-cp35-cp35m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.22.1-cp35-cp35m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 913d7b7147538cd13708a03542a81d65dd52f895ada3189149c46f6c018bfbc0
MD5 d25ab63ee182ccd1ec41f4c2223ce220
BLAKE2b-256 25a63d42671ae42081953ea10049c903eb52302691da03904d0c5b653c7cfc70

See more details on using hashes here.

Provenance

File details

Details for the file tensorflow_data_validation-0.22.1-cp35-cp35m-macosx_10_6_intel.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.22.1-cp35-cp35m-macosx_10_6_intel.whl
Algorithm Hash digest
SHA256 271bb43b6e3852dd97f7fab7189845e510beaa9dc1ee888c01fabca1714be7d8
MD5 ff0c8386b0a38c6bfeef2b925cf3c2c3
BLAKE2b-256 801857060faee97ac8877d78d4a53123af46f7727c544d4e85dfd7d4b8c5516d

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page