Skip to main content

A library for exploring and validating machine learning data.

Project description

TensorFlow Data Validation

Python PyPI Documentation

TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).

TF Data Validation includes:

  • Scalable calculation of summary statistics of training and test data.
  • Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of features (Facets)
  • Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies
  • A schema viewer to help you inspect the schema.
  • Anomaly detection to identify anomalies, such as missing features, out-of-range values, or wrong feature types, to name a few.
  • An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.

For instructions on using TFDV, see the get started guide and try out the example notebook. Some of the techniques implemented in TFDV are described in a technical paper published in SysML'19.

Caution: TFDV may be backwards incompatible before version 1.0.

Installing from PyPI

The recommended way to install TFDV is using the PyPI package:

pip install tensorflow-data-validation

Build with Docker

This is the recommended way to build TFDV under Linux, and is continuously tested at Google.

1. Install Docker

Please first install docker and docker-compose by following the directions: docker; docker-compose.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

Then, run the following at the project root:

sudo docker-compose build manylinux2010
sudo docker-compose run -e PYTHON_VERSION=${PYTHON_VERSION} manylinux2010

where PYTHON_VERSION is one of {35, 36, 37}.

A wheel will be produced under dist/.

4. Install the pip package

pip install dist/*.whl

Build from source

1. Prerequisites

To compile and use TFDV, you need to set up some prerequisites.

Install NumPy

If NumPy is not installed on your system, install it now by following these directions.

Install Bazel

If Bazel is not installed on your system, install it now by following these directions.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

TFDV uses Bazel to build the pip package from source. Before invoking the following commands, make sure the python in your $PATH is the one of the target version and has NumPy installed.

bazel run -c opt --cxxopt=-D_GLIBCXX_USE_CXX11_ABI=0 tensorflow_data_validation:build_pip_package

Note that we are assuming here that dependent packages (e.g. PyArrow) are built with a GCC older than 5.1 and use the flag D_GLIBCXX_USE_CXX11_ABI=0 to be compatible with the old std::string ABI.

You can find the generated .whl file in the dist subdirectory.

4. Install the pip package

pip install dist/*.whl

Supported platforms

TFDV is tested on the following 64-bit operating systems:

  • macOS 10.14.6 (Mojave) or later.
  • Ubuntu 16.04 or later.
  • Windows 7 or later.

Notable Dependencies

TensorFlow is required.

Apache Beam is required; it's the way that efficient distributed computation is supported. By default, Apache Beam runs in local mode but can also run in distributed mode using Google Cloud Dataflow and other Apache Beam runners.

Apache Arrow is also required. TFDV uses Arrow to represent data internally in order to make use of vectorized numpy functions.

Compatible versions

The following table shows the package versions that are compatible with each other. This is determined by our testing framework, but other untested combinations may also work.

tensorflow-data-validation tensorflow apache-beam[gcp] pyarrow
GitHub master nightly (1.x/2.x) 2.22.0 0.16.0
0.22.2 1.15 / 2.2 2.20.0 0.16.0
0.22.1 1.15 / 2.2 2.20.0 0.16.0
0.22.0 1.15 / 2.2 2.20.0 0.16.0
0.21.5 1.15 / 2.1 2.17.0 0.15.0
0.21.4 1.15 / 2.1 2.17.0 0.15.0
0.21.2 1.15 / 2.1 2.17.0 0.15.0
0.21.1 1.15 / 2.1 2.17.0 0.15.0
0.21.0 1.15 / 2.1 2.17.0 0.15.0
0.15.0 1.15 / 2.0 2.16.0 0.14.0
0.14.1 1.14 2.14.0 0.14.0
0.14.0 1.14 2.14.0 0.14.0
0.13.1 1.13 2.11.0 n/a
0.13.0 1.13 2.11.0 n/a
0.12.0 1.12 2.10.0 n/a
0.11.0 1.11 2.8.0 n/a
0.9.0 1.9 2.6.0 n/a

Questions

Please direct any questions about working with TF Data Validation to Stack Overflow using the tensorflow-data-validation tag.

Links

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

tensorflow_data_validation-0.22.2-cp37-cp37m-win_amd64.whl (1.8 MB view details)

Uploaded CPython 3.7m Windows x86-64

tensorflow_data_validation-0.22.2-cp37-cp37m-manylinux2010_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.22.2-cp37-cp37m-macosx_10_9_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

tensorflow_data_validation-0.22.2-cp36-cp36m-win_amd64.whl (1.8 MB view details)

Uploaded CPython 3.6m Windows x86-64

tensorflow_data_validation-0.22.2-cp36-cp36m-manylinux2010_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.22.2-cp36-cp36m-macosx_10_9_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.6m macOS 10.9+ x86-64

tensorflow_data_validation-0.22.2-cp35-cp35m-win_amd64.whl (1.8 MB view details)

Uploaded CPython 3.5m Windows x86-64

tensorflow_data_validation-0.22.2-cp35-cp35m-manylinux2010_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.5m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.22.2-cp35-cp35m-macosx_10_6_intel.whl (3.0 MB view details)

Uploaded CPython 3.5m macOS 10.6+ intel

File details

Details for the file tensorflow_data_validation-0.22.2-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.22.2-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.7.7

File hashes

Hashes for tensorflow_data_validation-0.22.2-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 5e99fddaab458809779af269491d6ba96cd08b1c6dd7f5d011c74735fde5163f
MD5 3a8cd5c53c54d9b5ba02465839b20ecf
BLAKE2b-256 215613d64493eaf5711ec103f993088cefc2aa2735b481eadf542616c3680886

See more details on using hashes here.

Provenance

File details

Details for the file tensorflow_data_validation-0.22.2-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.22.2-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 904516c498533fbc7e54ef61dcfb0d0d1f2c9249618f942c5d649c3449601256
MD5 0d17bd86a105f7777d355fe0fd09eea2
BLAKE2b-256 ae1115e85359c086263520492d76f578b3a7fc75bf55a7ec808955cd4117e1a2

See more details on using hashes here.

Provenance

File details

Details for the file tensorflow_data_validation-0.22.2-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.22.2-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 47c85308fa89ece2a7bafdd3960d443ce9fc3365677ff4fb9cacd2d424231bd8
MD5 a2362eaf6942ad1471d5b16ee06da7ad
BLAKE2b-256 936e62d6fa8cb513a06f5cc38d6998bbd35e21af205b5c4e669b51fe2e3edddd

See more details on using hashes here.

Provenance

File details

Details for the file tensorflow_data_validation-0.22.2-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.22.2-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.7.7

File hashes

Hashes for tensorflow_data_validation-0.22.2-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 4d62cf8277cc3f9fa90cc815ce6d867601374a55260b137e6c5c7de2aef0fb8f
MD5 c214dd75ea64c1da594ba6bcd9ace12a
BLAKE2b-256 6df98101c2f8b5b4cc1ceef979078e1d6d6280612d8c0690e5e322f57b967487

See more details on using hashes here.

Provenance

File details

Details for the file tensorflow_data_validation-0.22.2-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.22.2-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 23a892a0d8fe4cd53ae0929347d852b02fabb688ed876dcc92c695c699df57e4
MD5 7025c18b98bb8acb07dccb1ddfa1c1b1
BLAKE2b-256 8013ca84e30dd444c9a03535aa3a23172f74a3aadf0eab35be921bcfdbd076c0

See more details on using hashes here.

Provenance

File details

Details for the file tensorflow_data_validation-0.22.2-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.22.2-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 81ac5a7ba86eeaaa7d769225f5dd3eab1ce34e4f90498012b147eab2cd7295d1
MD5 50cc02aa5709d0e3932856d663354c35
BLAKE2b-256 d0e87cdc70841271dc462b3cbbcfa605ca235e5be5e437a88047644f1567bcd4

See more details on using hashes here.

Provenance

File details

Details for the file tensorflow_data_validation-0.22.2-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.22.2-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.7.7

File hashes

Hashes for tensorflow_data_validation-0.22.2-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 908db0a3235f50d166294257938fe7b42907f9423ff8fb6a25bc24d38e40fe75
MD5 714e780e3bbb2032d01b36a26c66fc30
BLAKE2b-256 d5cc091d292843174f0e2c2f818d4d964ba193098af708abf33d12987c4026c0

See more details on using hashes here.

Provenance

File details

Details for the file tensorflow_data_validation-0.22.2-cp35-cp35m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.22.2-cp35-cp35m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 7e3bdc316eea9d1a9673fffd4d52da13b5e2636229655a96c126a43504b39362
MD5 e4ec51105b0d073e6a2a59a60f88f399
BLAKE2b-256 65aebed19f6239659173473d630f9b4cf1a425a02725872e0281ca5e9c94c258

See more details on using hashes here.

Provenance

File details

Details for the file tensorflow_data_validation-0.22.2-cp35-cp35m-macosx_10_6_intel.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.22.2-cp35-cp35m-macosx_10_6_intel.whl
Algorithm Hash digest
SHA256 5cd53984da9dcdb841bfd65fcae43a12f05c026a27e9ee210cc52594ad9f5771
MD5 28a89bea306bff1f1427dcee535123b5
BLAKE2b-256 808d88172817da15815c0b390d9340d177f74ee15dc08289d820aa196392b892

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page