A library for exploring and validating machine learning data.
Project description
TensorFlow Data Validation
TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).
TF Data Validation includes:
- Scalable calculation of summary statistics of training and test data.
- Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of features (Facets)
- Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies
- A schema viewer to help you inspect the schema.
- Anomaly detection to identify anomalies, such as missing features, out-of-range values, or wrong feature types, to name a few.
- An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.
For instructions on using TFDV, see the get started guide and try out the example notebook. Some of the techniques implemented in TFDV are described in a technical paper published in SysML'19.
Caution: TFDV may be backwards incompatible before version 1.0.
Installing from PyPI
The recommended way to install TFDV is using the PyPI package:
pip install tensorflow-data-validation
Build with Docker
This is the recommended way to build TFDV under Linux, and is continuously tested at Google.
1. Install Docker
Please first install docker
and docker-compose
by following the directions:
docker;
docker-compose.
2. Clone the TFDV repository
git clone https://github.com/tensorflow/data-validation
cd data-validation
Note that these instructions will install the latest master branch of TensorFlow
Data Validation. If you want to install a specific branch (such as a release
branch), pass -b <branchname>
to the git clone
command.
When building on Python 2, make sure to strip the Python types in the source code using the following commands:
pip install strip-hints
python tensorflow_data_validation/tools/strip_type_hints.py tensorflow_data_validation/
3. Build the pip package
Then, run the following at the project root:
sudo docker-compose build manylinux2010
sudo docker-compose run -e PYTHON_VERSION=${PYTHON_VERSION} manylinux2010
where PYTHON_VERSION
is one of {27, 35, 36, 37}
.
A wheel will be produced under dist/
.
4. Install the pip package
pip install dist/*.whl
Build from source
1. Prerequisites
To compile and use TFDV, you need to set up some prerequisites.
Install NumPy
If NumPy is not installed on your system, install it now by following these directions.
Install Bazel
If Bazel is not installed on your system, install it now by following these directions.
Install PyArrow
TFDV needs to be built with specific PyArrow versions ( as indicated in third_party/pyarrow.version). Install pyarrow by following these directions.
When installing please make sure to specify the compatible pyarrow version. For example:
pip install "pyarrow>=0.14.0,<0.15.0"
2. Clone the TFDV repository
git clone https://github.com/tensorflow/data-validation
cd data-validation
Note that these instructions will install the latest master branch of TensorFlow
Data Validation. If you want to install a specific branch (such as a release branch),
pass -b <branchname>
to the git clone
command.
When building on Python 2, make sure to strip the Python types in the source code using the following commands:
pip install strip-hints
python tensorflow_data_validation/tools/strip_type_hints.py tensorflow_data_validation/
3. Build the pip package
TFDV uses Bazel to build the pip package from source. Before invoking the
following commands, make sure the python
in your $PATH
is the one of the
target version and has NumPy and PyArrow installed.
./configure.sh
bazel run -c opt --cxxopt=-D_GLIBCXX_USE_CXX11_ABI=0 tensorflow_data_validation:build_pip_package
Note that we are assuming here that dependent packages (e.g. PyArrow) are built
with a GCC older than 5.1 and use the flag D_GLIBCXX_USE_CXX11_ABI=0
to be
compatible with the old std::string ABI.
You can find the generated .whl
file in the dist
subdirectory.
4. Install the pip package
pip install dist/*.whl
Supported platforms
TFDV is tested on the following 64-bit operating systems:
- macOS 10.12.6 (Sierra) or later.
- Ubuntu 16.04 or later.
- Windows 7 or later.
Dependencies
TFDV requires TensorFlow but does not depend on the tensorflow
PyPI package. See the TensorFlow install guides
for instructions on how to get started with TensorFlow.
Apache Beam is required; it's the way that efficient distributed computation is supported. By default, Apache Beam runs in local mode but can also run in distributed mode using Google Cloud Dataflow. TFDV is designed to be extensible for other Apache Beam runners.
Apache Arrow is also required. TFDV uses Arrow to represent data internally in order to make use of vectorized numpy functions.
Compatible versions
The following table shows the package versions that are compatible with each other. This is determined by our testing framework, but other untested combinations may also work.
tensorflow-data-validation | tensorflow | apache-beam[gcp] | pyarrow |
---|---|---|---|
GitHub master | nightly (1.x) | 2.14.0 | 0.14.0 |
0.14.1 | 1.14 | 2.14.0 | 0.14.0 |
0.14.0 | 1.14 | 2.14.0 | 0.14.0 |
0.13.1 | 1.13 | 2.11.0 | n/a |
0.13.0 | 1.13 | 2.11.0 | n/a |
0.12.0 | 1.12 | 2.10.0 | n/a |
0.11.0 | 1.11 | 2.8.0 | n/a |
0.9.0 | 1.9 | 2.6.0 | n/a |
Questions
Please direct any questions about working with TF Data Validation to Stack Overflow using the tensorflow-data-validation tag.
Links
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for tensorflow_data_validation-0.14.1-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c8f43ebd164703550c97d8e39e2fe1165591ef400ce97f58183a9317a5c82837 |
|
MD5 | abfd1cbf7b808828bcb9ab84700b4e83 |
|
BLAKE2b-256 | 8cdccb414fd7ecca4fc145cee7e17d0d97d2383e236ae54b847c36f5bd72fda5 |
Hashes for tensorflow_data_validation-0.14.1-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bd65a8e3c0e792d85f87368f796e0ef0c984b461a6e11116ce9e4d8fda4697c9 |
|
MD5 | 8e7ed8999ae11bbb5d4640b43cd31cc2 |
|
BLAKE2b-256 | 1d2c36c656c4cb2eea5c834c6ded74c25ca6263a3c54678fffb499d61f2fdb27 |
Hashes for tensorflow_data_validation-0.14.1-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2cba18c385d7de8d346b8db4b9bfec38e8535e1371a6a7f2f375ea51264dfeb8 |
|
MD5 | 91e71e6c37b0512ffbd86bffafae5fc6 |
|
BLAKE2b-256 | 543edec2c051d4a6dd04dcacfd73d4d02be3ad3cd56008ba2251e3bd8cc36adf |
Hashes for tensorflow_data_validation-0.14.1-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fd68fec9a2a1742f731a948197a861e09e1f0939993039b64b3daf56b4d314e8 |
|
MD5 | a116395d404d4ee1aed2b1dd1fe98658 |
|
BLAKE2b-256 | d18677ec6a7c5c91ac69798fc3a9c911ff225a4c2833a42fb59d63c8162679e7 |
Hashes for tensorflow_data_validation-0.14.1-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 156205ad56e786a65b39b115c7c3af4ea981c9383cb8f84947c42d6b56e15b6d |
|
MD5 | 5b4685059a55c2f2788ad75323504e81 |
|
BLAKE2b-256 | 3da2fb311b2924568699052c718a8bc9041955b8f1773a784f164b02bd13140f |
Hashes for tensorflow_data_validation-0.14.1-cp35-cp35m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | df5eb52ef53ee9db901aed5a30db183f272cda0a8b4f6981d9843cb6c52fc58a |
|
MD5 | 3dee9ed34ef329d21fd70373ed1a572e |
|
BLAKE2b-256 | 7713d0a90ccde514a4547b5d2ce3268f683aa6d5fb9f185c2b4d9a7db15eafca |
Hashes for tensorflow_data_validation-0.14.1-cp35-cp35m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6a08cf22eeb8dfac805ff37f54f4f3f76b540bb3d152bd64840348c481593d92 |
|
MD5 | 1559f0e4531cb3f068878b49f0e5192e |
|
BLAKE2b-256 | eacf4f714f6ec2f2f764086ccb941ac964905ee39efc34decc2e73ac0485b9fe |
Hashes for tensorflow_data_validation-0.14.1-cp35-cp35m-macosx_10_6_intel.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 18ce52cb25fe6be75138916f1f503c345548de15fa20465d3599c681aa469561 |
|
MD5 | d57276441ff34b0167ae89262ee636cb |
|
BLAKE2b-256 | 58427cfaf2bacf06b99a52228a684ea58dde248fb26284bfc7e83dc07dadd81b |
Hashes for tensorflow_data_validation-0.14.1-cp27-cp27mu-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 06900f329e6e23a7d92aea37a80d341301c497531a2cb52d93168b9029c25ce3 |
|
MD5 | 37ab4ec2498d41836e01ae0962915e34 |
|
BLAKE2b-256 | 35bfb5ce7a4ab497f2fe9e5e379eee2c9044f2cd7de3f53e8c29d5cc5c4ae86b |
Hashes for tensorflow_data_validation-0.14.1-cp27-cp27m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5b057745806c5f766d83cc5e479a82a740e5b9c621440efecfb46e1b58a084ed |
|
MD5 | ad7a2174ee276340247eb5dba8bd903c |
|
BLAKE2b-256 | bf5aa00402426453e425fa89d8d59e7c1764656498b234c36a09fa3f6f3765f3 |