T-Digest data structure
Project description
# tdigest
### Efficient percentile estimation of streaming or distributed data
[![PyPI version](https://badge.fury.io/py/tdigest.svg)](https://badge.fury.io/py/tdigest)
[![Build Status](https://travis-ci.org/CamDavidsonPilon/tdigest.svg?branch=master)](https://travis-ci.org/CamDavidsonPilon/tdigest)
This is a Python implementation of Ted Dunning's [t-digest](https://github.com/tdunning/t-digest) data structure. The t-digest data structure is designed around computing accurate estimates from either streaming data, or distributed data. These estimates are percentiles, quantiles, trimmed means, etc. Two t-digests can be added, making the data structure ideal for map-reduce settings, and can be serialized into much less than 10kB (instead of storing the entire list of data).
See a blog post about it here: [Percentile and Quantile Estimation of Big Data: The t-Digest](http://dataorigami.net/blogs/napkin-folding/19055451-percentile-and-quantile-estimation-of-big-data-the-t-digest)
### Installation
*tdigest* is compatible with both Python 2 and Python 3.
```
pip install tdigest
```
### Usage
#### Update the digest sequentially
```
from tdigest import TDigest
from numpy.random import random
digest = TDigest()
for x in range(5000):
digest.update(random())
print(digest.percentile(15)) # about 0.15, as 0.15 is the 15th percentile of the Uniform(0,1) distribution
```
#### Update the digest in batches
```
another_digest = TDigest()
another_digest.batch_update(random(5000))
print(another_digest.percentile(15))
```
#### Sum two digests to create a new digest
```
sum_digest = digest + another_digest
sum_digest.percentile(30) # about 0.3
```
#### To dict or serializing a digest with JSON
You can use the to_dict() method to turn a TDigest object into a standard Python dictionary.
```
digest = TDigest()
digest.update(1)
digest.update(2)
digest.update(3)
print(digest.to_dict())
```
Or you can get only a list of Centroids with `centroids_to_list()`.
```
digest.centroids_to_list()
```
Similarly, you can restore a Python dict of digest values with `update_from_dict()`. Centroids are merged with any existing ones in the digest.
For example, make a fresh digest and restore values from a python dictionary.
```
digest = TDigest()
digest.update_from_dict({'K': 25, 'delta': 0.01, 'centroids': [{'c': 1.0, 'm': 1.0}, {'c': 1.0, 'm': 2.0}, {'c': 1.0, 'm': 3.0}]})
```
K and delta values are optional, or you can provide only a list of centroids with `update_centroids_from_list()`.
```
digest = TDigest()
digest.update_centroids([{'c': 1.0, 'm': 1.0}, {'c': 1.0, 'm': 2.0}, {'c': 1.0, 'm': 3.0}])
```
If you want to serialize with other tools like JSON, you can first convert to_dict().
```
json.dumps(digest.to_dict())
```
Alternatively, make a custom encoder function to provide as default to the standard json module.
```
def encoder(digest_obj):
return digest_obj.to_dict()
```
Then pass the encoder function as the default parameter.
```
json.dumps(digest, default=encoder)
```
### API
`TDigest.`
- `update(x, w=1)`: update the tdigest with value `x` and weight `w`.
- `batch_update(x, w=1)`: update the tdigest with values in array `x` and weight `w`.
- `compress()`: perform a compression on the underlying data structure that will shrink the memory footprint of it, without hurting accuracy. Good to perform after adding many values.
- `percentile(p)`: return the `p`th percentile. Example: `p=50` is the median.
- `cdf(x)`: return the CDF the value `x` is at.
- `trimmed_mean(p1, p2)`: return the mean of data set without the values below and above the `p1` and `p2` percentile respectively.
- `to_dict()`: return a Python dictionary of the TDigest and internal Centroid values.
- `update_from_dict(dict_values)`: update from serialized dictionary values into the TDigest object.
- `centroids_to_list()`: return a Python list of the TDigest object's internal Centroid values.
- `update_centroids_from_list(list_values)`: update Centroids from a python list.
### Efficient percentile estimation of streaming or distributed data
[![PyPI version](https://badge.fury.io/py/tdigest.svg)](https://badge.fury.io/py/tdigest)
[![Build Status](https://travis-ci.org/CamDavidsonPilon/tdigest.svg?branch=master)](https://travis-ci.org/CamDavidsonPilon/tdigest)
This is a Python implementation of Ted Dunning's [t-digest](https://github.com/tdunning/t-digest) data structure. The t-digest data structure is designed around computing accurate estimates from either streaming data, or distributed data. These estimates are percentiles, quantiles, trimmed means, etc. Two t-digests can be added, making the data structure ideal for map-reduce settings, and can be serialized into much less than 10kB (instead of storing the entire list of data).
See a blog post about it here: [Percentile and Quantile Estimation of Big Data: The t-Digest](http://dataorigami.net/blogs/napkin-folding/19055451-percentile-and-quantile-estimation-of-big-data-the-t-digest)
### Installation
*tdigest* is compatible with both Python 2 and Python 3.
```
pip install tdigest
```
### Usage
#### Update the digest sequentially
```
from tdigest import TDigest
from numpy.random import random
digest = TDigest()
for x in range(5000):
digest.update(random())
print(digest.percentile(15)) # about 0.15, as 0.15 is the 15th percentile of the Uniform(0,1) distribution
```
#### Update the digest in batches
```
another_digest = TDigest()
another_digest.batch_update(random(5000))
print(another_digest.percentile(15))
```
#### Sum two digests to create a new digest
```
sum_digest = digest + another_digest
sum_digest.percentile(30) # about 0.3
```
#### To dict or serializing a digest with JSON
You can use the to_dict() method to turn a TDigest object into a standard Python dictionary.
```
digest = TDigest()
digest.update(1)
digest.update(2)
digest.update(3)
print(digest.to_dict())
```
Or you can get only a list of Centroids with `centroids_to_list()`.
```
digest.centroids_to_list()
```
Similarly, you can restore a Python dict of digest values with `update_from_dict()`. Centroids are merged with any existing ones in the digest.
For example, make a fresh digest and restore values from a python dictionary.
```
digest = TDigest()
digest.update_from_dict({'K': 25, 'delta': 0.01, 'centroids': [{'c': 1.0, 'm': 1.0}, {'c': 1.0, 'm': 2.0}, {'c': 1.0, 'm': 3.0}]})
```
K and delta values are optional, or you can provide only a list of centroids with `update_centroids_from_list()`.
```
digest = TDigest()
digest.update_centroids([{'c': 1.0, 'm': 1.0}, {'c': 1.0, 'm': 2.0}, {'c': 1.0, 'm': 3.0}])
```
If you want to serialize with other tools like JSON, you can first convert to_dict().
```
json.dumps(digest.to_dict())
```
Alternatively, make a custom encoder function to provide as default to the standard json module.
```
def encoder(digest_obj):
return digest_obj.to_dict()
```
Then pass the encoder function as the default parameter.
```
json.dumps(digest, default=encoder)
```
### API
`TDigest.`
- `update(x, w=1)`: update the tdigest with value `x` and weight `w`.
- `batch_update(x, w=1)`: update the tdigest with values in array `x` and weight `w`.
- `compress()`: perform a compression on the underlying data structure that will shrink the memory footprint of it, without hurting accuracy. Good to perform after adding many values.
- `percentile(p)`: return the `p`th percentile. Example: `p=50` is the median.
- `cdf(x)`: return the CDF the value `x` is at.
- `trimmed_mean(p1, p2)`: return the mean of data set without the values below and above the `p1` and `p2` percentile respectively.
- `to_dict()`: return a Python dictionary of the TDigest and internal Centroid values.
- `update_from_dict(dict_values)`: update from serialized dictionary values into the TDigest object.
- `centroids_to_list()`: return a Python list of the TDigest object's internal Centroid values.
- `update_centroids_from_list(list_values)`: update Centroids from a python list.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
tdigest-0.5.2.1.tar.gz
(7.1 kB
view details)
Built Distribution
File details
Details for the file tdigest-0.5.2.1.tar.gz
.
File metadata
- Download URL: tdigest-0.5.2.1.tar.gz
- Upload date:
- Size: 7.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8b1e554356dcc176a2e15265b2d2c39f67a388524b5dd121bd03e929b1c52e7b |
|
MD5 | e6943c24fc8385aee7972a8e0f23c8be |
|
BLAKE2b-256 | fb6130ce52974601199c1f5d38c03989aa6c913d27892b5e7b4f035166f561c5 |
File details
Details for the file tdigest-0.5.2.1-py2.py3-none-any.whl
.
File metadata
- Download URL: tdigest-0.5.2.1-py2.py3-none-any.whl
- Upload date:
- Size: 18.6 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e0febed55737e1b5cf67e530a38beaaadad68c4e0c37229d3d8f662b99c914d7 |
|
MD5 | 8b988c655b3989e54696dbb4d53626a6 |
|
BLAKE2b-256 | 2741b714941a6dba3760ddf2c2604daabbb578bcd6063f57ecdbe2c1d8ce4a79 |