Skip to main content

T-Digest data structure

Project description

# tdigest
### Efficient percentile estimation of streaming or distributed data
[![PyPI version](https://badge.fury.io/py/tdigest.svg)](https://badge.fury.io/py/tdigest)
[![Build Status](https://travis-ci.org/CamDavidsonPilon/tdigest.svg?branch=master)](https://travis-ci.org/CamDavidsonPilon/tdigest)


This is a Python implementation of Ted Dunning's [t-digest](https://github.com/tdunning/t-digest) data structure. The t-digest data structure is designed around computing accurate estimates from either streaming data, or distributed data. These estimates are percentiles, quantiles, trimmed means, etc. Two t-digests can be added, making the data structure ideal for map-reduce settings, and can be serialized into much less than 10kB (instead of storing the entire list of data).

See a blog post about it here: [Percentile and Quantile Estimation of Big Data: The t-Digest](http://dataorigami.net/blogs/napkin-folding/19055451-percentile-and-quantile-estimation-of-big-data-the-t-digest)


### Installation
*tdigest* is compatible with both Python 2 and Python 3.

```
pip install tdigest
```

### Usage

#### Update the digest sequentially

```
from tdigest import TDigest
from numpy.random import random

digest = TDigest()
for x in range(5000):
digest.update(random())

print(digest.percentile(15)) # about 0.15, as 0.15 is the 15th percentile of the Uniform(0,1) distribution
```

#### Update the digest in batches

```
another_digest = TDigest()
another_digest.batch_update(random(5000))
print(another_digest.percentile(15))
```

#### Sum two digests to create a new digest

```
sum_digest = digest + another_digest
sum_digest.percentile(30) # about 0.3
```

#### To dict or serializing a digest with JSON

You can use the to_dict() method to turn a TDigest object into a standard Python dictionary.
```
digest = TDigest()
digest.update(1)
digest.update(2)
digest.update(3)
print(digest.to_dict())
```
Or you can get only a list of Centroids with `centroids_to_list()`.
```
digest.centroids_to_list()
```

Similarly, you can restore a Python dict of digest values with `update_from_dict()`. Centroids are merged with any existing ones in the digest.
For example, make a fresh digest and restore values from a python dictionary.
```
digest = TDigest()
digest.update_from_dict({'K': 25, 'delta': 0.01, 'centroids': [{'c': 1.0, 'm': 1.0}, {'c': 1.0, 'm': 2.0}, {'c': 1.0, 'm': 3.0}]})
```

K and delta values are optional, or you can provide only a list of centroids with `update_centroids_from_list()`.
```
digest = TDigest()
digest.update_centroids([{'c': 1.0, 'm': 1.0}, {'c': 1.0, 'm': 2.0}, {'c': 1.0, 'm': 3.0}])
```

If you want to serialize with other tools like JSON, you can first convert to_dict().
```
json.dumps(digest.to_dict())
```

Alternatively, make a custom encoder function to provide as default to the standard json module.
```
def encoder(digest_obj):
return digest_obj.to_dict()
```
Then pass the encoder function as the default parameter.
```
json.dumps(digest, default=encoder)
```


### API

`TDigest.`

- `update(x, w=1)`: update the tdigest with value `x` and weight `w`.
- `batch_update(x, w=1)`: update the tdigest with values in array `x` and weight `w`.
- `compress()`: perform a compression on the underlying data structure that will shrink the memory footprint of it, without hurting accuracy. Good to perform after adding many values.
- `percentile(p)`: return the `p`th percentile. Example: `p=50` is the median.
- `cdf(x)`: return the CDF the value `x` is at.
- `trimmed_mean(p1, p2)`: return the mean of data set without the values below and above the `p1` and `p2` percentile respectively.
- `to_dict()`: return a Python dictionary of the TDigest and internal Centroid values.
- `update_from_dict(dict_values)`: update from serialized dictionary values into the TDigest object.
- `centroids_to_list()`: return a Python list of the TDigest object's internal Centroid values.
- `update_centroids_from_list(list_values)`: update Centroids from a python list.







Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tdigest-0.5.2.0.tar.gz (7.0 kB view details)

Uploaded Source

Built Distributions

tdigest-0.5.2.0-py3-none-any.whl (12.1 kB view details)

Uploaded Python 3

tdigest-0.5.2.0-py2-none-any.whl (12.1 kB view details)

Uploaded Python 2

File details

Details for the file tdigest-0.5.2.0.tar.gz.

File metadata

  • Download URL: tdigest-0.5.2.0.tar.gz
  • Upload date:
  • Size: 7.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for tdigest-0.5.2.0.tar.gz
Algorithm Hash digest
SHA256 080825c2c6a8c0a494774e7e69fbd5ba850f9777b812558268614b12ead73d78
MD5 3f2257fe4eb61eca00826413e4a6c6ef
BLAKE2b-256 d9e3f509a40b3b3d31cf0318524c994ef07b1e3aeb2e7fc7da2fb89bb04d2842

See more details on using hashes here.

File details

Details for the file tdigest-0.5.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for tdigest-0.5.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8ff243f138e7702803d221aeb6526525ad384a8e0cf356133b87238c03d1788c
MD5 81fa191371ae09cdc4a98fc808c02526
BLAKE2b-256 b39c79805580305ffa4c8aaf99a13219fb6950067088976f97bf551a61d4de76

See more details on using hashes here.

File details

Details for the file tdigest-0.5.2.0-py2-none-any.whl.

File metadata

File hashes

Hashes for tdigest-0.5.2.0-py2-none-any.whl
Algorithm Hash digest
SHA256 94d08df0a17035cdf7ab904c590f23c882811b50c415e3e866d154c16ce11a42
MD5 b8c5da1be4cfd2a395238b3aedd2ed3e
BLAKE2b-256 6e53073e8639f0dbeb760e627864a085e83f0f8f244e1ad9253b1bb19732fa2f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page