Skip to main content

T-Digest data structure

Project description

# tdigest
### Efficient percentile estimation of streaming or distributed data
[![Latest Version](https://pypip.in/v/tdigest/badge.png)](https://pypi-hypernode.com/pypi/tdigest/)
[![Build Status](https://travis-ci.org/CamDavidsonPilon/tdigest.svg?branch=master)](https://travis-ci.org/CamDavidsonPilon/tdigest)


This is a Python implementation of Ted Dunning's [t-digest](https://github.com/tdunning/t-digest) data structure. The t-digest data structure is designed around computing accurate estimates from either streaming data, or distributed data. These estimates are percentiles, quantiles, trimmed means, etc. Two t-digests can be added, making the data structure ideal for map-reduce settings, and can be serialized into much less than 10kB (instead of storing the entire list of data).

See a blog post about it here: [Percentile and Quantile Estimation of Big Data: The t-Digest](http://dataorigami.net/blogs/napkin-folding/19055451-percentile-and-quantile-estimation-of-big-data-the-t-digest)


### Installation
*tdigest* is compatible with both Python 2 and Python 3.

```
pip install tdigest
```

### Usage

#### Update the digest sequentially

```
from tdigest import TDigest
from numpy.random import random

digest = TDigest()
for x in range(5000):
digest.update(random())

print(digest.percentile(15)) # about 0.15, as 0.15 is the 15th percentile of the Uniform(0,1) distribution
```

#### Update the digest in batches

```
another_digest = TDigest()
another_digest.batch_update(random(5000))
print(another_digest.percentile(15))
```

#### Sum two digests to create a new digest

```
sum_digest = digest + another_digest
sum_digest.percentile(30) # about 0.3
```

### API

`TDigest.`

- `update(x, w=1)`: update the tdigest with value `x` and weight `w`.
- `batch_update(x, w=1)`: update the tdigest with values in array `x` and weight `w`.
- `compress()`: perform a compression on the underlying data structure that will shrink the memory footprint of it, without hurting accuracy. Good to perform after adding many values.
- `percentile(p)`: return the `p`th percentile. Example: `p=50` is the median.
- `cdf(x)`: return the CDF the value `x` is at.
- `trimmed_mean(p1, p2)`: return the mean of data set without the values below and above the `p1` and `p2` percentile respectively.







Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

tdigest-0.5.0.0-py3-none-any.whl (9.7 kB view details)

Uploaded Python 3

tdigest-0.5.0.0-py2.py3-none-any.whl (9.7 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file tdigest-0.5.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for tdigest-0.5.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7aa665116b28d628b25a37e6640d461660d34f2502cd143370c36713b66183f9
MD5 75b8973535f0dc66ba227e3ed7ffd658
BLAKE2b-256 c72ab8d383790023f3184b0bc0978bdea2203f8044ad4b13e1e70bad2cb3537a

See more details on using hashes here.

File details

Details for the file tdigest-0.5.0.0-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for tdigest-0.5.0.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 23b13868b3a0f8fdbd11425ca103929ac76870a5d470a48655cdd44071cecdf5
MD5 af9bb9ef6309e0fc30904776a4fedb0c
BLAKE2b-256 5b34708ac0f9c65d080e2d9f32e6e24ee7cfebdf04627dbe1314bb3b713922da

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page