Skip to main content

T-Digest data structure

Project description

# tdigest
### Efficient percentile estimation of streaming or distributed data
[![PyPI version](https://badge.fury.io/py/tdigest.svg)](https://badge.fury.io/py/tdigest)
[![Build Status](https://travis-ci.org/CamDavidsonPilon/tdigest.svg?branch=master)](https://travis-ci.org/CamDavidsonPilon/tdigest)


This is a Python implementation of Ted Dunning's [t-digest](https://github.com/tdunning/t-digest) data structure. The t-digest data structure is designed around computing accurate estimates from either streaming data, or distributed data. These estimates are percentiles, quantiles, trimmed means, etc. Two t-digests can be added, making the data structure ideal for map-reduce settings, and can be serialized into much less than 10kB (instead of storing the entire list of data).

See a blog post about it here: [Percentile and Quantile Estimation of Big Data: The t-Digest](http://dataorigami.net/blogs/napkin-folding/19055451-percentile-and-quantile-estimation-of-big-data-the-t-digest)


### Installation
*tdigest* is compatible with both Python 2 and Python 3.

```
pip install tdigest
```

### Usage

#### Update the digest sequentially

```
from tdigest import TDigest
from numpy.random import random

digest = TDigest()
for x in range(5000):
digest.update(random())

print(digest.percentile(15)) # about 0.15, as 0.15 is the 15th percentile of the Uniform(0,1) distribution
```

#### Update the digest in batches

```
another_digest = TDigest()
another_digest.batch_update(random(5000))
print(another_digest.percentile(15))
```

#### Sum two digests to create a new digest

```
sum_digest = digest + another_digest
sum_digest.percentile(30) # about 0.3
```

### API

`TDigest.`

- `update(x, w=1)`: update the tdigest with value `x` and weight `w`.
- `batch_update(x, w=1)`: update the tdigest with values in array `x` and weight `w`.
- `compress()`: perform a compression on the underlying data structure that will shrink the memory footprint of it, without hurting accuracy. Good to perform after adding many values.
- `percentile(p)`: return the `p`th percentile. Example: `p=50` is the median.
- `cdf(x)`: return the CDF the value `x` is at.
- `trimmed_mean(p1, p2)`: return the mean of data set without the values below and above the `p1` and `p2` percentile respectively.







Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tdigest-0.5.1.0.tar.gz (5.8 kB view details)

Uploaded Source

Built Distributions

tdigest-0.5.1.0-py3-none-any.whl (9.7 kB view details)

Uploaded Python 3

tdigest-0.5.1.0-py2-none-any.whl (9.7 kB view details)

Uploaded Python 2

File details

Details for the file tdigest-0.5.1.0.tar.gz.

File metadata

  • Download URL: tdigest-0.5.1.0.tar.gz
  • Upload date:
  • Size: 5.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for tdigest-0.5.1.0.tar.gz
Algorithm Hash digest
SHA256 5546c32c0e7c18f6873a00637afa9b524290d47f0f0de6964c1247787e62bf8a
MD5 5494e0c7e4f7c3df4450feb0558e630c
BLAKE2b-256 6f05678ce3837a02f4a9dbef8cb88ef2bbc38be2127ba6dda4ef0ed365f788eb

See more details on using hashes here.

File details

Details for the file tdigest-0.5.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for tdigest-0.5.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f42099001442d461df17f28f3c47bc11bcf6f0fa4716bec718e8c97acd2dcf95
MD5 fd2ee13e12c27722a91350b025b12fe1
BLAKE2b-256 c5850cb268e3efa0532146d0c96a23f2574cdb0ba4123286f82821362e51d524

See more details on using hashes here.

File details

Details for the file tdigest-0.5.1.0-py2-none-any.whl.

File metadata

File hashes

Hashes for tdigest-0.5.1.0-py2-none-any.whl
Algorithm Hash digest
SHA256 01fb1e02d9ecb8e9c4810405827c0dc84a9afc600d1cba232c9d77834e9a3691
MD5 055991a8e0b277151827250e41d5658b
BLAKE2b-256 51bfb115637cdc037a31771c628d84cdacf79c40da593e6c81c2372efedf5632

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page