tdigest

T-Digest data structure

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
License
- OSI Approved :: MIT License
Programming Language
Topic
- Scientific/Engineering

Project description

# tdigest
### Efficient percentile estimation of streaming or distributed data
[![Latest Version](https://pypip.in/v/tdigest/badge.png)](https://pypi-hypernode.com/pypi/tdigest/)
[![Build Status](https://travis-ci.org/CamDavidsonPilon/tdigest.svg?branch=master)](https://travis-ci.org/CamDavidsonPilon/tdigest)

This is a Python implementation of Ted Dunning's [t-digest](https://github.com/tdunning/t-digest) data structure. The t-digest data structure is designed around computing accurate estimates from either streaming data, or distributed data. These estimates are percentiles, quantiles, trimmed means, etc. Two t-digests can be added, making the data structure ideal for map-reduce settings, and can be serialized into much less than 10kB (instead of storing the entire list of data).

See a blog post about it here: [Percentile and Quantile Estimation of Big Data: The t-Digest](http://dataorigami.net/blogs/napkin-folding/19055451-percentile-and-quantile-estimation-of-big-data-the-t-digest)

### Installation
*tdigest* is compatible with both Python 2 and Python 3.

```
pip install tdigest
```

### Usage

#### Update the digest sequentially

```
from tdigest import TDigest
from numpy.random import random

digest = TDigest()
for x in range(5000):
digest.update(random())

print(digest.percentile(15)) # about 0.15, as 0.15 is the 15th percentile of the Uniform(0,1) distribution
```

#### Update the digest in batches

```
another_digest = TDigest()
another_digest.batch_update(random(5000))
print(another_digest.percentile(15))
```

#### Sum two digests to create a new digest

```
sum_digest = digest + another_digest
sum_digest.percentile(30) # about 0.3
```

### API

`TDigest.`

- `update(x, w=1)`: update the tdigest with value `x` and weight `w`.
- `batch_update(x, w=1)`: update the tdigest with values in array `x` and weight `w`.
- `compress()`: perform a compression on the underlying data structure that will shrink the memory footprint of it, without hurting accuracy. Good to perform after adding many values.
- `percentile(p)`: return the `p`th percentile. Example: `p=50` is the median.
- `quantile(q)`: return the CDF the value `q` is at.
- `trimmed_mean(p1, p2)`: return the mean of data set without the values below and above the `p1` and `p2` percentile respectively.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
License
- OSI Approved :: MIT License
Programming Language
Topic
- Scientific/Engineering

Release history Release notifications | RSS feed

0.5.2.2

May 7, 2019

0.5.2.1

May 5, 2018

0.5.2.0

Mar 12, 2018

0.5.1.0

Feb 9, 2018

0.5.0.0

Dec 27, 2017

0.4.1.0

Aug 27, 2016

0.4.0.2

Jul 21, 2016

This version

0.4.0.1

Oct 10, 2015

0.4.0

Jul 20, 2015

0.3.0

Jul 1, 2015

0.2.0

Jun 9, 2015

0.1.2

May 31, 2015

0.1.1

May 10, 2015

0.1.0

May 10, 2015

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tdigest-0.4.0.1.tar.gz (4.9 kB view details)

Uploaded Oct 10, 2015 Source

File details

Details for the file tdigest-0.4.0.1.tar.gz.

File metadata

Download URL: tdigest-0.4.0.1.tar.gz
Upload date: Oct 10, 2015
Size: 4.9 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for tdigest-0.4.0.1.tar.gz
Algorithm	Hash digest
SHA256	`c8c29fb7c98f07f52b420a0bd92dadc582b9731b75b4e02aa53ef0900fd24699`
MD5	`df54f358a007c9659d9291766da5ad7b`
BLAKE2b-256	`21c976a30b19aecddadf5f1929c435e89fc5848307103a491bb3f459e8619fb7`