T-Digest data structure
Project description
# tdigest
### Efficient percentile estimation of streaming or distributed data
[![Latest Version](https://pypip.in/v/tdigest/badge.png)](https://pypi-hypernode.com/pypi/tdigest/)
[![Build Status](https://travis-ci.org/CamDavidsonPilon/tdigest.svg?branch=master)](https://travis-ci.org/CamDavidsonPilon/tdigest)
This is a Python implementation of Ted Dunning's [t-digest](https://github.com/tdunning/t-digest) data structure. The t-digest data structure is designed around computing accurate estimates from either streaming data, or distributed data. These estimates are percentiles, quantiles, trimmed means, etc. Two t-digests can be added, making the data structure ideal for map-reduce settings, and can be serialized into much less than 10kB (instead of storing the entire list of data).
See a blog post about it here: [Percentile and Quantile Estimation of Big Data: The t-Digest](http://dataorigami.net/blogs/napkin-folding/19055451-percentile-and-quantile-estimation-of-big-data-the-t-digest)
### Installation
*tdigest* is compatible with both Python 2 and Python 3.
```
pip install tdigest
```
### Usage
#### Update the digest sequentially
```
from tdigest import TDigest
from numpy.random import random
digest = TDigest()
for x in range(5000):
digest.update(random())
print digest.percentile(0.15) # about 0.15, as 0.15 is the 15th percentile of the Uniform(0,1) distribution
```
#### Update the digest in batches
```
another_digest = TDigest()
another_digest.batch_update(random(5000))
print another_digest.percentile(0.15)
```
#### Sum two digests to create a new digest
```
sum_digest = digest + another_digest
sum_digest.percentile(0.3) # about 0.3
```
### API
`TDigest.`
- `update(x, w=1)`: update the tdigest with value `x` and weight `w`.
- `batch_update(x, w=1)`: update the tdigest with values in array `x` and weight `w`.
- `compress()`: perform a compression on the underlying data structure that will shrink the memory footprint of it, without hurting accuracy. Good to perform after adding many values.
- `percentile(q)`: return the `q`th percentile. Example: `q=.50` is the median.
- `quantile(q)`: return the percentile the value `q` is at.
- `trimmed_mean(q1, q2)`: return the mean of data set without the values below and above the `q1` and `q2` percentile respectively.
### Efficient percentile estimation of streaming or distributed data
[![Latest Version](https://pypip.in/v/tdigest/badge.png)](https://pypi-hypernode.com/pypi/tdigest/)
[![Build Status](https://travis-ci.org/CamDavidsonPilon/tdigest.svg?branch=master)](https://travis-ci.org/CamDavidsonPilon/tdigest)
This is a Python implementation of Ted Dunning's [t-digest](https://github.com/tdunning/t-digest) data structure. The t-digest data structure is designed around computing accurate estimates from either streaming data, or distributed data. These estimates are percentiles, quantiles, trimmed means, etc. Two t-digests can be added, making the data structure ideal for map-reduce settings, and can be serialized into much less than 10kB (instead of storing the entire list of data).
See a blog post about it here: [Percentile and Quantile Estimation of Big Data: The t-Digest](http://dataorigami.net/blogs/napkin-folding/19055451-percentile-and-quantile-estimation-of-big-data-the-t-digest)
### Installation
*tdigest* is compatible with both Python 2 and Python 3.
```
pip install tdigest
```
### Usage
#### Update the digest sequentially
```
from tdigest import TDigest
from numpy.random import random
digest = TDigest()
for x in range(5000):
digest.update(random())
print digest.percentile(0.15) # about 0.15, as 0.15 is the 15th percentile of the Uniform(0,1) distribution
```
#### Update the digest in batches
```
another_digest = TDigest()
another_digest.batch_update(random(5000))
print another_digest.percentile(0.15)
```
#### Sum two digests to create a new digest
```
sum_digest = digest + another_digest
sum_digest.percentile(0.3) # about 0.3
```
### API
`TDigest.`
- `update(x, w=1)`: update the tdigest with value `x` and weight `w`.
- `batch_update(x, w=1)`: update the tdigest with values in array `x` and weight `w`.
- `compress()`: perform a compression on the underlying data structure that will shrink the memory footprint of it, without hurting accuracy. Good to perform after adding many values.
- `percentile(q)`: return the `q`th percentile. Example: `q=.50` is the median.
- `quantile(q)`: return the percentile the value `q` is at.
- `trimmed_mean(q1, q2)`: return the mean of data set without the values below and above the `q1` and `q2` percentile respectively.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
tdigest-0.2.0.tar.gz
(4.8 kB
view details)
File details
Details for the file tdigest-0.2.0.tar.gz
.
File metadata
- Download URL: tdigest-0.2.0.tar.gz
- Upload date:
- Size: 4.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4cca9def0356e32f0178faf996b1c899446c70c08b096e92c339af93942fbce6 |
|
MD5 | e8adb1885eae8c1105019d474f69c008 |
|
BLAKE2b-256 | 3406a0fb53624218ce35ab6de3ceb8713a18c20828cdb8ea299660281dbc13e1 |