Distributed quantile sketches
Project description
ddsketch
This repo contains the Python implementation of the distributed quantile sketch
algorithm DDSketch [1]. DDSketch has relative-error guarantees for any quantile
q in [0, 1]. That is if the true value of the qth-quantile is x
then DDSketch
returns a value y
such that |x-y| / x < e
where e
is the relative error
parameter. (The default here is set to 0.01.) DDSketch is also fully mergeable,
meaning that multiple sketches from distributed systems can be combined in a
central node.
Our default implementation, DDSketch
, is guaranteed [1] to not grow too large
in size for any data that can be described by a distribution whose tails are
sub-exponential.
We also provide implementations (LogCollapsingLowestDenseDDSketch
and
LogCollapsingHighestDenseDDSketch
) where the q-quantile will be accurate up to
the specified relative error for q that is not too small (or large). Concretely,
the q-quantile will be accurate up to the specified relative error as long as it
belongs to one of the m
bins kept by the sketch. If the data is time in
seconds, the default of m = 2048
covers 80 microseconds to 1 year.
Installation
To install this package, run pip install ddsketch
, or clone the repo and run
python setup.py install
. This package depends on numpy
and protobuf
. (The
protobuf dependency can be removed if it's not applicable.)
Usage
from ddsketch import DDSketch
sketch = DDSketch()
Add values to the sketch
import numpy as np
values = np.random.normal(size=500)
for v in values:
sketch.add(v)
Find the quantiles of values
to within the relative error.
quantiles = [sketch.get_quantile_value(q) for q in [0.5, 0.75, 0.9, 1]]
Merge another DDSketch
into sketch
.
another_sketch = DDSketch()
other_values = np.random.normal(size=500)
for v in other_values:
another_sketch.add(v)
sketch.merge(another_sketch)
The quantiles of values
concatenated with other_values
are still accurate to within the relative error.
Development
To work on ddsketch a Python interpreter must be installed. It is recommended to use the provided development container (requires docker) which includes all the required Python interpreters.
docker-compose run dev
Or, if developing outside of docker then it is recommended to use a virtual environment:
pip install virtualenv
virtualenv --python=3 .venv
source .venv/bin/activate
Testing
To run the tests install riot
:
pip install riot
Replace the Python version with the interpreter(s) available.
# Run tests with Python 3.9
riot run -p3.9 test
Release notes
New features, bug fixes, deprecations and other breaking changes must have release notes included.
To generate a release note for the change:
riot run reno new <short-description-of-change-no-spaces>
Edit the generated file to include notes on the changes made in the commit/PR and add commit it.
Formatting
Format code with
riot run fmt
Type-checking
Type checking is done with mypy:
riot run mypy
Type-checking
Lint the code with flake8:
riot run flake8
Protobuf
The protobuf is stored in the go repository: https://github.com/DataDog/sketches-go/blob/master/ddsketch/pb/ddsketch.proto
Install the minimum required protoc and generate the Python code:
docker run -v $PWD:/code -it ubuntu:18.04 /bin/bash
apt update && apt install protobuf-compiler # default is 3.0.0
protoc --proto_path=ddsketch/pb/ --python_out=ddsketch/pb/ ddsketch/pb/ddsketch.proto
Releasing
- Generate the release notes and use
pandoc
to format them for Github:
git checkout master && git pull
riot run -s reno report --no-show-source | pandoc -f rst -t gfm --wrap=none
Copy the output into a new release: https://github.com/DataDog/sketches-py/releases/new.
- Enter a tag for the release (following
semver
) (eg.v1.1.3
,v1.0.3
,v1.2.0
). - Use the tag without the
v
as the title. - Save the release as a draft and pass the link to someone else to give a quick review.
- If all looks good hit publish
References
[1] Charles Masson and Jee E Rim and Homin K. Lee. DDSketch: A fast and fully-mergeable quantile sketch with relative-error guarantees. PVLDB, 12(12): 2195-2205, 2019. (The code referenced in the paper, including our implementation of the the Greenwald-Khanna (GK) algorithm, can be found at: https://github.com/DataDog/sketches-py/releases/tag/v0.1 )
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.