Skip to main content

Fast N-dimensional aggregation functions with Numba

Project description

Numbagg: Fast N-dimensional aggregation functions with Numba

GitHub Workflow CI Status PyPI Version

Fast, flexible N-dimensional array functions written with Numba and NumPy's generalized ufuncs.

Currently accelerated functions:

  • Array functions: allnan, anynan, count, nanargmax, nanargmin, nanmax, nanmean, nanstd, nanvar, nanmin, nansum
  • Moving window functions: move_exp_nanmean, move_exp_nansum, move_exp_nanvar, move_mean, move_sum

Note: Only functions listed here (exposed in Numbagg's top level namespace) are supported as part of Numbagg's public API.

Easy to extend

Numbagg makes it easy to write, in pure Python/NumPy, flexible aggregation functions accelerated by Numba. All the hard work is done by Numba's JIT compiler and NumPy's gufunc machinery (as wrapped by Numba).

For example, here is how we wrote nansum:

import numpy as np
from numbagg.decorators import ndreduce

@ndreduce
def nansum(a):
    asum = 0.0
    for ai in a.flat:
        if not np.isnan(ai):
            asum += ai
    return asum

You are welcome to experiment with Numbagg's decorator functions, but these are not public APIs (yet): we reserve the right to change them at any time.

We'd rather get your pull requests to add new functions into Numbagg directly!

Advantages over Bottleneck

  • Way less code. Easier to add new functions. No ad-hoc templating system. No Cython!
  • Fast functions still work for >3 dimensions.
  • axis argument handles tuples of integers.

Most of the functions in Numbagg (including our test suite) are adapted from Bottleneck's battle-hardened implementations. Still, Numbagg is experimental, and probably not yet ready for production.

Benchmarks

Initial benchmarks are quite encouraging. Numbagg/Numba has comparable (slightly better) performance than Bottleneck's hand-written C:

import numbagg
import numpy as np
import bottleneck

x = np.random.RandomState(42).randn(1000, 1000)
x[x < -1] = np.NaN

# timings with numba=0.41.0 and bottleneck=1.2.1

In [2]: %timeit numbagg.nanmean(x)
1.8 ms ± 92.3 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit numbagg.nanmean(x, axis=0)
3.63 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %timeit numbagg.nanmean(x, axis=1)
1.81 ms ± 41 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: %timeit bottleneck.nanmean(x)
2.22 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: %timeit bottleneck.nanmean(x, axis=0)
4.45 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [7]: %timeit bottleneck.nanmean(x, axis=1)
2.19 ms ± 13.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Benchmarks vs. pandas

Here are the current benchmark results relative to pandas for the rolling exponential functions:

Function n numbagg pandas %change
move_exp_nanmean 1000 77.6μs 360μs -78%
move_exp_nanmean 100000 6.85ms 18.8ms -63%
move_exp_nanmean 10000000 793ms 1.96s -59%
move_exp_nansum 1000 92.3μs 335μs -72%
move_exp_nansum 100000 10.9ms 11.3ms -3%
move_exp_nansum 10000000 1.02s 1.22s -16%
move_exp_nanvar 1000 74.3μs 360μs -79%
move_exp_nanvar 100000 6.63ms 15.3ms -56%
move_exp_nanvar 10000000 1.06s 1.86s -43%

Benchmarks were run on a Mac M1 in September 2023 on numbagg's HEAD and pandas 2.1.1.

Our approach

Numbagg includes somewhat awkward workarounds for features missing from NumPy/Numba:

I hope that the need for most of these will eventually go away. In the meantime, expect Numbagg to be tightly coupled to Numba and NumPy release cycles.

License

3-clause BSD. Includes portions of Bottleneck, which is distributed under a Simplified BSD license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

numbagg-0.3.1.tar.gz (29.5 kB view details)

Uploaded Source

Built Distribution

numbagg-0.3.1-py3-none-any.whl (24.6 kB view details)

Uploaded Python 3

File details

Details for the file numbagg-0.3.1.tar.gz.

File metadata

  • Download URL: numbagg-0.3.1.tar.gz
  • Upload date:
  • Size: 29.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for numbagg-0.3.1.tar.gz
Algorithm Hash digest
SHA256 43f530724ea19768a58092bd00bca7744e0d41c8b161399c6c5d5f8f56d49a34
MD5 2d152d69040f8174f506bc6eea650cdc
BLAKE2b-256 d8cb63cf85ae3e2c9864451fd290e2f2affa5a2c8cae284f39b9ac66afdbb3d8

See more details on using hashes here.

File details

Details for the file numbagg-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: numbagg-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 24.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for numbagg-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 83547f96c889abe57442487f7744eb0213df1d4aef78ca77b2a66ad2e0820562
MD5 8a6ffdf9706830163863d934a2430493
BLAKE2b-256 9ff23219594c916dd7197a3031743e8a2a18bdc6f96ce6986e704319e283a355

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page