Skip to main content

Manipulate arrays of complex data structures as easily as Numpy.

Project description

awkward-array

awkward-array is a pure Python+Numpy library for manipulating complex data structures as you would Numpy arrays. Even if your data structures

  • contain variable-length lists (jagged or ragged),

  • are deeply nested (record structure),

  • have different data types in the same list (heterogeneous),

  • are masked, bit-masked, or index-mapped (nullable),

  • contain cross-references or even cyclic references,

  • need to be Python class instances on demand,

  • are not defined at every point (sparse),

  • are not contiguous in memory,

  • should not be loaded into memory all at once (lazy),

this library can access them with the efficiency of Numpy arrays. They may be converted from JSON or Python data, loaded from “awkd” files, HDF5, Parquet, or ROOT files, or they may be views into memory buffers like Arrow.

Consider this monstrosity:

import awkward
array = awkward.fromiter([[1.1, 2.2, None, 3.3, None],
                          [4.4, [5.5]],
                          [{"x": 6, "y": {"z": 7}}, None, {"x": 8, "y": {"z": 9}}]
                         ])

It’s a list of lists; the first contains numbers and None, the second contains a sub-sub-list, and the third defines nested records. If we print this out, we see that it is called a JaggedArray:

array
# returns <JaggedArray [[1.1 2.2 None 3.3 None] [4.4 [5.5]] [<Row 0> None <Row 1>]] at 79093e598f98>

and we get the full Python structure back by calling array.tolist():

array.tolist()
# returns [[1.1, 2.2, None, 3.3, None],
#          [4.4, [5.5]],
#          [{'x': 6, 'y': {'z': 7}}, None, {'x': 8, 'y': {'z': 9}}]]

But we can also manipulate it as though it were a Numpy array. We can, for instance, take the first two elements of each sub-list (slicing the second dimension):

array[:, :2]
# returns <JaggedArray [[1.1 2.2] [4.4 [5.5]] [<Row 0> None]] at 79093e5ab080>

or the last two:

array[:, -2:]
# returns <JaggedArray [[3.3 None] [4.4 [5.5]] [None <Row 1>]] at 79093e5ab3c8>

Internally, the data has been rearranged into a columnar form, with all values at a given level of hierarchy in the same array. Numpy-like slicing, masking, and fancy indexing are translated into Numpy operations on these internal arrays: they are not implemented with Python for loops!

To see some of this structure, ask for the content of the array:

array.content
# returns <IndexedMaskedArray [1.1 2.2 None ... <Row 0> None <Row 1>] at 79093e598ef0>

Notice that the boundaries between sub-lists are gone: they exist only at the JaggedArray level. This IndexedMaskedArray level handles the None values in the data. If we dig further, we’ll find a UnionArray to handle the mixture of sub-lists and sub-sub-lists and record structures. If we dig deeply enough, we’ll find the numerical data:

array.content.content.contents[0]
# returns array([1.1, 2.2, 3.3, 4.4])
array.content.content.contents[1].content
# returns array([5.5])

Perhaps most importantly, Numpy’s universal functions (operations that apply to every element in an array) can be used on our array. This, too, goes straight to the columnar data and preserves structure.

array + 100
# returns <JaggedArray [[101.1 102.2 None 103.3 None]
#                       [104.4 [105.5]]
#                       [<Row 0> None <Row 1>]] at 724509ffe2e8>

(array + 100).tolist()
# returns [[101.1, 102.2, None, 103.3, None],
#          [104.4, [105.5]],
#          [{'x': 106, 'y': {'z': 107}}, None, {'x': 108, 'y': {'z': 109}}]]

numpy.sin(array)
# returns <JaggedArray [[0.8912073600614354 0.8084964038195901 None -0.1577456941432482 None]
#                       [-0.951602073889516 [-0.70554033]]
#                       [<Row 0> None <Row 1>]] at 70a40c3a61d0>

Rather than matching the speed of compiled code, this can exceed the speed of compiled code (on non-columnar data) because the operation may be vectorized on awkward-array’s underlying columnar arrays.

(To do: performance example to substantiate that claim.)

Installation

Install awkward like any other Python package:

pip install awkward                       # maybe with sudo or --user, or in virtualenv
pip install awkward-numba                 # optional: some methods accelerated by Numba

or install with conda:

conda config --add channels conda-forge   # if you haven't added conda-forge already
conda install awkward
conda install awkward-numba               # optional: some methods accelerated by Numba

The base awkward package requires only Numpy (1.13.1+), but awkward-numba additionally requires Numba.

Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

awkward-0.10.1.tar.gz (213.8 kB view details)

Uploaded Source

Built Distribution

awkward-0.10.1-py2.py3-none-any.whl (75.1 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file awkward-0.10.1.tar.gz.

File metadata

  • Download URL: awkward-0.10.1.tar.gz
  • Upload date:
  • Size: 213.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.1

File hashes

Hashes for awkward-0.10.1.tar.gz
Algorithm Hash digest
SHA256 08cd5d0921db60b82ce415a3be749562034fa2f912b650035dbb37c8557a5786
MD5 99db76e28b2601c6d1725bdd42399230
BLAKE2b-256 2a39f2d1b0bcbd7658d3caf871a5b4c84ff475e160f769f2a520503aacc9dd7c

See more details on using hashes here.

File details

Details for the file awkward-0.10.1-py2.py3-none-any.whl.

File metadata

  • Download URL: awkward-0.10.1-py2.py3-none-any.whl
  • Upload date:
  • Size: 75.1 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.1

File hashes

Hashes for awkward-0.10.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 d195bfa022e4065e1d844ffe95a7a721389694dcd8aaeb9bd52457d670a1419c
MD5 70afc5a0676eec02946bd8f0ee8e7f3c
BLAKE2b-256 3151166f4a2813f4bb50b8815ba7e085a483da2f38f4279fa80b703633dfdfba

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page