Skip to main content

Ragged array library, complying with Python API specification.

Project description

Ragged

Actions Status PyPI version PyPI platforms GitHub Discussion

Introduction

Ragged is a library for manipulating ragged arrays as though they were NumPy or CuPy arrays, following the Array API specification.

For example, this is a ragged/jagged array:

>>> import ragged
>>> a = ragged.array([[[1.1, 2.2, 3.3], []], [[4.4]], [], [[5.5, 6.6, 7.7, 8.8], [9.9]]])
>>> a
ragged.array([
    [[1.1, 2.2, 3.3], []],
    [[4.4]],
    [],
    [[5.5, 6.6, 7.7, 8.8], [9.9]]
])

The values are all floating-point numbers, so a.dtype is float64,

>>> a.dtype
dtype('float64')

but a.shape has non-integer dimensions to account for the fact that some of its list lengths are non-uniform:

>>> a.shape
(4, None, None)

In general, a ragged.array can have any mixture of regular and irregular dimensions, though shape[0] (the length) is always an integer. This convention follows the Array API's specification for array.shape, which must be a tuple of int or None:

array.shape: Tuple[Optional[int], ...]

(Our use of None to indicate a dimension without a single-valued size differs from the Array API's intention of specifying dimensions of unknown size, but it follows the technical specification. Array API-consuming libraries can try using Ragged to find out if they are ragged-ready.)

All of the normal elementwise and reducing functions apply, as well as slices:

>>> ragged.sqrt(a)
ragged.array([
    [[1.05, 1.48, 1.82], []],
    [[2.1]],
    [],
    [[2.35, 2.57, 2.77, 2.97], [3.15]]
])

>>> ragged.sum(a, axis=0)
ragged.array([
    [11, 8.8, 11, 8.8],
    [9.9]
])

>>> ragged.sum(a, axis=-1)
ragged.array([
    [6.6, 0],
    [4.4],
    [],
    [28.6, 9.9]
])

>>> a[-1, 0, 2]
ragged.array(7.7)

>>> a[a * 10 % 2 == 0]
ragged.array([
    [[2.2], []],
    [[4.4]],
    [],
    [[6.6, 8.8], []]
])

All of the methods, attributes, and functions in the Array API will be implemented for Ragged, as well as conveniences that are not required by the Array API. See open issues marked "todo" for Array API functions that still need to be written (out of 120 in total).

Ragged has two device values, "cpu" (backed by NumPy) and "cuda" (backed by CuPy). Eventually, all operations will be identical for CPU and GPU.

Implementation

Ragged is implemented using Awkward Array (code, docs), which is an array library for arbitrary tree-like (JSON-like) data. Because of its generality, Awkward Array cannot follow the Array API—in fact, its array objects can't have separate dtype and shape attributes (the array type can't be factorized). Ragged is therefore

  • a specialization of Awkward Array for numeric data in fixed-length and variable-length lists, and
  • a formalization to adhere to the Array API and its fully typed protocols.

See Why does this library exist? under the Discussions tab for more details.

Ragged is a thin wrapper around Awkward Array, restricting it to ragged arrays and transforming its function arguments and return values to fit the specification.

Awkward Array, in turn, is time- and memory-efficient, ready for big datasets. Consider the following:

import gc      # control for garbage collection
import psutil  # measure process memory
import time    # measure time

import math
import ragged

this_process = psutil.Process()

def measure_memory(task):
    gc.collect()
    start_memory = this_process.memory_full_info().uss
    out = task()
    gc.collect()
    stop_memory = this_process.memory_full_info().uss
    print(f"memory: {(stop_memory - start_memory) * 1e-9:.3f} GB")
    return out

def measure_time(task):
    gc.disable()
    start_time = time.perf_counter()
    out = task()
    stop_time = time.perf_counter()
    gc.enable()
    print(f"time: {stop_time - start_time:.3f} sec")
    return out

def make_big_python_object():
    out = []
    for i in range(10000000):
        out.append([j * 1.1 for j in range(i % 10)])
    return out

def make_ragged_array():
    return ragged.array(pyobj)

def compute_on_python_object():
    out = []
    for row in pyobj:
        out.append([math.sqrt(x) for x in row])
    return out

def compute_on_ragged_array():
    return ragged.sqrt(arr)

The ragged.array is 3 times smaller:

>>> pyobj = measure_memory(make_big_python_object)
memory: 2.687 GB

>>> arr = measure_memory(make_ragged_array)
memory: 0.877 GB

and a sample calculation on it (square root of each value) is 50 times faster:

>>> result = measure_time(compute_on_python_object)
time: 4.180 sec

>>> result = measure_time(compute_on_ragged_array)
time: 0.082 sec

Awkward Array and Ragged are generally smaller and faster than their Python equivalents for the same reasons that NumPy is smaller and faster than Python lists. See Awkward Array papers and presentations for more.

Installation

Ragged is on PyPI:

pip install ragged

and will someday be on conda-forge.

ragged is a pure-Python library that only depends on awkward (which, in turn, only depends on numpy and a compiled extension). In principle (i.e. eventually), ragged can be loaded into Pyodide and JupyterLite.

Acknowledgements

Support for this work was provided by NSF grant OAC-2103945 and the gracious help of Awkward Array contributors.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragged-0.1.0.tar.gz (49.6 kB view details)

Uploaded Source

Built Distribution

ragged-0.1.0-py3-none-any.whl (44.6 kB view details)

Uploaded Python 3

File details

Details for the file ragged-0.1.0.tar.gz.

File metadata

  • Download URL: ragged-0.1.0.tar.gz
  • Upload date:
  • Size: 49.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for ragged-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5c9502b7a2a21c4ebe8ec4c1452c6665bb2fcf11d414626ebec6a43426a37be0
MD5 5986adb763c97feab46b748875a98fbb
BLAKE2b-256 6c87a55dee55f905a49e22108921c034ce231f0b644144d04dee65f13152ef14

See more details on using hashes here.

File details

Details for the file ragged-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: ragged-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 44.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for ragged-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 057aab7391a753027bfdb9d9c85968f85efa0cc71235feb8a3bfd338853daa69
MD5 d91c9079e56b6c7f4a9ef77eba6b9726
BLAKE2b-256 e3db709b3cceffd51a322265ff9caae46535104af501cab1714e842dea1e3b67

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page