Skip to main content

Tools for analyzing hierarchically nested, columnar data without deserialization.

Project description

Data analysts are often faced with a choice between speed and flexibility. Tabular data, such as SQL tables, can be processed rapidly enough for a truly interactive analysis session, but hierarchically nested data, such as JSON, is better at representing relationships within complex data models. In some domains (such as particle physics), we want to perform calculations on JSON-like structures at the speed of SQL.

The key to high throughput for large datasets, particularly ones with many more attributes than are typically accessed in a single pass, is laying out the data in “columns.” All values of an attribute should be contiguous in disk or memory because data are paged from one cache to the next in locally contiguous blocks. The ROOT and Parquet file formats represent hierarchically nested data in a columnar form on disk, and Apache Arrow is an emerging standard for sharing hierarchically nested data in memory. However, data from these formats are usually deserialized into conventional data structures before processing, which limits performance (see this talk and this paper).

OAMap is a toolset for computing arbitrary functions on hierarchical, columnar data without deserialization. The name stands for Object Array Mapping, in analogy with the Object Relational Mapping (ORM) interface to some databases. Users (data analysts) write functions on JSON-like objects that OAMap compiles into operations on the underlying arrays, similar to the way that ORM converts object-oriented code into SQL. The difference is that mapping objects to non-relational arrays permits bare-metal performance (giving up some traditional database features).

OAMap is a Python library on top of which high-level analysis software may be built. It focuses on mapping an object-oriented view of data onto columnar arrays. Numpy is OAMap’s only strict dependency, though OAMap objects are also implemented as Numba extensions, so they may be used in Numba’s JIT-compiled functions at speeds that match or exceed hand-written C code. OAMap is unopinionated about the source of its columnar arrays, allowing for a variety of backends. See the walkthrough for more.

Also, a similar object array mapping could be implemented in any language— Python was chosen only for its popularity among data analysts.

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oamap-0.2.6.tar.gz (46.4 kB view details)

Uploaded Source

File details

Details for the file oamap-0.2.6.tar.gz.

File metadata

  • Download URL: oamap-0.2.6.tar.gz
  • Upload date:
  • Size: 46.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for oamap-0.2.6.tar.gz
Algorithm Hash digest
SHA256 b29983a1cd50c49a2c4790307fd5d9485f882ac567cc6b657ccc8ef10ec30ae2
MD5 0b0b61f3eac763c5edf3a653d970855e
BLAKE2b-256 0c183b517764622ce7857cb6041d6b3d68035c128ab5b10c188b43d2d31ef5af

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page