Skip to main content

Arrow bindings for casacore

Project description

Rationale

  • The structure of Apache Arrow Tables is highly similar to that of CASA Tables

  • It’s easy to convert Arrow Tables between many different languages

  • Once in Apache Arrow format, it is easy to store data in modern, cloud-native disk formats such as parquet and orc.

  • Converting CASA Tables to Arrow in the C++ layer avoids the GIL

  • Access to non thread-safe CASA Tables is constrained to a ThreadPool containing a single thread

  • It also allows us to write astrometric routines in C++, potentially side-stepping thread-safety and GIL issues with the CASA Measures server.

Build Wheel Locally

In the user or, even better, a virtual environment:

$ pip install -U pip cibuildwheel
$ bash scripts/run_cbuildwheel.sh -p 3.10

Local Development

In the directory containing the source, setup your development environment as follows:

$ pip install -U pip virtualenv
$ virtualenv -p python3.10 /venv/arcaedev
$ . /venv/arcaedev/bin/activate
(arcaedev) export VCPKG_TARGET_TRIPLET=x64-linux-dynamic-cxx17-abi1-dbg
(arcaedev) pip install -e .[test]
(arcaedev) export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$(pwd)/vcpkg/installed/$VCPKG_TARGET_TRIPLET/lib
(arcaedev) py.test -s -vvv --pyargs arcae

Usage

Example Usage:

import json
from pprint import pprint

import arcae
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# Obtain (partial) Apache Arrow Table from a CASA Table
casa_table = arcae.table("/path/to/measurementset.ms")
arrow_table = casa_table.to_arrow()        # read entire table
arrow_table = casa_table.to_arrow(10, 20)  # startrow, nrow
assert isinstance(arrow_table, pa.Table)

# Print JSON-encoded Table and Column keywords
pprint(json.loads(arrow_table.schema.metadata[b"__arcae_metadata__"]))
pprint(json.loads(arrow_table.schema.field("DATA").metadata[b"__arcae_metadata__"]))

# Extract Arrow Table columns into numpy arrays
time = arrow_table.column("TIME").to_numpy()
data = arrow_table.column("DATA").to_numpy()   # currently, arrays of object arrays, overly slow and memory hungry
df = arrow_table.to_pandas()                   # currently slow, memory hungry due to arrays of object arrays

# Write Arrow Table to parquet file
pq.write_table(arrow_table, "measurementset.parquet")

See the test cases for further use cases.

Exporting Measurement Sets to Arrow Parquet Datasets

An export script is available:

$ arcae export /path/to/the.ms --nrow 50000
$ tree output.arrow/
output.arrow/
├── ANTENNA
   └── data0.parquet
├── DATA_DESCRIPTION
   └── data0.parquet
├── FEED
   └── data0.parquet
├── FIELD
   └── data0.parquet
├── MAIN
   └── FIELD_ID=0
       └── PROCESSOR_ID=0
           ├── DATA_DESC_ID=0
              ├── data0.parquet
              ├── data1.parquet
              ├── data2.parquet
              └── data3.parquet
           ├── DATA_DESC_ID=1
              ├── data0.parquet
              ├── data1.parquet
              ├── data2.parquet
              └── data3.parquet
           ├── DATA_DESC_ID=2
              ├── data0.parquet
              ├── data1.parquet
              ├── data2.parquet
              └── data3.parquet
           └── DATA_DESC_ID=3
               ├── data0.parquet
               ├── data1.parquet
               ├── data2.parquet
               └── data3.parquet
├── OBSERVATION
   └── data0.parquet

This data can be loaded into an Arrow Dataset:

>>> import pyarrow as pa
>>> import pyarrow.dataset as pad
>>> main_ds = pad.dataset("output.arrow/MAIN")
>>> spw_ds = pad.dataset("output.arrow/SPECTRAL_WINDOW")

Limitations

Some edge cases have not yet been implemented, but could be with some thought.

  • Not yet able to handle columns with unconstrained rank (ndim == -1). Probably simplest to convert these rows to json and store as a string.

  • Not yet able to handle TpRecord columns. Probably simplest to convert these rows to json and store as a string.

  • Not yet able to handle TpQuantity columns. Possible to represent as a run-time parametric Arrow DataType.

  • to_numpy() conversion of nested lists produces nested numpy arrays, instead of tensors. This is possible but requires some changes to how C++ Extension Types are exposed in Python.

Etymology

Noun: arca f (genitive arcae); first declension A chest, box, coffer, safe (safe place for storing items, or anything of a similar shape)

Pronounced: ar-ki.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arcae-0.2.1.tar.gz (47.8 kB view details)

Uploaded Source

Built Distributions

arcae-0.2.1-cp311-cp311-manylinux_2_28_x86_64.whl (24.4 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.28+ x86-64

arcae-0.2.1-cp310-cp310-manylinux_2_28_x86_64.whl (24.4 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.28+ x86-64

arcae-0.2.1-cp39-cp39-manylinux_2_28_x86_64.whl (24.4 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.28+ x86-64

arcae-0.2.1-cp38-cp38-manylinux_2_28_x86_64.whl (24.4 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.28+ x86-64

File details

Details for the file arcae-0.2.1.tar.gz.

File metadata

  • Download URL: arcae-0.2.1.tar.gz
  • Upload date:
  • Size: 47.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for arcae-0.2.1.tar.gz
Algorithm Hash digest
SHA256 c0ba710e41ed0315fbe6a11ec84515a204387bc982979dbbdec0b9af65b21485
MD5 7edb7a77352a5a9f9ac492c5857928f9
BLAKE2b-256 81c57582e7d4b91a2ee95beea50adf5227a0fc7ada60c39b5c8d485cf7f68c01

See more details on using hashes here.

File details

Details for the file arcae-0.2.1-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for arcae-0.2.1-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 4041bc76328c37951af2df5f5cad3b8db1d2f78f01b660fb4f8d36d19be874ba
MD5 364365801250c7c6567a68a416818043
BLAKE2b-256 cc08e78b525947d567a971a7a390960593e9299b87484b61aa2dd056a2384b89

See more details on using hashes here.

File details

Details for the file arcae-0.2.1-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for arcae-0.2.1-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 631caed72ec41f70f639d232755a241be2d7316ba841eea9a528c42f373b8e61
MD5 d9c8c25caaada570bd560cbf518e27d1
BLAKE2b-256 80be83500ab3b3cfa3398a7ed3eaca7f1a347f79b7a5555a6487f068e574678e

See more details on using hashes here.

File details

Details for the file arcae-0.2.1-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for arcae-0.2.1-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 200917e743bfb9b1f23fed12b29526b8b99c3c03cd19c188555288f4a011e95e
MD5 3a3ca99fd9d4c559211682fff7fe3d27
BLAKE2b-256 62ebc312139b1d6b75d36257e9e1433a38d0b4bb23323776f1d2870b67786b63

See more details on using hashes here.

File details

Details for the file arcae-0.2.1-cp38-cp38-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for arcae-0.2.1-cp38-cp38-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 b9ee2de19d620729bbc7483298c9baacd2e6cb30b55a658b2052fa91a346a1c7
MD5 5c074c813129eeb0b861d9d0a16b9a12
BLAKE2b-256 96be737a3f76c411aa366f160e305094d44431f464d82879bb2a1f0041b8ce9f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page