Skip to main content

Arrow bindings for casacore

Project description

Rationale

  • The structure of Apache Arrow Tables is highly similar to that of CASA Tables

  • It’s easy to convert Arrow Tables between many different languages

  • Once in Apache Arrow format, it is easy to store data in modern, cloud-native disk formats such as parquet and orc.

  • Converting CASA Tables to Arrow in the C++ layer avoids the GIL

  • Access to non thread-safe CASA Tables is constrained to a ThreadPool containing a single thread

  • It also allows us to write astrometric routines in C++, potentially side-stepping thread-safety and GIL issues with the CASA Measures server.

Build Wheel Locally

In the user or, even better, a virtual environment:

$ pip install -U pip cibuildwheel
$ bash scripts/run_cbuildwheel.sh -p 3.10

Local Development

In the directory containing the source, setup your development environment as follows:

$ pip install -U pip virtualenv
$ virtualenv -p python3.10 /venv/arcaedev
$ . /venv/arcaedev/bin/activate
(arcaedev) export VCPKG_TARGET_TRIPLET=x64-linux-dynamic-cxx17-abi1-dbg
(arcaedev) pip install -e .[test]
(arcaedev) export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$(pwd)/vcpkg/installed/$VCPKG_TARGET_TRIPLET/lib
(arcaedev) py.test -s -vvv --pyargs arcae

Usage

Example Usage:

import json
from pprint import pprint

import arcae
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# Obtain (partial) Apache Arrow Table from a CASA Table
casa_table = arcae.table("/path/to/measurementset.ms")
arrow_table = casa_table.to_arrow()        # read entire table
arrow_table = casa_table.to_arrow(10, 20)  # startrow, nrow
assert isinstance(arrow_table, pa.Table)

# Print JSON-encoded Table and Column keywords
pprint(json.loads(arrow_table.schema.metadata[b"__arcae_metadata__"]))
pprint(json.loads(arrow_table.schema.field("DATA").metadata[b"__arcae_metadata__"]))

# Extract Arrow Table columns into numpy arrays
time = arrow_table.column("TIME").to_numpy()
data = arrow_table.column("DATA").to_numpy()   # currently, arrays of object arrays, overly slow and memory hungry
df = arrow_table.to_pandas()                   # currently slow, memory hungry due to arrays of object arrays

# Write Arrow Table to parquet file
pq.write_table(arrow_table, "measurementset.parquet")

See the test cases for further use cases.

Exporting Measurement Sets to Arrow Parquet Datasets

An export script is available:

$ arcae export /path/to/the.ms --nrow 50000
$ tree output.arrow/
output.arrow/
├── ANTENNA
   └── data0.parquet
├── DATA_DESCRIPTION
   └── data0.parquet
├── FEED
   └── data0.parquet
├── FIELD
   └── data0.parquet
├── MAIN
   └── FIELD_ID=0
       └── PROCESSOR_ID=0
           ├── DATA_DESC_ID=0
              ├── data0.parquet
              ├── data1.parquet
              ├── data2.parquet
              └── data3.parquet
           ├── DATA_DESC_ID=1
              ├── data0.parquet
              ├── data1.parquet
              ├── data2.parquet
              └── data3.parquet
           ├── DATA_DESC_ID=2
              ├── data0.parquet
              ├── data1.parquet
              ├── data2.parquet
              └── data3.parquet
           └── DATA_DESC_ID=3
               ├── data0.parquet
               ├── data1.parquet
               ├── data2.parquet
               └── data3.parquet
├── OBSERVATION
   └── data0.parquet

This data can be loaded into an Arrow Dataset:

>>> import pyarrow as pa
>>> import pyarrow.dataset as pad
>>> main_ds = pad.dataset("output.arrow/MAIN")
>>> spw_ds = pad.dataset("output.arrow/SPECTRAL_WINDOW")

Limitations

Some edge cases have not yet been implemented, but could be with some thought.

  • Columns with unconstrained rank (ndim == -1) whose rows, in practice, have differing dimensions. Unconstrained rank columns whose rows actually have the same rank are catered for.

  • Not yet able to handle TpRecord columns. Probably simplest to convert these rows to json and store as a string.

  • Not yet able to handle TpQuantity columns. Possible to represent as a run-time parametric Arrow DataType.

  • to_numpy() conversion of nested lists produces nested numpy arrays, instead of tensors. This is possible but requires some changes to how C++ Extension Types are exposed in Python.

Etymology

Noun: arca f (genitive arcae); first declension A chest, box, coffer, safe (safe place for storing items, or anything of a similar shape)

Pronounced: ar-ki.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arcae-0.2.2.tar.gz (49.5 kB view details)

Uploaded Source

Built Distributions

arcae-0.2.2-cp311-cp311-manylinux_2_28_x86_64.whl (24.4 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.28+ x86-64

arcae-0.2.2-cp310-cp310-manylinux_2_28_x86_64.whl (24.4 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.28+ x86-64

arcae-0.2.2-cp39-cp39-manylinux_2_28_x86_64.whl (24.4 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.28+ x86-64

arcae-0.2.2-cp38-cp38-manylinux_2_28_x86_64.whl (24.4 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.28+ x86-64

File details

Details for the file arcae-0.2.2.tar.gz.

File metadata

  • Download URL: arcae-0.2.2.tar.gz
  • Upload date:
  • Size: 49.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for arcae-0.2.2.tar.gz
Algorithm Hash digest
SHA256 cfdbf81eacbf22af95635e50370b6262739452ea7b0eaaefd15f664d27a32592
MD5 c615fb5b0b46e16288dd011ea311e09b
BLAKE2b-256 af53d8289a1d7aed36490f3290f97a875cf9f588790ac0876b4f94b31bbc6da7

See more details on using hashes here.

File details

Details for the file arcae-0.2.2-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for arcae-0.2.2-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 65974575a67c6d8933a3d5e48783022ef0b8bbed1fb15ec58e448c8aa46f65fb
MD5 3a4372ee068f0a1c58eca18d83540ead
BLAKE2b-256 5876ac06d509d2e716ea266f21d7619c63f8efe80bf7f1f6a724d597d3c4b32e

See more details on using hashes here.

File details

Details for the file arcae-0.2.2-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for arcae-0.2.2-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 c25a129a5fd3208f9a14ba36bda6e8c56f68c3cc630107e4005a7b928e051fc7
MD5 304eadc36bfcdfd57bcc0fb52e3b506c
BLAKE2b-256 f348b1f4106da53ad6b65d52a335aa4af3e699ca928aab924af358f5eb969080

See more details on using hashes here.

File details

Details for the file arcae-0.2.2-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for arcae-0.2.2-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 b49d059058da8bd400e5f821cf5ddc14b0e0f56251bb20a70e0100ac1128b4e4
MD5 dd6eb380a000a166e05b89ea1069b30a
BLAKE2b-256 34d1fa9b53fd386427887a6033def3bf059587daeba908a7e8b67da62f8b211a

See more details on using hashes here.

File details

Details for the file arcae-0.2.2-cp38-cp38-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for arcae-0.2.2-cp38-cp38-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 a59d6fecb8bd98bf8603f0c0e148750b20dafe0605573584ed058e724af87335
MD5 70018b1ae3c5bcd34f50a3bd5f5c98cc
BLAKE2b-256 89c255b2e6a91d62e729870281490b5ae1d0cb26ce8c7b74faa4c93b3abb780c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page