Skip to main content

Arrow bindings for casacore

Project description

Rationale

  • The structure of Apache Arrow Tables is highly similar to that of CASA Tables

  • It’s easy to convert Arrow Tables between many different languages

  • Once in Apache Arrow format, it is easy to store data in modern, cloud-native disk formats such as parquet and orc.

  • Converting CASA Tables to Arrow in the C++ layer avoids the GIL

  • Access to non thread-safe CASA Tables is constrained to a ThreadPool containing a single thread

  • It also allows us to write astrometric routines in C++, potentially side-stepping thread-safety and GIL issues with the CASA Measures server.

Build Wheel Locally

In the user or, even better, a virtual environment:

$ pip install -U pip cibuildwheel
$ bash scripts/run_cbuildwheel.sh -p 3.10

Local Development

In the directory containing the source, setup your development environment as follows:

$ pip install -U pip virtualenv
$ virtualenv -p python3.10 /venv/arcaedev
$ . /venv/arcaedev/bin/activate
(arcaedev) export VCPKG_TARGET_TRIPLET=x64-linux-dynamic-cxx17-abi1-dbg
(arcaedev) pip install -e .[test]
(arcaedev) export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$(pwd)/vcpkg/installed/$VCPKG_TARGET_TRIPLET/lib
(arcaedev) py.test -s -vvv --pyargs arcae

Usage

Example Usage:

import json
from pprint import pprint

import arcae
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# Obtain (partial) Apache Arrow Table from a CASA Table
casa_table = arcae.table("/path/to/measurementset.ms")
arrow_table = casa_table.to_arrow()        # read entire table
arrow_table = casa_table.to_arrow(10, 20)  # startrow, nrow
assert isinstance(arrow_table, pa.Table)

# Print JSON-encoded Table and Column keywords
pprint(json.loads(arrow_table.schema.metadata[b"__arcae_metadata__"]))
pprint(json.loads(arrow_table.schema.field("DATA").metadata[b"__arcae_metadata__"]))

# Extract Arrow Table columns into numpy arrays
time = arrow_table.column("TIME").to_numpy()
data = arrow_table.column("DATA").to_numpy()   # currently, arrays of object arrays, overly slow and memory hungry
df = arrow_table.to_pandas()                   # currently slow, memory hungry due to arrays of object arrays

# Write Arrow Table to parquet file
pq.write_table(arrow_table, "measurementset.parquet")

See the test cases for further use cases.

Exporting Measurement Sets to Arrow Parquet Datasets

An export script is available:

$ arcae export /path/to/the.ms --nrow 50000
$ tree output.arrow/
output.arrow/
├── ANTENNA
   └── data0.parquet
├── DATA_DESCRIPTION
   └── data0.parquet
├── FEED
   └── data0.parquet
├── FIELD
   └── data0.parquet
├── MAIN
   └── FIELD_ID=0
       └── PROCESSOR_ID=0
           ├── DATA_DESC_ID=0
              ├── data0.parquet
              ├── data1.parquet
              ├── data2.parquet
              └── data3.parquet
           ├── DATA_DESC_ID=1
              ├── data0.parquet
              ├── data1.parquet
              ├── data2.parquet
              └── data3.parquet
           ├── DATA_DESC_ID=2
              ├── data0.parquet
              ├── data1.parquet
              ├── data2.parquet
              └── data3.parquet
           └── DATA_DESC_ID=3
               ├── data0.parquet
               ├── data1.parquet
               ├── data2.parquet
               └── data3.parquet
├── OBSERVATION
   └── data0.parquet

This data can be loaded into an Arrow Dataset:

>>> import pyarrow as pa
>>> import pyarrow.dataset as pad
>>> main_ds = pad.dataset("output.arrow/MAIN")
>>> spw_ds = pad.dataset("output.arrow/SPECTRAL_WINDOW")

Limitations

Some edge cases have not yet been implemented, but could be with some thought.

  • Not yet able to handle columns with unconstrained rank (ndim == -1). Probably simplest to convert these rows to json and store as a string.

  • Not yet able to handle TpRecord columns. Probably simplest to convert these rows to json and store as a string.

  • Not yet able to handle TpQuantity columns. Possible to represent as a run-time parametric Arrow DataType.

  • to_numpy() conversion of nested lists produces nested numpy arrays, instead of tensors. This is possible but requires some changes to how C++ Extension Types are exposed in Python.

Etymology

Noun: arca f (genitive arcae); first declension A chest, box, coffer, safe (safe place for storing items, or anything of a similar shape)

Pronounced: ar-ki.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arcae-0.2.0.tar.gz (46.4 kB view details)

Uploaded Source

Built Distributions

arcae-0.2.0-cp311-cp311-manylinux_2_28_x86_64.whl (24.4 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.28+ x86-64

arcae-0.2.0-cp310-cp310-manylinux_2_28_x86_64.whl (24.4 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.28+ x86-64

arcae-0.2.0-cp39-cp39-manylinux_2_28_x86_64.whl (24.4 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.28+ x86-64

arcae-0.2.0-cp38-cp38-manylinux_2_28_x86_64.whl (24.4 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.28+ x86-64

File details

Details for the file arcae-0.2.0.tar.gz.

File metadata

  • Download URL: arcae-0.2.0.tar.gz
  • Upload date:
  • Size: 46.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for arcae-0.2.0.tar.gz
Algorithm Hash digest
SHA256 f7577cd408ebe48714793afd10d91289fef272f36695430a13a49571296b1010
MD5 8476123ceee059a65f7df89847e01204
BLAKE2b-256 d08814c506b8e1d74956ad509421beff4aa34935e18d7d8f04717e3625ac3e18

See more details on using hashes here.

File details

Details for the file arcae-0.2.0-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for arcae-0.2.0-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 7f602b4732217db345d1b69fe7a6b464612243c1afae5be1bf5b5a1823a55cbb
MD5 2f5a4148fec93adc8f1abd4d83d1a7ca
BLAKE2b-256 967d1619d7578f1e84cf78f1fc0ef74bf7f8ff33d15ff9040b271ecd13b0cf21

See more details on using hashes here.

File details

Details for the file arcae-0.2.0-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for arcae-0.2.0-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 b8f440e9a6fa4ffa1bbcef0518ab762719b0a8332ea0334e89f88a11da2571e5
MD5 8691f9a3f810de3d0388c58c5ce8e027
BLAKE2b-256 bf671ff9b865c5d1838c094f9caaf9fbdc980c9ca8f6e389ceb1cc710e0e16c1

See more details on using hashes here.

File details

Details for the file arcae-0.2.0-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for arcae-0.2.0-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 5eea46ec771ddf6578e2bb4cd038b1ff37330c4a48e91e32947c252384a9673a
MD5 2ce6c7c1cd4295a5a09157f47708a636
BLAKE2b-256 629309be57a6eae4d364d90dbbf475ffff06a2e32939b7688978768d552179e8

See more details on using hashes here.

File details

Details for the file arcae-0.2.0-cp38-cp38-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for arcae-0.2.0-cp38-cp38-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 56b857c957a36c65a2e74e1b4634c061e6f28a0e827a0022f644d65728c5f0b3
MD5 c3838dbf2060f5ebe216357b7380ef37
BLAKE2b-256 4276836d28c7b6caff4007f65c4f4a0fb24e132031c072fe5ffa476ed2899bdc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page