Arrow bindings for casacore
Project description
Rationale
The structure of Apache Arrow Tables is highly similar to that of CASA Tables
It’s easy to convert Arrow Tables between many different languages
Once in Apache Arrow format, it is easy to store data in modern, cloud-native disk formats such as parquet and orc.
Converting CASA Tables to Arrow in the C++ layer avoids the GIL
Access to non thread-safe CASA Tables is constrained to a ThreadPool containing a single thread
It also allows us to write astrometric routines in C++, potentially side-stepping thread-safety and GIL issues with the CASA Measures server.
Build Wheel Locally
In the user or, even better, a virtual environment:
$ pip install -U pip cibuildwheel
$ bash scripts/run_cbuildwheel.sh -p 3.10
Local Development
In the directory containing the source, setup your development environment as follows:
$ pip install -U pip virtualenv
$ virtualenv -p python3.10 /venv/arcaedev
$ . /venv/arcaedev/bin/activate
(arcaedev) export VCPKG_TARGET_TRIPLET=x64-linux-dynamic-cxx17-abi1-dbg
(arcaedev) pip install -e .[test]
(arcaedev) export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$(pwd)/vcpkg/installed/$VCPKG_TARGET_TRIPLET/lib
(arcaedev) py.test -s -vvv --pyargs arcae
Usage
Example Usage:
import json from pprint import pprint import arcae import pandas as pd import pyarrow as pa import pyarrow.parquet as pq # Obtain (partial) Apache Arrow Table from a CASA Table casa_table = arcae.table("/path/to/measurementset.ms") arrow_table = casa_table.to_arrow() # read entire table arrow_table = casa_table.to_arrow(10, 20) # startrow, nrow assert isinstance(arrow_table, pa.Table) # Print JSON-encoded Table and Column keywords pprint(json.loads(arrow_table.schema.metadata[b"__arcae_metadata__"])) pprint(json.loads(arrow_table.schema.field("DATA").metadata[b"__arcae_metadata__"])) # Extract Arrow Table columns into numpy arrays time = arrow_table.column("TIME").to_numpy() data = arrow_table.column("DATA").to_numpy() # currently, arrays of object arrays, overly slow and memory hungry df = arrow_table.to_pandas() # currently slow, memory hungry due to arrays of object arrays # Write Arrow Table to parquet file pq.write_table(arrow_table, "measurementset.parquet")
See the test cases for further use cases.
Exporting Measurement Sets to Arrow Parquet Datasets
An export script is available:
$ arcae export /path/to/the.ms --nrow 50000
$ tree output.arrow/
output.arrow/
├── ANTENNA
│ └── data0.parquet
├── DATA_DESCRIPTION
│ └── data0.parquet
├── FEED
│ └── data0.parquet
├── FIELD
│ └── data0.parquet
├── MAIN
│ └── FIELD_ID=0
│ └── PROCESSOR_ID=0
│ ├── DATA_DESC_ID=0
│ │ ├── data0.parquet
│ │ ├── data1.parquet
│ │ ├── data2.parquet
│ │ └── data3.parquet
│ ├── DATA_DESC_ID=1
│ │ ├── data0.parquet
│ │ ├── data1.parquet
│ │ ├── data2.parquet
│ │ └── data3.parquet
│ ├── DATA_DESC_ID=2
│ │ ├── data0.parquet
│ │ ├── data1.parquet
│ │ ├── data2.parquet
│ │ └── data3.parquet
│ └── DATA_DESC_ID=3
│ ├── data0.parquet
│ ├── data1.parquet
│ ├── data2.parquet
│ └── data3.parquet
├── OBSERVATION
│ └── data0.parquet
This data can be loaded into an Arrow Dataset:
>>> import pyarrow as pa
>>> import pyarrow.dataset as pad
>>> main_ds = pad.dataset("output.arrow/MAIN")
>>> spw_ds = pad.dataset("output.arrow/SPECTRAL_WINDOW")
Limitations
Some edge cases have not yet been implemented, but could be with some thought.
Columns with unconstrained rank (ndim == -1) whose rows, in practice, have differing dimensions. Unconstrained rank columns whose rows actually have the same rank are catered for.
Not yet able to handle TpRecord columns. Probably simplest to convert these rows to json and store as a string.
Not yet able to handle TpQuantity columns. Possible to represent as a run-time parametric Arrow DataType.
to_numpy() conversion of nested lists produces nested numpy arrays, instead of tensors. This is possible but requires some changes to how C++ Extension Types are exposed in Python.
Etymology
Noun: arca f (genitive arcae); first declension A chest, box, coffer, safe (safe place for storing items, or anything of a similar shape)
Pronounced: ar-ki.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
File details
Details for the file arcae-0.2.2.tar.gz
.
File metadata
- Download URL: arcae-0.2.2.tar.gz
- Upload date:
- Size: 49.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cfdbf81eacbf22af95635e50370b6262739452ea7b0eaaefd15f664d27a32592 |
|
MD5 | c615fb5b0b46e16288dd011ea311e09b |
|
BLAKE2b-256 | af53d8289a1d7aed36490f3290f97a875cf9f588790ac0876b4f94b31bbc6da7 |
File details
Details for the file arcae-0.2.2-cp311-cp311-manylinux_2_28_x86_64.whl
.
File metadata
- Download URL: arcae-0.2.2-cp311-cp311-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 24.4 MB
- Tags: CPython 3.11, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 65974575a67c6d8933a3d5e48783022ef0b8bbed1fb15ec58e448c8aa46f65fb |
|
MD5 | 3a4372ee068f0a1c58eca18d83540ead |
|
BLAKE2b-256 | 5876ac06d509d2e716ea266f21d7619c63f8efe80bf7f1f6a724d597d3c4b32e |
File details
Details for the file arcae-0.2.2-cp310-cp310-manylinux_2_28_x86_64.whl
.
File metadata
- Download URL: arcae-0.2.2-cp310-cp310-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 24.4 MB
- Tags: CPython 3.10, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c25a129a5fd3208f9a14ba36bda6e8c56f68c3cc630107e4005a7b928e051fc7 |
|
MD5 | 304eadc36bfcdfd57bcc0fb52e3b506c |
|
BLAKE2b-256 | f348b1f4106da53ad6b65d52a335aa4af3e699ca928aab924af358f5eb969080 |
File details
Details for the file arcae-0.2.2-cp39-cp39-manylinux_2_28_x86_64.whl
.
File metadata
- Download URL: arcae-0.2.2-cp39-cp39-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 24.4 MB
- Tags: CPython 3.9, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b49d059058da8bd400e5f821cf5ddc14b0e0f56251bb20a70e0100ac1128b4e4 |
|
MD5 | dd6eb380a000a166e05b89ea1069b30a |
|
BLAKE2b-256 | 34d1fa9b53fd386427887a6033def3bf059587daeba908a7e8b67da62f8b211a |
File details
Details for the file arcae-0.2.2-cp38-cp38-manylinux_2_28_x86_64.whl
.
File metadata
- Download URL: arcae-0.2.2-cp38-cp38-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 24.4 MB
- Tags: CPython 3.8, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a59d6fecb8bd98bf8603f0c0e148750b20dafe0605573584ed058e724af87335 |
|
MD5 | 70018b1ae3c5bcd34f50a3bd5f5c98cc |
|
BLAKE2b-256 | 89c255b2e6a91d62e729870281490b5ae1d0cb26ce8c7b74faa4c93b3abb780c |