Skip to main content

Example package description

Project description

Rationale

  • The structure of Apache Arrow Tables is highly similar to that of CASA Tables

  • It’s easy to convert Arrow Tables between many different languages

  • Once in Apache Arrow format, it is easy to store data in modern, cloud-native disk formats such as parquet and orc.

  • Converting CASA Tables to Arrow in the C++ layer avoids the GIL

  • Access to non thread-safe CASA Tables is constrained to a ThreadPool containing a single thread

  • It also allows us to write astrometric routines in C++, potentially side-stepping thread-safety and GIL issues with the CASA Measures server.

Build Wheel Locally

In the user or, even better, a virtual environment:

$ pip install -U pip cibuildwheel
$ bash scripts/run_cbuildwheel.sh -p 3.8

Local Development

In the directory containing the source, setup your development environment as follows:

$ pip install -U pip virtualenv
$ virtualenv -p python3.8 /venv/arcaedev
$ . /venv/arcaedev/bin/activate
(arcaedev) export VCPKG_TARGET_TRIPLET=x64-linux-dynamic-cxx17-abi0-dbg  $ suffix to -rel for release
(arcaedev) pip install -e .[test]
(arcaedev) export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$(pwd)/vcpkg/installed/$VCPKG_TARGET_TRIPLET/lib
(arcaedev) py.test -s -vvv --pyargs arcae

Usage

Example Usage:

import json
from pprint import pprint

import arcae
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# Obtain (partial) Apache Arrow Table from a CASA Table
casa_table = arcae.table("/path/to/measurementset.ms")
arrow_table = casa_table.to_arrow()        # read entire table
arrow_table = casa_table.to_arrow(10, 20)  # startrow, nrow
assert isinstance(arrow_table, pa.Table)

# Print JSON-encoded Table and Column keywords
pprint(json.loads(AT.schema.metadata[b"__arcae_metadata__"]))
pprint(json.loads(AT.schema.field("DATA").metadata[b"__arcae_metadata__"]))

# Extract Arrow Table columns into numpy arrays
time = arrow_table.column("TIME").to_numpy()
data = arrow_table.column("DATA").to_numpy()   # currently, arrays of object arrays, overly slow and memory hungry
df = arrow_table.to_pandas()                   # currently slow, memory hungry due to arrays of object arrays

# Write Arrow Table to parquet file
pq.write_table(arrow_table, "measurementset.parquet")

See the test cases for further use cases.

Exporting Measurement Sets to Arrow Parquet Datasets

An export script is available:

$ arcae export /path/to/the.ms --nrow 50000
$ tree output.arrow/
output.arrow/
├── ANTENNA
   └── data0.parquet
├── DATA_DESCRIPTION
   └── data0.parquet
├── FEED
   └── data0.parquet
├── FIELD
   └── data0.parquet
├── MAIN
   └── FIELD_ID=0
       └── PROCESSOR_ID=0
           ├── DATA_DESC_ID=0
              ├── data0.parquet
              ├── data1.parquet
              ├── data2.parquet
              └── data3.parquet
           ├── DATA_DESC_ID=1
              ├── data0.parquet
              ├── data1.parquet
              ├── data2.parquet
              └── data3.parquet
           ├── DATA_DESC_ID=2
              ├── data0.parquet
              ├── data1.parquet
              ├── data2.parquet
              └── data3.parquet
           └── DATA_DESC_ID=3
               ├── data0.parquet
               ├── data1.parquet
               ├── data2.parquet
               └── data3.parquet
├── OBSERVATION
   └── data0.parquet

This data can be loaded into an Arrow Dataset:

>>> import pyarrow as pa
>>> import pyarrow.dataset as pad
>>> main_ds = pad.dataset("output.arrow/MAIN")
>>> spw_ds = pad.dataset("output.arrow/SPECTRAL_WINDOW")

Limitations

Some edge cases have not yet been implemented, but could be with some thought.

  • Not yet able to handle columns with unconstrained rank (ndim == -1). Probably simplest to convert these rows to json and store as a string.

  • Not yet able to handle TpRecord columns. Probably simplest to convert these rows to json and store as a string.

  • Not yet able to handle TpQuantity columns. Possible to represent as a run-time parametric Arrow DataType.

  • to_numpy() conversion of nested lists produces nested numpy arrays, instead of tensors. This is possible but requires some changes to how C++ Extension Types are exposed in Python.

Etymology

Noun: arca f (genitive arcae); first declension A chest, box, coffer, safe (safe place for storing items, or anything of a similar shape)

Pronounced: ar-ki.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arcae-0.1.0.tar.gz (13.8 kB view details)

Uploaded Source

Built Distributions

arcae-0.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (24.7 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

arcae-0.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (24.7 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

arcae-0.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (24.7 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

arcae-0.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (24.7 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

File details

Details for the file arcae-0.1.0.tar.gz.

File metadata

  • Download URL: arcae-0.1.0.tar.gz
  • Upload date:
  • Size: 13.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.1 CPython/3.11.4

File hashes

Hashes for arcae-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1b6b95bbfe0fa171d793412d31df7c2f44a0aaa37af02b9af709f0700cba1f9a
MD5 391a14aa9c2282e473e719e719c94e1c
BLAKE2b-256 e828fdaf7ed54d5cd6c3b45c4ae379be5990a26285b3d6ae1103c72351d5eac8

See more details on using hashes here.

File details

Details for the file arcae-0.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for arcae-0.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d3db27ea43de0605c178bf072ed9c05c8b8a60eaa71e7c924a4bfe7faa375499
MD5 708d598509776d9b3772de1752df1810
BLAKE2b-256 e95a025ea340b6b2e0190b12b07a18a1f7145f3bd6eb5249dc7086db0199a9c7

See more details on using hashes here.

File details

Details for the file arcae-0.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for arcae-0.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d1576eb23f65961266a70b6504fbf8600e1b67edd495b6d9f54262f56876cfd1
MD5 bf5ccdb4b8529d95fd5acbe41848cc47
BLAKE2b-256 3f25dbb0ceb3315bdbbee709bbf3886c93c6753bf8246009809f344033b3629f

See more details on using hashes here.

File details

Details for the file arcae-0.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for arcae-0.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 946eaea6d00159b6ab4a3c1bec2d2b9d4ad2158469517634b59217c9060c4d03
MD5 7871b24b9d1a0722d1d4ef10970526fb
BLAKE2b-256 af2ee6e8556466affd97069a6d4459c88f00a0241eeda911bb9963025b9c9dbf

See more details on using hashes here.

File details

Details for the file arcae-0.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for arcae-0.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 44bbcce87cdaa7d4f5e655cca5a91b57f0c876c17d4c83756bbfc81c8753d358
MD5 74237785dc30113b267e762d658f8bb7
BLAKE2b-256 a62e547e6d299e7c5152eced029585012f9de151dd4ccb304f7a11a21e034e17

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page