Bitemporal HDF5
Project description
bitemporal-h5
A generic bitemporal model built on HDF5 (h5py)
Model
The basic model for a bitemporal is an HDF5 dataset that is extensible
along a single dimension with named columns and different dtypes for
each column. In-memory, this will be represented by a numpy structured array.
We will call this structure a Table
, for purposes here.
Note that HDF5 has its own Table data structure in the high-level
interface (hdf5-hl). We will not be using the high-level table here for
a couple of reasons. The first is that h5py
does not support HDF5's
high-level constructs. The second is that we plan on eventually swapping out
the value column with a deduplicated cache. Relying on low-level HDF5 constructs
grants us this flexibility in the future.
The columns present in the table are as follows:
transaction_id (uint64)
: This is a monotonic integer that represents the precise write action that caused this row to be written. Multiple rows may be written at the same time, so this value is not unique among rows, though presumably all rows with a given transaction id are contiguous in the table. This value is zero-indexed. The current largest transaction id should be written to the table's attributes asmax_transaction_id
(also uint64). Write operations should bump themax_transaction_id
by one.transaction_time (datetime64)
: This is a timestamp (sec since epoch). Any metadata about the timezones should be stored as a string attribute of the dataset astransaction_time_zone
. This represents the time at which the data was recorded by the write operation. All rows with the sametransaction_id
should have the same value here.valid_time (datetime64)
: This is a timestamp (sec since epoch). Any metadata about the timetzones should be stored as a string attribute of the dataset asvalid_time_zone
. This is the primary axis of the time series. It represents the data stored in thevalue
column.value ((I,J,K,...)<scalar-type>|)
: This column represents the actual values of a time series. This may be an N-dimensional array of any valid dtype. It is likely sufficient to restrict ourselves to floats and ints, but the model should be general enough to accept any scalar dtypes. Additionally, the typical usecase will be for this column to be a scalar float value.
Therefor an example numpy dtype with float values and a shape of (1, 2, 3)
is:
np.dtype([
('transaction_id', '<uint64'),
('transaction_time', '<M8'),
('valid_time', '<M8'),
('value', '<f8', (1, 2, 3))
])
Quickstart API
The interface for writing to the bitemporal HDF5 storage is as follows:
import bth5
with bth5.open("/path/to/file.h5", "/path/to/group", "a+") as ds:
# all writes are staged into a single transaction and then
# written when the context exits. Transaction times and IDs
# are automatically applied. The first write call determines
# the dtype & shape of value, if the data set does not already
# exist.
ds.write(t1, p1)
ds.write(valid_time=t2, value=p2)
ds.write(valid_time=[t3, t4, t5], value=[p3, p4, p5])
ds.write((t6, p6))
ds.write([(t7, p7), (t8, p8)])
Reading from the data set should follow the normal numpy indexing:
# opened in read-only mode
>>> ds = bth5.open("/path/to/file.h5", "/path/to/group")
>>> in_mem = ds[:]
>>> in_mem.dtype
np.dtype([
('valid_time', '<M8'),
('value', '<f8', (1, 2, 3))
])
This again should return the latest valid time and value which is present in the
reduced dataset. Furthermore, there is an escape valve to reach the actual
dataset as represented on-disk. This is the raw
attribute, that returns
a reference to the h5py dataset.
>>> ds = bth5.open("/path/to/file.h5", "/path/to/group")
>>> ds.raw
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.