An adapter to convert NetCDF files to TileDB or Zarr arrays
Project description
tiledb_netcdf
An adapter to convert NetCDF files to TileDB or Zarr arrays.
Usage Examples
Here we'll demonstrate using this library to convert NetCDF files to TileDB and Zarr, and read the results using Iris and Xarray.
Converting to TileDB
TileDB supports direct interaction with blob storage as provided by public cloud platforms. Currently this library only supports Azure Blob Storage containers, but we will add AWS support in a future release. If you'd like that support right now, do raise an issue 👆!
1. Create a common data model
Use the data model class NCDataModel
to create a data model:
from nctotdb import NCDataModel
data_model = NCDataModel('/path/to/my/file.nc')
data_model.populate()
Manually modifying classification
Classifying NetCDF variables is a long way from a precise science, and occasionally the
process may fail to correctly classify a variable. In such a case you can manually modify
the classification processes by using the following instead of calling data_model.populate()
:
data_model = NCDataModel('/path/to/my/file.nc')
my_bespoke_data_var_name = 'foobarbaz'
with data_model.open_netcdf():
data_model.classify_variables()
data_model.data_var_names = [my_bespoke_data_var_name]
data_model.get_metadata()
2. Write the data model
With a data model created, we can convert the contents of the NetCDF file to a supported
cloud-ready fileformat (one of TileDB
or Zarr
). We'll demonstrate writing to TileDB here,
with Zarr covered in the next section.
As mentioned earlier, we can write either to posix-like filepaths or directly to an Azure Storage account blob container. Let's cover each of these in turn.
2a. Filepaths
from nctotdb import TileDBWriter
# TileDB.
tiledb_save_path = '/path/to/my_tdb'
tiledb_name = 'my_tiledb'
unlimited_dims = 'z' # Useful if you know you're going to need to append to the `z` dimension
writer = TileDBWriter(data_model,
array_filepath=tiledb_save_path,
array_name=tiledb_name,
unlimited_dims=unlimited_dims)
writer.create_domains()
2b. Blob container
Some more setup is needed to interface with an Azure Storage account Blob container. We need to provide authentication to the Azure Storage account and configure TileDB operations to work with the Blob container we wish to write to:
import tiledb
# Azure blob storage definitions.
storage_account_name = 'my_azure_storage_account_name'
container = 'my_blob_container_name'
uri = f'azure://{container}'
access_key = 'my_azure_access_key'
# TileDB configuration for Azure Blob.
cfg = tiledb.Config()
cfg['vfs.azure.storage_account_name'] = storage_account_name
cfg['vfs.azure.storage_account_key'] = access_key
cfg["vfs.s3.use_multipart_upload"] = 'false'
ctx = tiledb.Ctx(config=cfg)
Important! Do not share or publish your Azure Storage account key! You can also set an environment variable that TileDB will use instead of pasting your account key into your code.
Now we can write to our TileDB array. This is much the same as with posix-like
paths, other than that we must also pass the TileDB Ctx
(context) object and specify
a container rather than a filepath to save to:
writer = TileDBWriter(data_model,
container=container,
array_name=tiledb_name,
unlimited_dims=unlimited_dims,
ctx=ctx)
writer.create_domains()
3. Append
We can also append the contents of one or more extra NetCDF files along a named dimension. The extra NetCDF files can be specified either as a list of filepaths or as a list of data model objects. If filepaths are specified they will be automatically converted to data model objects.
append_files = ['file1.nc', 'file2.nc', 'file3.nc']
data_array_name = 'data' # The name of the data arrays in the TileDB array, typically `data`.
writer.append(append_files, unlimited_dims, data_array_name)
You can track the progress of append operations by enabling verbose mode:
writer.append(append_files, unlimited_dims, data_array_name,
verbose=True)
If you have a large number of files to append or you simply want the append to complete
faster, tiledb_netcdf
can use dask to parallelise the append operation on a per-file basis.
Assuming you have already set up a dask cluster, my_cluster
:
client = dask.distributed.Client(my_cluster)
logfile = "append.log"
writer.append(append_files, unlimited_dims, data_array_name,
parallel=True, logfile=logfile)
Note: it is recommended you also log parallel appends for error tracking, should anything go wrong during the append process.
3a. Scalar Append
One case of appending needs to be handled differently. This is the case where the datasets to
be appended are scalar along the append dimension. For example, you may wish to append along the
time
dimension, but the base dataset and all files to be appended only contain a single
(that is, scalar) time point. In this case a scalar append needs to be carried out.
Typically the append algorithm uses the separation between points along the append dimension to calculate the offsets of all datasets to be appended. With only a single point along the append dimension this is not possible, so instead you need to also supply a file to the append call that allows the offset between files to be calculated. To ensure the correct offset is calculated, this file must describe the next step in the append dimension from the file originally used to create the TileDB array.
The file used to calculate the offset is passed into the append operation using the
baseline
keyword argument. For example:
append_files = ['file1.nc', 'file2.nc', 'file3.nc', 'file4.nc', 'file5.nc']
data_array_name = 'data'
writer.append(append_files, unlimited_dims, data_array_name,
baseline=append_files[0])
Note: The file used to calculate the offsets is not appended as well as being used to calculate the offset. You will need to include the offset file in the append files as well!
Note: All such appends with a scalar append dimension must be supplied with a baseline
file to calculate the offset, even if an append has already successfully been carried out.
If you try and perform an append along a scalar dimension without providing a baseline
file to calculate the offset, you will encounter an error message:
ValueError: Cannot determine scalar step without a baseline dataset.
4. Read Converted Arrays
We can use the Reader
classes to read our TileDB array with Iris or Xarray:
from nctotdb import TileDBReader
# From a posix-like filepath:
tiledb_reader = TileDBReader(tiledb_name, array_filepath=tiledb_save_path)
# Or directly from Azure Blob:
tiledb_reader = TileDBReader(tiledb_name, container=container, ctx=ctx)
# TileDB to Iris.
cubes = tiledb_reader.to_iris() # Convert all TileDB arrays to Iris Cubes.
cube = tiledb_reader.to_iris('array_name') # Convert a named variable to an Iris Cube.
# TileDB to Xarray.
dss = tiledb_reader.to_xarray() # Convert all TileDB arrays to Xarray.
ds = tiledb_reader.to_xarray('array_name') # Convert a named variable to an Xarray dataset.
Converting to Zarr
We can also convert NetCDF files to Zarr using this library, and read these Zarrs back into Iris and Xarray. A similar set of APIs is provided for Zarr as was provided for TileDB.
1. Create a common data model
This is exactly the same as for TileDB. The differentiation comes at the next step when we choose the data representation format we want to use to store the contents of the NetCDF file represented by the data model.
from nctotdb import NCDataModel
my_nc_filepath = '/path/to/my/file.nc'
data_model = NCDataModel(my_nc_file)
data_model.populate()
2. Write to Zarr
With a data model created we can write the contents of the NetCDF file as exposed via the data model. Here we write the contents to Zarr:
from nctotdb import ZarrWriter
zarr_writer = ZarrWriter(data_model, '/path/to/my_zarr',
array_name='my_zarr')
zarr_writer.create_zarr()
3. Append
We can also add the contents of other NetCDF files to the Zarr we created, and extend one of the Zarr's dimensions:
my_other_nc_filepath = '/path/to/my/other_file.nc'
other_data_model = NCDataModel(my_other_nc_file)
append_var_name = 'array_name'
append_dim = 'dimension_name'
zarr_writer.append(other_data_model, append_var_name, append_dim)
4. Read Zarr
Finally we can read the Zarr we created into Iris and Xarray:
from nctotdb import ZarrReader
zarr_reader = ZarrReader('/path/to/my_zarr')
cubes = zarr_reader.to_iris()
ds = zarr_reader.to_xarray()
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for tiledb_netcdf-0.2.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d71f00c90bfea58f9f687e4ab718f1eb896c064ce88d4a21fbd75bbaf2bdf8f9 |
|
MD5 | 4e106efc82919abe2dafdf7c79948880 |
|
BLAKE2b-256 | a92a98762a993408d916d09c060c57ef5b523f39ce7c955b38bce662c6f3db41 |