Tools for converting OME-Zarr data within the ome2024-ngff-challenge (see https://forum.image.sc/t/ome2024-ngff-challenge/97363)
Project description
ome2024-ngff-challenge
Project planning and material repository for the 2024 challenge to generate 1 PB of OME-Zarr data
Challenge overview
The high-level goal of the challenge is to generate OME-Zarr data according to a development version of the specification to drive forward the implementation work and establish a baseline for the conversion costs that members of the community can expect to incur.
Data generated within the challenge will have:
- all v2 arrays converted to v3, optionally sharding the data
- all .zattrs metadata migrated to
zarr.json["attributes"]["ome"]
- a top-level
ro-crate-metadata.json
file with minimal metadata (specimen and imaging modality)
You can example the contents of a sample dataset by using the minio client:
$ mc config host add uk1anon https://uk1s3.embassy.ebi.ac.uk "" ""
Added `uk1anon` successfully.
$ mc ls -r uk1anon/idr/share/ome2024-ngff-challenge/0.0.5/6001240.zarr/
[2024-08-01 14:24:35 CEST] 24MiB STANDARD 0/c/0/0/0/0
[2024-08-01 14:24:28 CEST] 598B STANDARD 0/zarr.json
[2024-08-01 14:24:32 CEST] 6.0MiB STANDARD 1/c/0/0/0/0
[2024-08-01 14:24:28 CEST] 598B STANDARD 1/zarr.json
[2024-08-01 14:24:29 CEST] 1.6MiB STANDARD 2/c/0/0/0/0
[2024-08-01 14:24:28 CEST] 592B STANDARD 2/zarr.json
[2024-08-01 14:24:28 CEST] 1.2KiB STANDARD ro-crate-metadata.json
[2024-08-01 14:24:28 CEST] 2.7KiB STANDARD zarr.json
The dataset can be inspected using a development version of the OME-NGFF Validator available at https://deploy-preview-36--ome-ngff-validator.netlify.app/?source=https://uk1s3.embassy.ebi.ac.uk/idr/share/ome2024-ngff-challenge/0.0.4/6001240.zarr
Converting your data
Getting started
The ome2024-ngff-challenge
script can be used to convert an OME-Zarr 0.4
dataset that is based on Zarr v2:
ome2024-ngff-challenge input.zarr output.zarr
If you would like to re-run the script with different parameters, you can
additionally set --output-overwrite
to ignore a previous conversion:
ome2024-ngff-challenge input.zarr output.zarr --output-overwrite
Reading/writing remotely
If you would like to avoid downloading and/or upload the Zarr datasets, you can set S3 parameters on the command-line which will then treat the input and/or output datasets as a prefix within an S3 bucket:
ome2024-ngff-challenge \
--input-bucket=BUCKET \
--input-endpoint=HOST \
--input-anon \
input.zarr \
output.zarr
A small example you can try yourself:
ome2024-ngff-challenge \
--input-bucket=idr \
--input-endpoint=https://uk1s3.embassy.ebi.ac.uk \
--input-anon \
zarr/v0.4/idr0062A/6001240.zarr \
/tmp/6001240.zarr
Reading/writing via a script
Another R/W option is to have resave.py
generate a script which you can
execute later. If you pass --output-script
, then rather than generate the
arrays immediately, a file named convert.sh
will be created which can be
executed later.
For example, running:
ome2024-ngff-challenge dev2/input.zarr /tmp/scripts.zarr --output-script
produces a dataset with one zarr.json
file and 3 convert.sh
scripts:
/tmp/scripts.zarr/0/convert.sh
/tmp/scripts.zarr/1/convert.sh
/tmp/scripts.zarr/2/convert.sh
Each of the scripts contains a statement of the form:
zarrs_reencode --chunk-shape 1,1,275,271 --shard-shape 2,236,275,271 --dimension-names c,z,y,x --validate dev2/input.zarr /tmp/scripts.zarr
Running this script will require having installed zarrs_tools
with:
cargo install zarrs_tools
export PATH=$PATH:$HOME/.cargo/bin
Optimizing chunks and shards
Finally, there is not yet a single heuristic for determining the chunk and shard
sizes that will work for all data. Pass the --output-chunks
and
--output-shards
flags in order to set the size of chunks and shards for all
resolutions:
ome2024-ngff-challenge input.zarr output.zarr --output-chunks=1,1,1,256,256 --output-shards=1,1,1,2048,2048
Alternatively, you can use a JSON file to review and manually optimize the chunking and sharding parameters on a per-resolution basis:
ome2024-ngff-challenge input.zarr parameters.json --output-write-details
This will write a JSON file of the form:
[{"shape": [...], "chunks": [...], "shards": [...]}, ...
where the order of the dictionaries matches the order of the "datasets" field in
the "multiscales". Edits to this file can be read back in using the
output-read-details
flag:
ome2024-ngff-challenge input.zarr output.zarr --output-read-details=parameters.json
Note: Changes to the shape are ignored.
Related work
The following additional PRs are required to work with the data created by the scripts in this repository:
- https://github.com/ome/ome-ngff-validator/pull/36
- https://github.com/ome/ome-zarr-py/pull/383
- https://github.com/hms-dbmi/vizarr/pull/172
- https://github.com/LDeakin/zarrs_tools/issues/8
Slightly less related but important at the moment:
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file ome2024_ngff_challenge-0.0.5.tar.gz
.
File metadata
- Download URL: ome2024_ngff_challenge-0.0.5.tar.gz
- Upload date:
- Size: 15.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f1fba5ea5e93d05237810b5b9cef7b25d4ac6980d83ccb0b231a57255f44220f |
|
MD5 | b3ff8b699415b0f3721176a8e125aa60 |
|
BLAKE2b-256 | 693340217f1144292a7a2dd28fec7320c65b8f8eb14cf8a72cefedd821e4e61b |
File details
Details for the file ome2024_ngff_challenge-0.0.5-py3-none-any.whl
.
File metadata
- Download URL: ome2024_ngff_challenge-0.0.5-py3-none-any.whl
- Upload date:
- Size: 14.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 30294cb200d35b4f5e66b077b95347a804d06cdd885a418a8cd87a9297acde09 |
|
MD5 | 7b6726674f550fdf65df510ad00e081e |
|
BLAKE2b-256 | f4020ae247653614529f319f2a0829bfa9eadacf55496f500f08d6afffc66094 |