Skip to main content

Provides a sharded Zarr store.

Project description

shardedstore

image Test DOI

Provides a sharded Zarr store.

Features

  • For large Zarr stores, avoid an excessive number of objects or extremely large objects, which bypasses filesystem inode usage and object store limitations.
  • Performance-sensitive implementation.
  • Use existing Zarr v2 stores.
  • Mix and match shard store types.
  • Serialize and deserialize the ShardedStore in JSON.
  • Shard groups or array chunks.
  • Easily run transformations on store shards.

Installation

pip install shardedstore

Example

from shardedstore import ShardedStore, array_shard_directory_store, to_zip_store_with_prefix

from zarr.storage import DirectoryStore

# xarray example, but works with zarr in general
import xarray as xr
from datatree import DataTree, open_datatree
import json
import numpy as np
import os

Create component shard stores

base_store = DirectoryStore("base.zarr")
shard1 = DirectoryStore("shard1.zarr")
shard2 = DirectoryStore("shard2.zarr")
array_shards1 = array_shard_directory_store("array_shards1")
array_shards2 = array_shard_directory_store("array_shards2")

Generate data for the example

# xarray-datatree Quick Overview
data = xr.DataArray(np.random.randn(2, 3), dims=("x", "y"), coords={"x": [10, 20]})
# Sharded array dimensions must have a chunk shape of 1.
data = data.chunk([1,2])
ds = xr.Dataset(dict(foo=data, bar=("x", [1, 2]), baz=np.pi))
ds2 = ds.interp(coords={"x": [10, 12, 14, 16, 18, 20]})
ds2 = ds2.chunk({'x':1, 'y':2})
ds3 = xr.Dataset(
    dict(people=["alice", "bob"], heights=("people", [1.57, 1.82])),
    coords={"species": "human"},
    )
dt = DataTree.from_dict({"simulation/coarse": ds, "simulation/fine": ds2, "/": ds3})

A monolithic store

single_store = DirectoryStore("single.zarr")
dt.to_zarr(single_store)

A sharded store demonstrating sharding on groups and arrays.

Arrays are sharded over 1 dimension.

sharded_store = ShardedStore(base_store,
    {'people': shard1, 'species': shard2},
    {'simulation/coarse/foo': (1, array_shards1), 'simulation/fine/foo': (1, array_shards2)})
dt.to_zarr(sharded_store)

Serialize / deserialize

config = sharded_store.get_config()
config_str = json.dumps(config)
config = json.loads(config_str)
sharded_store = ShardedStore.from_config(config)

Validate

from_single = open_datatree(single_store, engine='zarr').compute()
from_sharded = open_datatree(sharded_store, engine='zarr').compute()
assert from_single.identical(from_sharded)

Run transformations over component shards with map_shards

to_zip_stores = to_zip_store_with_prefix("zip_stores")
zip_sharded_stores = sharded_store.map_shards(to_zip_stores)

Development

Contributions are welcome and appreciated.

git clone https://github.com/thewtex/shardedstore
cd shardedstore
pip install -e ".[test]"
pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shardedstore-0.3.1.tar.gz (13.1 kB view details)

Uploaded Source

Built Distribution

shardedstore-0.3.1-py3-none-any.whl (10.9 kB view details)

Uploaded Python 3

File details

Details for the file shardedstore-0.3.1.tar.gz.

File metadata

  • Download URL: shardedstore-0.3.1.tar.gz
  • Upload date:
  • Size: 13.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.27.1

File hashes

Hashes for shardedstore-0.3.1.tar.gz
Algorithm Hash digest
SHA256 a1b75197274f13dc696fcb22c5d63073f48f83f6538f7abe2de16fa7a3b9c285
MD5 1ab3635b8d9b2fafb48d187955966b35
BLAKE2b-256 e49c5d592b01cd56032a3818ef1e55d5de96f64e6ce78d253fcf99875b8dc5e3

See more details on using hashes here.

File details

Details for the file shardedstore-0.3.1-py3-none-any.whl.

File metadata

File hashes

Hashes for shardedstore-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7bc68fb78f400a4d354c923c1c122ad586eea4f346289d00b9b44efd8e84de26
MD5 dc554c9ee6e4ef0136fc1d06c79ddda1
BLAKE2b-256 cf9a3b64d4b931e1d2c1c1730394a1870d3aa88659e949893462dc0792048770

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page