Skip to main content

Provides a sharded Zarr store.

Project description

shardedstore

image Test DOI

Provides a sharded Zarr store.

Features

  • For large Zarr stores, avoid an excessive number of objects or extremely large objects, which avoids filesystem inode usage and object store limitations.
  • Performance-sensitive implementation.
  • Use existing Zarr v2 stores.
  • Mix and match shard store types.
  • Serialize and deserialize the ShardedStore in JSON.
  • Shard groups or array chunks.
  • Easily run transformations on store shards.

Installation

pip install shardedstore

Example

from zarr.storage import DirectoryStore
from shardedstore import ShardedStore, array_shard_directory_store, to_zip_store_with_prefix

# xarray example, but works with zarr in general
import xarray as xr
from datatree import DataTree, open_datatree
import json
import numpy as np
import os

base_store = DirectoryStore("base.zarr")
shard1 = DirectoryStore("shard1.zarr")
shard2 = DirectoryStore("shard2.zarr")
array_shards1 = array_shard_directory_store("array_shards1")
array_shards2 = array_shard_directory_store("array_shards2")

# xarray-datatree Quick Overview
data = xr.DataArray(np.random.randn(2, 3), dims=("x", "y"), coords={"x": [10, 20]})
# Sharded array dimensions must have a chunk shape of 1.
data = data.chunk([1,2])
ds = xr.Dataset(dict(foo=data, bar=("x", [1, 2]), baz=np.pi))
ds2 = ds.interp(coords={"x": [10, 12, 14, 16, 18, 20]})
ds2 = ds2.chunk({'x':1, 'y':2})
ds3 = xr.Dataset(
    dict(people=["alice", "bob"], heights=("people", [1.57, 1.82])),
    coords={"species": "human"},
    )
dt = DataTree.from_dict({"simulation/coarse": ds, "simulation/fine": ds2, "/": ds3})

# A monolithic store
single_store = DirectoryStore("single.zarr")
dt.to_zarr(single_store)

# A sharded store demonstrating sharding on groups and arrays. The arrays are sharded over 1 dimension.
sharded_store = ShardedStore(base_store,
    {'people': shard1, 'species': shard2},
    {'simulation/coarse/foo': (1, array_shards1), 'simulation/fine/foo': (1, array_shards2)})
dt.to_zarr(sharded_store)

# Serialize / deserialize
config = sharded_store.get_config()
config_str = json.dumps(config)
config = json.loads(config_str)
sharded_store = ShardedStore.from_config(config)

from_single = datatree.open_datatree(single_store, engine='zarr').compute()
from_sharded = datatree.open_datatree(sharded_store, engine='zarr').compute()

assert from_single.identical(from_sharded)

# Run transformations over component shards with `map_shards`
to_zip_stores = to_zip_store_with_prefix("zip_stores")
zip_sharded_stores = sharded_store.map_shards(to_zip_stores)

Development

Contributions are welcome and appreciated.

git clone https://github.com/thewtex/shardedstore
cd shardedstore
pip install -e ".[test]"
pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shardedstore-0.3.0.tar.gz (13.1 kB view details)

Uploaded Source

Built Distribution

shardedstore-0.3.0-py3-none-any.whl (10.8 kB view details)

Uploaded Python 3

File details

Details for the file shardedstore-0.3.0.tar.gz.

File metadata

  • Download URL: shardedstore-0.3.0.tar.gz
  • Upload date:
  • Size: 13.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.27.1

File hashes

Hashes for shardedstore-0.3.0.tar.gz
Algorithm Hash digest
SHA256 c82b2a05ddeb9afec081c257d2c8eef62c448b3f3198d1afa120d5bbf3298cd4
MD5 3bf539847b24cc827b08bdd4f82fd6a4
BLAKE2b-256 19cdec9571e841b2ba2445c34d26d9253ab0d8bf491ed70f04cd48676aa1416d

See more details on using hashes here.

File details

Details for the file shardedstore-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for shardedstore-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 548ba3d465b85d5cd321a382c3800247ddcfd3f868680c15ab97c8962786fb54
MD5 f902aa18670bb5e298707609ad980482
BLAKE2b-256 86bbde5db97f5c42c3677ff9bb55055b7f3331cd9639e39f31cccf70a51e7f55

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page