Skip to main content

Determine appropriate chunk sizes for a given xarray dataset based on target chunk size and chunk aspect ratio

Project description

# dynamic_chunks

Determine appropriate chunk sizes for a given xarray dataset based on target chunk size and ‘chunk aspect ratio’

The chunk aspect ratio describes the amount of chunks along a given dimension. Take a dataset with two dimensions (a and b). A chunk aspect ratio {‘a’:2, ‘b’:1} means that the number of total chunks along b is twice that of b. This concept was inspired by a discussion with [Rich Signell](https://github.com/rsignell-usgs) and a follow up sprint with [@amsnyder](https://github.com/amsnyder),[@thodson-usgs](https://github.com/thodson-usgs), [@alaws-USGS](https://github.com/alaws-USGS), [@kjdoore](https://github.com/kjdoore) at Scipy ‘23. The idea is that one might want to optimize the chunking of a dataset to make e.g. an operation along time n times slower than an operation along spatial dimensions.

## Usage dynamic_chunks implements several algorithms with the input signature

`python chunk_dict = choosen_algorithm(ds, desired_chunksize, target_aspect_ratio, size_tolerance) ` Lets demonstrate this with the xarray example dataset

`python import xarray as xr from dynamic_chunks.algorithms import even_divisor_algo ds = xr.tutorial.open_dataset("rasm").load() `

### Rechunking with even divisors along dimensions dynamic_chunks.algorithms.even_divisor_algo restricts possible rechunking options to only even divisors of each chunked dimension.

The simplest case is rechunking a dataset along a single dimension (the target_chunk_ratio only contains a single dimension and the value does not matter)

`python # Aim for 1MB chunks along time only with a tolerance of 0.2 (so we will accept chunks from 0.8-1.2 MB) even_divisor_algo(ds, '1MB', {'time':1}, 0.5) ` gives ` {'time': 2, 'y': 205, 'x': 275} `

Lets test that really quick `python ds.chunk({'time': 2, 'y': 205, 'x': 275}) ` <img width=”646” alt=”image” src=”https://github.com/jbusecke/dynamic_chunks/assets/14314623/75c88672-965c-4748-af5b-84517e04776c”>

Nice this gave us ~800KB chunks.

### Chunk along multiple dimensions Lets chunk along all dimensions of the dataset and aim to have the same number of chunks along each dimension: `python chunks = even_divisor_algo(ds, '100KiB', {'x':1, 'y':1, 'time':1}, 0.3) ds.chunk(chunks) ` <img width=”648” alt=”image” src=”https://github.com/jbusecke/dynamic_chunks/assets/14314623/eb866582-f41b-4984-b88f-1091b007ee26”>

Ok nice we have chunks of the desired chunksize (~100KB).

> Note that I chose a very small chunksize here for demonstration purposes, in practice you should adjust the chunksize to your specific use case (more reading [here](https://blog.dask.org/2021/11/02/choosing-dask-chunk-sizes)).

Finally lets check the total number of chunks along each dimension `python {k:len(v) for k,v in ds.chunk(chunks).chunks.items()} ` ` {'time': 6, 'y': 5, 'x': 5} ` Not bad! Please note that due to the requirements of even divisors the resulting ratio of chunks can significantly differ from the target you provided. You should experiment with different values for size, tolerance to arrive at a chunking scheme that works for you.

### Keeping one dimension unchunked

In many cases you simply do not want to chunk along a given dimension at all. You can use the sentinel value -1 in target_chunk_ratio to keep specific dimensions unchunked:

`python chunks = even_divisor_algo(ds, '100KiB', {'x':1, 'y':1, 'time':-1}, 0.3) chunks ` gives ` {'time': 36, 'y': 41, 'x': 11} ` which means the time dimension is only a single chunk

## Developer Guide

Set up your development environment with conda:

` conda create --name dynamic_rechunking python=3.10 pip conda activate dynamic_rechunking pip install -e ".[test]" `

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dynamic_chunks-0.0.3.tar.gz (17.1 kB view details)

Uploaded Source

Built Distribution

dynamic_chunks-0.0.3-py3-none-any.whl (13.8 kB view details)

Uploaded Python 3

File details

Details for the file dynamic_chunks-0.0.3.tar.gz.

File metadata

  • Download URL: dynamic_chunks-0.0.3.tar.gz
  • Upload date:
  • Size: 17.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for dynamic_chunks-0.0.3.tar.gz
Algorithm Hash digest
SHA256 6fbeb8e693f55ffe19e04b4a1b4e733b09c80b4dadc5e97a0927fdebdca80814
MD5 d0df613a566b316d3559e0977425e317
BLAKE2b-256 520ebc904e2d5f88b382a3ab8d686050d5f805f35323e91e49f9bb2cc1e80463

See more details on using hashes here.

Provenance

File details

Details for the file dynamic_chunks-0.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for dynamic_chunks-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 77783e95893845836a22ecfd5422ec1b42d9621ec26b76b63fb7e20eadca33d7
MD5 3930d239d288860ec0d44d50b580929f
BLAKE2b-256 1289bb99bca601f2ccd666b087528569e7fd884b9e95b31378ef6e94a67a83ae

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page