xarray-einstats.
Project description
xarray-einstats
Stats, linear algebra and einops for xarray
⚠️ Caution: This project is still in an early development stage
Installation
To install, run
(.venv) $ pip install xarray-einstats
Overview
As stated in their website:
xarray makes working with multi-dimensional labeled arrays simple, efficient and fun!
The code is often more verbose, but it is generally because it is clearer and thus less error prone and more intuitive. Here are some examples of such trade-off where we believe the increased clarity is worth the extra characters:
numpy | xarray |
---|---|
a[2, 5] |
da.sel(drug="paracetamol", subject=5) |
a.mean(axis=(0, 1)) |
da.mean(dim=("chain", "draw")) |
a.reshape((-1, 10)) |
da.stack(sample=("chain", "draw")) |
a.transpose(2, 0, 1) |
da.transpose("drug", "chain", "draw") |
In some other cases however, using xarray can result in overly verbose code
that often also becomes less clear. xarray_einstats
provides wrappers
around some numpy and scipy functions (mostly numpy.linalg
and scipy.stats
)
and around einops with an api and features adapted to xarray.
% ⚠️ Attention: A nicer rendering of the content below is available at our documentation
Data for examples
The examples in this overview page use the DataArray
s from the Dataset
below
(stored as ds
variable) to illustrate xarray_einstats
features:
<xarray.Dataset>
Dimensions: (dim_plot: 50, chain: 4, draw: 500, team: 6)
Coordinates:
* chain (chain) int64 0 1 2 3
* draw (draw) int64 0 1 2 3 4 5 6 7 8 ... 492 493 494 495 496 497 498 499
* team (team) object 'Wales' 'France' 'Ireland' ... 'Italy' 'England'
Dimensions without coordinates: dim_plot
Data variables:
x_plot (dim_plot) float64 0.0 0.2041 0.4082 0.6122 ... 9.592 9.796 10.0
atts (chain, draw, team) float64 0.1063 -0.01913 ... -0.2911 0.2029
sd_att (draw) float64 0.272 0.2685 0.2593 0.2612 ... 0.4112 0.2117 0.3401
Stats
{mod}xarray_einstats.stats
provides two wrapper classes {class}xarray_einstats.stats.XrContinuousRV
and {class}xarray_einstats.stats.XrDiscreteRV
that can be used to wrap any distribution
in {mod}scipy.stats
so they accept {class}~xarray.DataArray
as inputs,
and some wrappers for other functions in the scipy.stats
module
so you can use dims
(supporting both string and iterable of strings)
instead of axis
and keep the labels from the input DataArrays.
The distribution wrappers perform broadcasting and alignment of all the inputs automatically. You can evaluate the logpdf using inputs that wouldn't align if using numpy in a couple lines:
norm_dist = xarray_einstats.stats.XrContinuousRV(scipy.stats.norm)
# shapes: (50,) (4, 500, 6) (500,)
norm_dist.logpdf(ds["x_plot"], ds["atts"], ds["sd_att"])
which returns:
<xarray.DataArray (dim_plot: 50, chain: 4, draw: 500, team: 6)>
array([[[[ 3.06470249e-01, 3.80373065e-01, 2.56575936e-01,
...
-4.41658154e+02, -4.57599982e+02, -4.14709280e+02]]]])
Coordinates:
* chain (chain) int64 0 1 2 3
* draw (draw) int64 0 1 2 3 4 5 6 7 8 ... 492 493 494 495 496 497 498 499
* team (team) object 'Wales' 'France' 'Ireland' ... 'Italy' 'England'
Dimensions without coordinates: dim_plot
More examples available at {ref}stats_tutorial
.
Linear Algebra
There is no one size fits all solution, but knowing the function
we are wrapping we can easily make the code more concise and clear.
Without xarray_einstats
, to invert a batch of matrices stored in a 4d
array you have to do:
inv = xarray.apply_ufunc( # output is a 4d labeled array
numpy.linalg.inv,
batch_of_matrices, # input is a 4d labeled array
input_core_dims=[["matrix_dim", "matrix_dim_bis"]],
output_core_dims=[["matrix_dim", "matrix_dim_bis"]]
)
to calculate it's norm instead, it becomes:
norm = xarray.apply_ufunc( # output is a 2d labeled array
numpy.linalg.norm,
batch_of_matrices, # input is a 4d labeled array
input_core_dims=[["matrix_dim", "matrix_dim_bis"]],
)
With {mod}xarray_einstats.linalg
, those operations become:
inv = xarray_einstats.inv(batch_of_matrices, dim=("matrix_dim", "matrix_dim_bis"))
norm = xarray_einstats.norm(batch_of_matrices, dim=("matrix_dim", "matrix_dim_bis"))
Moreover, if you use some internal conventions to label the dimensions
that correspond to matrices, so that they can always be identified
if given the list of all dimensions in the input, you can configure
xarray_einstats
to follow that convention.
Take a look at {func}~xarray_einstats.linalg.get_default_dims
And if you still need more reasons for xarray_einstats
, to complement
the einops
wrappers, it also provides {func}xarray_einstats.einsum
!
More examples available, also using einsum
at {ref}linalg_tutorial
.
einops
repeat wrapper still missing
einops uses a convenient notation inspired in
Einstein notation to specify operations on multidimensional arrays.
It uses spaces as a delimiter between dimensions, parenthesis to
indicate splitting or stacking of dimensions and ->
to separate
between input and output dim specification.
{mod}xarray_einstats.einops
uses an adapted notation to take advantage of xarray,
where dimensions are already labeled,
and adapts to dimension names with spaces or parenthesis in them.
It then translates the expression and calls einops via {func}xarray.apply_ufunc
so you need to have einops installed for the functions in this
module to work.
xarray_einstats
uses two separate arguments, one for the input pattern (optional) and
another for the output pattern. Each is a list of dimensions (strings)
or dimension (lists or dictionaries).
:::{tip}
If you are willing to impose some extra constraints to your dimension names,
you can also use the raw_
einops wrappers, with a syntax more concise and
much closer to the einops library.
:::
Combine the chain and draw dimensions
::::{tab-set} :::{tab-item} rearrange :sync: full
We can combine the chain and draw dimensions and name the resulting dimension sample
using a list with a single dictionary.
rearrange(ds.atts, [{"sample": ("chain", "draw")}])
::: :::{tab-item} raw_rearrange :sync: raw
As you would do in einops, we indicate we want to combine the chain and draw dimensions
by putting the two inside a parenthesis. With xarray_einstats
in addition,
you can add an =new_name
to label this combined dimension, otherwise it gets
a default name.
Moreover, as dimensions are already labeled in the input, we can skip the
left side of the expression. If no ->
symbol is present in the pattern,
xarray_einstats
generates the left side automatically.
raw_rearrange(ds.atts, "(chain draw)=sample")
::: ::::
The team
dimension is not present in the pattern and is not modified.
As here dimensions are named already in the input object, we don't need
ellipsis nor adding dimensions in both input and output to indicate they
are left as is. You can see how the team dimension has not been modified
in the output below:
<xarray.DataArray 'atts' (team: 6, sample: 2000)>
array([[ 0.10632395, 0.1538294 , 0.17806237, ..., 0.16744257,
0.14927569, 0.21803568],
...,
[ 0.30447644, 0.22650416, 0.25523419, ..., 0.28405435,
0.29232681, 0.20286656]])
Coordinates:
* team (team) object 'Wales' 'France' 'Ireland' ... 'Italy' 'England'
Dimensions without coordinates: sample
Note that following xarray convention, new dimensions and dimensions on which we operated
are moved to the end. This only matters when you access the underlying array with .values
or .data
and you can always transpose using {meth}xarray.Dataset.transpose
, but
it can matter. You can change the pattern to enforce the output dimension order:
::::{tab-set} :::{tab-item} rearrange :sync: full
rearrange(ds.atts, [{"sample": ("chain", "draw")}, "team"])
::: :::{tab-item} raw_rearrange :sync: raw
raw_rearrange(ds.atts, "(chain draw)=sample team")
::: ::::
Out:
<xarray.DataArray 'atts' (sample: 2000, team: 6)>
array([[ 0.10632395, -0.01912607, 0.13671159, -0.06754783, -0.46083807,
0.30447644],
...,
[ 0.21803568, -0.11394285, 0.09447937, -0.11032643, -0.29111234,
0.20286656]])
Coordinates:
* team (team) object 'Wales' 'France' 'Ireland' ... 'Italy' 'England'
Dimensions without coordinates: sample
Decompose and combine two dimensions in a different order
Now to a more complicated pattern. We will split the chain and team dimension, then combine those split dimensions between them.
::::{tab-set} :::{tab-item} rearrange :sync: full
Use a list of dictionaries to choose which dimensions to decompose, note that lists with dimensions to decompose are not valid, you need to indicate which dimension is the one to be decomposed.
rearrange(
ds.atts,
in_dims=[{"chain": ("chain1", "chain2")}, {"team": ("team1", "team2")}],
# combine split chain and team dims between them
# here we don't use a dict so the new dimensions get a default name
out_dims=[("chain1", "team1"), ("team2", "chain2")],
# set the lengths of split dimensions as kwargs
chain1=2, chain2=2, team1=2, team2=3
)
::: :::{tab-item} raw_rearrange :sync: raw
We use ()=
on the left side because we need to indicate which dimensions
to decompose, but we can skip it if we want on the right side and xarray_einstats
uses a default name for them.
raw_rearrange(
ds.atts,
"(chain1 chain2)=chain (team1 team2)=team -> (chain1 team1) (team2 chain2)",
# set the lengths of split dimensions as kwargs
chain1=2, chain2=2, team1=2, team2=3
)
::: ::::
Out:
<xarray.DataArray 'atts' (draw: 500, chain1,team1: 4, team2,chain2: 6)>
array([[[ 1.06323952e-01, 2.47005252e-01, -1.91260714e-02,
-2.55769582e-02, 1.36711590e-01, 1.23165119e-01],
...
[-2.76616968e-02, -1.10326428e-01, -3.99582340e-01,
-2.91112341e-01, 1.90714405e-01, 2.02866563e-01]]])
Coordinates:
* draw (draw) int64 0 1 2 3 4 5 6 7 8 ... 492 493 494 495 496 497 498 499
Dimensions without coordinates: chain1,team1, team2,chain2
More einops examples with both rearrange
and reduce
at {ref}einops_tutorial
Other features
xarray_einstats
also includes some functions that are not direct wrappers of other
libraries. {func}~xarray_einstats.numba.histogram
for example combines numba,
numpy and xarray to provide a vectorized version of numpy.histogram
that works
on DataArrays.
Similar projects
Here we list some similar projects we know of. Note that all of them are complementary and don't overlap:
Cite xarray-einstats
If you use this software, please cite it using the following template and the version specific DOI provided by Zenodo. Click on the badge to go to the Zenodo page and select the DOI corresponding to the version you used
- Oriol Abril-Pla. (2022). arviz-devs/xarray-einstats
<version>
. Zenodo.<version_doi>
or in bibtex format:
@software{xarray_einstats2022,
author = {Abril-Pla, Oriol},
title = {{xarray-einstats}},
year = 2022,
url = {https://github.com/arviz-devs/xarray-einstats}
publisher = {Zenodo},
version = {<version>},
doi = {<version_doi>},
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file xarray-einstats-0.2.0.tar.gz
.
File metadata
- Download URL: xarray-einstats-0.2.0.tar.gz
- Upload date:
- Size: 29.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5c4e00d09172daf665cbd5e533a897896b84f37feb5cab0fd75829477fbcda7d |
|
MD5 | 36dd7a9228bacedbbb1710ede33cc2f4 |
|
BLAKE2b-256 | 4c674992b4eb7f6c484d4f33a5b1a6387d347ff4ea230ccb22a9995624f21e00 |
File details
Details for the file xarray_einstats-0.2.0-py2.py3-none-any.whl
.
File metadata
- Download URL: xarray_einstats-0.2.0-py2.py3-none-any.whl
- Upload date:
- Size: 29.4 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b3fa8383b9729c48be40e00596ca6f641f74c9d7ca5a087197d1f9fbf21dc457 |
|
MD5 | 57971f9a7185cc9de58c4762523e261c |
|
BLAKE2b-256 | 5c95c23eaa0ea0b9cbb4e4df54df992e119fa417b5a4c4bf61a81f8073cc72a1 |