Skip to main content

High Level Expressions for Dask

Project description

Dask Expressions

Dask DataFrames with query optimization.

This is a proof-of-concept rewrite of Dask DataFrame that includes query optimization and generally improved organization.

More in our blog posts:

Example

import dask_expr as dx

df = dx.datasets.timeseries()
df.head()

df.groupby("name").x.mean().compute()

Query Representation

Dask-expr encodes user code in an expression tree:

>>> df.x.mean().pprint()

Mean:
  Projection: columns='x'
    Timeseries: seed=1896674884

This expression tree will be optimized and modified before execution:

>>> df.x.mean().optimize().pprint()

Div:
  Sum:
    Fused(375f9):
    | Projection: columns='x'
    |   Timeseries: dtypes={'x': <class 'float'>} seed=1896674884
  Count:
    Fused(375f9):
    | Projection: columns='x'
    |   Timeseries: dtypes={'x': <class 'float'>} seed=1896674884

Stability

This project is a work in progress and will be changed without notice or deprecation warning. Please provide feedback, but it's best to avoid use in production settings.

API Coverage

dask_expr.DataFrame

  • abs
  • add
  • add_prefix
  • add_sufix
  • align
  • all
  • any
  • apply
  • assign
  • astype
  • bfill
  • clip
  • combine_first
  • copy
  • count
  • cummax
  • cummin
  • cumprod
  • cumsum
  • dask
  • div
  • divide
  • drop
  • drop_duplicates
  • dropna
  • dtypes
  • eval
  • explode
  • ffill
  • fillna
  • floordiv
  • groupby
  • head
  • idxmax
  • idxmin
  • ìloc
  • index
  • isin
  • isna
  • join
  • map
  • map_overlap
  • map_partitions
  • mask
  • max
  • mean
  • memory_usage
  • memory_usage_per_partition
  • merge
  • min
  • min
  • mod
  • mode
  • mul
  • nlargest
  • nsmallest
  • nunique_approx
  • partitions
  • pivot_table
  • pow
  • prod
  • query
  • radd
  • rdiv
  • rename
  • rename_axis
  • repartition
  • replace
  • reset_index
  • rfloordiv
  • rmod
  • rmul
  • round
  • rpow
  • rsub
  • rtruediv
  • sample
  • select_dtypes
  • set_index
  • shift
  • shuffle
  • sort_values
  • std
  • sub
  • sum
  • tail
  • to_parquet
  • to_timestamp
  • truediv
  • var
  • visualize
  • where

dask_expr.Series

  • abs
  • add
  • align
  • all
  • any
  • apply
  • astype
  • between
  • bfill
  • clip
  • combine_first
  • copy
  • count
  • cummax
  • cummin
  • cumprod
  • cumsum
  • dask
  • div
  • divide
  • drop_duplicates
  • dropna
  • dtype
  • explode
  • ffill
  • fillna
  • floordiv
  • groupby
  • head
  • idxmax
  • idxmin
  • index
  • isin
  • isna
  • map
  • map_partitions
  • mask
  • max
  • mean
  • memory_usage
  • memory_usage_per_partition
  • min
  • min
  • mod
  • mode
  • mul
  • nlargest
  • nsmallest
  • nunique_approx
  • partitions
  • pow
  • prod
  • product
  • radd
  • rdiv
  • rename
  • rename_axis
  • repartition
  • replace
  • reset_index
  • rfloordiv
  • rmod
  • rmul
  • round
  • rpow
  • rsub
  • rtruediv
  • shift
  • shuffle
  • std
  • sub
  • sum
  • tail
  • to_frame
  • to_timestamp
  • truediv
  • unique
  • value_counts
  • var
  • visualize
  • where

dask_expr.Index

  • abs
  • align
  • all
  • any
  • apply
  • astype
  • clip
  • combine_first
  • copy
  • count
  • dask
  • dtype
  • fillna
  • groupby
  • head
  • idxmax
  • idxmin
  • index
  • isin
  • isna
  • map_partitions
  • max
  • memory_usage
  • min
  • min
  • mode
  • nunique_approx
  • partitions
  • prod
  • rename
  • rename_axis
  • repartition
  • replace
  • reset_index
  • round
  • shuffle
  • std
  • sum
  • tail
  • to_frame
  • to_timestamp
  • var
  • visualize

dask_expr._groupby.GroupBy

  • agg
  • aggregate
  • apply
  • `bfill
  • count
  • ffill
  • first
  • last
  • max
  • mean
  • median
  • min
  • nunique
  • prod
  • shift
  • size
  • std
  • sum
  • transform
  • value_counts
  • var

Support for SeriesGroupBy and DataFrameGroupBy.

dask_expr._resample.Resampler

  • agg
  • count
  • first
  • last
  • max
  • mean
  • median
  • min
  • nunique
  • ohlc
  • prod
  • quantile
  • sem
  • size
  • std
  • sum
  • var

dask_expr._rolling.Rolling

  • agg
  • apply
  • count
  • max
  • mean
  • median
  • min
  • quantile
  • std
  • sum
  • var
  • skew
  • kurt

Binary operators (DataFrame, Series, and Index):

  • __add__
  • __radd__
  • __sub__
  • __rsub__
  • __mul__
  • __pow__
  • __rmul__
  • __truediv__
  • __rtruediv__
  • __lt__
  • __rlt__
  • __gt__
  • __rgt__
  • __le__
  • __rle__
  • __ge__
  • __rge__
  • __eq__
  • __ne__
  • __and__
  • __rand__
  • __or__
  • __ror__
  • __xor__
  • __rxor__

Unary operators (DataFrame, Series, and Index):

  • __invert__
  • __neg__
  • __pos__

Accessors:

  • CategoricalAccessor
  • DatetimeAccessor
  • StringAccessor

Function

  • concat
  • from_pandas
  • merge
  • pivot_table
  • read_csv
  • read_parquet
  • repartition
  • to_datetime
  • to_numeric
  • to_timedelta
  • to_parquet

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dask-expr-0.3.0.tar.gz (116.8 kB view details)

Uploaded Source

Built Distribution

dask_expr-0.3.0-py3-none-any.whl (108.8 kB view details)

Uploaded Python 3

File details

Details for the file dask-expr-0.3.0.tar.gz.

File metadata

  • Download URL: dask-expr-0.3.0.tar.gz
  • Upload date:
  • Size: 116.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.12.0

File hashes

Hashes for dask-expr-0.3.0.tar.gz
Algorithm Hash digest
SHA256 fe5d62b076d92d6741ba3532fa9199ef1f815946573423171f36c0f02ea9e8cf
MD5 6521f6a130390de12520f843337120c9
BLAKE2b-256 297d1eccf099448a722d054d3714c13e72a9fe846801a4cf617446221739db5c

See more details on using hashes here.

Provenance

File details

Details for the file dask_expr-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: dask_expr-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 108.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.12.0

File hashes

Hashes for dask_expr-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ad8ff3260b623246cd44bc12c05f025fbe3bf68543bf442cc1da95785293dd06
MD5 066e60d80b44b6169de01eb03dc79fdc
BLAKE2b-256 73fe92dd2d5dfb99e235003709e87c6cbffc6c4ea954a8b2486242d9bb918356

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page