Skip to main content

SQL query layer for Dask

Project description

Conda PyPI GitHub Workflow Status Read the Docs Codecov GitHub Binder

SQL + Python

dask-sql is a distributed SQL query engine in Python. It allows you to query and transform your data using a mixture of common SQL operations and Python code and also scale up the calculation easily if you need it.

  • Combine the power of Python and SQL: load your data with Python, transform it with SQL, enhance it with Python and query it with SQL - or the other way round. With dask-sql you can mix the well known Python dataframe API of pandas and Dask with common SQL operations, to process your data in exactly the way that is easiest for you.
  • Infinite Scaling: using the power of the great Dask ecosystem, your computations can scale as you need it - from your laptop to your super cluster - without changing any line of SQL code. From k8s to cloud deployments, from batch systems to YARN - if Dask supports it, so will dask-sql.
  • Your data - your queries: Use Python user-defined functions (UDFs) in SQL without any performance drawback and extend your SQL queries with the large number of Python libraries, e.g. machine learning, different complicated input formats, complex statistics.
  • Easy to install and maintain: dask-sql is just a pip/conda install away (or a docker run if you prefer).
  • Use SQL from wherever you like: dask-sql integrates with your jupyter notebook, your normal Python module or can be used as a standalone SQL server from any BI tool. It even integrates natively with Apache Hue.
  • GPU Support: dask-sql supports running SQL queries on CUDA-enabled GPUs by utilizing RAPIDS libraries like cuDF, enabling accelerated compute for SQL.

Read more in the documentation.

dask-sql GIF

Example

For this example, we use some data loaded from disk and query them with a SQL command from our python code. Any pandas or dask dataframe can be used as input and dask-sql understands a large amount of formats (csv, parquet, json,...) and locations (s3, hdfs, gcs,...).

import dask.dataframe as dd
from dask_sql import Context

# Create a context to hold the registered tables
c = Context()

# Load the data and register it in the context
# This will give the table a name, that we can use in queries
df = dd.read_csv("...")
c.create_table("my_data", df)

# Now execute a SQL query. The result is again dask dataframe.
result = c.sql("""
    SELECT
        my_data.name,
        SUM(my_data.x)
    FROM
        my_data
    GROUP BY
        my_data.name
""", return_futures=False)

# Show the result
print(result)

Quickstart

Have a look into the documentation or start the example notebook on binder.

dask-sql is currently under development and does so far not understand all SQL commands (but a large fraction). We are actively looking for feedback, improvements and contributors!

Installation

dask-sql can be installed via conda (preferred) or pip - or in a development environment.

With conda

Create a new conda environment or use your already present environment:

conda create -n dask-sql
conda activate dask-sql

Install the package from the conda-forge channel:

conda install dask-sql -c conda-forge

With pip

You can install the package with

pip install dask-sql

For development

If you want to have the newest (unreleased) dask-sql version or if you plan to do development on dask-sql, you can also install the package from sources.

git clone https://github.com/dask-contrib/dask-sql.git

Create a new conda environment and install the development environment:

conda env create -f continuous_integration/environment-3.9-dev.yaml

It is not recommended to use pip instead of conda for the environment setup.

After that, you can install the package in development mode

pip install -e ".[dev]"

The Rust DataFusion bindings are built as part of the pip install. If changes are made to the Rust source in dask_planner/, another build/install must be run to recompile the bindings:

python setup.py build install

This repository uses pre-commit hooks. To install them, call

pre-commit install

Testing

You can run the tests (after installation) with

pytest tests

GPU-specific tests require additional dependencies specified in continuous_integration/gpuci/environment.yaml. These can be added to the development environment by running

conda env update -n dask-sql -f continuous_integration/gpuci/environment.yaml

And GPU-specific tests can be run with

pytest tests -m gpu --rungpu

SQL Server

dask-sql comes with a small test implementation for a SQL server. Instead of rebuilding a full ODBC driver, we re-use the presto wire protocol. It is - so far - only a start of the development and missing important concepts, such as authentication.

You can test the sql presto server by running (after installation)

dask-sql-server

or by using the created docker image

docker run --rm -it -p 8080:8080 nbraun/dask-sql

in one terminal. This will spin up a server on port 8080 (by default) that looks similar to a normal presto database to any presto client.

You can test this for example with the default presto client:

presto --server localhost:8080

Now you can fire simple SQL queries (as no data is loaded by default):

=> SELECT 1 + 1;
 EXPR$0
--------
    2
(1 row)

You can find more information in the documentation.

CLI

You can also run the CLI dask-sql for testing out SQL commands quickly:

dask-sql --load-test-data --startup

(dask-sql) > SELECT * FROM timeseries LIMIT 10;

How does it work?

At the core, dask-sql does two things:

  • translate the SQL query using DataFusion into a relational algebra, which is represented as a logical query plan - similar to many other SQL engines (Hive, Flink, ...)
  • convert this description of the query into dask API calls (and execute them) - returning a dask dataframe.

For the first step, Arrow DataFusion needs to know about the columns and types of the dask dataframes, therefore some Rust code to store this information for dask dataframes are defined in dask_planner. After the translation to a relational algebra is done (using DaskSQLContext.logical_relational_algebra), the python methods defined in dask_sql.physical turn this into a physical dask execution plan by converting each piece of the relational algebra one-by-one.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dask_sql-2023.6.0.tar.gz (205.4 kB view details)

Uploaded Source

Built Distributions

dask_sql-2023.6.0-cp310-cp310-win_amd64.whl (17.0 MB view details)

Uploaded CPython 3.10 Windows x86-64

dask_sql-2023.6.0-cp310-cp310-manylinux_2_32_aarch64.whl (21.4 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.32+ ARM64

dask_sql-2023.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.9 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

dask_sql-2023.6.0-cp310-cp310-macosx_11_0_arm64.whl (17.4 MB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

dask_sql-2023.6.0-cp310-cp310-macosx_10_9_x86_64.whl (19.1 MB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

dask_sql-2023.6.0-cp39-cp39-win_amd64.whl (16.6 MB view details)

Uploaded CPython 3.9 Windows x86-64

dask_sql-2023.6.0-cp39-cp39-manylinux_2_32_aarch64.whl (21.4 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.32+ ARM64

dask_sql-2023.6.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.9 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

dask_sql-2023.6.0-cp39-cp39-macosx_11_0_arm64.whl (17.4 MB view details)

Uploaded CPython 3.9 macOS 11.0+ ARM64

dask_sql-2023.6.0-cp39-cp39-macosx_10_9_x86_64.whl (19.1 MB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

dask_sql-2023.6.0-cp38-cp38-win_amd64.whl (17.0 MB view details)

Uploaded CPython 3.8 Windows x86-64

dask_sql-2023.6.0-cp38-cp38-manylinux_2_32_aarch64.whl (21.4 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.32+ ARM64

dask_sql-2023.6.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.9 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

dask_sql-2023.6.0-cp38-cp38-macosx_11_0_arm64.whl (17.4 MB view details)

Uploaded CPython 3.8 macOS 11.0+ ARM64

dask_sql-2023.6.0-cp38-cp38-macosx_10_9_x86_64.whl (19.1 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

File details

Details for the file dask_sql-2023.6.0.tar.gz.

File metadata

  • Download URL: dask_sql-2023.6.0.tar.gz
  • Upload date:
  • Size: 205.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.16

File hashes

Hashes for dask_sql-2023.6.0.tar.gz
Algorithm Hash digest
SHA256 03772eb90fd00de2a88e1d4ea44c90cb58588d86049ab13e0d4f5343ed21b47a
MD5 d3efdc1c45b19dcd765d16c16ef60eb7
BLAKE2b-256 c8dd2cb75136da798f0ae3074fd855375ad7feb858dc53ff1cfc14fa0dd6ce96

See more details on using hashes here.

File details

Details for the file dask_sql-2023.6.0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.6.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 497b07cab1afcbf3d37b376033a00f9e1e5aa6f6c166dfbe6a527aca64b27ba0
MD5 c442f1823872dff59080dd2ca589186d
BLAKE2b-256 7d17dc7f6bad2e068214415fd04799363ee3299c4dc612971346131b3f3d6bc2

See more details on using hashes here.

File details

Details for the file dask_sql-2023.6.0-cp310-cp310-manylinux_2_32_aarch64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.6.0-cp310-cp310-manylinux_2_32_aarch64.whl
Algorithm Hash digest
SHA256 6d7d79be2ca6564007a1146a680976fbf7ec0a24d01d38ae4967d2067f12ed7d
MD5 1b4fdcbcaf1a60685baf77856abffedb
BLAKE2b-256 f9323d56a89b564b4e314e1d4e5838e3a373a66074c1e16e87f8e7b71bce1184

See more details on using hashes here.

File details

Details for the file dask_sql-2023.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 20c1a07a74987af16af3f0ba8683a0eaebf42c75568d38f7121357823c1509c8
MD5 ccf18ae4736c5a327875d29ef4a589fd
BLAKE2b-256 decb2fa06e41a44f3c21bab86f6ecdc33db73408bae38567fd6313eaf70e1fe5

See more details on using hashes here.

File details

Details for the file dask_sql-2023.6.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.6.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7acb436325354f3ee0520e724838cc1c8b4e799a9390901712266239a229b374
MD5 c94ca60f249875ed1d53ea04ebe1dbc2
BLAKE2b-256 041aa81e57a564d4f8799fbf4e0ca93a837275e6e8c8e0832f5998fb70d39a5c

See more details on using hashes here.

File details

Details for the file dask_sql-2023.6.0-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.6.0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 8d6247551b8fec3a7214203d88c31a2630717f8fa236f34f966f82fc5baaee37
MD5 45a11dd913e92f9f247625a3bf4ae985
BLAKE2b-256 f55523975ea582c7dacef03e15ecc0b13ee8a87696ed574458ce5e1da3f3f987

See more details on using hashes here.

File details

Details for the file dask_sql-2023.6.0-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.6.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 b602a061eaf914129810ed3a56de75686d27b3bd274007cc02bdb797b2184818
MD5 111884febb37cad73133f7a78da9e9bd
BLAKE2b-256 c4b6d56813f5d827744ac32a83e9a36ab0d92e6f0417bd1a1941f15f6b62220f

See more details on using hashes here.

File details

Details for the file dask_sql-2023.6.0-cp39-cp39-manylinux_2_32_aarch64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.6.0-cp39-cp39-manylinux_2_32_aarch64.whl
Algorithm Hash digest
SHA256 cca421aa044d83b41200e47e3cf5aef853cfd614a9266110604cd0e87155d89d
MD5 8e102bf994f72e906a112774a59ee6c4
BLAKE2b-256 282583568615664ef664d0f264c1883ddb00c7088050a8d484d1de12de84eedc

See more details on using hashes here.

File details

Details for the file dask_sql-2023.6.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.6.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f35fdf3b4ed4cdaa8291e7d0d678f67695e142f982659199481ea35a3dcd38b6
MD5 85ffeb260b7f80c1d5240578f4f0e838
BLAKE2b-256 40392136728a307f4b223b4d50481f8637874989a3a784efe314a927f36cf87f

See more details on using hashes here.

File details

Details for the file dask_sql-2023.6.0-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.6.0-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6d356e3f0c7e71b2943a859e7792b84eaeafc4d4a0a466f582404141336aa09c
MD5 57f7bdc46f8b2b48fb8eec8a4a491392
BLAKE2b-256 8d9af82432dfe3c7e81241563456baa91be44cd30d9ebfc027c06ed0b1695dde

See more details on using hashes here.

File details

Details for the file dask_sql-2023.6.0-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.6.0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 ade35224b9366c4d38ba09da3206007af1df06e1672b6447fe583e28e5440bfa
MD5 7f73399884bf4b7fbc257624d2fbe281
BLAKE2b-256 5c69736f774c486f474caec0e2f53ea4829ee7353c0850b4f1b0f7c1c06df973

See more details on using hashes here.

File details

Details for the file dask_sql-2023.6.0-cp38-cp38-win_amd64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.6.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 940b0f57f5b1170c0d4fdaece959c206ed159cb7d60524b8153e10a92a9db090
MD5 caa9c62419b16ebbd4eedcf85c458e33
BLAKE2b-256 b8b7a91a25361948dcb626da7d8a052457765b1989ff0f0a88ad60aed537ef2e

See more details on using hashes here.

File details

Details for the file dask_sql-2023.6.0-cp38-cp38-manylinux_2_32_aarch64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.6.0-cp38-cp38-manylinux_2_32_aarch64.whl
Algorithm Hash digest
SHA256 8618fc4063e4f9ec9724ab1f2b65ca897db0d92d1274a688f7e05da4668811e0
MD5 bb941acf951c3361438f8cccab18bfd9
BLAKE2b-256 cedd6bc64a557df33337729b005d3f8a1d5cc0d15066c4fb12c4690f0b5e0f67

See more details on using hashes here.

File details

Details for the file dask_sql-2023.6.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.6.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4a092dd7d8f316f465026b3a8b98bae920a4b20606bf34bd21a7c1cda497c362
MD5 c7929af6a52290870ff97df2402bcefe
BLAKE2b-256 870e20b40fe9b9ef4eab20aacd48c3ca586c6d440f0fd8d9d136985a62668868

See more details on using hashes here.

File details

Details for the file dask_sql-2023.6.0-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.6.0-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c62cc571f9db5848b8f2e33f57d2ac0dbd9b67453bf80abe5bd5875bf5a25629
MD5 c9739125d41768bdb7c422c7f6d6a01a
BLAKE2b-256 82d1e90bcad907d0eee42cf534f7bbfcee9fc9222aa18ce5a7b743ef68d63404

See more details on using hashes here.

File details

Details for the file dask_sql-2023.6.0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.6.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 0f8feaade89468e9ff8a324e38bd6539711829fa5a2c33661b9d00671b253d67
MD5 1809ed874fc72efe9b0115a3bf8f68ae
BLAKE2b-256 6a9ee882eedebd1cdb215389cdccdb814a0a7facb9725f6b76eb24260a25ccac

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page