Skip to main content

SQL query layer for Dask

Project description

Conda PyPI GitHub Workflow Status Read the Docs Codecov GitHub Binder

SQL + Python

dask-sql is a distributed SQL query engine in Python. It allows you to query and transform your data using a mixture of common SQL operations and Python code and also scale up the calculation easily if you need it.

  • Combine the power of Python and SQL: load your data with Python, transform it with SQL, enhance it with Python and query it with SQL - or the other way round. With dask-sql you can mix the well known Python dataframe API of pandas and Dask with common SQL operations, to process your data in exactly the way that is easiest for you.
  • Infinite Scaling: using the power of the great Dask ecosystem, your computations can scale as you need it - from your laptop to your super cluster - without changing any line of SQL code. From k8s to cloud deployments, from batch systems to YARN - if Dask supports it, so will dask-sql.
  • Your data - your queries: Use Python user-defined functions (UDFs) in SQL without any performance drawback and extend your SQL queries with the large number of Python libraries, e.g. machine learning, different complicated input formats, complex statistics.
  • Easy to install and maintain: dask-sql is just a pip/conda install away (or a docker run if you prefer). No need for complicated cluster setups - dask-sql will run out of the box on your machine and can be easily connected to your computing cluster.
  • Use SQL from wherever you like: dask-sql integrates with your jupyter notebook, your normal Python module or can be used as a standalone SQL server from any BI tool. It even integrates natively with Apache Hue.
  • GPU Support: dask-sql supports running SQL queries on CUDA-enabled GPUs by utilizing RAPIDS libraries like cuDF, enabling accelerated compute for SQL.

Read more in the documentation.

dask-sql GIF

Example

For this example, we use some data loaded from disk and query them with a SQL command from our python code. Any pandas or dask dataframe can be used as input and dask-sql understands a large amount of formats (csv, parquet, json,...) and locations (s3, hdfs, gcs,...).

import dask.dataframe as dd
from dask_sql import Context

# Create a context to hold the registered tables
c = Context()

# Load the data and register it in the context
# This will give the table a name, that we can use in queries
df = dd.read_csv("...")
c.create_table("my_data", df)

# Now execute a SQL query. The result is again dask dataframe.
result = c.sql("""
    SELECT
        my_data.name,
        SUM(my_data.x)
    FROM
        my_data
    GROUP BY
        my_data.name
""", return_futures=False)

# Show the result
print(result)

Quickstart

Have a look into the documentation or start the example notebook on binder.

dask-sql is currently under development and does so far not understand all SQL commands (but a large fraction). We are actively looking for feedback, improvements and contributors!

Installation

dask-sql can be installed via conda (preferred) or pip - or in a development environment.

With conda

Create a new conda environment or use your already present environment:

conda create -n dask-sql
conda activate dask-sql

Install the package from the conda-forge channel:

conda install dask-sql -c conda-forge

With pip

You can install the package with

pip install dask-sql

For development

If you want to have the newest (unreleased) dask-sql version or if you plan to do development on dask-sql, you can also install the package from sources.

git clone https://github.com/dask-contrib/dask-sql.git

Create a new conda environment and install the development environment:

conda env create -f continuous_integration/environment-3.9-dev.yaml

It is not recommended to use pip instead of conda for the environment setup.

After that, you can install the package in development mode

pip install -e ".[dev]"

The Rust DataFusion bindings are built as part of the pip install. If changes are made to the Rust source in dask_planner/, another build/install must be run to recompile the bindings:

python setup.py build install

This repository uses pre-commit hooks. To install them, call

pre-commit install

Testing

You can run the tests (after installation) with

pytest tests

GPU-specific tests require additional dependencies specified in continuous_integration/gpuci/environment.yaml. These can be added to the development environment by running

conda env update -n dask-sql -f continuous_integration/gpuci/environment.yaml

And GPU-specific tests can be run with

pytest tests -m gpu --rungpu

SQL Server

dask-sql comes with a small test implementation for a SQL server. Instead of rebuilding a full ODBC driver, we re-use the presto wire protocol. It is - so far - only a start of the development and missing important concepts, such as authentication.

You can test the sql presto server by running (after installation)

dask-sql-server

or by using the created docker image

docker run --rm -it -p 8080:8080 nbraun/dask-sql

in one terminal. This will spin up a server on port 8080 (by default) that looks similar to a normal presto database to any presto client.

You can test this for example with the default presto client:

presto --server localhost:8080

Now you can fire simple SQL queries (as no data is loaded by default):

=> SELECT 1 + 1;
 EXPR$0
--------
    2
(1 row)

You can find more information in the documentation.

CLI

You can also run the CLI dask-sql for testing out SQL commands quickly:

dask-sql --load-test-data --startup

(dask-sql) > SELECT * FROM timeseries LIMIT 10;

How does it work?

At the core, dask-sql does two things:

  • translate the SQL query using DataFusion into a relational algebra, which is represented as a logical query plan - similar to many other SQL engines (Hive, Flink, ...)
  • convert this description of the query into dask API calls (and execute them) - returning a dask dataframe.

For the first step, Arrow DataFusion needs to know about the columns and types of the dask dataframes, therefore some Rust code to store this information for dask dataframes are defined in dask_planner. After the translation to a relational algebra is done (using DaskSQLContext.logical_relational_algebra), the python methods defined in dask_sql.physical turn this into a physical dask execution plan by converting each piece of the relational algebra one-by-one.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dask_sql-2022.10.1rc1.tar.gz (165.2 kB view details)

Uploaded Source

Built Distributions

dask_sql-2022.10.1rc1-cp310-cp310-win_amd64.whl (4.5 MB view details)

Uploaded CPython 3.10 Windows x86-64

dask_sql-2022.10.1rc1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.3 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

dask_sql-2022.10.1rc1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (6.0 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARM64

dask_sql-2022.10.1rc1-cp310-cp310-macosx_11_0_arm64.whl (4.4 MB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

dask_sql-2022.10.1rc1-cp310-cp310-macosx_10_9_x86_64.whl (4.9 MB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

dask_sql-2022.10.1rc1-cp39-cp39-win_amd64.whl (4.5 MB view details)

Uploaded CPython 3.9 Windows x86-64

dask_sql-2022.10.1rc1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.3 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

dask_sql-2022.10.1rc1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (6.0 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARM64

dask_sql-2022.10.1rc1-cp39-cp39-macosx_11_0_arm64.whl (4.4 MB view details)

Uploaded CPython 3.9 macOS 11.0+ ARM64

dask_sql-2022.10.1rc1-cp39-cp39-macosx_10_9_x86_64.whl (4.9 MB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

dask_sql-2022.10.1rc1-cp38-cp38-win_amd64.whl (4.5 MB view details)

Uploaded CPython 3.8 Windows x86-64

dask_sql-2022.10.1rc1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.3 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

dask_sql-2022.10.1rc1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (6.0 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ARM64

dask_sql-2022.10.1rc1-cp38-cp38-macosx_11_0_arm64.whl (4.4 MB view details)

Uploaded CPython 3.8 macOS 11.0+ ARM64

dask_sql-2022.10.1rc1-cp38-cp38-macosx_10_9_x86_64.whl (4.9 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

File details

Details for the file dask_sql-2022.10.1rc1.tar.gz.

File metadata

  • Download URL: dask_sql-2022.10.1rc1.tar.gz
  • Upload date:
  • Size: 165.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.13

File hashes

Hashes for dask_sql-2022.10.1rc1.tar.gz
Algorithm Hash digest
SHA256 add72aa0b3b5354d4d30f1aeeaf2f8593bc9f8c1da7d6cf0e43a8df4d7e5d79c
MD5 87f3c94c6860484f00bcad3ccb6a683b
BLAKE2b-256 a2500534f7f93a168c228517b7e7c144656fcc516c1c116f7ba41bb6e3a7938f

See more details on using hashes here.

File details

Details for the file dask_sql-2022.10.1rc1-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.10.1rc1-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 160c6b32c9752baae19fd0b5eeb369a88d55e337835dcaf51643a1b974e67e22
MD5 ad9fcf5150f55a650a5ca5bf9226f717
BLAKE2b-256 669134ebb1094444aacb8812396f43dc55cd035fad9bff4e7ac27f459554743d

See more details on using hashes here.

File details

Details for the file dask_sql-2022.10.1rc1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.10.1rc1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d477f53f6794e5cf6bd2ae7040278119b677c5c9d44b7d198e882e7f69f2f2d8
MD5 b45c8ec32a93e99ed335fc251f3dd1b1
BLAKE2b-256 8a3bf344dc02c92b84de95380a3fa532547287d05cc0d5fc9994b57cf934d74a

See more details on using hashes here.

File details

Details for the file dask_sql-2022.10.1rc1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.10.1rc1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 78cb0d7711f564800e2b82178b8d3676acec88155179a5de3e14934b169b2f26
MD5 e4939d8ec27d722bf2aad20a15194b5f
BLAKE2b-256 7c176b586190bc79ff2a840144146defd02bc8a20bc62e0019b9bf57ec6dab1c

See more details on using hashes here.

File details

Details for the file dask_sql-2022.10.1rc1-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.10.1rc1-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4be60f1a15ae8ed8cd87dd257a8908bf733e2e0375711618bd059afd25ee13bd
MD5 7ce16a092175c3bbc628487c5cdbc19e
BLAKE2b-256 7b29fb73daf5589096b18aef870ad957d78c6440edcb19377a6b590b1882625b

See more details on using hashes here.

File details

Details for the file dask_sql-2022.10.1rc1-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.10.1rc1-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 f462bd4c5a22ea7d4f8fa8054045daa0568a896c0b0d74faf64645b4f69edab9
MD5 f635bc71e3ef6ebc632d877dca955526
BLAKE2b-256 b175f85196d1ccc8be5193f0016907feb9ff88d4dd357b1b8dc49fab18d779dc

See more details on using hashes here.

File details

Details for the file dask_sql-2022.10.1rc1-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.10.1rc1-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 83320066af7af7e1aa3b8135f2d409af70e99f9e03807b58767bcdca829ff080
MD5 75ac0588be86e7ceb682440d42171c3c
BLAKE2b-256 6707e2dcc104b4db81822803c52c87d28121ca6ab7cdc1d6198b3e03bc78e34b

See more details on using hashes here.

File details

Details for the file dask_sql-2022.10.1rc1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.10.1rc1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1ff4fac67e70acd82e3de859b37c30662b9cbb377581416e7405df82ade0cb9c
MD5 2eebb07ea56aa644263f0051c933c988
BLAKE2b-256 7efb0505fb715fb583d9485d3b495903b89bcf6e34af78eac597e9eadf2aa983

See more details on using hashes here.

File details

Details for the file dask_sql-2022.10.1rc1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.10.1rc1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 d07817706639ae5cf01083e02e06cab4757cf85e1f1a7f9741f0481cc89e7250
MD5 8c7b0d45699bed9fd167510b27dee423
BLAKE2b-256 1787dd936bba453f169bdc9bee51539ff0e9e11c711060df8e7ff4823ea8f8c6

See more details on using hashes here.

File details

Details for the file dask_sql-2022.10.1rc1-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.10.1rc1-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ea7172528af955155ee17cd7941e81e95b48534b8568219f0ce7a6c1c69d5445
MD5 a23309ce888e74e4d477706578be644f
BLAKE2b-256 30d87c1c8b8d6d2343cfdca88f350747fbd7aaaabf10b14ab9d563664572e502

See more details on using hashes here.

File details

Details for the file dask_sql-2022.10.1rc1-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.10.1rc1-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 67c8a3c075fb5089a942204dfad0a125877107e6277318304f51a1751aa472b8
MD5 97f15c182b1a9e16177ab40a006273ca
BLAKE2b-256 7e74ecd5c1618e265c0c6f3f6d9d09dfb2767631e1e7822a27bfa07f4eeda92b

See more details on using hashes here.

File details

Details for the file dask_sql-2022.10.1rc1-cp38-cp38-win_amd64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.10.1rc1-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 0d3806f9e9da65d681604f0475d6a2c6c145c2d084aba4b7721c702479af6a58
MD5 e8c33bf328aa12b473103d0f443851a9
BLAKE2b-256 30a60250afd668666d300509892423babfdbca6db3d91fe78beeaad04153f443

See more details on using hashes here.

File details

Details for the file dask_sql-2022.10.1rc1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.10.1rc1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2c33402f0a5f873de31854d2b8a130c196eb70eddad1db49ae027529fdfa505d
MD5 0a207e04277fe03a6ccf4710b87b402c
BLAKE2b-256 3b36b63d4b764c186fc76711feae0a7d44ab9f39282f4f69b0d0e575f2f1198a

See more details on using hashes here.

File details

Details for the file dask_sql-2022.10.1rc1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.10.1rc1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 2f412a705c7959c47eaea6eaf3ad8495e603dfdd5c4188d9c7b15975ba13f8f4
MD5 a70f2ff647529033024522207176a436
BLAKE2b-256 f0a0b15560b7728367b2040b592e0cd5ec2f1e76f8692753b99103ae239c3bb2

See more details on using hashes here.

File details

Details for the file dask_sql-2022.10.1rc1-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.10.1rc1-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 1851d7922182c3ef467b8c29fe661b23438bd89e7359abaabdc3ca2cf24b3b0e
MD5 e4e5c20a9dffacc69ee0e3b4d5471d96
BLAKE2b-256 29e3a9a56dcf325b2a3cf0c2ea2295b2b0fcaa73c0242a9857c4c3ecc7bdd8b7

See more details on using hashes here.

File details

Details for the file dask_sql-2022.10.1rc1-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.10.1rc1-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 adb99cdb6e156aef515f9ad1066b88b8b9dd007981e96fbc4ea3378ac9719c23
MD5 d4e30dbd4274a68d0d0e539699080e54
BLAKE2b-256 5022cb6a88e53ba375a0f52eb7b3793ace2626169e8bdb3d5e6e8d17e0f608bb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page