Skip to main content

SQL query layer for Dask

Project description

Conda PyPI GitHub Workflow Status Read the Docs Codecov GitHub Binder

SQL + Python

dask-sql is a distributed SQL query engine in Python. It allows you to query and transform your data using a mixture of common SQL operations and Python code and also scale up the calculation easily if you need it.

  • Combine the power of Python and SQL: load your data with Python, transform it with SQL, enhance it with Python and query it with SQL - or the other way round. With dask-sql you can mix the well known Python dataframe API of pandas and Dask with common SQL operations, to process your data in exactly the way that is easiest for you.
  • Infinite Scaling: using the power of the great Dask ecosystem, your computations can scale as you need it - from your laptop to your super cluster - without changing any line of SQL code. From k8s to cloud deployments, from batch systems to YARN - if Dask supports it, so will dask-sql.
  • Your data - your queries: Use Python user-defined functions (UDFs) in SQL without any performance drawback and extend your SQL queries with the large number of Python libraries, e.g. machine learning, different complicated input formats, complex statistics.
  • Easy to install and maintain: dask-sql is just a pip/conda install away (or a docker run if you prefer). No need for complicated cluster setups - dask-sql will run out of the box on your machine and can be easily connected to your computing cluster.
  • Use SQL from wherever you like: dask-sql integrates with your jupyter notebook, your normal Python module or can be used as a standalone SQL server from any BI tool. It even integrates natively with Apache Hue.
  • GPU Support: dask-sql supports running SQL queries on CUDA-enabled GPUs by utilizing RAPIDS libraries like cuDF, enabling accelerated compute for SQL.

Read more in the documentation.

dask-sql GIF

Example

For this example, we use some data loaded from disk and query them with a SQL command from our python code. Any pandas or dask dataframe can be used as input and dask-sql understands a large amount of formats (csv, parquet, json,...) and locations (s3, hdfs, gcs,...).

import dask.dataframe as dd
from dask_sql import Context

# Create a context to hold the registered tables
c = Context()

# Load the data and register it in the context
# This will give the table a name, that we can use in queries
df = dd.read_csv("...")
c.create_table("my_data", df)

# Now execute a SQL query. The result is again dask dataframe.
result = c.sql("""
    SELECT
        my_data.name,
        SUM(my_data.x)
    FROM
        my_data
    GROUP BY
        my_data.name
""", return_futures=False)

# Show the result
print(result)

Quickstart

Have a look into the documentation or start the example notebook on binder.

dask-sql is currently under development and does so far not understand all SQL commands (but a large fraction). We are actively looking for feedback, improvements and contributors!

Installation

dask-sql can be installed via conda (preferred) or pip - or in a development environment.

With conda

Create a new conda environment or use your already present environment:

conda create -n dask-sql
conda activate dask-sql

Install the package from the conda-forge channel:

conda install dask-sql -c conda-forge

With pip

You can install the package with

pip install dask-sql

For development

If you want to have the newest (unreleased) dask-sql version or if you plan to do development on dask-sql, you can also install the package from sources.

git clone https://github.com/dask-contrib/dask-sql.git

Create a new conda environment and install the development environment:

conda env create -f continuous_integration/environment-3.9-dev.yaml

It is not recommended to use pip instead of conda for the environment setup.

After that, you can install the package in development mode

pip install -e ".[dev]"

The Rust DataFusion bindings are built as part of the pip install. If changes are made to the Rust source in dask_planner/, another build/install must be run to recompile the bindings:

python setup.py build install

This repository uses pre-commit hooks. To install them, call

pre-commit install

Testing

You can run the tests (after installation) with

pytest tests

GPU-specific tests require additional dependencies specified in continuous_integration/gpuci/environment.yaml. These can be added to the development environment by running

conda env update -n dask-sql -f continuous_integration/gpuci/environment.yaml

And GPU-specific tests can be run with

pytest tests -m gpu --rungpu

SQL Server

dask-sql comes with a small test implementation for a SQL server. Instead of rebuilding a full ODBC driver, we re-use the presto wire protocol. It is - so far - only a start of the development and missing important concepts, such as authentication.

You can test the sql presto server by running (after installation)

dask-sql-server

or by using the created docker image

docker run --rm -it -p 8080:8080 nbraun/dask-sql

in one terminal. This will spin up a server on port 8080 (by default) that looks similar to a normal presto database to any presto client.

You can test this for example with the default presto client:

presto --server localhost:8080

Now you can fire simple SQL queries (as no data is loaded by default):

=> SELECT 1 + 1;
 EXPR$0
--------
    2
(1 row)

You can find more information in the documentation.

CLI

You can also run the CLI dask-sql for testing out SQL commands quickly:

dask-sql --load-test-data --startup

(dask-sql) > SELECT * FROM timeseries LIMIT 10;

How does it work?

At the core, dask-sql does two things:

  • translate the SQL query using DataFusion into a relational algebra, which is represented as a logical query plan - similar to many other SQL engines (Hive, Flink, ...)
  • convert this description of the query into dask API calls (and execute them) - returning a dask dataframe.

For the first step, Arrow DataFusion needs to know about the columns and types of the dask dataframes, therefore some Rust code to store this information for dask dataframes are defined in dask_planner. After the translation to a relational algebra is done (using DaskSQLContext.logical_relational_algebra), the python methods defined in dask_sql.physical turn this into a physical dask execution plan by converting each piece of the relational algebra one-by-one.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dask_sql-2022.12.0.tar.gz (178.5 kB view details)

Uploaded Source

Built Distributions

dask_sql-2022.12.0-cp310-cp310-win_amd64.whl (5.2 MB view details)

Uploaded CPython 3.10 Windows x86-64

dask_sql-2022.12.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.0 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

dask_sql-2022.12.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (6.7 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARM64

dask_sql-2022.12.0-cp310-cp310-macosx_11_0_arm64.whl (5.2 MB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

dask_sql-2022.12.0-cp310-cp310-macosx_10_9_x86_64.whl (5.5 MB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

dask_sql-2022.12.0-cp39-cp39-win_amd64.whl (5.2 MB view details)

Uploaded CPython 3.9 Windows x86-64

dask_sql-2022.12.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.0 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

dask_sql-2022.12.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (6.7 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARM64

dask_sql-2022.12.0-cp39-cp39-macosx_11_0_arm64.whl (5.2 MB view details)

Uploaded CPython 3.9 macOS 11.0+ ARM64

dask_sql-2022.12.0-cp39-cp39-macosx_10_9_x86_64.whl (5.5 MB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

dask_sql-2022.12.0-cp38-cp38-win_amd64.whl (5.2 MB view details)

Uploaded CPython 3.8 Windows x86-64

dask_sql-2022.12.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.0 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

dask_sql-2022.12.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (6.7 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ARM64

dask_sql-2022.12.0-cp38-cp38-macosx_11_0_arm64.whl (5.2 MB view details)

Uploaded CPython 3.8 macOS 11.0+ ARM64

dask_sql-2022.12.0-cp38-cp38-macosx_10_9_x86_64.whl (5.5 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

File details

Details for the file dask_sql-2022.12.0.tar.gz.

File metadata

  • Download URL: dask_sql-2022.12.0.tar.gz
  • Upload date:
  • Size: 178.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.15

File hashes

Hashes for dask_sql-2022.12.0.tar.gz
Algorithm Hash digest
SHA256 460cf0c0cad6c7fcbac30341e8607ed6ae032d262b30fe7d686daede98737774
MD5 58789173565039d7ada13be619bb6963
BLAKE2b-256 93a34e33a25cfbeee5594dc0c7480dc7318e72a7a6d01edcc03496c93675682e

See more details on using hashes here.

File details

Details for the file dask_sql-2022.12.0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.12.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 ecefa437e8fede1b5cc656effdc956b204e6558dbd431e9ba9d32dd11f91f0e6
MD5 893b0e1bbed325e63feebee5974c2f77
BLAKE2b-256 4f68334ea07e478465a6e3e464583a3cf64fa8d40d8794b59b4a601651f14810

See more details on using hashes here.

File details

Details for the file dask_sql-2022.12.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.12.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a613fcf9440bee54114e6eec32a9fc1f5a9b4b9a636f3c1310f1830abfff0c88
MD5 1c3189cbd86b1d2858470007fdf023bb
BLAKE2b-256 cd6828eaad482c2e02eb46ed39a95c525e118d823a04854d1d06b6c763d946e9

See more details on using hashes here.

File details

Details for the file dask_sql-2022.12.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.12.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 b2f4c97bc83c391ed611a6f87cfeff0d2b45fc06dd42117eb8ac77ea328c4a76
MD5 41c0a31169965650a77885743f6c2bf8
BLAKE2b-256 64caddf16bd0d529c2586279e148d0409db77dabb815f6f1dbe4065fcd8e1883

See more details on using hashes here.

File details

Details for the file dask_sql-2022.12.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.12.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b76438c31561efcbbc51788e7f17dad96d5cec8fd79f0de47e232fe6c471b3f2
MD5 7a479ce39eed1cee0fba8f0c9751ba73
BLAKE2b-256 6236fdaf5f0824cb25b6bddc736e62fc8afecd8f6344ff2a46585e96f92196aa

See more details on using hashes here.

File details

Details for the file dask_sql-2022.12.0-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.12.0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 0efd8277cf40601754ca9a27f9edef40b44e962efb891e749fdd5fbd291c0bd3
MD5 2f6ed73bd7f3c91030b214c3aa91b5d3
BLAKE2b-256 2cec7f56f601ea82dd61659cc23567335e4b848c7989b0a94a8d3818f577478a

See more details on using hashes here.

File details

Details for the file dask_sql-2022.12.0-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.12.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 d806ff9bffe06e209017df3ffb32c9cc0c01a39272811f9048981ff4d88d8d30
MD5 5b83d40b44e8e0c8b531d608f0fe312c
BLAKE2b-256 cc1ea71927b8c13f13710a76ce5fbad006b7fa3c4f7841fb1469b05f02feaa84

See more details on using hashes here.

File details

Details for the file dask_sql-2022.12.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.12.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6ce1bd1c368c1b1729153df2e397cbe2efed9ea9b77b0dc6e34283b5efeac7cd
MD5 355ad639e38a6e2cf81022341774def1
BLAKE2b-256 4eab7187f6eac0796572a48f657ba825589776843f0d10680e960966d796d66f

See more details on using hashes here.

File details

Details for the file dask_sql-2022.12.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.12.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 84377e43fe1677ba96ca1b675dbf68e96714a3cec9ce5903a960a407f8474207
MD5 f6b55075e075af4ccac47b713acebb8b
BLAKE2b-256 d9b46f63d2dbea4c26f4e33f98d7b74d83df3e0b336a4cc3463d7c6c2cb58b99

See more details on using hashes here.

File details

Details for the file dask_sql-2022.12.0-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.12.0-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3516856f347fceb0be7439b3f12867f74e731ccb5ba0ef2f4cb4e3ce44cbc6c6
MD5 81037d9e236c13cf956b0f22ce5d86d8
BLAKE2b-256 172ce484dcb92b0ca7335b233bbf027b8e648f618f9c958207dc46f95532976b

See more details on using hashes here.

File details

Details for the file dask_sql-2022.12.0-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.12.0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 bb070b60ff2796e74aefcf3cbe0d5f8e00ca0157b20b01fe3e11c7cd8041543b
MD5 ae4f32aba8d494b81abcbccd56b9eb9e
BLAKE2b-256 d3173cfd22e5c75a742d28154a295a56df7883334f42d126fc2751cfa30c3d23

See more details on using hashes here.

File details

Details for the file dask_sql-2022.12.0-cp38-cp38-win_amd64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.12.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 24590e5318506f3e3af2622f387a9cbd00f05eac98fce971ae63886689cb286e
MD5 24bf398e5c7e716091361222e484811b
BLAKE2b-256 9e12fafa41ec24454b1b089f1710ab097fd4de641bcd88055c2801da24dd7981

See more details on using hashes here.

File details

Details for the file dask_sql-2022.12.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.12.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 593ae9683f0ae462d7bf2c087c6373800dcff55389bb0da7a26598e7c2594a8e
MD5 c5bc8d9feae14e81d5b84ba62111f4c2
BLAKE2b-256 3b902466f290875a73f73efc4e521aedffecfc2d0972b140b6ac35e574ba4321

See more details on using hashes here.

File details

Details for the file dask_sql-2022.12.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.12.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 6e65d8cd863c10c9b3339cac7b0ac78b6dcf64b28f6f416eca8557df7524a763
MD5 74e91323fe23dc1e955bb0d7b14d68b5
BLAKE2b-256 ab76f7915dbcb7b5b4eea768f52537cf1fa647887f96b156c9a480d1fb7d4de7

See more details on using hashes here.

File details

Details for the file dask_sql-2022.12.0-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.12.0-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9b9976d0fc62305682509255fb4d9270a40e36134f0fbe5873c94eedd88ae0ea
MD5 5ae1846d7f3797244408164c47288301
BLAKE2b-256 c05266cc76b335dc5f9e7b83c5f1c2d8fd10007a77ce516d629a2cdbb9d4db42

See more details on using hashes here.

File details

Details for the file dask_sql-2022.12.0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.12.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 517eeed963de7f3f208e28bf584177bab4752e8b9c213f2d69fd88776b78c135
MD5 a6b81983ad2104765b7cff9c2cadf385
BLAKE2b-256 2d4c392b612baa96a8b781d4edb0831fd651295221d06e2e446fba4a0611a4e2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page