Skip to main content

SQL query layer for Dask

Project description

Conda PyPI GitHub Workflow Status Read the Docs Codecov GitHub Binder

SQL + Python

dask-sql is a distributed SQL query engine in Python. It allows you to query and transform your data using a mixture of common SQL operations and Python code and also scale up the calculation easily if you need it.

  • Combine the power of Python and SQL: load your data with Python, transform it with SQL, enhance it with Python and query it with SQL - or the other way round. With dask-sql you can mix the well known Python dataframe API of pandas and Dask with common SQL operations, to process your data in exactly the way that is easiest for you.
  • Infinite Scaling: using the power of the great Dask ecosystem, your computations can scale as you need it - from your laptop to your super cluster - without changing any line of SQL code. From k8s to cloud deployments, from batch systems to YARN - if Dask supports it, so will dask-sql.
  • Your data - your queries: Use Python user-defined functions (UDFs) in SQL without any performance drawback and extend your SQL queries with the large number of Python libraries, e.g. machine learning, different complicated input formats, complex statistics.
  • Easy to install and maintain: dask-sql is just a pip/conda install away (or a docker run if you prefer). No need for complicated cluster setups - dask-sql will run out of the box on your machine and can be easily connected to your computing cluster.
  • Use SQL from wherever you like: dask-sql integrates with your jupyter notebook, your normal Python module or can be used as a standalone SQL server from any BI tool. It even integrates natively with Apache Hue.
  • GPU Support: dask-sql supports running SQL queries on CUDA-enabled GPUs by utilizing RAPIDS libraries like cuDF, enabling accelerated compute for SQL.

Read more in the documentation.

dask-sql GIF

Example

For this example, we use some data loaded from disk and query them with a SQL command from our python code. Any pandas or dask dataframe can be used as input and dask-sql understands a large amount of formats (csv, parquet, json,...) and locations (s3, hdfs, gcs,...).

import dask.dataframe as dd
from dask_sql import Context

# Create a context to hold the registered tables
c = Context()

# Load the data and register it in the context
# This will give the table a name, that we can use in queries
df = dd.read_csv("...")
c.create_table("my_data", df)

# Now execute a SQL query. The result is again dask dataframe.
result = c.sql("""
    SELECT
        my_data.name,
        SUM(my_data.x)
    FROM
        my_data
    GROUP BY
        my_data.name
""", return_futures=False)

# Show the result
print(result)

Quickstart

Have a look into the documentation or start the example notebook on binder.

dask-sql is currently under development and does so far not understand all SQL commands (but a large fraction). We are actively looking for feedback, improvements and contributors!

Installation

dask-sql can be installed via conda (preferred) or pip - or in a development environment.

With conda

Create a new conda environment or use your already present environment:

conda create -n dask-sql
conda activate dask-sql

Install the package from the conda-forge channel:

conda install dask-sql -c conda-forge

With pip

You can install the package with

pip install dask-sql

For development

If you want to have the newest (unreleased) dask-sql version or if you plan to do development on dask-sql, you can also install the package from sources.

git clone https://github.com/dask-contrib/dask-sql.git

Create a new conda environment and install the development environment:

conda env create -f continuous_integration/environment-3.9-dev.yaml

It is not recommended to use pip instead of conda for the environment setup.

After that, you can install the package in development mode

pip install -e ".[dev]"

The Rust DataFusion bindings are built as part of the pip install. If changes are made to the Rust source in dask_planner/, another build/install must be run to recompile the bindings:

python setup.py build install

This repository uses pre-commit hooks. To install them, call

pre-commit install

Testing

You can run the tests (after installation) with

pytest tests

GPU-specific tests require additional dependencies specified in continuous_integration/gpuci/environment.yaml. These can be added to the development environment by running

conda env update -n dask-sql -f continuous_integration/gpuci/environment.yaml

And GPU-specific tests can be run with

pytest tests -m gpu --rungpu

SQL Server

dask-sql comes with a small test implementation for a SQL server. Instead of rebuilding a full ODBC driver, we re-use the presto wire protocol. It is - so far - only a start of the development and missing important concepts, such as authentication.

You can test the sql presto server by running (after installation)

dask-sql-server

or by using the created docker image

docker run --rm -it -p 8080:8080 nbraun/dask-sql

in one terminal. This will spin up a server on port 8080 (by default) that looks similar to a normal presto database to any presto client.

You can test this for example with the default presto client:

presto --server localhost:8080

Now you can fire simple SQL queries (as no data is loaded by default):

=> SELECT 1 + 1;
 EXPR$0
--------
    2
(1 row)

You can find more information in the documentation.

CLI

You can also run the CLI dask-sql for testing out SQL commands quickly:

dask-sql --load-test-data --startup

(dask-sql) > SELECT * FROM timeseries LIMIT 10;

How does it work?

At the core, dask-sql does two things:

  • translate the SQL query using DataFusion into a relational algebra, which is represented as a logical query plan - similar to many other SQL engines (Hive, Flink, ...)
  • convert this description of the query into dask API calls (and execute them) - returning a dask dataframe.

For the first step, Arrow DataFusion needs to know about the columns and types of the dask dataframes, therefore some Rust code to store this information for dask dataframes are defined in dask_planner. After the translation to a relational algebra is done (using DaskSQLContext.logical_relational_algebra), the python methods defined in dask_sql.physical turn this into a physical dask execution plan by converting each piece of the relational algebra one-by-one.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dask_sql-2023.2.0.tar.gz (184.0 kB view details)

Uploaded Source

Built Distributions

dask_sql-2023.2.0-cp310-cp310-win_amd64.whl (6.7 MB view details)

Uploaded CPython 3.10 Windows x86-64

dask_sql-2023.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.3 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

dask_sql-2023.2.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (7.7 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARM64

dask_sql-2023.2.0-cp310-cp310-macosx_11_0_arm64.whl (6.1 MB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

dask_sql-2023.2.0-cp310-cp310-macosx_10_9_x86_64.whl (6.7 MB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

dask_sql-2023.2.0-cp39-cp39-win_amd64.whl (6.7 MB view details)

Uploaded CPython 3.9 Windows x86-64

dask_sql-2023.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.3 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

dask_sql-2023.2.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (7.7 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARM64

dask_sql-2023.2.0-cp39-cp39-macosx_11_0_arm64.whl (6.1 MB view details)

Uploaded CPython 3.9 macOS 11.0+ ARM64

dask_sql-2023.2.0-cp39-cp39-macosx_10_9_x86_64.whl (6.7 MB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

dask_sql-2023.2.0-cp38-cp38-win_amd64.whl (6.7 MB view details)

Uploaded CPython 3.8 Windows x86-64

dask_sql-2023.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.3 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

dask_sql-2023.2.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (7.7 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ARM64

dask_sql-2023.2.0-cp38-cp38-macosx_11_0_arm64.whl (6.1 MB view details)

Uploaded CPython 3.8 macOS 11.0+ ARM64

dask_sql-2023.2.0-cp38-cp38-macosx_10_9_x86_64.whl (6.7 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

File details

Details for the file dask_sql-2023.2.0.tar.gz.

File metadata

  • Download URL: dask_sql-2023.2.0.tar.gz
  • Upload date:
  • Size: 184.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.16

File hashes

Hashes for dask_sql-2023.2.0.tar.gz
Algorithm Hash digest
SHA256 b3e71114cf44e26b96e63f0840120ef4caefe6bb5ff3c949e9cbff6bc5b1e0b3
MD5 7aa41fe3a4308831d8cb8b389589df9b
BLAKE2b-256 65fa911e0dd7408b264a85f62f74c3773402fed096909006478e83d0bcb85a8f

See more details on using hashes here.

File details

Details for the file dask_sql-2023.2.0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.2.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 6832c5e4443720f01cdeec84e7eaa82ebadeb53d3946e1cd1854e9d2048f8068
MD5 d7520931dc4efd8f772d1b31cb29e448
BLAKE2b-256 facc806a5e576af77652df490d1f458990796bdf5c46bd7551a6654521bb370b

See more details on using hashes here.

File details

Details for the file dask_sql-2023.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 596efafee79083c0c4fda783d0d4d649eff4b48b72337a7702e5e8a387a046f2
MD5 e4ebc495d15504250b54dee2b4b73102
BLAKE2b-256 278d30d6dbcf1c01d68ef1ff5e7b4c103d0bd2b17724b62d0a9cccded49c4151

See more details on using hashes here.

File details

Details for the file dask_sql-2023.2.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.2.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 d6a01b2e3211c52b15c52a697cf69ed2b1b7237b87e90c8862c5900f7070eb66
MD5 57dce556380e62dc76fdefc86980c6b6
BLAKE2b-256 72755eb5680145cd7259aebb695484c5480e6a8d5057c5768f7c413ec7fc88c3

See more details on using hashes here.

File details

Details for the file dask_sql-2023.2.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.2.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 288a4c26b03f8337e18a209f46882ef7f2afb34cd9947d6e2177cf45045a53fb
MD5 37b5e9a9c6e084b9d0d220b2aa77d3fd
BLAKE2b-256 3c158bf7eb75363bee406bed4d21dae528d8265461df292ddb7033ec265c9a3d

See more details on using hashes here.

File details

Details for the file dask_sql-2023.2.0-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.2.0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 690d098524d65a2fe2c36a8f00937fa9664fc78ef76ca87b54cc294c14c76689
MD5 dcac7f1227b04daea725823a191054cd
BLAKE2b-256 9e53e0be927a6454ebed7d808ee3087ab35102bbbcce4eb8f53b24ac2171d04e

See more details on using hashes here.

File details

Details for the file dask_sql-2023.2.0-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.2.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 1e87324ee29b801d39631afc8ae0949d0159d1067ea897913bc247381236986d
MD5 3db081f0dc02b48881d57a9b438463cf
BLAKE2b-256 dfe76f6dd9e543f41616e261395dd30aaea12221636ee886e93a7c56347bc857

See more details on using hashes here.

File details

Details for the file dask_sql-2023.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 98fc928411688f5b571d47614e074408e0b21912e28b65fa3e8f40b4ab00c745
MD5 166b30b397e8647c18c23eff556d08d4
BLAKE2b-256 dc6aa6c2941c7ca3d33fba2b554511bf1e4ded2857d89d2b5e9a79bc9a930a30

See more details on using hashes here.

File details

Details for the file dask_sql-2023.2.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.2.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 27a855742867cd7d4211bf272db8cbb8e1c3dc1a0c52df260ff9a7acd1ee9f10
MD5 b8e48a59cdbe9d7a4fc93edf5c146af2
BLAKE2b-256 56e9ad3edbaf9f79358b7d86a9c4e60e98d5f352eda42e1d8703d0f4d7d5d00b

See more details on using hashes here.

File details

Details for the file dask_sql-2023.2.0-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.2.0-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b12aa203641d6aec62703a61c2b18752339ee0222228dab8ee983e70493959a3
MD5 6e2a0f6b05306d53ad5437b51eb7ef88
BLAKE2b-256 11c355c1d7d9f5cc73e4323fa43a46d7c4dda585240225fe58e1763ef6cac9e9

See more details on using hashes here.

File details

Details for the file dask_sql-2023.2.0-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.2.0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 180471664af962591134be018b5f5b653b5a5d794faad5ca6aa5d51f7ddc5e64
MD5 6f9b8a28650a538846f8416f455df6ec
BLAKE2b-256 f066aa2d1116c0e7452ef187efb744e082e6d417c34257f7a0c3f69e2d2525e4

See more details on using hashes here.

File details

Details for the file dask_sql-2023.2.0-cp38-cp38-win_amd64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.2.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 b813a65340720ccfc90dd605b1cdeb7b70b4c0b96a90daedb4d9ca61defb56c9
MD5 f66b4c902d41164f8b68c165c3c6cfb3
BLAKE2b-256 d8147774d3a24603f059483483db55ea344adf4f7e4b78eb1932ca3f6e0515c1

See more details on using hashes here.

File details

Details for the file dask_sql-2023.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 01895fe05d7707f276754f92d810399ce7ecbb49299ec7ebcce110a0686f7497
MD5 65676f82b80485648041dc3f8af5b282
BLAKE2b-256 ec0fd870fe89c3f83e3c3b0577ae881ba2ddeb3cbf00a5bd99e8c006c13669d3

See more details on using hashes here.

File details

Details for the file dask_sql-2023.2.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.2.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 0e688364add0ca6864c2906e5b9512f21268f20d682b8ba17c81dcdc6c1dcfea
MD5 dc7f5942c25160b4895110e923b4708a
BLAKE2b-256 3745aaa636d2d7425e146dc591fa4d2b054ecab781fa6f12300f18a7ded347b2

See more details on using hashes here.

File details

Details for the file dask_sql-2023.2.0-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.2.0-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b82e653a2fffd4445ef54eec56a58b85ca9cbdf9c2b86f599d08477b96f14a00
MD5 d137d4ece71f8e87119a9285db10f74c
BLAKE2b-256 610e123d7e84356dfbb4061b5608fd03cdd3a703e7d0a0f3a881baf455a367a0

See more details on using hashes here.

File details

Details for the file dask_sql-2023.2.0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.2.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 b1cbe055dda7973a906119403c4aff34c71f2ebabf239c5b86fcec558807b31e
MD5 a0acb05e670957b693c36b78376d32ea
BLAKE2b-256 b6fd7fd6ceb3811b9e79b997b77bda60b74830813e678e7d3da359ab5527be5f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page