Skip to main content

SQL query layer for Dask

Project description

Conda PyPI GitHub Workflow Status Read the Docs Codecov GitHub Binder

SQL + Python

dask-sql is a distributed SQL query engine in Python. It allows you to query and transform your data using a mixture of common SQL operations and Python code and also scale up the calculation easily if you need it.

  • Combine the power of Python and SQL: load your data with Python, transform it with SQL, enhance it with Python and query it with SQL - or the other way round. With dask-sql you can mix the well known Python dataframe API of pandas and Dask with common SQL operations, to process your data in exactly the way that is easiest for you.
  • Infinite Scaling: using the power of the great Dask ecosystem, your computations can scale as you need it - from your laptop to your super cluster - without changing any line of SQL code. From k8s to cloud deployments, from batch systems to YARN - if Dask supports it, so will dask-sql.
  • Your data - your queries: Use Python user-defined functions (UDFs) in SQL without any performance drawback and extend your SQL queries with the large number of Python libraries, e.g. machine learning, different complicated input formats, complex statistics.
  • Easy to install and maintain: dask-sql is just a pip/conda install away (or a docker run if you prefer). No need for complicated cluster setups - dask-sql will run out of the box on your machine and can be easily connected to your computing cluster.
  • Use SQL from wherever you like: dask-sql integrates with your jupyter notebook, your normal Python module or can be used as a standalone SQL server from any BI tool. It even integrates natively with Apache Hue.
  • GPU Support: dask-sql supports running SQL queries on CUDA-enabled GPUs by utilizing RAPIDS libraries like cuDF, enabling accelerated compute for SQL.

Read more in the documentation.

dask-sql GIF

Example

For this example, we use some data loaded from disk and query them with a SQL command from our python code. Any pandas or dask dataframe can be used as input and dask-sql understands a large amount of formats (csv, parquet, json,...) and locations (s3, hdfs, gcs,...).

import dask.dataframe as dd
from dask_sql import Context

# Create a context to hold the registered tables
c = Context()

# Load the data and register it in the context
# This will give the table a name, that we can use in queries
df = dd.read_csv("...")
c.create_table("my_data", df)

# Now execute a SQL query. The result is again dask dataframe.
result = c.sql("""
    SELECT
        my_data.name,
        SUM(my_data.x)
    FROM
        my_data
    GROUP BY
        my_data.name
""", return_futures=False)

# Show the result
print(result)

Quickstart

Have a look into the documentation or start the example notebook on binder.

dask-sql is currently under development and does so far not understand all SQL commands (but a large fraction). We are actively looking for feedback, improvements and contributors!

Installation

dask-sql can be installed via conda (preferred) or pip - or in a development environment.

With conda

Create a new conda environment or use your already present environment:

conda create -n dask-sql
conda activate dask-sql

Install the package from the conda-forge channel:

conda install dask-sql -c conda-forge

With pip

You can install the package with

pip install dask-sql

For development

If you want to have the newest (unreleased) dask-sql version or if you plan to do development on dask-sql, you can also install the package from sources.

git clone https://github.com/dask-contrib/dask-sql.git

Create a new conda environment and install the development environment:

conda env create -f continuous_integration/environment-3.9-dev.yaml

It is not recommended to use pip instead of conda for the environment setup.

After that, you can install the package in development mode

pip install -e ".[dev]"

The Rust DataFusion bindings are built as part of the pip install. If changes are made to the Rust source in dask_planner/, another build/install must be run to recompile the bindings:

python setup.py build install

This repository uses pre-commit hooks. To install them, call

pre-commit install

Testing

You can run the tests (after installation) with

pytest tests

GPU-specific tests require additional dependencies specified in continuous_integration/gpuci/environment.yaml. These can be added to the development environment by running

conda env update -n dask-sql -f continuous_integration/gpuci/environment.yaml

And GPU-specific tests can be run with

pytest tests -m gpu --rungpu

SQL Server

dask-sql comes with a small test implementation for a SQL server. Instead of rebuilding a full ODBC driver, we re-use the presto wire protocol. It is - so far - only a start of the development and missing important concepts, such as authentication.

You can test the sql presto server by running (after installation)

dask-sql-server

or by using the created docker image

docker run --rm -it -p 8080:8080 nbraun/dask-sql

in one terminal. This will spin up a server on port 8080 (by default) that looks similar to a normal presto database to any presto client.

You can test this for example with the default presto client:

presto --server localhost:8080

Now you can fire simple SQL queries (as no data is loaded by default):

=> SELECT 1 + 1;
 EXPR$0
--------
    2
(1 row)

You can find more information in the documentation.

CLI

You can also run the CLI dask-sql for testing out SQL commands quickly:

dask-sql --load-test-data --startup

(dask-sql) > SELECT * FROM timeseries LIMIT 10;

How does it work?

At the core, dask-sql does two things:

  • translate the SQL query using DataFusion into a relational algebra, which is represented as a logical query plan - similar to many other SQL engines (Hive, Flink, ...)
  • convert this description of the query into dask API calls (and execute them) - returning a dask dataframe.

For the first step, Arrow DataFusion needs to know about the columns and types of the dask dataframes, therefore some Rust code to store this information for dask dataframes are defined in dask_planner. After the translation to a relational algebra is done (using DaskSQLContext.logical_relational_algebra), the python methods defined in dask_sql.physical turn this into a physical dask execution plan by converting each piece of the relational algebra one-by-one.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dask_sql-2023.4.0.tar.gz (194.0 kB view details)

Uploaded Source

Built Distributions

dask_sql-2023.4.0-cp310-cp310-win_amd64.whl (6.5 MB view details)

Uploaded CPython 3.10 Windows x86-64

dask_sql-2023.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.5 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

dask_sql-2023.4.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (8.0 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARM64

dask_sql-2023.4.0-cp310-cp310-macosx_11_0_arm64.whl (6.3 MB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

dask_sql-2023.4.0-cp310-cp310-macosx_10_9_x86_64.whl (7.1 MB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

dask_sql-2023.4.0-cp39-cp39-win_amd64.whl (6.5 MB view details)

Uploaded CPython 3.9 Windows x86-64

dask_sql-2023.4.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.5 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

dask_sql-2023.4.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (8.0 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARM64

dask_sql-2023.4.0-cp39-cp39-macosx_11_0_arm64.whl (6.3 MB view details)

Uploaded CPython 3.9 macOS 11.0+ ARM64

dask_sql-2023.4.0-cp39-cp39-macosx_10_9_x86_64.whl (7.1 MB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

dask_sql-2023.4.0-cp38-cp38-win_amd64.whl (6.5 MB view details)

Uploaded CPython 3.8 Windows x86-64

dask_sql-2023.4.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.5 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

dask_sql-2023.4.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (8.0 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ARM64

dask_sql-2023.4.0-cp38-cp38-macosx_11_0_arm64.whl (6.3 MB view details)

Uploaded CPython 3.8 macOS 11.0+ ARM64

dask_sql-2023.4.0-cp38-cp38-macosx_10_9_x86_64.whl (7.1 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

File details

Details for the file dask_sql-2023.4.0.tar.gz.

File metadata

  • Download URL: dask_sql-2023.4.0.tar.gz
  • Upload date:
  • Size: 194.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.16

File hashes

Hashes for dask_sql-2023.4.0.tar.gz
Algorithm Hash digest
SHA256 4cf5513a3e3192ed4debc36a0877ba1639348e4b156e5a7dce79db717dce70a0
MD5 cf78eed678e2f6e68620441cfeb77b03
BLAKE2b-256 95ba82ec4a5f7e766f66c22b3a5d447a458fe09702a0e965a978b8cea422dff1

See more details on using hashes here.

File details

Details for the file dask_sql-2023.4.0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.4.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 57a6d61fc116e8155fa24129fe0a8cea52dfde09f8cfad0f4a3acdf6c47ccc23
MD5 da4e4b35b04156a5567b4220749462de
BLAKE2b-256 44ed1a5e1cdaad25a3c7dc881471b5d00c4d9d77c5c648f30310d92e4c8d6ede

See more details on using hashes here.

File details

Details for the file dask_sql-2023.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5e27942df7ae3c26eaa94a23e58df79a35ad87528cb0017078f2001db8f28e2c
MD5 96fde5ef736e1ea5837af595d48a076a
BLAKE2b-256 39a397004ca95c1020e64971fbc58fc7194d5da800f7fff5a24feede97779ae0

See more details on using hashes here.

File details

Details for the file dask_sql-2023.4.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.4.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 dba2e53f5c49d8c0c3f82f5e08f821e08cb2001a67527784ac23d18ae688a11f
MD5 c285d4ad7819d37c55bdafff34c6578a
BLAKE2b-256 497aac585f25f966c8c508a33cdea729026d3dbe415239ce5e82603c6b9f7c66

See more details on using hashes here.

File details

Details for the file dask_sql-2023.4.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.4.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 673221e0ec22deba97357c48f9830c18ab5ec366b196fa7757d9da06449434ee
MD5 266d414e44f283f7ca1a344e29e6e9ab
BLAKE2b-256 d63ce444b30f640c9aa7282463632630b33251d8a2e71fdac7d5b81667b7a1ed

See more details on using hashes here.

File details

Details for the file dask_sql-2023.4.0-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.4.0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 0f1932651b7552ba53cd2ec7221dbe39b3ddb524c9a2872974d2b1edc437b0b5
MD5 9d26d466ab3c330a44b53aa749157397
BLAKE2b-256 0dc9cc58380851f87675889feb9cecbd403db027c2c6b9c7ed018914ba8429cd

See more details on using hashes here.

File details

Details for the file dask_sql-2023.4.0-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.4.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 13aa398332c0a35acdd33bf733ac56e79753cb463ea95accf6a78427e72ac8de
MD5 012e0007a066ef2cf06d794d40778e5b
BLAKE2b-256 32d4267204316099906911539e886ab3cb270d6f8afe91311b93226c5e7519ea

See more details on using hashes here.

File details

Details for the file dask_sql-2023.4.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.4.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b9325310dc2721b0eaafb365f43ba830ffda056fc4d990a2e879cf0c3a6cdc6d
MD5 da62727330fda21816027cf4dbd601d6
BLAKE2b-256 e7b2bc5cd3baaf1c1514fe72e59f91c6cb3e8b797f01d83adefee1cda1d9e99c

See more details on using hashes here.

File details

Details for the file dask_sql-2023.4.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.4.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 994a9f3a7dbb396671605a659d4af6970932f21fb15898ec887d6fa53d9f2b00
MD5 3edb921b891d5bb471d39886311636e9
BLAKE2b-256 5ef3fbe1805f4c20ae6e0a3e7e06e519d4f56a4deec6fab5da1eed557784588a

See more details on using hashes here.

File details

Details for the file dask_sql-2023.4.0-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.4.0-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 974472f3a5cbe521bc40a760b266c048cfcccd19e4093f3573efc48775f400f3
MD5 88a6fe380014f5f36a3b2885ed89e111
BLAKE2b-256 5c4bb731b969e9d62b6ee99331cc44725a20ff5852028368dfd865822494d12c

See more details on using hashes here.

File details

Details for the file dask_sql-2023.4.0-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.4.0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 307df760fd653f94a870fdbdbbdfbdcacaf93295fa686b8dcfdd90cdfe6634d4
MD5 8edf127f07ecb0afec77c565bb061f68
BLAKE2b-256 dbf3b8889885aaa2bbe3c0b4a112ce99c1f5061ebedf02971a51bc6263b81b41

See more details on using hashes here.

File details

Details for the file dask_sql-2023.4.0-cp38-cp38-win_amd64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.4.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 af7cf248a015516e43f1c146185d7876c27f4ab3a088a16e7b1de67c788e8b75
MD5 789b223b69d59b1188eb9e9c8b157b6c
BLAKE2b-256 adb889dbd395873cf28de4ae5f322444754548999d3559b3f0a6cb5153e62303

See more details on using hashes here.

File details

Details for the file dask_sql-2023.4.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.4.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ab458cf9d05f9c115690af4635382c8f6e7e37ef5255c69a79ceab00d9e959e2
MD5 f3d3342b2caf1124fd6c382d712bbd83
BLAKE2b-256 54c112e301480f0e56f540adc275dcdd2565784a7c7178a564cbb58c88da46fd

See more details on using hashes here.

File details

Details for the file dask_sql-2023.4.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.4.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 d5c31e84fb41256a450d52d40d595c960b5455e2d5cd4ef8ec3b27ed888fbc1d
MD5 01ff76972f50c32c924353eacfd40650
BLAKE2b-256 0424557674aacd08a6eb47f2e43f5e8284a8b6ccb521d31107dbe50c579aeedd

See more details on using hashes here.

File details

Details for the file dask_sql-2023.4.0-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.4.0-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d4162e6e21b2490b2c5d40f2303728d01dfef35aeb477c94ec0d50d3ad4e734c
MD5 980307db928ac9ab490a6bbcf98df45a
BLAKE2b-256 fedc46720c8844f4cd058eece21fc4d7b2d8c9f5d090bad06b4c1c41729480f1

See more details on using hashes here.

File details

Details for the file dask_sql-2023.4.0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.4.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 acd5f9bc4a8d5fa3b46d474945e13d5cfa520479e71123c1f9111db42c58267a
MD5 6bce78c9400c004cb02a11c2f30ad697
BLAKE2b-256 9ee818285adcd7b724f75e06945907978a3c33e62fa9e3fdbf8f521255883468

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page