Skip to main content

SQL query layer for Dask

Project description

Conda PyPI GitHub Workflow Status Read the Docs Codecov GitHub Binder

SQL + Python

dask-sql is a distributed SQL query engine in Python. It allows you to query and transform your data using a mixture of common SQL operations and Python code and also scale up the calculation easily if you need it.

  • Combine the power of Python and SQL: load your data with Python, transform it with SQL, enhance it with Python and query it with SQL - or the other way round. With dask-sql you can mix the well known Python dataframe API of pandas and Dask with common SQL operations, to process your data in exactly the way that is easiest for you.
  • Infinite Scaling: using the power of the great Dask ecosystem, your computations can scale as you need it - from your laptop to your super cluster - without changing any line of SQL code. From k8s to cloud deployments, from batch systems to YARN - if Dask supports it, so will dask-sql.
  • Your data - your queries: Use Python user-defined functions (UDFs) in SQL without any performance drawback and extend your SQL queries with the large number of Python libraries, e.g. machine learning, different complicated input formats, complex statistics.
  • Easy to install and maintain: dask-sql is just a pip/conda install away (or a docker run if you prefer).
  • Use SQL from wherever you like: dask-sql integrates with your jupyter notebook, your normal Python module or can be used as a standalone SQL server from any BI tool. It even integrates natively with Apache Hue.
  • GPU Support: dask-sql supports running SQL queries on CUDA-enabled GPUs by utilizing RAPIDS libraries like cuDF, enabling accelerated compute for SQL.

Read more in the documentation.

dask-sql GIF

Example

For this example, we use some data loaded from disk and query them with a SQL command from our python code. Any pandas or dask dataframe can be used as input and dask-sql understands a large amount of formats (csv, parquet, json,...) and locations (s3, hdfs, gcs,...).

import dask.dataframe as dd
from dask_sql import Context

# Create a context to hold the registered tables
c = Context()

# Load the data and register it in the context
# This will give the table a name, that we can use in queries
df = dd.read_csv("...")
c.create_table("my_data", df)

# Now execute a SQL query. The result is again dask dataframe.
result = c.sql("""
    SELECT
        my_data.name,
        SUM(my_data.x)
    FROM
        my_data
    GROUP BY
        my_data.name
""", return_futures=False)

# Show the result
print(result)

Quickstart

Have a look into the documentation or start the example notebook on binder.

dask-sql is currently under development and does so far not understand all SQL commands (but a large fraction). We are actively looking for feedback, improvements and contributors!

Installation

dask-sql can be installed via conda (preferred) or pip - or in a development environment.

With conda

Create a new conda environment or use your already present environment:

conda create -n dask-sql
conda activate dask-sql

Install the package from the conda-forge channel:

conda install dask-sql -c conda-forge

With pip

You can install the package with

pip install dask-sql

For development

If you want to have the newest (unreleased) dask-sql version or if you plan to do development on dask-sql, you can also install the package from sources.

git clone https://github.com/dask-contrib/dask-sql.git

Create a new conda environment and install the development environment:

conda env create -f continuous_integration/environment-3.9-dev.yaml

It is not recommended to use pip instead of conda for the environment setup.

After that, you can install the package in development mode

pip install -e ".[dev]"

The Rust DataFusion bindings are built as part of the pip install. If changes are made to the Rust source in dask_planner/, another build/install must be run to recompile the bindings:

python setup.py build install

This repository uses pre-commit hooks. To install them, call

pre-commit install

Testing

You can run the tests (after installation) with

pytest tests

GPU-specific tests require additional dependencies specified in continuous_integration/gpuci/environment.yaml. These can be added to the development environment by running

conda env update -n dask-sql -f continuous_integration/gpuci/environment.yaml

And GPU-specific tests can be run with

pytest tests -m gpu --rungpu

SQL Server

dask-sql comes with a small test implementation for a SQL server. Instead of rebuilding a full ODBC driver, we re-use the presto wire protocol. It is - so far - only a start of the development and missing important concepts, such as authentication.

You can test the sql presto server by running (after installation)

dask-sql-server

or by using the created docker image

docker run --rm -it -p 8080:8080 nbraun/dask-sql

in one terminal. This will spin up a server on port 8080 (by default) that looks similar to a normal presto database to any presto client.

You can test this for example with the default presto client:

presto --server localhost:8080

Now you can fire simple SQL queries (as no data is loaded by default):

=> SELECT 1 + 1;
 EXPR$0
--------
    2
(1 row)

You can find more information in the documentation.

CLI

You can also run the CLI dask-sql for testing out SQL commands quickly:

dask-sql --load-test-data --startup

(dask-sql) > SELECT * FROM timeseries LIMIT 10;

How does it work?

At the core, dask-sql does two things:

  • translate the SQL query using DataFusion into a relational algebra, which is represented as a logical query plan - similar to many other SQL engines (Hive, Flink, ...)
  • convert this description of the query into dask API calls (and execute them) - returning a dask dataframe.

For the first step, Arrow DataFusion needs to know about the columns and types of the dask dataframes, therefore some Rust code to store this information for dask dataframes are defined in dask_planner. After the translation to a relational algebra is done (using DaskSQLContext.logical_relational_algebra), the python methods defined in dask_sql.physical turn this into a physical dask execution plan by converting each piece of the relational algebra one-by-one.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dask_sql-2023.8.0.tar.gz (215.8 kB view details)

Uploaded Source

Built Distributions

dask_sql-2023.8.0-cp310-cp310-win_amd64.whl (17.1 MB view details)

Uploaded CPython 3.10 Windows x86-64

dask_sql-2023.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (22.4 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

dask_sql-2023.8.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.9 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARM64

dask_sql-2023.8.0-cp310-cp310-macosx_11_0_arm64.whl (17.4 MB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

dask_sql-2023.8.0-cp310-cp310-macosx_10_9_x86_64.whl (18.8 MB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

dask_sql-2023.8.0-cp39-cp39-win_amd64.whl (17.1 MB view details)

Uploaded CPython 3.9 Windows x86-64

dask_sql-2023.8.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (22.4 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

dask_sql-2023.8.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.9 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARM64

dask_sql-2023.8.0-cp39-cp39-macosx_11_0_arm64.whl (17.4 MB view details)

Uploaded CPython 3.9 macOS 11.0+ ARM64

dask_sql-2023.8.0-cp39-cp39-macosx_10_9_x86_64.whl (18.8 MB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

dask_sql-2023.8.0-cp38-cp38-win_amd64.whl (17.1 MB view details)

Uploaded CPython 3.8 Windows x86-64

dask_sql-2023.8.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (22.4 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

dask_sql-2023.8.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.9 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ARM64

dask_sql-2023.8.0-cp38-cp38-macosx_11_0_arm64.whl (17.4 MB view details)

Uploaded CPython 3.8 macOS 11.0+ ARM64

dask_sql-2023.8.0-cp38-cp38-macosx_10_9_x86_64.whl (18.8 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

File details

Details for the file dask_sql-2023.8.0.tar.gz.

File metadata

  • Download URL: dask_sql-2023.8.0.tar.gz
  • Upload date:
  • Size: 215.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.17

File hashes

Hashes for dask_sql-2023.8.0.tar.gz
Algorithm Hash digest
SHA256 f559c5f9a1408408ae9f6c877b555422d590e93f67f46f670ac2f453572efdd7
MD5 b4cd11d1adea048baaff11a723b31b31
BLAKE2b-256 17fd4eef566f85bc7efd95f59246dd840931bf942850449f42b41bb3dca24b85

See more details on using hashes here.

File details

Details for the file dask_sql-2023.8.0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.8.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 37a059f7e45a5a92b4d1bfd2ebf555dfeda367bba2ce2d7d1aa2ed31daaaccd3
MD5 c657f2491bdc46e103aa3dc91110061a
BLAKE2b-256 7e90f703d13255456aa570a67828e9a4bd774b7c6d414a7255fceaa2d1caef69

See more details on using hashes here.

File details

Details for the file dask_sql-2023.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 af9d842290da5a54c02c03277e6d7ddddb65cc23cbea8cba8e05d7a568199eab
MD5 4e9ad0abbc53252fef5c461b2875b0f4
BLAKE2b-256 43185c43af7058b1060afb67c141f767f2c1a31ac4f55a47eea6b7bd6d221198

See more details on using hashes here.

File details

Details for the file dask_sql-2023.8.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.8.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 53031befe19c92d6c3e88d0d52fba4832629b01979f41ef4124c08903ce3b732
MD5 ad1f321f0659b7b466273465b9a47fa6
BLAKE2b-256 75915c6f444553510f16712dfba95dec513c572995d1b26084ec47d98d685812

See more details on using hashes here.

File details

Details for the file dask_sql-2023.8.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.8.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4400771558bdc2359d18e3cbbae402c93bfc9d05d2ee6703ebed9039aca55e71
MD5 4829dab95b85af4711bbb226f9049538
BLAKE2b-256 2bc48c31e880dee718ac4f6e95934f71692fecd661d697095ba71146155fbf8a

See more details on using hashes here.

File details

Details for the file dask_sql-2023.8.0-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.8.0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 749502825ba86b6ac4c5c9dcf991a10309809e3cc55b33a3c442676d3abf5792
MD5 b860a78e3d716bfcb9d97223fd7b5efd
BLAKE2b-256 0a783b3d6ace93d82d9a83a652e06756a6d2a6f39ee2eb8ded550f6a0c43d88a

See more details on using hashes here.

File details

Details for the file dask_sql-2023.8.0-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.8.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 e3ff639de1f157f55176536904a26394db9db2422960a32a70723687e7aaeb3b
MD5 4a49e9f82d52b681c71c0fe81ef454ec
BLAKE2b-256 678e4fe2dc48a921e224bcb9d2b66ad2441d051afcfc1a44d7d29196da79dc36

See more details on using hashes here.

File details

Details for the file dask_sql-2023.8.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.8.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 bcf482476ab10a978007db94fbcc4117d5f0ed0ec0920341c143231fc3c758c5
MD5 468b0725ea7ead2a9fa457b61ebfac6c
BLAKE2b-256 00004c9d42aea288d90dff3d523b1f2a24f35c4097a1b7aeb4db9fff7b442440

See more details on using hashes here.

File details

Details for the file dask_sql-2023.8.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.8.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 59b6851da3a24fbc8ecb37481c3c9f7b909892514c250e2c15273baf4e5f4b9a
MD5 2f15d36afff3ed11cbeb53066412afe5
BLAKE2b-256 3c69f257db917cd5da576aaf1a8ecdb0cd129bdfca4ce0935e8030b32ad52a60

See more details on using hashes here.

File details

Details for the file dask_sql-2023.8.0-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.8.0-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8258083d2b3643705ace1ba2d5a8c1d33e3153d7f21023dca84244d0f39f81f9
MD5 89886ce757a330d8d611f69ce141d18f
BLAKE2b-256 4aff2ab66ba31b53fc45b3dbade73668c6b42b7e0cbd1fb989a94eda8dd587a2

See more details on using hashes here.

File details

Details for the file dask_sql-2023.8.0-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.8.0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 3d1441e6197d40935ddcb6fe0c57eccb30642387c91eb2a6630da455510791c3
MD5 e5b0603dfa88245e08033dbe372d85a0
BLAKE2b-256 8c5aa746a829b4127da98ea951275e74ed86108458dee84431a0a87549bb747b

See more details on using hashes here.

File details

Details for the file dask_sql-2023.8.0-cp38-cp38-win_amd64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.8.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 403909b009bb7f4fbef00bdbb2c01a19922ab20bd039f57ab507a955c3d06b73
MD5 ec7dba976d49e539a094f631bb073df3
BLAKE2b-256 f5dba5f47e445e402c69703b282e631b933e6996acecb8bd81df20139e70db84

See more details on using hashes here.

File details

Details for the file dask_sql-2023.8.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.8.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9de64885e2ffaa6668ae2f29765aa1b7b325483034aa57f9a377c424268066bf
MD5 201f21dffad19c2aa2ae57189fc20f70
BLAKE2b-256 f2315d316f1e43525e28bb2daa284e6522aa2cada3044f95ab8a439ebdd8fe29

See more details on using hashes here.

File details

Details for the file dask_sql-2023.8.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.8.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 7d013a7eb64957d8f7bbf5d656b7966aa8365b3a0b0256664f082328091645f7
MD5 bfacc9391db63060e734d705edc0ad3e
BLAKE2b-256 8ce512c9645bf029218d8534a889691c091ce825c6dc77147f22eacb3ad1dd20

See more details on using hashes here.

File details

Details for the file dask_sql-2023.8.0-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.8.0-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8815407031bd6f053f3b2d903c05108ef7dafef0c33c4e3f21c788991f618450
MD5 ebd513f3024c780d675ee44cb8c52636
BLAKE2b-256 b7476a9225a03601fcbf5dc5c06408bf271054bbf97c4a29637f13751d7b5538

See more details on using hashes here.

File details

Details for the file dask_sql-2023.8.0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2023.8.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 1190b04e5394c61a2c2718a820ddfee9c1af5027f128744e58febbf14a9dde26
MD5 5e6e32d47c8ac8c0045f3e0122f73986
BLAKE2b-256 607b5bd894ab626e10e040c62e151cbb34984c84324cb2d19250ae98653d4290

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page