Skip to main content

SQL query layer for Dask

Project description

Conda PyPI GitHub Workflow Status Read the Docs Codecov GitHub Binder

SQL + Python

dask-sql is a distributed SQL query engine in Python. It allows you to query and transform your data using a mixture of common SQL operations and Python code and also scale up the calculation easily if you need it.

  • Combine the power of Python and SQL: load your data with Python, transform it with SQL, enhance it with Python and query it with SQL - or the other way round. With dask-sql you can mix the well known Python dataframe API of pandas and Dask with common SQL operations, to process your data in exactly the way that is easiest for you.
  • Infinite Scaling: using the power of the great Dask ecosystem, your computations can scale as you need it - from your laptop to your super cluster - without changing any line of SQL code. From k8s to cloud deployments, from batch systems to YARN - if Dask supports it, so will dask-sql.
  • Your data - your queries: Use Python user-defined functions (UDFs) in SQL without any performance drawback and extend your SQL queries with the large number of Python libraries, e.g. machine learning, different complicated input formats, complex statistics.
  • Easy to install and maintain: dask-sql is just a pip/conda install away (or a docker run if you prefer). No need for complicated cluster setups - dask-sql will run out of the box on your machine and can be easily connected to your computing cluster.
  • Use SQL from wherever you like: dask-sql integrates with your jupyter notebook, your normal Python module or can be used as a standalone SQL server from any BI tool. It even integrates natively with Apache Hue.
  • GPU Support: dask-sql supports running SQL queries on CUDA-enabled GPUs by utilizing RAPIDS libraries like cuDF, enabling accelerated compute for SQL.

Read more in the documentation.

dask-sql GIF

Example

For this example, we use some data loaded from disk and query them with a SQL command from our python code. Any pandas or dask dataframe can be used as input and dask-sql understands a large amount of formats (csv, parquet, json,...) and locations (s3, hdfs, gcs,...).

import dask.dataframe as dd
from dask_sql import Context

# Create a context to hold the registered tables
c = Context()

# Load the data and register it in the context
# This will give the table a name, that we can use in queries
df = dd.read_csv("...")
c.create_table("my_data", df)

# Now execute a SQL query. The result is again dask dataframe.
result = c.sql("""
    SELECT
        my_data.name,
        SUM(my_data.x)
    FROM
        my_data
    GROUP BY
        my_data.name
""", return_futures=False)

# Show the result
print(result)

Quickstart

Have a look into the documentation or start the example notebook on binder.

dask-sql is currently under development and does so far not understand all SQL commands (but a large fraction). We are actively looking for feedback, improvements and contributors!

Installation

dask-sql can be installed via conda (preferred) or pip - or in a development environment.

With conda

Create a new conda environment or use your already present environment:

conda create -n dask-sql
conda activate dask-sql

Install the package from the conda-forge channel:

conda install dask-sql -c conda-forge

With pip

You can install the package with

pip install dask-sql

For development

If you want to have the newest (unreleased) dask-sql version or if you plan to do development on dask-sql, you can also install the package from sources.

git clone https://github.com/dask-contrib/dask-sql.git

Create a new conda environment and install the development environment:

conda env create -f continuous_integration/environment-3.9-dev.yaml

It is not recommended to use pip instead of conda for the environment setup.

After that, you can install the package in development mode

pip install -e ".[dev]"

The Rust DataFusion bindings are built as part of the pip install. If changes are made to the Rust source in dask_planner/, another build/install must be run to recompile the bindings:

python setup.py build install

This repository uses pre-commit hooks. To install them, call

pre-commit install

Testing

You can run the tests (after installation) with

pytest tests

GPU-specific tests require additional dependencies specified in continuous_integration/gpuci/environment.yaml. These can be added to the development environment by running

conda env update -n dask-sql -f continuous_integration/gpuci/environment.yaml

And GPU-specific tests can be run with

pytest tests -m gpu --rungpu

SQL Server

dask-sql comes with a small test implementation for a SQL server. Instead of rebuilding a full ODBC driver, we re-use the presto wire protocol. It is - so far - only a start of the development and missing important concepts, such as authentication.

You can test the sql presto server by running (after installation)

dask-sql-server

or by using the created docker image

docker run --rm -it -p 8080:8080 nbraun/dask-sql

in one terminal. This will spin up a server on port 8080 (by default) that looks similar to a normal presto database to any presto client.

You can test this for example with the default presto client:

presto --server localhost:8080

Now you can fire simple SQL queries (as no data is loaded by default):

=> SELECT 1 + 1;
 EXPR$0
--------
    2
(1 row)

You can find more information in the documentation.

CLI

You can also run the CLI dask-sql for testing out SQL commands quickly:

dask-sql --load-test-data --startup

(dask-sql) > SELECT * FROM timeseries LIMIT 10;

How does it work?

At the core, dask-sql does two things:

  • translate the SQL query using DataFusion into a relational algebra, which is represented as a logical query plan - similar to many other SQL engines (Hive, Flink, ...)
  • convert this description of the query into dask API calls (and execute them) - returning a dask dataframe.

For the first step, Arrow DataFusion needs to know about the columns and types of the dask dataframes, therefore some Rust code to store this information for dask dataframes are defined in dask_planner. After the translation to a relational algebra is done (using DaskSQLContext.logical_relational_algebra), the python methods defined in dask_sql.physical turn this into a physical dask execution plan by converting each piece of the relational algebra one-by-one.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dask_sql-2022.10.1.tar.gz (168.7 kB view details)

Uploaded Source

Built Distributions

dask_sql-2022.10.1-cp310-cp310-win_amd64.whl (4.6 MB view details)

Uploaded CPython 3.10 Windows x86-64

dask_sql-2022.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.3 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

dask_sql-2022.10.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (6.0 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARM64

dask_sql-2022.10.1-cp310-cp310-macosx_11_0_arm64.whl (4.4 MB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

dask_sql-2022.10.1-cp310-cp310-macosx_10_9_x86_64.whl (4.9 MB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

dask_sql-2022.10.1-cp39-cp39-win_amd64.whl (4.6 MB view details)

Uploaded CPython 3.9 Windows x86-64

dask_sql-2022.10.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.3 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

dask_sql-2022.10.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (6.0 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARM64

dask_sql-2022.10.1-cp39-cp39-macosx_11_0_arm64.whl (4.4 MB view details)

Uploaded CPython 3.9 macOS 11.0+ ARM64

dask_sql-2022.10.1-cp39-cp39-macosx_10_9_x86_64.whl (4.9 MB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

dask_sql-2022.10.1-cp38-cp38-win_amd64.whl (4.6 MB view details)

Uploaded CPython 3.8 Windows x86-64

dask_sql-2022.10.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.3 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

dask_sql-2022.10.1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (6.0 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ARM64

dask_sql-2022.10.1-cp38-cp38-macosx_11_0_arm64.whl (4.4 MB view details)

Uploaded CPython 3.8 macOS 11.0+ ARM64

dask_sql-2022.10.1-cp38-cp38-macosx_10_9_x86_64.whl (4.9 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

File details

Details for the file dask_sql-2022.10.1.tar.gz.

File metadata

  • Download URL: dask_sql-2022.10.1.tar.gz
  • Upload date:
  • Size: 168.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.13

File hashes

Hashes for dask_sql-2022.10.1.tar.gz
Algorithm Hash digest
SHA256 cce36ce9d2cfb3628f5d5ecb0f418a6e9533a0291b58b387e6aa0c288ea57662
MD5 1df9f8818a7c07a99a7cac2de7a51de9
BLAKE2b-256 5dcce67ffbcc0c6a4a4c7406c25b0839703a35e03fa7bbf53e0152dbeb51b317

See more details on using hashes here.

File details

Details for the file dask_sql-2022.10.1-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.10.1-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 d8da6a6060146879c04f6c491ebc8fc2cb5f942503c48260c1fa24f22bfdfd93
MD5 a48bcf13b10b0ae84758c767a2e1b741
BLAKE2b-256 db123fffb6de4924aceffb4a8b810823b920214f94173c049125c642078ef853

See more details on using hashes here.

File details

Details for the file dask_sql-2022.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c35a2b46fd2626e129f81a527eed34ba3c5a9e672435c2b261097a69dd4a9692
MD5 135958f579e5275326413b3f88b98d13
BLAKE2b-256 a79da36ff34c8c4070fd2f42c2ef643d9d3b3b0faf69d107bfd77cc4091e3da2

See more details on using hashes here.

File details

Details for the file dask_sql-2022.10.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.10.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 6b5a79b971eb4d27d37a5b38862dd705ea0e36bcc0b63003dd971ae1171fa71a
MD5 ab7ae128cc41f40a34d07dbc76d26d01
BLAKE2b-256 f724561a55c796e77d7f40f2ad3ef231f917cd3f52b269439037b1eea6bfba78

See more details on using hashes here.

File details

Details for the file dask_sql-2022.10.1-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.10.1-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 571484c940d3c9c58e61f230bd27c381d8be2fd1938bd1912ac2a22d89805e55
MD5 c3520a760a49682460aa4323c05d248f
BLAKE2b-256 24a8c12eae1106025d9472e4556dc8c3015008d78ea4ae32dc57cadd3adf05fd

See more details on using hashes here.

File details

Details for the file dask_sql-2022.10.1-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.10.1-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 a87aa67de70bd357301c2c9165781777ddfc54af4c05381df25caddbaaacf662
MD5 c1523302cb31e4eca48b862723e45b85
BLAKE2b-256 fdaf1d536b4b04adc0c60fd3ac3748e0c392f6e50b7f12626e674f2bf734066c

See more details on using hashes here.

File details

Details for the file dask_sql-2022.10.1-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.10.1-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 63c30fea31951535eab2c53dda3e7ea698547cbb32eb58ad434222c9c8eb0eef
MD5 da5ed109cd26e8a12be2f3bb81f542ad
BLAKE2b-256 0c2b7e879a5988440d0490b90665d909387ac778c57e8a1e50073946b5d0edbf

See more details on using hashes here.

File details

Details for the file dask_sql-2022.10.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.10.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 62dd9e7d1317efd8a9e9377da95b358d5ef31be5fefe88e930eb3d63752300a2
MD5 4d5c2e7a454d14a406feeffd3ed17077
BLAKE2b-256 385511750e994dc1ba4fcbf164123e46a46bc1395674e12830e92130ce32604a

See more details on using hashes here.

File details

Details for the file dask_sql-2022.10.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.10.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 8bd116add82fd3c828d627556b71e3e11b6e5bef6c29f00fca3f12c249049038
MD5 f0353ccda48e107be3fd58a9f18b8f80
BLAKE2b-256 997315657de08075cb7a20f7feb3605a4c012189c2e226481cef9a7f81682489

See more details on using hashes here.

File details

Details for the file dask_sql-2022.10.1-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.10.1-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 29f124fb74722e42cc1ce3f9b9b9e0be18af8681004d511bf76e18af972f169d
MD5 7081fc233200b49cad6af1f2aa5e5551
BLAKE2b-256 33da97ad7299053b4be390d1a7dbb706c4826ee914f541d55214ca887a10aaf7

See more details on using hashes here.

File details

Details for the file dask_sql-2022.10.1-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.10.1-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 6d3d1de329eb0ee4e1c1d26c1bbd45295645a184aad0cff5aedce26083b5f763
MD5 90cc229da7c38d8f127f4ed922b32fc1
BLAKE2b-256 393d845379120970a45d9b59dcdef24a0593b8f4e1d3f282bc48dfb391d54839

See more details on using hashes here.

File details

Details for the file dask_sql-2022.10.1-cp38-cp38-win_amd64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.10.1-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 3d3ffb05583e330c1eed3078a94e3abe90e471bf99e05bf7894ca84833f8bb41
MD5 4980a933ea97530e28ea5c02c2f48ee3
BLAKE2b-256 8189c38d0c0c7507feb3cf0d86e97465150b54d7890e52d2249d5d1bff5686e3

See more details on using hashes here.

File details

Details for the file dask_sql-2022.10.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.10.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4c17c9119caa37befc194882e135ad29fda3bcd1c6380080b58345dc66ade90d
MD5 75f9ad6ce81e447bbe47ecefd756923d
BLAKE2b-256 28fb521d9ad8148efa829c9c222152c0ef51e5d4b9c9cb819998185fbd255169

See more details on using hashes here.

File details

Details for the file dask_sql-2022.10.1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.10.1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 b621802e4a5aea189f1990e6aeb670bf39678b16938e21bbe1ec880489563a4b
MD5 20c23e771eb901fa74ab800ee7c58bae
BLAKE2b-256 e87fbdfc47f5c9ba26139669f213b410e6fd31065c4e5046b63e6b266011b4ea

See more details on using hashes here.

File details

Details for the file dask_sql-2022.10.1-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.10.1-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 49b99898d6fb8e524b844fdc64501ca57534103e46b5673afeccc68dc0bc75c4
MD5 0194e6c522c2e4e39140ea023e0044c5
BLAKE2b-256 0da653017da90d2618c7bbcf9fd705ab168a4da544d5437cbbc3a7e106860ba8

See more details on using hashes here.

File details

Details for the file dask_sql-2022.10.1-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.10.1-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 02707d4bfdefcf8c2977ac6488ddabafcf7631b1b9d8d81cd45734b192697077
MD5 a53248d4cf0b33cbb9dc9e4b435c7ea4
BLAKE2b-256 1b1d691c22920b37d550e656a4c60e69f7706c88a9f25f28f1e402d576601520

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page