Skip to main content

SQL query layer for Dask

Project description

Conda PyPI GitHub Workflow Status Read the Docs Codecov GitHub Binder

SQL + Python

dask-sql is a distributed SQL query engine in Python. It allows you to query and transform your data using a mixture of common SQL operations and Python code and also scale up the calculation easily if you need it.

  • Combine the power of Python and SQL: load your data with Python, transform it with SQL, enhance it with Python and query it with SQL - or the other way round. With dask-sql you can mix the well known Python dataframe API of pandas and Dask with common SQL operations, to process your data in exactly the way that is easiest for you.
  • Infinite Scaling: using the power of the great Dask ecosystem, your computations can scale as you need it - from your laptop to your super cluster - without changing any line of SQL code. From k8s to cloud deployments, from batch systems to YARN - if Dask supports it, so will dask-sql.
  • Your data - your queries: Use Python user-defined functions (UDFs) in SQL without any performance drawback and extend your SQL queries with the large number of Python libraries, e.g. machine learning, different complicated input formats, complex statistics.
  • Easy to install and maintain: dask-sql is just a pip/conda install away (or a docker run if you prefer). No need for complicated cluster setups - dask-sql will run out of the box on your machine and can be easily connected to your computing cluster.
  • Use SQL from wherever you like: dask-sql integrates with your jupyter notebook, your normal Python module or can be used as a standalone SQL server from any BI tool. It even integrates natively with Apache Hue.
  • GPU Support: dask-sql supports running SQL queries on CUDA-enabled GPUs by utilizing RAPIDS libraries like cuDF, enabling accelerated compute for SQL.

Read more in the documentation.

dask-sql GIF

Example

For this example, we use some data loaded from disk and query them with a SQL command from our python code. Any pandas or dask dataframe can be used as input and dask-sql understands a large amount of formats (csv, parquet, json,...) and locations (s3, hdfs, gcs,...).

import dask.dataframe as dd
from dask_sql import Context

# Create a context to hold the registered tables
c = Context()

# Load the data and register it in the context
# This will give the table a name, that we can use in queries
df = dd.read_csv("...")
c.create_table("my_data", df)

# Now execute a SQL query. The result is again dask dataframe.
result = c.sql("""
    SELECT
        my_data.name,
        SUM(my_data.x)
    FROM
        my_data
    GROUP BY
        my_data.name
""", return_futures=False)

# Show the result
print(result)

Quickstart

Have a look into the documentation or start the example notebook on binder.

dask-sql is currently under development and does so far not understand all SQL commands (but a large fraction). We are actively looking for feedback, improvements and contributors!

Installation

dask-sql can be installed via conda (preferred) or pip - or in a development environment.

With conda

Create a new conda environment or use your already present environment:

conda create -n dask-sql
conda activate dask-sql

Install the package from the conda-forge channel:

conda install dask-sql -c conda-forge

With pip

You can install the package with

pip install dask-sql

For development

If you want to have the newest (unreleased) dask-sql version or if you plan to do development on dask-sql, you can also install the package from sources.

git clone https://github.com/dask-contrib/dask-sql.git

Create a new conda environment and install the development environment:

conda env create -f continuous_integration/environment-3.9-dev.yaml

It is not recommended to use pip instead of conda for the environment setup.

After that, you can install the package in development mode

pip install -e ".[dev]"

The Rust DataFusion bindings are built as part of the pip install. If changes are made to the Rust source in dask_planner/, another build/install must be run to recompile the bindings:

python setup.py build install

This repository uses pre-commit hooks. To install them, call

pre-commit install

Testing

You can run the tests (after installation) with

pytest tests

GPU-specific tests require additional dependencies specified in continuous_integration/gpuci/environment.yaml. These can be added to the development environment by running

conda env update -n dask-sql -f continuous_integration/gpuci/environment.yaml

And GPU-specific tests can be run with

pytest tests -m gpu --rungpu

SQL Server

dask-sql comes with a small test implementation for a SQL server. Instead of rebuilding a full ODBC driver, we re-use the presto wire protocol. It is - so far - only a start of the development and missing important concepts, such as authentication.

You can test the sql presto server by running (after installation)

dask-sql-server

or by using the created docker image

docker run --rm -it -p 8080:8080 nbraun/dask-sql

in one terminal. This will spin up a server on port 8080 (by default) that looks similar to a normal presto database to any presto client.

You can test this for example with the default presto client:

presto --server localhost:8080

Now you can fire simple SQL queries (as no data is loaded by default):

=> SELECT 1 + 1;
 EXPR$0
--------
    2
(1 row)

You can find more information in the documentation.

CLI

You can also run the CLI dask-sql for testing out SQL commands quickly:

dask-sql --load-test-data --startup

(dask-sql) > SELECT * FROM timeseries LIMIT 10;

How does it work?

At the core, dask-sql does two things:

  • translate the SQL query using DataFusion into a relational algebra, which is represented as a logical query plan - similar to many other SQL engines (Hive, Flink, ...)
  • convert this description of the query into dask API calls (and execute them) - returning a dask dataframe.

For the first step, Arrow DataFusion needs to know about the columns and types of the dask dataframes, therefore some Rust code to store this information for dask dataframes are defined in dask_planner. After the translation to a relational algebra is done (using DaskSQLContext.logical_relational_algebra), the python methods defined in dask_sql.physical turn this into a physical dask execution plan by converting each piece of the relational algebra one-by-one.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dask_sql-2022.10.1rc0.tar.gz (164.2 kB view details)

Uploaded Source

Built Distributions

dask_sql-2022.10.1rc0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.3 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

dask_sql-2022.10.1rc0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (6.0 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARM64

dask_sql-2022.10.1rc0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.3 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

dask_sql-2022.10.1rc0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (6.0 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARM64

dask_sql-2022.10.1rc0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.3 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

dask_sql-2022.10.1rc0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (6.0 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ARM64

File details

Details for the file dask_sql-2022.10.1rc0.tar.gz.

File metadata

  • Download URL: dask_sql-2022.10.1rc0.tar.gz
  • Upload date:
  • Size: 164.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.10

File hashes

Hashes for dask_sql-2022.10.1rc0.tar.gz
Algorithm Hash digest
SHA256 4e839465e0095ff4e00e26e975b702d4dd82765da7acce6fb04db3a8b044e541
MD5 a583890581b55004578ea8294364f937
BLAKE2b-256 59975e3dcf49fb158bdcc2d7199cb3b3ddd4c4d5026aa0abc748a64569afdbef

See more details on using hashes here.

File details

Details for the file dask_sql-2022.10.1rc0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.10.1rc0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 48b375f9cf43162e030906e305dbcc3a3db142384b59f9353bf353dc68a404cd
MD5 f5998c4e364a964fd2250eb765a102bd
BLAKE2b-256 5dc6bf9457474400061eb9f88e8a5dea9f658e9bde88c047352055a7f324011b

See more details on using hashes here.

File details

Details for the file dask_sql-2022.10.1rc0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.10.1rc0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 8465bbfb246fcd2a0960132274009b2047b016677780ff5abef33a59adce9e7f
MD5 ca5fa95a3752f7ca01947674a8e1c152
BLAKE2b-256 9f718c7c157a83d065e7e7f55a412562769ee40df3a662f09c712eff38089b45

See more details on using hashes here.

File details

Details for the file dask_sql-2022.10.1rc0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.10.1rc0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2aed02faaebf4112b0356c51c4812ac2ff252905790596be17bf87b332691f38
MD5 ba2cbad277811bd57d48a767e25cd733
BLAKE2b-256 fd9b56bebee63079b4c2a5d37806169bfa0c9b1a27362103e40fa30c62daabf2

See more details on using hashes here.

File details

Details for the file dask_sql-2022.10.1rc0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.10.1rc0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 4e19893579778353950321b581bbfee3bf8c26dd0ed7011ad2c8da0d3992d28a
MD5 9cf991ab90e766009e7af314faa80ed1
BLAKE2b-256 d6720bb594ea7cfe2bbf6bb12c8b715c09e8ed65cb24b4466f37c0012884215c

See more details on using hashes here.

File details

Details for the file dask_sql-2022.10.1rc0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.10.1rc0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 605e96154620a3fd6f2a037c3277532cf2a70717963db205077b3781c20a7618
MD5 18e47568289efee6a596f298802aad2f
BLAKE2b-256 9eec2c91ae22fbdb2c0270fcdde81007fe7fe81c86759020df89067d176d4493

See more details on using hashes here.

File details

Details for the file dask_sql-2022.10.1rc0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for dask_sql-2022.10.1rc0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 5a9fc943be341d79e260516be60b514de1f597a1f76b92dd7573fd3d1fdb17e8
MD5 cab3238bb3ce303e9735bfb77135b0fe
BLAKE2b-256 b6c8b1f7622863d571bcae04f4d4354a03d53b3dc9afc7931fb0800f73d1fc81

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page