Skip to main content

Dask SQL

Project description

dask-sql

Conda PyPI GitHub Workflow Status Read the Docs Codecov GitHub

dask-sql adds a SQL query layer on top of dask. This allows you to query and transform your dask dataframes using common SQL operations.

The queries will run as normal dask operations, which can be distributed within your dask cluster. The goal of this project is therefore similar to what Spark SQL/Hive/Drill/... is for the Hadoop world - but with much less features (so far...). Some ideas for this project are coming from the very great blazingSQL project.

Read more in the documentation.

You can try out dask-sql quickly by using the docker command

docker run --rm -it -p 8080:8080 nils-braun/dask-sql

See information in the SQL server at the end of this page.


NOTE

dask-sql is currently under development and does so far not understand all SQL commands. We are actively looking for feedback, improvements and contributors!


Example

We use the timeseries random data from dask.datasets as an example:

from dask_sql import Context
from dask.datasets import timeseries

# Create a context to hold the registered tables
c = Context()

# If you have a cluster of dask workers,
# initialize it now

# Load the data and register it in the context
# This will give the table a name
df = timeseries()
c.register_dask_table(df, "timeseries")

# Now execute an SQL query. The result is a dask dataframe
# The query looks for the id with the highest x for each name
# (this is just random test data, but you could think of looking
# for outliers in the sensor data)
result = c.sql("""
    SELECT
        lhs.name,
        lhs.id,
        lhs.x
    FROM
        timeseries AS lhs
    JOIN
        (
            SELECT
                name AS max_name,
                MAX(x) AS max_x
            FROM timeseries
            GROUP BY name
        ) AS rhs
    ON
        lhs.name = rhs.max_name AND
        lhs.x = rhs.max_x
""")

# Show the result...
print(result.compute())

# ... or use it for any other dask calculation
# (just an example, could also be done via SQL)
print(result.x.mean().compute())

Installation

dask-sql can be installed via conda (preferred) or pip - or in a development environment.

With conda

Create a new conda environment or use your already present environment:

conda create -n dask-sql
conda activate dask-sql

Install the package from the conda-forge channel:

conda install dask-sql -c conda-forge

With pip

dask-sql needs Java for the parsing of the SQL queries. Make sure you have a running java installation with version >= 8.

To test if you have Java properly installed and set up, run

$ java -version
openjdk version "1.8.0_152-release"
OpenJDK Runtime Environment (build 1.8.0_152-release-1056-b12)
OpenJDK 64-Bit Server VM (build 25.152-b12, mixed mode)

After installing Java, you can install the package with

pip install dask-sql

For development

If you want to have the newest (unreleased) dask-sql version or if you plan to do development on dask-sql, you can also install the package from sources.

git clone https://github.com/nils-braun/dask-sql.git

Create a new conda environment and install the development environment:

conda create -n dask-sql --file conda.yaml -c conda-forge

After that, you can install the package in development mode

pip install -e .

This will also compile the Java classes. If there were changes to the Java code, you need to rerun this compilation with

python setup.py java

Testing

You can run the tests (after installation) with

pytest tests

How does it work?

At the core, dask-sql does two things:

  • translate the SQL query using Apache Calcite into a relational algebra, which is specified as a tree of java objects - similar to many other SQL engines (Hive, Flink, ...)
  • convert this description of the query from java objects into dask API calls (and execute them) - returning a dask dataframe.

For the first step, Apache Calcite needs to know about the columns and types of the dask dataframes, therefore some java classes to store this information for dask dataframes are defined in planner. After the translation to a relational algebra is done (using RelationalAlgebraGenerator.getRelationalAlgebra), the python methods defined in dask_sql.physical turn this into a physical dask execution plan by converting each piece of the relational algebra one-by-one.

SQL Server

dask-sql comes with a small test implementation for a SQL server. Instead of rebuilding a full ODBC driver, we re-use the presto wire protocol. It is - so far - only a start of the development and missing important concepts, such as authentication.

You can test the sql presto server by running

python dask_sql/server/app.py

or by using the created docker image

docker run --rm -it -p 8080:8080 nils-braun/dask-sql

in one terminal. This will spin up a server on port 8080 (by default) that looks similar to a normal presto database to any presto client.

You can test this for example with the default presto client:

presto --server localhost:8080

Now you can fire simple SQL queries (as no data is loaded by default):

=> SELECT 1 + 1;
 EXPR$0
--------
    2
(1 row)

You can find more information in the documentation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dask_sql-0.1.2.tar.gz (19.3 MB view details)

Uploaded Source

Built Distribution

dask_sql-0.1.2-py3-none-any.whl (19.2 MB view details)

Uploaded Python 3

File details

Details for the file dask_sql-0.1.2.tar.gz.

File metadata

  • Download URL: dask_sql-0.1.2.tar.gz
  • Upload date:
  • Size: 19.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20201009 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for dask_sql-0.1.2.tar.gz
Algorithm Hash digest
SHA256 fe5e25fab736d7c13303d5e96a377574dbb186dffba04e338b7daf33b43bfe8a
MD5 4f11879d671f9c005854a55b8b2aa58d
BLAKE2b-256 053e5ff2750577834a73392f0783d140cd30f26db86d67a075e444e2834e77ae

See more details on using hashes here.

File details

Details for the file dask_sql-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: dask_sql-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 19.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20201009 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for dask_sql-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f640bfebb81f9219a6fa49d4c8a75c1b2f515a7075461c97df84f5aea38f0f2a
MD5 47028f585dde464e4399b566bf87d435
BLAKE2b-256 bde751aa0fa0a30e3fec989bcc26e7afac5168195376b64d58ab8ce4c82cb2ab

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page