Skip to main content

Dask SQL

Project description

dask-sql

dask-sql adds a SQL query layer on top of dask.

It lets you register your dask dataframes and then run SQL queries against them. The queries will run as normal dask operations, which can be distributed within your dask cluster. The goal is therefore similar to what Spark SQL/Hive/Drill/... is for the Hadoop world - but with much less features (so far...)

It uses Apache Calcite to convert SQL queries into a query plan (so called relational algebra) - similar to many other SQL engines (Hive, Flink, ...). This plan is then converted into normal dask API calls. Some ideas for this project are coming from the blazingSQL project.

dask-sql is currently under development. Any contributions are highly welcome! It is mainly a proof of concept - many things are still missing.

Example

We use the NYC flight data from the dask tutorial, as an example:

from dask_sql import Context
import dask.dataframe as dd

# Create a context to hold the registered tables
c = Context()

# Load the data and register it in the context
# This will give the table a name
df = dd.read_csv(...)
c.register_dask_table(df, "nycflights")

# Now execute an SQL query. The result is a dask dataframe
result = c.sql("""
    SELECT
        Origin, MAX(DepDelay)
    FROM nycflights
    GROUP BY Origin
""")

# Show the result (or use it for any other dask calculation)
print(result.compute())

dask-sql also comes with a very simple test implementation of a SQL server speaking the postgreSQL wire protocol. It is - so far - just a proof of concept. See below on how to start it.

Installation

So far, the project can only be installed via the sources. A packaged installation will come soon!

With conda

Create a new conda environment using the supplied conda.yaml:

conda create -n dask-sql --file conda.yaml -c conda-forge
conda activate dask-sql

Finally, you can install the python module

python setup.py install

This will trigger also the compilation of the java library.

With pip

Make sure you have a running java installation with version >= 11. Currently, for the compilation the JDK is needed.

After installing Java, you can install the package with

python setup.py install

For development

You can also build the java library separately:

python setup.py java

After that, you can install the package in development mode

pip install -e .

Make sure to re-run the java build whenever you do any changes to the java code.

Testing

You can run the tests (after installation) with

pytest tests

SQL Server

dask-sql comes with a small test implementation for a SQL server. Instead of rebuilding a full ODBC driver, we re-use the postgreSQL wire protocol.

You can test the sql postgres server by running

python dask_sql/server/handler.py

in one terminal. This will spin up a server on port 9876 that looks similar to a normal postgres database to any postgres client (except that you can only do queries, no database creation etc.)

You can test this for example with the default postgres client:

psql -h localhost -p 9876

Now you can fire simple SQL queries (as no data is loaded by default):

=> SELECT 1 + 1;
 EXPR$0
--------
    2
(1 row)

How does it work?

At the core, dask-sql does two things:

  • translate the SQL query using Apache Calcite into a relational algebra, which is specified as a tree of java objects.
  • convert this description of the query from java objects into dask API calls (and execute them) - returning a dask dataframe.

For the first step, Apache Calcite needs to know about the columns and types of the dask dataframes, therefore some java classes to store this information for dask dataframes are defined in planner. After the translation to a relational algebra is done (using RelationalAlgebraGenerator.getRelationalAlgebra), the python methods defined in dask_sql.physical turn this into a physical dask execution plan by converting each piece of the relational algebra one-by-one.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dask_sql-0.1.0rc2.tar.gz (17.4 MB view details)

Uploaded Source

Built Distribution

dask_sql-0.1.0rc2-py3-none-any.whl (17.4 MB view details)

Uploaded Python 3

File details

Details for the file dask_sql-0.1.0rc2.tar.gz.

File metadata

  • Download URL: dask_sql-0.1.0rc2.tar.gz
  • Upload date:
  • Size: 17.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for dask_sql-0.1.0rc2.tar.gz
Algorithm Hash digest
SHA256 683e4c9759ef0a0d475aadc42d976125a04f9cc7d32065d11f48bd27282c46a1
MD5 d9cee7ec2f6cdaac7d199626791f1cb4
BLAKE2b-256 ad4c04ef5c73f7354640d20490e27a5f24fc4fc4a28eedca1707044c714e2f66

See more details on using hashes here.

File details

Details for the file dask_sql-0.1.0rc2-py3-none-any.whl.

File metadata

  • Download URL: dask_sql-0.1.0rc2-py3-none-any.whl
  • Upload date:
  • Size: 17.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for dask_sql-0.1.0rc2-py3-none-any.whl
Algorithm Hash digest
SHA256 7a4dc4cbc139d41822aa3949b186b807757f6b7dc488417902c499e552c930f3
MD5 4e4c1c13fe2afbeab2c7da347ac9d00d
BLAKE2b-256 d7eeb4d0074e921e8f31377af3f2e43e2b0f0c9a55a20b138b6cdd245493c44d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page