dask-sql

Dask SQL

Project description

dask-sql is a distributed SQL query engine in Python. It allows you to query and transform your data using a mixture of common SQL operations and Python code and also scale up the calculation easily if you need it.

Combine the power of Python and SQL: load your data with Python, transform it with SQL, enhance it with Python and query it with SQL - or the other way round. With dask-sql you can mix the well known Python dataframe API of pandas and Dask with common SQL operations, to process your data in exactly the way that is easiest for you.
Infinite Scaling: using the power of the great Dask ecosystem, your computations can scale as you need it - from your laptop to your super cluster - without changing any line of SQL code. From k8s to cloud deployments, from batch systems to YARN - if Dask supports it, so will dask-sql.
Your data - your queries: Use Python user-defined functions (UDFs) in SQL without any performance drawback and extend your SQL queries with the large number of Python libraries, e.g. machine learning, different complicated input formats, complex statistics.
Easy to install and maintain: dask-sql is just a pip/conda install away (or a docker run if you prefer). No need for complicated cluster setups - dask-sql will run out of the box on your machine and can be easily connected to your computing cluster.
Use SQL from wherever you like: dask-sql integrates with your jupyter notebook, your normal Python module or can be used as a standalone SQL server from any BI tool. It even integrates natively with Apache Hue.

Example

For this example, we use some data loaded from disk and query them with a SQL command from our python code. Any pandas or dask dataframe can be used as input and dask-sql understands a large amount of formats (csv, parquet, json,...) and locations (s3, hdfs, gcs,...).

import dask.dataframe as dd
from dask_sql import Context

# Create a context to hold the registered tables
c = Context()

# Load the data and register it in the context
# This will give the table a name, that we can use in queries
df = dd.read_csv("...")
c.create_table("my_data", df)

# Now execute a SQL query. The result is again dask dataframe.
result = c.sql("""
    SELECT
        my_data.name,
        SUM(my_data.x)
    FROM
        my_data
    GROUP BY
        my_data.name
""", return_futures=False)

# Show the result
print(result)

Quickstart

Have a look into the documentation or start the example notebook on binder.

dask-sql is currently under development and does so far not understand all SQL commands (but a large fraction). We are actively looking for feedback, improvements and contributors!

If you would like to utilize GPUs for your SQL queries, have a look into the blazingSQL project.

Installation

dask-sql can be installed via conda (preferred) or pip - or in a development environment.

With `conda`

Create a new conda environment or use your already present environment:

conda create -n dask-sql
conda activate dask-sql

Install the package from the conda-forge channel:

conda install dask-sql -c conda-forge

With `pip`

dask-sql needs Java for the parsing of the SQL queries. Make sure you have a running java installation with version >= 8.

To test if you have Java properly installed and set up, run

$ java -version
openjdk version "1.8.0_152-release"
OpenJDK Runtime Environment (build 1.8.0_152-release-1056-b12)
OpenJDK 64-Bit Server VM (build 25.152-b12, mixed mode)

After installing Java, you can install the package with

pip install dask-sql

For development

If you want to have the newest (unreleased) dask-sql version or if you plan to do development on dask-sql, you can also install the package from sources.

git clone https://github.com/dask-contrib/dask-sql.git

Create a new conda environment and install the development environment:

conda env create -f continuous_integration/environment-3.8-jdk11-dev.yaml

It is not recommended to use pip instead of conda for the environment setup. If you however need to, make sure to have Java (jdk >= 8) and maven installed and correctly setup before continuing. Have a look into environment-3.8-jdk11-dev.yaml for the rest of the development environment.

After that, you can install the package in development mode

pip install -e ".[dev]"

To compile the Java classes (at the beginning or after changes), run

python setup.py java

This repository uses pre-commit hooks. To install them, call

pre-commit install

Testing

You can run the tests (after installation) with

pytest tests

SQL Server

dask-sql comes with a small test implementation for a SQL server. Instead of rebuilding a full ODBC driver, we re-use the presto wire protocol. It is - so far - only a start of the development and missing important concepts, such as authentication.

You can test the sql presto server by running (after installation)

dask-sql-server

or by using the created docker image

docker run --rm -it -p 8080:8080 nbraun/dask-sql

in one terminal. This will spin up a server on port 8080 (by default) that looks similar to a normal presto database to any presto client.

You can test this for example with the default presto client:

presto --server localhost:8080

Now you can fire simple SQL queries (as no data is loaded by default):

=> SELECT 1 + 1;
 EXPR$0
--------
    2
(1 row)

You can find more information in the documentation.

CLI

You can also run the CLI dask-sql for testing out SQL commands quickly:

dask-sql --load-test-data --startup

(dask-sql) > SELECT * FROM timeseries LIMIT 10;

How does it work?

At the core, dask-sql does two things:

translate the SQL query using Apache Calcite into a relational algebra, which is specified as a tree of java objects - similar to many other SQL engines (Hive, Flink, ...)
convert this description of the query from java objects into dask API calls (and execute them) - returning a dask dataframe.

For the first step, Apache Calcite needs to know about the columns and types of the dask dataframes, therefore some java classes to store this information for dask dataframes are defined in planner. After the translation to a relational algebra is done (using RelationalAlgebraGenerator.getRelationalAlgebra), the python methods defined in dask_sql.physical turn this into a physical dask execution plan by converting each piece of the relational algebra one-by-one.

Project details

Release history Release notifications | RSS feed

2024.5.0

May 28, 2024

2024.3.0

Mar 19, 2024

2024.1.0

Jan 31, 2024

2024.1.0rc0 pre-release

Jan 25, 2024

2023.11.0

Nov 20, 2023

2023.11.0rc1 pre-release

Nov 13, 2023

2023.10.1

Oct 17, 2023

2023.10.0

Oct 6, 2023

2023.8.0

Aug 3, 2023

2023.6.0

Jun 8, 2023

2023.4.0

Apr 6, 2023

2023.2.0

Feb 6, 2023

2022.12.0

Dec 2, 2022

2022.10.1

Oct 25, 2022

2022.10.1rc1 pre-release

Oct 24, 2022

2022.10.1rc0 pre-release

Oct 19, 2022

2022.8.0

Aug 16, 2022

2022.6.0

Jun 3, 2022

2022.4.1

Apr 8, 2022

2022.4.0

Apr 7, 2022

2022.1.0

Jan 24, 2022

2021.12.0

Dec 13, 2021

2021.11.0

Nov 10, 2021

This version

0.4.0

Nov 2, 2021

0.3.9

Aug 18, 2021

0.3.8

Aug 17, 2021

0.3.7

Aug 10, 2021

0.3.6

May 16, 2021

0.3.5

May 15, 2021

0.3.4

May 13, 2021

0.3.3

Apr 30, 2021

0.3.2

Apr 13, 2021

0.3.1

Feb 7, 2021

0.3.0

Jan 21, 2021

0.2.2

Nov 28, 2020

0.2.0

Nov 5, 2020

0.1.2

Oct 14, 2020

0.1.1

Oct 13, 2020

0.1.0

Oct 13, 2020

0.1.0rc4 pre-release

Oct 13, 2020

0.1.0rc2 pre-release

Sep 9, 2020

0.1.0rc1 pre-release

Sep 7, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dask_sql-0.4.0.tar.gz (20.0 MB view details)

Uploaded Nov 2, 2021 Source

Built Distribution

dask_sql-0.4.0-py3-none-any.whl (19.4 MB view details)

Uploaded Nov 2, 2021 Python 3

File details

Details for the file dask_sql-0.4.0.tar.gz.

File metadata

Download URL: dask_sql-0.4.0.tar.gz
Upload date: Nov 2, 2021
Size: 20.0 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.12

File hashes

Hashes for dask_sql-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`edb59cf82699f2fd260335641fd56b9b60932db39ee6a3cff3b0a720ca58578a`
MD5	`edf00828a10892bc67209ce3ae946222`
BLAKE2b-256	`7e36568c5cd26033ba1e489491e1d3d3e6c5df0ad0fb8d97c2dcfeaace73b013`

See more details on using hashes here.

File details

Details for the file dask_sql-0.4.0-py3-none-any.whl.

File metadata

Download URL: dask_sql-0.4.0-py3-none-any.whl
Upload date: Nov 2, 2021
Size: 19.4 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.12

File hashes

Hashes for dask_sql-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`10045a4bb59c8b5f2376c687a7be890980e541791e9ee45861892cf8607d5e03`
MD5	`8edd5a342f010f51f47fbc06a5e33938`
BLAKE2b-256	`26363bc7ca793f0b19686bb1f9a15948343f168d07fec517474de11fc7da0a6e`

See more details on using hashes here.

dask-sql 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Example

Quickstart

Installation

With `conda`

With `pip`

For development

Testing

SQL Server

CLI

How does it work?

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

dask-sql 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Example

Quickstart

Installation

With conda

With pip

For development

Testing

SQL Server

CLI

How does it work?

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

With `conda`

With `pip`