Skip to main content

No project description provided

Project description

deltadask

A connector for reading Delta Lake tables into Dask DataFrames.

Install with pip install deltadask.

Read a Delta Lake into a Dask DataFrame as follows:

import deltadask

ddf = deltadask.read_delta("path/to/delta/table")

Basic usage

Suppose you have a Delta table with the following three versions.

Delta table with version

Here's how to read the latest version of the Delta table:

deltadask.read_delta("path/to/delta/table").compute()
   id
0   7
1   8
2   9

And here's how to read version 1 of the Delta table:

deltadask.read_delta("path/to/delta/table", version=1).compute()
   id
0   0
1   1
2   2
3   4
4   5

Delta Lake makes it easy to time travel between different versions of a Delta table with Dask.

See this notebook for a full working example with an environment so you can replicate this on your machine.

Why Delta Lake is better than Parquet for Dask

A Delta table stores data in Parquet files and metadata in a trasaction log. The metadata includes the schema and location of the files.

Delta table architecture

A Dask Parquet data lake can be stored in two different ways.

  1. Parquet files with a single metadata file
  2. Parquet files without a metadata file

Parquet files with a single metadata file are limited because a single file has scaling limitations.

Parquet files without a metadata file are limited because they require a relatively expensive file listing operation followed by calls to build the overall metadata statistics for the data lake.

Delta Lake is better because the transaction log is scalable and can be queried a lot faster than an expensive file listing operation.

Here's an example of how to query a Delta table with Dask and take advantage of column pruning and predicate pushdown filtering:

ddf = deltadask.read_delta(
    "path/to/delta/table", 
    columns=["col1"], filters=[[('col1', '==', 0)]])

Why this library is really easy to build

Reading a Delta Lake into a Dask DataFrame is ridiculously easy, thanks to delta-rs.

Reading Delta Lakes is also really fast and efficient. You can get a list of the files from the transaction log which is a lot faster than a file listing operation.

You can also skip entire files based on column metadata stored in the transaction log. Skipping data allows for huge performance improvements.

Here's how to read a Delta Lake into a Dask DataFrame with this library:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deltadask-0.1.0.tar.gz (2.5 kB view details)

Uploaded Source

Built Distribution

deltadask-0.1.0-py3-none-any.whl (2.4 kB view details)

Uploaded Python 3

File details

Details for the file deltadask-0.1.0.tar.gz.

File metadata

  • Download URL: deltadask-0.1.0.tar.gz
  • Upload date:
  • Size: 2.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.7 CPython/3.9.5 Darwin/20.3.0

File hashes

Hashes for deltadask-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f56c0e7f7788fc3dbbf419a2424c2844dcf63b002e1372c193e796af52abc432
MD5 29580c6d68f89a2bc2ec523fb81d9d44
BLAKE2b-256 5d1620f019b3d0f3cbc3b9609f477650f6760f0d04925bbc6bc6433221f82c6a

See more details on using hashes here.

File details

Details for the file deltadask-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: deltadask-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 2.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.7 CPython/3.9.5 Darwin/20.3.0

File hashes

Hashes for deltadask-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0896c786124d0d1b7145a16bee92dccf856e3231b39ea04719afa152e8cd2231
MD5 520d5a57cee240b97565ff61995c2d71
BLAKE2b-256 4fc9cffb9a6690955bae249e839a87838c217658751047b340ffef142e6d5156

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page