Skip to main content

Dask + Delta Table

Project description

Dask-DeltaTable

Reading and writing to Delta Lake using Dask engine.

Installation

To install the package:

pip install dask-deltatable

Features:

  1. Read the parquet files from Delta Lake and parallelize with Dask
  2. Write Dask dataframes to Delta Lake (limited support)
  3. Supports multiple filesystems (s3, azurefs, gcsfs)
  4. Subset of Delta Lake features:
    • Time Travel
    • Schema evolution
    • Parquet filters
      • row filter
      • partition filter

Not supported

  1. Writing to Delta Lake is still in development.
  2. optimize API to run a bin-packing operation on a Delta Table.

Reading from Delta Lake

import dask_deltatable as ddt

# read delta table
ddt.read_deltalake("delta_path")

# with specific version
ddt.read_deltalake("delta_path", version=3)

# with specific datetime
ddt.read_deltalake("delta_path", datetime="2018-12-19T16:39:57-08:00")

Accessing remote file systems

To be able to read from S3, azure, gcsfs, and other remote filesystems, you ensure the credentials are properly configured in environment variables or config files. For AWS, you may need ~/.aws/credential; for gcsfs, GOOGLE_APPLICATION_CREDENTIALS. Refer to your cloud provider documentation to configure these.

ddt.read_deltalake("s3://bucket_name/delta_path", version=3)

Accessing AWS Glue catalog

dask-deltatable can connect to AWS Glue catalog to read the delta table. The method will look for AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables, and if those are not available, fall back to ~/.aws/credentials.

Example:

ddt.read_deltalake(catalog="glue", database_name="science", table_name="physics")

Writing to Delta Lake

To write a Dask dataframe to Delta Lake, use to_deltalake method.

import dask.dataframe as dd
import dask_deltatable as ddt

df = dd.read_csv("s3://bucket_name/data.csv")
# do some processing on the dataframe...
ddt.to_deltalake(df, "s3://bucket_name/delta_path")

Writing to Delta Lake is still in development, so be aware that some features may not work.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dask-deltatable-0.3.tar.gz (19.8 kB view details)

Uploaded Source

Built Distribution

dask_deltatable-0.3-py3-none-any.whl (16.7 kB view details)

Uploaded Python 3

File details

Details for the file dask-deltatable-0.3.tar.gz.

File metadata

  • Download URL: dask-deltatable-0.3.tar.gz
  • Upload date:
  • Size: 19.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for dask-deltatable-0.3.tar.gz
Algorithm Hash digest
SHA256 2d5b9a8f96bbc4e363bd48654daef8205c40493abe9a435bd9b9f87f321b0ddc
MD5 e3360402c5d8f7ca6bd6fd8ecde3c3b5
BLAKE2b-256 d8c5550c2a24db1a3b12b6e6912bc927d3439d729c4979add27103fb8e57b01e

See more details on using hashes here.

Provenance

File details

Details for the file dask_deltatable-0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for dask_deltatable-0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 03359d271973f3c660789b384f1ef83c8ab6bc283c015b09703968665b012433
MD5 c0292f09253fa1d7ab8832f90eca26c2
BLAKE2b-256 ca4867907e5eaa9d0370db38fc9c6570b3df08f79762f2d6d54d3fe2e7f7f430

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page