Skip to main content

Dask + Delta Table

Project description

Dask-DeltaTable

Reading and writing to Delta Lake using Dask engine.

Installation

To install the package:

pip install dask-deltatable

Features:

  1. Read the parquet files from Delta Lake and parallelize with Dask
  2. Write Dask dataframes to Delta Lake (limited support)
  3. Supports multiple filesystems (s3, azurefs, gcsfs)
  4. Subset of Delta Lake features:
    • Time Travel
    • Schema evolution
    • Parquet filters
      • row filter
      • partition filter

Not supported

  1. Writing to Delta Lake is still in development.
  2. optimize API to run a bin-packing operation on a Delta Table.

Reading from Delta Lake

import dask_deltatable as ddt

# read delta table
ddt.read_deltalake("delta_path")

# with specific version
ddt.read_deltalake("delta_path", version=3)

# with specific datetime
ddt.read_deltalake("delta_path", datetime="2018-12-19T16:39:57-08:00")

Accessing remote file systems

To be able to read from S3, azure, gcsfs, and other remote filesystems, you ensure the credentials are properly configured in environment variables or config files. For AWS, you may need ~/.aws/credential; for gcsfs, GOOGLE_APPLICATION_CREDENTIALS. Refer to your cloud provider documentation to configure these.

ddt.read_deltalake("s3://bucket_name/delta_path", version=3)

Accessing AWS Glue catalog

dask-deltatable can connect to AWS Glue catalog to read the delta table. The method will look for AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables, and if those are not available, fall back to ~/.aws/credentials.

Example:

ddt.read_deltalake(catalog="glue", database_name="science", table_name="physics")

Writing to Delta Lake

To write a Dask dataframe to Delta Lake, use to_deltalake method.

import dask.dataframe as dd
import dask_deltatable as ddt

df = dd.read_csv("s3://bucket_name/data.csv")
# do some processing on the dataframe...
ddt.to_deltalake(df, "s3://bucket_name/delta_path")

Writing to Delta Lake is still in development, so be aware that some features may not work.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dask-deltatable-0.3.1.tar.gz (20.3 kB view details)

Uploaded Source

Built Distribution

dask_deltatable-0.3.1-py3-none-any.whl (16.6 kB view details)

Uploaded Python 3

File details

Details for the file dask-deltatable-0.3.1.tar.gz.

File metadata

  • Download URL: dask-deltatable-0.3.1.tar.gz
  • Upload date:
  • Size: 20.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for dask-deltatable-0.3.1.tar.gz
Algorithm Hash digest
SHA256 25885d186cf05525c21cd5895f599a0b68427ff3f88878eb926dc4a5c6c5e522
MD5 c0b24829002a36a1d0f7107d04b32482
BLAKE2b-256 577498cf5e1b37f075a5a394be328dd8075a9eba9edecfd9fe540c8cbc7e84cf

See more details on using hashes here.

Provenance

File details

Details for the file dask_deltatable-0.3.1-py3-none-any.whl.

File metadata

File hashes

Hashes for dask_deltatable-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d2f6dca8c3010fbec257185ec9fe110248ede0ed35a5678341ca35ffaacda2e2
MD5 09a5fd80f76816e9dd67921649c5ce9c
BLAKE2b-256 97efb5aefd662a06e41453c8067944f036bdfc7ffa17cecf00910b664d14451d

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page