Skip to main content

A loose implimentation of the deltalake spec focused on extensibility and distributed data.

Project description

xdlake

A loose implementation of the deltalake protocol, written in Python on top of pyarrow, focused on extensibility, customizability, and distributed data.

This is mostly inspired by the deltalake package, and is (much) less battle tested. However, it is more flexible given it's Pythonic design. If you're interested give it a shot and maybe even help make it better.

Install

pip install xdlake

Usage

Instantiation

Instantiate a table! This can be a local or remote. For remote, you may need to install the relevant fsspec implementation, for instance s3fs, gcsfs, adlfs for AWS S3, Google Storage, and Azure Storage, respectively.

dt = xdlake.DeltaTable("path/to/my/cool/local/table")
dt = xdlake.DeltaTable("s3://path/to/my/cool/table")
dt = xdlake.DeltaTable("az://path/to/my/cool/table", storage_options=dict_of_azure_creds)

Reads

Read the data. For fancy filtering and predicate push down and whatever, use to_pyarrow_dataset and learn how to filter pyarrow datasets.

ds = dt.to_pyarrow_dataset()
t = dt.to_pyarrow_table()
df = dt.to_pyarrow_table().to_pandas()

Writes

Instances of DeltaTable are immutable, so any method that performs a table operation will return a new DeltaTable instance.

Write in-memory data

Write data from memory. Data can be pyarrow tables, datasets, or record batches, or iterables of those things. If you want to write pandas data you need to convert it to an arrow format. Fortunately this is easy and zero copy: pyarrow.Table.from_pandas(my_pandas_df).

dt = dt.write(my_cool_arrow_table)
dt = dt.write(my_cool_arrow_dataset)
dt = dt.write(my_cool_arrow_record_batches)
dt = dt.write(pyarrow.Table.from_pandas(df))
Import foreign data

This is the distributed part: import references to foreign data without copying (that's why it's a "reference"). Locations can be heterogeneous, for instance s3, gs, az, and local. It can even be partitioned differently than the DeltaTable itself. Go hog wild!

See Credentials if you need different creds for different storage locations.

Import data from various locations in one go. This only works for non-partitioned data.

dt = dt.import_refs(["s3://some/aws/data", "gs://some/gcp/data", "az://some/azure/data" ])
dt = dt.import_refs(my_pyarrow_filesystem_dataset)

Partitioned data needs to be handled a differently. First, you'll need to read up on pyarrow partitioning to do it. Second, you can only import one dataset at a time.

foreign_partitioning = pyarrow.dataset.partitioning(...)
ds = pyarrow.dataset.dataset(
    list_of_files,
    partitioning=foreign_partitioning,
    partition_base_dir,
    filesystem=xdlake.storage.get_filesystem(foreign_refs_loc),
)
dt = dt.import_refs(ds, partition_by=my_partition_cols)

Deletes

Delete rows from a DeltaTable using pyarrow expressions:

import pyarrow.compute as pc
expr = (
    (pc.field("cats") == pc.scalar("A"))
    |
    (pc.field("float64") > pc.scalar(0.9))
)
dt = dt.delete(expr)
Deletion Vectors

I really want to support deletion vectors, but pyarrow can't filter parquet files by row indices (pretty basic if you ask me). If you also would like xdlake to support deletion vectors, let the arrow folks know by chiming in here.

Clone

You can clone a deltatable. This is a soft clone (no data is copied, and the new table just references the data). The entire version history is preserved. Writes are written to the new location.

cloned_dt = dt.clone("the/location/of/the/clone")

Credentials

DeltaTables that reference distributed data may need credentials for various cloud locations.

To register default credentials for s3, gs, etc.

xdlake.storage.register_default_filesystem_for_protocol("s3", s3_creds)
xdlake.storage.register_default_filesystem_for_protocol("gs", gs_creds)
xdlake.storage.register_default_filesystem_for_protocol("az", az_creds)

To register specific credentials for various prefixes:

xdlake.storage.register_filesystem("s3://bucket-doom/foo/bar", s3_creds)
xdlake.storage.register_filesystem("s3://bucket-zoom/biz/baz", other_s3_creds)
xdlake.storage.register_filesystem("az://container-blah/whiz/whaz", az_creds)

Links

Project home page GitHub

Bugs

Please report bugs, issues, feature requests, etc. on GitHub.

Gitpod Workspace

launch gitpod workspace

Build Status

main

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xdlake-0.0.1.tar.gz (28.4 kB view details)

Uploaded Source

Built Distribution

xdlake-0.0.1-py3-none-any.whl (17.5 kB view details)

Uploaded Python 3

File details

Details for the file xdlake-0.0.1.tar.gz.

File metadata

  • Download URL: xdlake-0.0.1.tar.gz
  • Upload date:
  • Size: 28.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.5

File hashes

Hashes for xdlake-0.0.1.tar.gz
Algorithm Hash digest
SHA256 e67e21e308b3afffe2ca6ab39fb14bb27729af7dd4fd510aeace55f6c45cb985
MD5 939de420222f80d72c0a46a36168281e
BLAKE2b-256 e9ad46baf5e7bb66e5e25ae7a9efebee27dda48779214a30ede8f91b1da29695

See more details on using hashes here.

File details

Details for the file xdlake-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: xdlake-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 17.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.5

File hashes

Hashes for xdlake-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 83b2438c6ef3649f368b31d6eef9117836f53c357a28632b48f006228a1673f4
MD5 579541d30f1189aff8274bdd06791e3e
BLAKE2b-256 b66bedbec6ffd1acb6ff4eda4ea733c91ecf6d83100d47b5bf8d888bd1239c9a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page