Skip to main content

Distributed Dataframes for Multimodal Data

Project description

Daft dataframes can load any data such as PDF documents, images, protobufs, csv, parquet and audio files into a table dataframe structure for easy querying

Github Actions tests PyPI latest tag Coverage slack community

WebsiteDocsInstallation10-minute tour of DaftCommunity and Support

Daft: Distributed dataframes for multimodal data

Daft is a distributed query engine for large-scale data processing in Python and is implemented in Rust.

  • Familiar interactive API: Lazy Python Dataframe for rapid and interactive iteration

  • Focus on the what: Powerful Query Optimizer that rewrites queries to be as efficient as possible

  • Data Catalog integrations: Full integration with data catalogs such as Apache Iceberg

  • Rich multimodal type-system: Supports multimodal types such as Images, URLs, Tensors and more

  • Seamless Interchange: Built on the Apache Arrow In-Memory Format

  • Built for the cloud: Record-setting I/O performance for integrations with S3 cloud storage

Table of Contents

About Daft

Daft was designed with the following principles in mind:

  1. Any Data: Beyond the usual strings/numbers/dates, Daft columns can also hold complex or nested multimodal data such as Images, Embeddings and Python objects efficiently with it’s Arrow based memory representation. Ingestion and basic transformations of multimodal data is extremely easy and performant in Daft.

  2. Interactive Computing: Daft is built for the interactive developer experience through notebooks or REPLs - intelligent caching/query optimizations accelerates your experimentation and data exploration.

  3. Distributed Computing: Some workloads can quickly outgrow your local laptop’s computational resources - Daft integrates natively with Ray for running dataframes on large clusters of machines with thousands of CPUs/GPUs.

Getting Started

Installation

Install Daft with pip install getdaft.

For more advanced installations (e.g. installing from source or with extra dependencies such as Ray and AWS utilities), please see our Installation Guide

Quickstart

Check out our 10-minute quickstart!

In this example, we load images from an AWS S3 bucket’s URLs and resize each image in the dataframe:

import daft

# Load a dataframe from filepaths in an S3 bucket
df = daft.from_glob_path("s3://daft-public-data/laion-sample-images/*")

# 1. Download column of image URLs as a column of bytes
# 2. Decode the column of bytes into a column of images
df = df.with_column("image", df["path"].url.download().image.decode())

# Resize each image into 32x32
df = df.with_column("resized", df["image"].image.resize(32, 32))

df.show(3)

Dataframe code to load a folder of images from AWS S3 and create thumbnails

Benchmarks

Benchmarks for SF100 TPCH

To see the full benchmarks, detailed setup, and logs, check out our benchmarking page.

More Resources

  • 10-minute tour of Daft - learn more about Daft’s full range of capabilities including dataloading from URLs, joins, user-defined functions (UDF), groupby, aggregations and more.

  • User Guide - take a deep-dive into each topic within Daft

  • API Reference - API reference for public classes/functions of Daft

Contributing

To start contributing to Daft, please read CONTRIBUTING.md

Here’s a list of good first issues to get yourself warmed up with Daft. Comment in the issue to pick it up, and feel free to ask any questions!

Telemetry

To help improve Daft, we collect non-identifiable data.

To disable this behavior, set the following environment variable: DAFT_ANALYTICS_ENABLED=0

The data that we collect is:

  1. Non-identifiable: events are keyed by a session ID which is generated on import of Daft

  2. Metadata-only: we do not collect any of our users’ proprietary code or data

  3. For development only: we do not buy or sell any user data

Please see our documentation for more details.

License

Daft has an Apache 2.0 license - please see the LICENSE file.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

getdaft-0.2.23.tar.gz (1.6 MB view details)

Uploaded Source

Built Distributions

getdaft-0.2.23-cp37-abi3-win_amd64.whl (17.5 MB view details)

Uploaded CPython 3.7+ Windows x86-64

getdaft-0.2.23-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.9 MB view details)

Uploaded CPython 3.7+ manylinux: glibc 2.17+ x86-64

getdaft-0.2.23-cp37-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (19.0 MB view details)

Uploaded CPython 3.7+ manylinux: glibc 2.17+ ARM64

getdaft-0.2.23-cp37-abi3-macosx_11_0_arm64.whl (15.9 MB view details)

Uploaded CPython 3.7+ macOS 11.0+ ARM64

getdaft-0.2.23-cp37-abi3-macosx_10_12_x86_64.whl (17.2 MB view details)

Uploaded CPython 3.7+ macOS 10.12+ x86-64

File details

Details for the file getdaft-0.2.23.tar.gz.

File metadata

  • Download URL: getdaft-0.2.23.tar.gz
  • Upload date:
  • Size: 1.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.9

File hashes

Hashes for getdaft-0.2.23.tar.gz
Algorithm Hash digest
SHA256 c2d66e6a4ce75aeb4cedbe2c04c18fa8f3f7dcfe2799f66211f36c7be2f835a5
MD5 592f98958bbc8dfba3dde9cfa93d1e26
BLAKE2b-256 3de6242dab31ab28fa86514984c7e5c381d78ad9c37c7291ea4594defe724fec

See more details on using hashes here.

File details

Details for the file getdaft-0.2.23-cp37-abi3-win_amd64.whl.

File metadata

  • Download URL: getdaft-0.2.23-cp37-abi3-win_amd64.whl
  • Upload date:
  • Size: 17.5 MB
  • Tags: CPython 3.7+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.9

File hashes

Hashes for getdaft-0.2.23-cp37-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 533b78abefa738cac97a6823ef2b8f2df3300bf2d4bda4e8336371fc2780bbb9
MD5 aa72d57d6ff642fb5ec3e181b9b52696
BLAKE2b-256 e4423372bf612083b24ae1206679af1637c3e5ce0d96bdbd665e8aefda56b109

See more details on using hashes here.

File details

Details for the file getdaft-0.2.23-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for getdaft-0.2.23-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9bfa567569a8b53e9b0a7ab3eb0044afe8d5499d995bfeb40bd867661bfa2aa7
MD5 e65bbe227c348f03cd7815c48c82a745
BLAKE2b-256 942ae7a4b0ed0d1256604c92962b5faa33e149a08f751d919580a2b2b91379ee

See more details on using hashes here.

File details

Details for the file getdaft-0.2.23-cp37-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for getdaft-0.2.23-cp37-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 0d6f4dbb7f3b5d62f8df1006bf55cc657148c2a3962766e62fbd3c2df337fa32
MD5 bc9ed7a645fc0a525e82b7eff62d9f85
BLAKE2b-256 85208b89181a2a18b980233ba45bc2adbf89bcbcb6837bccb19e08ec4cb3c643

See more details on using hashes here.

File details

Details for the file getdaft-0.2.23-cp37-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for getdaft-0.2.23-cp37-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 dfaf492bb453675999d70626a8fdb6d4ecaecafbf4a0548e68105757a7a4025a
MD5 b810cc1c33a9088aef438c0eaaddee32
BLAKE2b-256 111ba81b9a0b1a5cb9490762092c68b919ae5fb8284341fa186ae1f25273d234

See more details on using hashes here.

File details

Details for the file getdaft-0.2.23-cp37-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for getdaft-0.2.23-cp37-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 a59f6084ca865528b26ed478d584f98c102500005314dbc7fc44b7c4b3e18d49
MD5 7f4b4c1da57f2c8b82aa8b07e230df22
BLAKE2b-256 52f59c4ad1331a6195167fbcae54f08f41be4739095b3176bc2fc9b0787f0438

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page