Skip to main content

Distributed Dataframes for Multimodal Data

Project description

Daft dataframes can load any data such as PDF documents, images, protobufs, csv, parquet and audio files into a table dataframe structure for easy querying

Github Actions tests PyPI latest tag Coverage slack community

WebsiteDocsInstallation10-minute tour of DaftCommunity and Support

Daft: Distributed dataframes for multimodal data

Daft is a distributed query engine for large-scale data processing in Python and is implemented in Rust.

  • Familiar interactive API: Lazy Python Dataframe for rapid and interactive iteration

  • Focus on the what: Powerful Query Optimizer that rewrites queries to be as efficient as possible

  • Data Catalog integrations: Full integration with data catalogs such as Apache Iceberg

  • Rich multimodal type-system: Supports multimodal types such as Images, URLs, Tensors and more

  • Seamless Interchange: Built on the Apache Arrow In-Memory Format

  • Built for the cloud: Record-setting I/O performance for integrations with S3 cloud storage

Table of Contents

About Daft

Daft was designed with the following principles in mind:

  1. Any Data: Beyond the usual strings/numbers/dates, Daft columns can also hold complex or nested multimodal data such as Images, Embeddings and Python objects efficiently with it’s Arrow based memory representation. Ingestion and basic transformations of multimodal data is extremely easy and performant in Daft.

  2. Interactive Computing: Daft is built for the interactive developer experience through notebooks or REPLs - intelligent caching/query optimizations accelerates your experimentation and data exploration.

  3. Distributed Computing: Some workloads can quickly outgrow your local laptop’s computational resources - Daft integrates natively with Ray for running dataframes on large clusters of machines with thousands of CPUs/GPUs.

Getting Started

Installation

Install Daft with pip install getdaft.

For more advanced installations (e.g. installing from source or with extra dependencies such as Ray and AWS utilities), please see our Installation Guide

Quickstart

Check out our 10-minute quickstart!

In this example, we load images from an AWS S3 bucket’s URLs and resize each image in the dataframe:

import daft

# Load a dataframe from filepaths in an S3 bucket
df = daft.from_glob_path("s3://daft-public-data/laion-sample-images/*")

# 1. Download column of image URLs as a column of bytes
# 2. Decode the column of bytes into a column of images
df = df.with_column("image", df["path"].url.download().image.decode())

# Resize each image into 32x32
df = df.with_column("resized", df["image"].image.resize(32, 32))

df.show(3)

Dataframe code to load a folder of images from AWS S3 and create thumbnails

Benchmarks

Benchmarks for SF100 TPCH

To see the full benchmarks, detailed setup, and logs, check out our benchmarking page.

More Resources

  • 10-minute tour of Daft - learn more about Daft’s full range of capabilities including dataloading from URLs, joins, user-defined functions (UDF), groupby, aggregations and more.

  • User Guide - take a deep-dive into each topic within Daft

  • API Reference - API reference for public classes/functions of Daft

Contributing

To start contributing to Daft, please read CONTRIBUTING.md

Here’s a list of good first issues to get yourself warmed up with Daft. Comment in the issue to pick it up, and feel free to ask any questions!

Telemetry

To help improve Daft, we collect non-identifiable data.

To disable this behavior, set the following environment variable: DAFT_ANALYTICS_ENABLED=0

The data that we collect is:

  1. Non-identifiable: events are keyed by a session ID which is generated on import of Daft

  2. Metadata-only: we do not collect any of our users’ proprietary code or data

  3. For development only: we do not buy or sell any user data

Please see our documentation for more details.

https://static.scarf.sh/a.png?x-pxid=cd444261-469e-473b-b9ba-f66ac3dc73ee

License

Daft has an Apache 2.0 license - please see the LICENSE file.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

getdaft-0.2.33.tar.gz (3.4 MB view details)

Uploaded Source

Built Distributions

getdaft-0.2.33-cp38-abi3-win_amd64.whl (25.5 MB view details)

Uploaded CPython 3.8+ Windows x86-64

getdaft-0.2.33-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.9 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ x86-64

getdaft-0.2.33-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (26.9 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ ARM64

getdaft-0.2.33-cp38-abi3-macosx_11_0_arm64.whl (23.4 MB view details)

Uploaded CPython 3.8+ macOS 11.0+ ARM64

getdaft-0.2.33-cp38-abi3-macosx_10_12_x86_64.whl (25.0 MB view details)

Uploaded CPython 3.8+ macOS 10.12+ x86-64

File details

Details for the file getdaft-0.2.33.tar.gz.

File metadata

  • Download URL: getdaft-0.2.33.tar.gz
  • Upload date:
  • Size: 3.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for getdaft-0.2.33.tar.gz
Algorithm Hash digest
SHA256 e397eb010b2166855b51e07e76a7fe835f7f73c1a9abca207b145700df814ef3
MD5 2762dd11b2c11c6605bad898d72bbcec
BLAKE2b-256 ccc33a938c7416568ab3d816e57094e736cdfb1e8cf3d3817283d1a1ef92b753

See more details on using hashes here.

File details

Details for the file getdaft-0.2.33-cp38-abi3-win_amd64.whl.

File metadata

  • Download URL: getdaft-0.2.33-cp38-abi3-win_amd64.whl
  • Upload date:
  • Size: 25.5 MB
  • Tags: CPython 3.8+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for getdaft-0.2.33-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 f3310fe32bb044575cebb25c13ebeb5b3cfbebb55d625bc3c1f6bf0f76dec003
MD5 7e3eb911cb6476b9ffea274a58b08e89
BLAKE2b-256 9ac64ba88e558ff385a3fccd8eb863a8580e5b224116b618c8bad45d031b57e8

See more details on using hashes here.

File details

Details for the file getdaft-0.2.33-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for getdaft-0.2.33-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 35ab39a362578bfa3009173e8811c39f636cdda91aaccd9893088104a2dcdfd7
MD5 cb4564c90e55629ec8dd55e141dca09c
BLAKE2b-256 fca9340695d57a8daca6076a0bbde81c4a406e7afd04978cb62ba58b7ea1fc73

See more details on using hashes here.

File details

Details for the file getdaft-0.2.33-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for getdaft-0.2.33-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 d8619a695c4dacc035aa70be0231c43e7b48604ec98cce451fd5c529077b4f8b
MD5 fc3049a681789da7aebb4d5ced41f45d
BLAKE2b-256 fa326b16f2b7057df558c8ccca70486116d7436dcd7cd39449e10b6a25aa52c2

See more details on using hashes here.

File details

Details for the file getdaft-0.2.33-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for getdaft-0.2.33-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3b9e38df4b58b56503f6f2180dffa54bb30a01cf1f1d47650f06240e24416996
MD5 0e2edbf09b3352a54b4e67b9633deb5c
BLAKE2b-256 1ce95af156157d0b0d30eb13d46d020a56ad26a9dfb5e8e6c239b8aa58fa89d7

See more details on using hashes here.

File details

Details for the file getdaft-0.2.33-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for getdaft-0.2.33-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 3798e2dae7fa9920a12609736d98587e192091b0d9ed8659343afb360fffe732
MD5 946f92f40ef13e3b3bda0e81a83b7070
BLAKE2b-256 6736a0c43a8f5076926d44e82fefb0e59bcb8afe20488ef0f56fb910a3e461d0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page