Skip to main content

A Distributed DataFrame library for large scale complex data processing.

Project description

Daft dataframes can load any data such as PDF documents, images, protobufs, csv, parquet and audio files into a table dataframe structure for easy querying

Github Actions tests PyPI latest tag Coverage slack community

WebsiteDocsInstallation10-minute tour of DaftCommunity and Support

Daft: the distributed Python dataframe for complex data

Daft is a fast, Pythonic and scalable open-source dataframe library built for Python and Machine Learning workloads.

Daft is currently in its Beta release phase - please expect bugs and rapid improvements to the project. We welcome user feedback/feature requests in our Discussions forums

Table of Contents

About Daft

The Daft dataframe is a table of data with rows and columns. Columns can contain any Python objects, which allows Daft to support rich complex data types such as images, audio, video and more.

  1. Any Data: Beyond the usual strings/numbers/dates, Daft columns can also hold complex multimodal data such as Images, Embeddings and Python objects. Ingestion and basic transformations of complex data is extremely easy and performant in Daft.

  2. Notebook Computing: Daft is built for the interactive developer experience on a notebook - intelligent caching/query optimizations accelerates your experimentation and data exploration.

  3. Distributed Computing: Rich complex formats such as images can quickly outgrow your local laptop’s computational resources - Daft integrates natively with Ray for running dataframes on large clusters of machines with thousands of CPUs/GPUs.

Getting Started

Installation

Install Daft with pip install getdaft.

For more advanced installations (e.g. installing from source or with extra dependencies such as Ray and AWS utilities), please see our Installation Guide

Quickstart

Check out our 10-minute quickstart!

In this example, we load images from an AWS S3 bucket’s URLs and resize each image in the dataframe:

import daft

# Load a dataframe from filepaths in an S3 bucket
df = daft.from_glob_path("s3://daft-public-data/laion-sample-images/*")

# 1. Download column of image URLs as a column of bytes
# 2. Decode the column of bytes into a column of images
df = df.with_column("image", df["path"].url.download().image.decode())

# Resize each image into 32x32
df = df.with_column("resized", df["image"].image.resize(32, 32))

df.show(3)

Dataframe code to load a folder of images from AWS S3 and create thumbnails

Benchmarks

Benchmarks for SF100 TPCH

To see the full benchmarks, detailed setup, and logs, check out our benchmarking page.

More Resources

  • 10-minute tour of Daft - learn more about Daft’s full range of capabilities including dataloading from URLs, joins, user-defined functions (UDF), groupby, aggregations and more.

  • User Guide - take a deep-dive into each topic within Daft

  • API Reference - API reference for public classes/functions of Daft

Contributing

To start contributing to Daft, please read CONTRIBUTING.md

Telemetry

To help improve Daft, we collect non-identifiable data.

To disable this behavior, set the following environment variable: DAFT_ANALYTICS_ENABLED=0

The data that we collect is:

  1. Non-identifiable: events are keyed by a session ID which is generated on import of Daft

  2. Metadata-only: we do not collect any of our users’ proprietary code or data

  3. For development only: we do not buy or sell any user data

Please see our documentation for more details.

License

Daft has an Apache 2.0 license - please see the LICENSE file.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

getdaft-0.2.4.tar.gz (804.5 kB view details)

Uploaded Source

Built Distributions

getdaft-0.2.4-cp37-abi3-win_amd64.whl (16.1 MB view details)

Uploaded CPython 3.7+ Windows x86-64

getdaft-0.2.4-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (20.5 MB view details)

Uploaded CPython 3.7+ manylinux: glibc 2.17+ x86-64

getdaft-0.2.4-cp37-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (19.0 MB view details)

Uploaded CPython 3.7+ manylinux: glibc 2.17+ ARM64

getdaft-0.2.4-cp37-abi3-macosx_11_0_arm64.whl (15.3 MB view details)

Uploaded CPython 3.7+ macOS 11.0+ ARM64

getdaft-0.2.4-cp37-abi3-macosx_10_7_x86_64.whl (16.7 MB view details)

Uploaded CPython 3.7+ macOS 10.7+ x86-64

File details

Details for the file getdaft-0.2.4.tar.gz.

File metadata

  • Download URL: getdaft-0.2.4.tar.gz
  • Upload date:
  • Size: 804.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for getdaft-0.2.4.tar.gz
Algorithm Hash digest
SHA256 4cfb4ab0eca5cc176194df18b7519226fd8e4f82a190aac5895b4fd60c918c2c
MD5 b2f61e59debc114245248a480c1e80f0
BLAKE2b-256 ddabf85ab88dd1a6b9fd753b8e64ce379e6716a649873c251b73f540aadfd6d5

See more details on using hashes here.

File details

Details for the file getdaft-0.2.4-cp37-abi3-win_amd64.whl.

File metadata

  • Download URL: getdaft-0.2.4-cp37-abi3-win_amd64.whl
  • Upload date:
  • Size: 16.1 MB
  • Tags: CPython 3.7+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for getdaft-0.2.4-cp37-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 71ee3a6f6970d009f257b54e95e0ea4582fbf64fad258d8333be607baccfc6e7
MD5 f43c9646f19dcd20842a8cf3bbd4e108
BLAKE2b-256 b4e47c283cbf8972f1578718dd6fe3c1e60eb751b7330cb1059c80d1e0b65c52

See more details on using hashes here.

File details

Details for the file getdaft-0.2.4-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for getdaft-0.2.4-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 34d1d21c669365226ba7b42b793d5dc22fd08bdf2b75d65588db58c9e64b6721
MD5 698224d5e2846d6b85486740b7284adb
BLAKE2b-256 8f74168fe1d88aff44efdafbc865c9caff584afbad347846cec750805066fa4b

See more details on using hashes here.

File details

Details for the file getdaft-0.2.4-cp37-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for getdaft-0.2.4-cp37-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 a317e162ebf1ac20d889dbcc89e4e96809a19afc3d7f443f91a7c15de71d1182
MD5 9e5a052934ecdd86f02f77cc36f87f53
BLAKE2b-256 de70d8ebe2d12d54671b99a951c3ac86d95cf0dc72fd3c610f3df6cfbf3b131d

See more details on using hashes here.

File details

Details for the file getdaft-0.2.4-cp37-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for getdaft-0.2.4-cp37-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 cca2b79faa7041e8404d196ca6147401a58d984b013023ce445726188bcb57cc
MD5 7ec5a87509ca13658e09843aa6d4f30c
BLAKE2b-256 3317a01c1b74149b029a31edbae71da9b36a90bfba9997f2ed5e48666360c22b

See more details on using hashes here.

File details

Details for the file getdaft-0.2.4-cp37-abi3-macosx_10_7_x86_64.whl.

File metadata

File hashes

Hashes for getdaft-0.2.4-cp37-abi3-macosx_10_7_x86_64.whl
Algorithm Hash digest
SHA256 0c109dddff72b0e849753163805a53ae1495808e59b56619b990ea8affedf448
MD5 672a641dd02c2967dfca346783924a97
BLAKE2b-256 2264b50e9bbf8b67fb8b132772274cfa263b47a48d0c8885c93abd1b9eabc6d4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page