Skip to main content

A Distributed DataFrame library for large scale complex data processing.

Project description

Daft dataframes can load any data such as PDF documents, images, protobufs, csv, parquet and audio files into a table dataframe structure for easy querying

Github Actions tests PyPI latest tag

WebsiteDocsInstallation10-minute tour of DaftCommunity and Support

Daft: the distributed Python dataframe for complex data

Daft is a fast, Pythonic and scalable open-source dataframe library built for Python and Machine Learning workloads.

Daft is currently in its Alpha release phase - please expect bugs and rapid improvements to the project. We welcome user feedback/feature requests in our Discussions forums

Table of Contents

About Daft

The Daft dataframe is a table of data with rows and columns. Columns can contain any Python objects, which allows Daft to support rich complex data types such as images, audio, video and more.

  1. Any Data: Columns can contain any Python objects, which means that the Python libraries you already use for running machine learning or custom data processing will work natively with Daft!

  2. Notebook Computing: Daft is built for the interactive developer experience on a notebook - intelligent caching/query optimizations accelerates your experimentation and data exploration.

  3. Distributed Computing: Rich complex formats such as images can quickly outgrow your local laptop’s computational resources - Daft integrates natively with Ray for running dataframes on large clusters of machines with thousands of CPUs/GPUs.

Getting Started

Installation

Install Daft with pip install getdaft.

Quickstart

Check out our full quickstart tutorial!

In this example, we load images from an AWS S3 bucket and run a simple function to generate thumbnails for each image:

from daft import DataFrame, lit

import io
from PIL import Image

def get_thumbnail(img: Image.Image) -> Image.Image:
    """Simple function to make an image thumbnail"""
    imgcopy = img.copy()
    imgcopy.thumbnail((48, 48))
    return imgcopy

# Load a dataframe from files in an S3 bucket
df = DataFrame.from_files("s3://daft-public-data/laion-sample-images/*")

# Get the AWS S3 url of each image
df = df.select(lit("s3://").str.concat(df["name"]).alias("s3_url"))

# Download images and load as a PIL Image object
df = df.with_column("image", df["s3_url"].url.download().apply(lambda data: Image.open(io.BytesIO(data))))

# Generate thumbnails from images
df = df.with_column("thumbnail", df["image"].apply(get_thumbnail))

df.show(3)

Dataframe code to load a folder of images from AWS S3 and create thumbnails

More Resources

  • 10-minute tour of Daft - learn more about Daft’s full range of capabilities including dataloading from URLs, joins, user-defined functions (UDF), groupby, aggregations and more.

  • User Guide - take a deep-dive into each topic within Daft

  • API Reference - API reference for public classes/functions of Daft

License

Daft has an Apache 2.0 license - please see the LICENSE file.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

getdaft-0.0.20.tar.gz (594.3 kB view details)

Uploaded Source

Built Distributions

getdaft-0.0.20-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.7+ manylinux: glibc 2.17+ x86-64

getdaft-0.0.20-cp37-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.5 MB view details)

Uploaded CPython 3.7+ manylinux: glibc 2.17+ ARM64

getdaft-0.0.20-cp37-abi3-macosx_11_0_arm64.whl (582.6 kB view details)

Uploaded CPython 3.7+ macOS 11.0+ ARM64

getdaft-0.0.20-cp37-abi3-macosx_10_7_x86_64.whl (603.7 kB view details)

Uploaded CPython 3.7+ macOS 10.7+ x86-64

File details

Details for the file getdaft-0.0.20.tar.gz.

File metadata

  • Download URL: getdaft-0.0.20.tar.gz
  • Upload date:
  • Size: 594.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.7.15

File hashes

Hashes for getdaft-0.0.20.tar.gz
Algorithm Hash digest
SHA256 dae9e7360c5faf4a7fa99ee686671fa1bed03362834380437725ba55dc34c378
MD5 07e0748b7b7df2d5ccbfc20913af8c22
BLAKE2b-256 bd6cd3e0edc844532f022ecea3e33f170f8b08a4048db5acdf0e83b2cb302e2b

See more details on using hashes here.

File details

Details for the file getdaft-0.0.20-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for getdaft-0.0.20-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 df2cce724e8bf2eddaffd11f67f1e0f90dc55d04139fee10a9f294bb8784558b
MD5 c7e0884b513ff8bfc59ad9e033cca920
BLAKE2b-256 a1f9f75b7cca4cda460ed743049024c11fc2d266641a6e2e1679b956193ba812

See more details on using hashes here.

File details

Details for the file getdaft-0.0.20-cp37-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for getdaft-0.0.20-cp37-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 cc778efb4028374957143515d4f7c9c36d5f1ee710e0dd6beb99266d3b4ca91b
MD5 e608bcb6722a4f6bc8b9797e8db6c8ee
BLAKE2b-256 b7a5ac885ac3371621550e2a525f9087eda578ece8e567bd5a6635bbe0a39e36

See more details on using hashes here.

File details

Details for the file getdaft-0.0.20-cp37-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for getdaft-0.0.20-cp37-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 eb5d91e82342e1983e0472537e3a7bd4f9f0b5d063f9d5a3893c39df27111cf6
MD5 48665993d9093033b4fbc8370f8b7516
BLAKE2b-256 890fe1fc986dd973a9a6d4beca93100b65e87a3096f4310c01243bd78d55296f

See more details on using hashes here.

File details

Details for the file getdaft-0.0.20-cp37-abi3-macosx_10_7_x86_64.whl.

File metadata

File hashes

Hashes for getdaft-0.0.20-cp37-abi3-macosx_10_7_x86_64.whl
Algorithm Hash digest
SHA256 3e3f6a4515f262947b217b8ea5ca912457c166e0ce9daa8068eebbb008d78ccb
MD5 6c62f4f3d3bc5fbc5d1e6670efc06604
BLAKE2b-256 453b5cb07ae0bbd95eef372f7fa79f1d0b2437ed8ba686c6da0910801c3547ed

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page