Skip to main content

A Distributed DataFrame library for large scale complex data processing.

Project description

Daft dataframes can load any data such as PDF documents, images, protobufs, csv, parquet and audio files into a table dataframe structure for easy querying

Github Actions tests PyPI latest tag

WebsiteDocsInstallation10-minute tour of DaftCommunity and Support

Daft: the distributed Python dataframe for media data

Daft is a fast, Pythonic and scalable open-source dataframe library built for Python and Machine Learning workloads.

Daft is currently in its Alpha release phase - please expect bugs and rapid improvements to the project. We welcome user feedback/feature requests in our Discussions forums

Table of Contents

About Daft

The Daft dataframe is a table of data with rows and columns. Columns can contain any Python objects, which allows Daft to support rich media data types such as images, audio, video and more.

  1. Any Data: Columns can contain any Python objects, which means that the Python libraries you already use for running machine learning or custom data processing will work natively with Daft!

  2. Notebook Computing: Daft is built for the interactive developer experience on a notebook - intelligent caching/query optimizations accelerates your experimentation and data exploration.

  3. Distributed Computing: Rich media formats such as images can quickly outgrow your local laptop’s computational resources - Daft integrates natively with Ray for running dataframes on large clusters of machines with thousands of CPUs/GPUs.

Getting Started

Installation

Install Daft with pip install getdaft.

Quickstart

Check out our full quickstart tutorial!

In this example, we load images from an AWS S3 bucket and run a simple function to generate thumbnails for each image:

from daft import DataFrame, lit

import io
from PIL import Image

def get_thumbnail(img: Image.Image) -> Image.Image:
    """Simple function to make an image thumbnail"""
    imgcopy = img.copy()
    imgcopy.thumbnail((48, 48))
    return imgcopy

# Load a dataframe from files in an S3 bucket
df = DataFrame.from_files("s3://daft-public-data/laion-sample-images/*")

# Get the AWS S3 url of each image
df = df.select(lit("s3://").str.concat(df["name"]).alias("s3_url"))

# Download images and load as a PIL Image object
df = df.with_column("image", df["s3_url"].url.download().apply(lambda data: Image.open(io.BytesIO(data))))

# Generate thumbnails from images
df = df.with_column("thumbnail", df["image"].apply(get_thumbnail))

df.show(3)

Dataframe code to load a folder of images from AWS S3 and create thumbnails

More Resources

  • 10-minute tour of Daft - learn more about Daft’s full range of capabilities including dataloading from URLs, joins, user-defined functions (UDF), groupby, aggregations and more.

  • User Guide - take a deep-dive into each topic within Daft

  • API Reference - API reference for public classes/functions of Daft

License

Daft has an Apache 2.0 license - please see the LICENSE file.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

getdaft-0.0.18.tar.gz (143.9 kB view details)

Uploaded Source

Built Distributions

getdaft-0.0.18-cp310-cp310-manylinux_2_17_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

getdaft-0.0.18-cp310-cp310-macosx_11_0_x86_64.whl (284.2 kB view details)

Uploaded CPython 3.10 macOS 11.0+ x86-64

getdaft-0.0.18-cp310-cp310-macosx_11_0_arm64.whl (270.7 kB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

getdaft-0.0.18-cp39-cp39-manylinux_2_17_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

getdaft-0.0.18-cp39-cp39-macosx_11_0_x86_64.whl (284.6 kB view details)

Uploaded CPython 3.9 macOS 11.0+ x86-64

getdaft-0.0.18-cp39-cp39-macosx_11_0_arm64.whl (271.1 kB view details)

Uploaded CPython 3.9 macOS 11.0+ ARM64

getdaft-0.0.18-cp38-cp38-manylinux_2_17_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

getdaft-0.0.18-cp38-cp38-macosx_11_0_arm64.whl (271.0 kB view details)

Uploaded CPython 3.8 macOS 11.0+ ARM64

getdaft-0.0.18-cp38-cp38-macosx_10_16_x86_64.whl (284.7 kB view details)

Uploaded CPython 3.8 macOS 10.16+ x86-64

getdaft-0.0.18-cp37-cp37m-manylinux_2_17_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

getdaft-0.0.18-cp37-cp37m-macosx_10_16_x86_64.whl (284.1 kB view details)

Uploaded CPython 3.7m macOS 10.16+ x86-64

File details

Details for the file getdaft-0.0.18.tar.gz.

File metadata

  • Download URL: getdaft-0.0.18.tar.gz
  • Upload date:
  • Size: 143.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.8

File hashes

Hashes for getdaft-0.0.18.tar.gz
Algorithm Hash digest
SHA256 1225c49f99c085f171eb0f0ce46d9a8eadf95b14b37de7181c4e54def62a897e
MD5 4bcb4425f6561525658b680a9fbdd7ab
BLAKE2b-256 829fcf60da7686323f02edcd3faf511c058b808dd3a531f5c8a420c8c6917057

See more details on using hashes here.

File details

Details for the file getdaft-0.0.18-cp310-cp310-manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for getdaft-0.0.18-cp310-cp310-manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 bfb4ec616c7e307ad11bcbe9dbe1738c5db74b44bb9080ba4fbdeb0321f0d8d0
MD5 dac1665545388a2419eca50ecbcf4245
BLAKE2b-256 7e24d861d858fdb51bdd8e817d78c67435aa5ddc9e1db9ab80e3ddf887d72136

See more details on using hashes here.

File details

Details for the file getdaft-0.0.18-cp310-cp310-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for getdaft-0.0.18-cp310-cp310-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 3218b97dbe289c9bfaf0de0455f3ebc05d4414dcd747c0dd7be6949ff4834341
MD5 23141ce287e34c51c541c90f6c5200b9
BLAKE2b-256 cd1d1149ab1b11f25cc27f9472db81e97e966809131cae2f9377089043afd754

See more details on using hashes here.

File details

Details for the file getdaft-0.0.18-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for getdaft-0.0.18-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6c308dc12a470b4db1be654c6607ed59ff24cb4e7961c41d70663ace78369511
MD5 fc0411ba4bafa91f4e03d36f7a204b2c
BLAKE2b-256 64ac5304a4408207a64d8ea49a5dbbd99daec55587e9ce220010fd92398dd82d

See more details on using hashes here.

File details

Details for the file getdaft-0.0.18-cp39-cp39-manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for getdaft-0.0.18-cp39-cp39-manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 838d2c1c19271389a11dee2eaea429eb0af12b3e0a971c8651d64448738857b6
MD5 707be7c0857c6856558eb3fc868e904a
BLAKE2b-256 57ca4057a7dfcda1ca3f6f3fd028bf2ff7b046846a4a7add4410fe3770e9aece

See more details on using hashes here.

File details

Details for the file getdaft-0.0.18-cp39-cp39-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for getdaft-0.0.18-cp39-cp39-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 41251a685c997f504172de8500ce7879040cd7430e1a9ec1214f17caef5ed56f
MD5 7ef2555f92c3aea0baf3f49c20bd21aa
BLAKE2b-256 244d73755dae374c83f524f5d1d25ca556d43ab95273f43376cfa4c2536bded3

See more details on using hashes here.

File details

Details for the file getdaft-0.0.18-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for getdaft-0.0.18-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 862003710e6bff64caa1924d0b902f53a887605807483554207a87b2e376dd9c
MD5 40e41db28e9e457abdfbb265a6feffea
BLAKE2b-256 cf0e5403556454efd14470c488861e0d314733156a8e48c22020d4a62dcf379a

See more details on using hashes here.

File details

Details for the file getdaft-0.0.18-cp38-cp38-manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for getdaft-0.0.18-cp38-cp38-manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 38b9f609234e3ad136d32454e478fba6c1d311beca7c876010d33844401ffd69
MD5 13577aa6f8f9f5ad2bd96c5381d2479a
BLAKE2b-256 953ef8684583c0aa19cf9084a1757efa3be2a0a730fc3c01e711448902ec0cda

See more details on using hashes here.

File details

Details for the file getdaft-0.0.18-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for getdaft-0.0.18-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f1b29be379080aefc35b411b61d24fc1ea456d95e7bd38d116192a278b3e349d
MD5 59d75c5a7a27bc150a763edc4aae886d
BLAKE2b-256 e5bbf7dcaaa3135b4ad6999b1b8a2d266134ecffa86cd121a0941f3971c7cc0b

See more details on using hashes here.

File details

Details for the file getdaft-0.0.18-cp38-cp38-macosx_10_16_x86_64.whl.

File metadata

File hashes

Hashes for getdaft-0.0.18-cp38-cp38-macosx_10_16_x86_64.whl
Algorithm Hash digest
SHA256 908d25dffabd3b34f131c56cee6f98dd27207eb519602fe5e5c9b7d7c3aa3d56
MD5 3b72390a519ef6c0780caa76526ffc5c
BLAKE2b-256 645632557d41c194565b9f7f1114b7c5a38faa9b8f45c8ab9822764d67fc2134

See more details on using hashes here.

File details

Details for the file getdaft-0.0.18-cp37-cp37m-manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for getdaft-0.0.18-cp37-cp37m-manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 641d71fe9b8a3a3576bfa1fd1c934dd096b63b8d6753e04cb2f0562e2b3c991e
MD5 ac6097ab5317c4e5e447e4c9b5ba797c
BLAKE2b-256 4de25175494b2587aa3491500d22f9ab50fa9bed881fc6e3f780cbfffe9d9b8e

See more details on using hashes here.

File details

Details for the file getdaft-0.0.18-cp37-cp37m-macosx_10_16_x86_64.whl.

File metadata

File hashes

Hashes for getdaft-0.0.18-cp37-cp37m-macosx_10_16_x86_64.whl
Algorithm Hash digest
SHA256 92b745c025b125e5606054ea68428c7035acd2d5fb19fc81700378a7dc38f64f
MD5 a43017552a4afb99938b1e64b3c62e4d
BLAKE2b-256 ca42b490836bc94d3ba94e36967c4fbe75b4cc55418ae3af8d8c3ab1550f88be

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page