Skip to main content

DeltaTorch allows loading training data from DeltaLake tables for training Deep Learning models using PyTorch

Project description

deltatorch

image image image

Concept

deltatorch allows users to directly use DeltaLake tables as a data source for training using PyTorch. Using deltatorch, users can create a PyTorch DataLoader to load the training data. We support distributed training using PyTorch DDP as well.

Why yet another data-loading framework?

  • Many Deep Learning projects are struggling with efficient data loading, especially with tabular datasets or datasets containing many small images
  • Classical Big Data formats like Parquet can help with this issue, but are hard to operate:
    • writers might block readers
    • Failed write can make the whole dataset unreadable
    • More complicated projects might ingest data all the time, even during training

Delta Lake storage format solves all these issues, but PyTorch has no direct support for DeltaLake datasets. deltatorch introduces such support and allows users to use DeltaLake for training Deep Learning models using PyTorch.

Usage

Requirements

  • Python Version > 3.8
  • pip or conda

Installation

  • with pip:
pip install  git+https://github.com/delta-incubator/deltatorch

Create PyTorch DataLoader to read our DeltaLake table

To utilize deltatorch at first, we will need a DeltaLake table containing training data we would like to use for training your PyTorch deep learning model. There is a requirement: this table must have an autoincrement ID field. This field is used by deltatorch for sharding and parallelization of loading. After that, we can use the create_pytorch_dataloader function to create PyTorch DataLoader, which can be used directly during training. Below you can find an example of creating a DataLoader for the following table schema :

CREATE TABLE TRAINING_DATA 
(   
    image BINARY,   
    label BIGINT,   
    id INT
) 
USING delta LOCATION 'path' 

After the table is ready we can use the create_pytorch_dataloader function to create a PyTorch DataLoader :

from deltatorch import create_pytorch_dataloader
from deltatorch import FieldSpec

def create_data_loader(path:str, batch_size:int):

    return create_pytorch_dataloader(
        # Path to the DeltaLake table
        path,
        # Autoincrement ID field
        id_field="id",
        # Fields which will be used during training
        fields=[
            FieldSpec("image",
                      # Load image using Pillow
                      load_image_using_pil=True, 
                      # PyTorch Transform
                      transform=transform),
            FieldSpec("label"),
        ],
        # Number of readers 
        num_workers=2,
        # Shuffle data inside the record batches
        shuffle=True,
        # Batch size        
        batch_size=batch_size,
    )

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deltatorch-0.0.2.tar.gz (7.4 kB view details)

Uploaded Source

Built Distribution

deltatorch-0.0.2-py3-none-any.whl (8.0 kB view details)

Uploaded Python 3

File details

Details for the file deltatorch-0.0.2.tar.gz.

File metadata

  • Download URL: deltatorch-0.0.2.tar.gz
  • Upload date:
  • Size: 7.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.0 CPython/3.11.3 Darwin/22.6.0

File hashes

Hashes for deltatorch-0.0.2.tar.gz
Algorithm Hash digest
SHA256 f3aaff1b804f84622b138d7114857345f1f6cb4a7f8e1150c28342626ec50582
MD5 a02ec001b5fc71f65a3188ec82d7d5c4
BLAKE2b-256 660ab329c358d03c4b3ec71ce4a7107bb45ae37570125abdbe86bda966212546

See more details on using hashes here.

File details

Details for the file deltatorch-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: deltatorch-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 8.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.0 CPython/3.11.3 Darwin/22.6.0

File hashes

Hashes for deltatorch-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7902d387768d6e149f8e6e59ef410c55aca8d47255f1ceda2d521ac4c977dc18
MD5 44d62c03a8e50b54d5e6e3af02b1a3ee
BLAKE2b-256 65ee7e4e9529caa7dc819b2e8375aef1a720c6f3d53331ed60260367f8c48acb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page