Skip to main content

DeltaTorch allows loading training data from DeltaLake tables for training Deep Learning models using PyTorch

Project description

deltatorch

image image

Concept

deltatorch allows users to directly use DeltaLake tables as a data source for training using PyTorch. Using deltatorch, users can create a PyTorch DataLoader to load the training data. We support distributed training using PyTorch DDP as well.

Usage

Requirements

  • Python Version > 3.8
  • pip or conda

Installation

  • with pip:
pip install git+https://github.com/mshtelma/deltatorch

Create PyTorch DataLoader to read our DeltaLake table

To utilize deltatorch at first, we will need a DeltaLake table containing training data we would like to use for training your PyTorch deep learning model. There is a requirement: this table must have an autoincrement ID field. This field is used by deltatorch for sharding and parallelization of loading. After that, we can use the create_pytorch_dataloader function to create PyTorch DataLoader, which can be used directly during training. Below you can find an example of creating a DataLoader for the following table schema :

CREATE TABLE TRAINING_DATA 
(   
    image BINARY,   
    label BIGINT,   
    id INT
) 
USING delta LOCATION 'path' 

After the table is ready we can use the create_pytorch_dataloader function to create a PyTorch DataLoader :

from deltatorch import create_pytorch_dataloader

def create_data_loader(path:str, length:int, batch_size:int):

    return create_pytorch_dataloader(
        # Path to the DeltaLake table
        path,
        # Length of the table. Can be easily pre-calculated using spark.read.load(path).count()
        length,
        # Field used as a source (X)
        src_field="image",
        # Target field (Y)
        target_field="label",
        # Autoincrement ID field
        id_field="id",
        # Load image using Pillow
        load_pil=True,
        # Number of readers 
        num_workers=2,
        # Shuffle data inside the record batches
        shuffle=True,
        # Batch size        
        batch_size=batch_size,
    )

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deltatorch-0.0.1.tar.gz (6.1 kB view details)

Uploaded Source

Built Distribution

deltatorch-0.0.1-py3-none-any.whl (7.9 kB view details)

Uploaded Python 3

File details

Details for the file deltatorch-0.0.1.tar.gz.

File metadata

  • Download URL: deltatorch-0.0.1.tar.gz
  • Upload date:
  • Size: 6.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.2 CPython/3.11.3 Darwin/22.4.0

File hashes

Hashes for deltatorch-0.0.1.tar.gz
Algorithm Hash digest
SHA256 a5d4f265ccb8414683a377be0b875bfee0eb8f4f00b89a0f991354ad05b68433
MD5 e758877439d17ccc86b045263250c840
BLAKE2b-256 6013f85777c06e8015ccdeb68e8d06a906feb05ff5cc6a2dd016d6f750ce0248

See more details on using hashes here.

File details

Details for the file deltatorch-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: deltatorch-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 7.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.2 CPython/3.11.3 Darwin/22.4.0

File hashes

Hashes for deltatorch-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ba09283ace27142cd4691e08d94a0fc2f33c123d3294706daffb34451c28eb05
MD5 ab7d41e935738281215ce08fe98d084d
BLAKE2b-256 bb5c2daf5553ee48f7d367d4e439a2f7b0bbb505af3283f75099f9d6503c3353

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page