DeltaTorch allows loading training data from DeltaLake tables for training Deep Learning models using PyTorch
Project description
deltatorch
Concept
deltatorch
allows users to directly use DeltaLake
tables as a data source for training using PyTorch.
Using deltatorch
, users can create a PyTorch DataLoader
to load the training data.
We support distributed training using PyTorch DDP as well.
Usage
Requirements
- Python Version > 3.8
pip
orconda
Installation
- with
pip
:
pip install git+https://github.com/mshtelma/deltatorch
Create PyTorch DataLoader to read our DeltaLake table
To utilize deltatorch
at first, we will need a DeltaLake table containing training data we would like to use for training your PyTorch deep learning model.
There is a requirement: this table must have an autoincrement ID field. This field is used by deltatorch
for sharding and parallelization of loading.
After that, we can use the create_pytorch_dataloader
function to create PyTorch DataLoader, which can be used directly during training.
Below you can find an example of creating a DataLoader for the following table schema :
CREATE TABLE TRAINING_DATA
(
image BINARY,
label BIGINT,
id INT
)
USING delta LOCATION 'path'
After the table is ready we can use the create_pytorch_dataloader
function to create a PyTorch DataLoader :
from deltatorch import create_pytorch_dataloader
def create_data_loader(path:str, length:int, batch_size:int):
return create_pytorch_dataloader(
# Path to the DeltaLake table
path,
# Length of the table. Can be easily pre-calculated using spark.read.load(path).count()
length,
# Field used as a source (X)
src_field="image",
# Target field (Y)
target_field="label",
# Autoincrement ID field
id_field="id",
# Load image using Pillow
load_pil=True,
# Number of readers
num_workers=2,
# Shuffle data inside the record batches
shuffle=True,
# Batch size
batch_size=batch_size,
)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file deltatorch-0.0.1.tar.gz
.
File metadata
- Download URL: deltatorch-0.0.1.tar.gz
- Upload date:
- Size: 6.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.4.2 CPython/3.11.3 Darwin/22.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a5d4f265ccb8414683a377be0b875bfee0eb8f4f00b89a0f991354ad05b68433 |
|
MD5 | e758877439d17ccc86b045263250c840 |
|
BLAKE2b-256 | 6013f85777c06e8015ccdeb68e8d06a906feb05ff5cc6a2dd016d6f750ce0248 |
File details
Details for the file deltatorch-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: deltatorch-0.0.1-py3-none-any.whl
- Upload date:
- Size: 7.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.4.2 CPython/3.11.3 Darwin/22.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ba09283ace27142cd4691e08d94a0fc2f33c123d3294706daffb34451c28eb05 |
|
MD5 | ab7d41e935738281215ce08fe98d084d |
|
BLAKE2b-256 | bb5c2daf5553ee48f7d367d4e439a2f7b0bbb505af3283f75099f9d6503c3353 |