Skip to main content

Composable data loading modules for PyTorch

Project description

TorchData (see note below on current status)

Why TorchData? | Install guide | What are DataPipes? | Beta Usage and Feedback | Contributing | Future Plans

:warning: As of July 2023, we have paused active development on TorchData and have paused new releases. We have learnt a lot from building it and hearing from users, but also believe we need to re-evaluate the technical design and approach given how much the industry has changed since we began the project. During the rest of 2023 we will be re-evaluating our plans in this space. Please reach out if you suggestions or comments (please use #1196 for feedback).

torchdata is a library of common modular data loading primitives for easily constructing flexible and performant data pipelines.

This library introduces composable Iterable-style and Map-style building blocks called DataPipes that work well out of the box with the PyTorch's DataLoader. These built-in DataPipes have the necessary functionalities to reproduce many different datasets in TorchVision and TorchText, namely loading files (from local or cloud), parsing, caching, transforming, filtering, and many more utilities. To understand the basic structure of DataPipes, please see What are DataPipes? below, and to see how DataPipes can be practically composed together into datasets, please see our examples.

On top of DataPipes, this library provides a new DataLoader2 that allows the execution of these data pipelines in various settings and execution backends (ReadingService). You can learn more about the new version of DataLoader2 in our full DataLoader2 documentation. Additional features are work in progres, such as checkpointing and advanced control of randomness and determinism.

Note that because many features of the original DataLoader have been modularized into DataPipes, their source codes live as standard DataPipes in pytorch/pytorch rather than torchdata to preserve backward-compatibility support and functional parity within torch. Regardless, you can to them by importing them from torchdata.

Why composable data loading?

Over many years of feedback and organic community usage of the PyTorch DataLoader and Dataset, we've found that:

  1. The original DataLoader bundled too many features together, making them difficult to extend, manipulate, or replace. This has created a proliferation of use-case specific DataLoader variants in the community rather than an ecosystem of interoperable elements.
  2. Many libraries, including each of the PyTorch domain libraries, have rewritten the same data loading utilities over and over again. We can save OSS maintainers time and effort rewriting, debugging, and maintaining these commonly used elements.

These reasons inspired the creation of DataPipe and DataLoader2, with a goal to make data loading components more flexible and reusable.

Installation

Version Compatibility

The following is the corresponding torchdata versions and supported Python versions.

torch torchdata python
master / nightly main / nightly >=3.8, <=3.11
2.0.0 0.6.0 >=3.8, <=3.11
1.13.1 0.5.1 >=3.7, <=3.10
1.12.1 0.4.1 >=3.7, <=3.10
1.12.0 0.4.0 >=3.7, <=3.10
1.11.0 0.3.0 >=3.7, <=3.10

Colab

Follow the instructions in this Colab notebook. The notebook also contains a simple usage example.

Local pip or conda

First, set up an environment. We will be installing a PyTorch binary as well as torchdata. If you're using conda, create a conda environment:

conda create --name torchdata
conda activate torchdata

If you wish to use venv instead:

python -m venv torchdata-env
source torchdata-env/bin/activate

Install torchdata:

Using pip:

pip install torchdata

Using conda:

conda install -c pytorch torchdata

You can then proceed to run our examples, such as the IMDb one.

From source

pip install .

If you'd like to include the S3 IO datapipes and aws-sdk-cpp, you may also follow the instructions here

In case building TorchData from source fails, install the nightly version of PyTorch following the linked guide on the contributing page.

From nightly

The nightly version of TorchData is also provided and updated daily from main branch.

Using pip:

pip install --pre torchdata --extra-index-url https://download.pytorch.org/whl/nightly/cpu

Using conda:

conda install torchdata -c pytorch-nightly

What are DataPipes?

Early on, we observed widespread confusion between the PyTorch Dataset which represented reusable loading tooling (e.g. TorchVision's ImageFolder), and those that represented pre-built iterators/accessors over actual data corpora (e.g. TorchVision's ImageNet). This led to an unfortunate pattern of siloed inheritance of data tooling rather than composition.

DataPipe is simply a renaming and repurposing of the PyTorch Dataset for composed usage. A DataPipe takes in some access function over Python data structures, __iter__ for IterDataPipes and __getitem__ for MapDataPipes, and returns a new access function with a slight transformation applied. For example, take a look at this JsonParser, which accepts an IterDataPipe over file names and raw streams, and produces a new iterator over the filenames and deserialized data:

import json

class JsonParserIterDataPipe(IterDataPipe):
    def __init__(self, source_datapipe, **kwargs) -> None:
        self.source_datapipe = source_datapipe
        self.kwargs = kwargs

    def __iter__(self):
        for file_name, stream in self.source_datapipe:
            data = stream.read()
            yield file_name, json.loads(data, **self.kwargs)

    def __len__(self):
        return len(self.source_datapipe)

You can see in this example how DataPipes can be easily chained together to compose graphs of transformations that reproduce sophisticated data pipelines, with streamed operation as a first-class citizen.

Under this naming convention, Dataset simply refers to a graph of DataPipes, and a dataset module like ImageNet can be rebuilt as a factory function returning the requisite composed DataPipes. Note that the vast majority of built-in features are implemented as IterDataPipes, we encourage the usage of built-in IterDataPipe as much as possible and convert them to MapDataPipe only when necessary.

DataLoader2

A new, light-weight DataLoader2 is introduced to decouple the overloaded data-manipulation functionalities from torch.utils.data.DataLoader to DataPipe operations. Besides, certain features can only be achieved with DataLoader2, such as like checkpointing/snapshotting and switching backend services to perform high-performant operations.

Please read the full documentation here.

Tutorial

A tutorial of this library is available here on the documentation site. It covers four topics: using DataPipes, working with DataLoader, implementing DataPipes, and working with Cloud Storage Providers.

There is also a tutorial available on how to work with the new DataLoader2.

Usage Examples

We provide a simple usage example in this Colab notebook. It can also be downloaded and executed locally as a Jupyter notebook.

In addition, there are several data loading implementations of popular datasets across different research domains that use DataPipes. You can find a few selected examples here.

Frequently Asked Questions (FAQ)

What should I do if the existing set of DataPipes does not do what I need?

You can implement your own custom DataPipe. If you believe your use case is common enough such that the community can benefit from having your custom DataPipe added to this library, feel free to open a GitHub issue. We will be happy to discuss!

What happens when the Shuffler DataPipe is used with DataLoader?

In order to enable shuffling, you need to add a Shuffler to your DataPipe line. Then, by default, shuffling will happen at the point where you specified as long as you do not set shuffle=False within DataLoader.

What happens when the Batcher DataPipe is used with DataLoader?

If you choose to use Batcher while setting batch_size > 1 for DataLoader, your samples will be batched more than once. You should choose one or the other.

Why are there fewer built-in MapDataPipes than IterDataPipes?

By design, there are fewer MapDataPipes than IterDataPipes to avoid duplicate implementations of the same functionalities as MapDataPipe. We encourage users to use the built-in IterDataPipe for various functionalities, and convert it to MapDataPipe as needed.

How is multiprocessing handled with DataPipes?

Multi-process data loading is still handled by the DataLoader, see the DataLoader documentation for more details. As of PyTorch version >= 1.12.0 (TorchData version >= 0.4.0), data sharding is automatically done for DataPipes within the DataLoader as long as a ShardingFilter DataPipe exists in your pipeline. Please see the tutorial for an example.

What is the upcoming plan for DataLoader?

DataLoader2 is in the prototype phase and more features are actively being developed. Please see the README file in torchdata/dataloader2. If you would like to experiment with it (or other prototype features), we encourage you to install the nightly version of this library.

Why is there an Error saying the specified DLL could not be found at the time of importing portalocker?

It only happens for people who runs torchdata on Windows OS as a common problem with pywin32. And, you can find the reason and the solution for it in the link.

Contributing

We welcome PRs! See the CONTRIBUTING file.

Beta Usage and Feedback

We'd love to hear from and work with early adopters to shape our designs. Please reach out by raising an issue if you're interested in using this tooling for your project.

License

TorchData is BSD licensed, as found in the LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

torchdata-0.7.0-py3-none-any.whl (184.4 kB view details)

Uploaded Python 3

torchdata-0.7.0-cp311-cp311-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.11 Windows x86-64

torchdata-0.7.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.7 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

torchdata-0.7.0-cp311-cp311-macosx_11_0_arm64.whl (1.7 MB view details)

Uploaded CPython 3.11 macOS 11.0+ ARM64

torchdata-0.7.0-cp311-cp311-macosx_10_13_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.11 macOS 10.13+ x86-64

torchdata-0.7.0-cp310-cp310-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.10 Windows x86-64

torchdata-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.7 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

torchdata-0.7.0-cp310-cp310-macosx_11_0_arm64.whl (1.7 MB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

torchdata-0.7.0-cp310-cp310-macosx_10_13_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.10 macOS 10.13+ x86-64

torchdata-0.7.0-cp39-cp39-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.9 Windows x86-64

torchdata-0.7.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.7 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

torchdata-0.7.0-cp39-cp39-macosx_11_0_arm64.whl (4.8 MB view details)

Uploaded CPython 3.9 macOS 11.0+ ARM64

torchdata-0.7.0-cp39-cp39-macosx_10_13_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.9 macOS 10.13+ x86-64

torchdata-0.7.0-cp38-cp38-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.8 Windows x86-64

torchdata-0.7.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.7 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

torchdata-0.7.0-cp38-cp38-macosx_11_0_arm64.whl (4.8 MB view details)

Uploaded CPython 3.8 macOS 11.0+ ARM64

torchdata-0.7.0-cp38-cp38-macosx_10_13_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.8 macOS 10.13+ x86-64

File details

Details for the file torchdata-0.7.0-py3-none-any.whl.

File metadata

  • Download URL: torchdata-0.7.0-py3-none-any.whl
  • Upload date:
  • Size: 184.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.7

File hashes

Hashes for torchdata-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 98ff5c4b73fc13fe663821d6ead044e47da637b90f86b430505419672f9c77dc
MD5 757dbcea73e9a7b1f62c4204800d51eb
BLAKE2b-256 33b9971ba6b6eb7ea92c9ee7af5afee06e2957496c74a50cc0999e543b319ed0

See more details on using hashes here.

Provenance

File details

Details for the file torchdata-0.7.0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for torchdata-0.7.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 c3b31e9b7dff99729ac65de0a39395d0e3e4ad7b55936e190146a450bffb101a
MD5 753ecb45bad34fd4d01f385ce795af32
BLAKE2b-256 f40cb0c538d9d439e44f639a35c37737db37f3ea816ac5a0b4e6c7b7ba941c6e

See more details on using hashes here.

Provenance

File details

Details for the file torchdata-0.7.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for torchdata-0.7.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b9a5cce29fb08743cccff868da56aa7495ed693fe621eea568c8c9726a2c0360
MD5 0ef48c1d86ef6355be0d6a1a0568cba9
BLAKE2b-256 486c6a35612a67bb72d0e389618d1c30c68c306c477b16f3418d6eef53a8791d

See more details on using hashes here.

Provenance

File details

Details for the file torchdata-0.7.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for torchdata-0.7.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 85c5122370de84bb43d15ed18ce589c855c49782b19fbb9b6d7ed4ee98fd1dac
MD5 ad0a4d00700929ff663cac2fa93feac5
BLAKE2b-256 495ddeff1a236286153107752298cbf2f02406b3447ad144b53b2e96e6159658

See more details on using hashes here.

Provenance

File details

Details for the file torchdata-0.7.0-cp311-cp311-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for torchdata-0.7.0-cp311-cp311-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 5a135c14850873d549382025265182934d46a2ef673e2c5cbb08ec656c808e63
MD5 25b9cf7c2a2f00e9c4961155936d9328
BLAKE2b-256 b08135729ce4b1a5d5e3ebf9c67d15eaf6b3e89eca2a71452242db3c2ba125df

See more details on using hashes here.

Provenance

File details

Details for the file torchdata-0.7.0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for torchdata-0.7.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 ce8d4e44fab23739b957b4ab33b24be1466bad14f4adce8f3e55a5f83f23b27a
MD5 c16defdd705f7e9d367549998fb4d40e
BLAKE2b-256 006aba927e61bf25991f352a7e0b943ed1d2461f97d8f86531567c47dbe0d99f

See more details on using hashes here.

Provenance

File details

Details for the file torchdata-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for torchdata-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ab24443f7365227fa0f96e1e40fa8fac2edcded132ccffbe481203c359edca95
MD5 aef23640ca82f941d2ab541ba8b78335
BLAKE2b-256 583fe805df66f0308eebf735f794e87164013024924efac22b4432d7c09374ea

See more details on using hashes here.

Provenance

File details

Details for the file torchdata-0.7.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for torchdata-0.7.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 958b839b126bdfb41e22b69ff34f74af7c9fd8e9b0a24c6e527e001b1c9d3ac5
MD5 5a7754c9deea8d63128c104bb9452de1
BLAKE2b-256 ffcc77d2529bb7061c5d9795411e802f041af9c3c3bdd003a248654e52cfb130

See more details on using hashes here.

Provenance

File details

Details for the file torchdata-0.7.0-cp310-cp310-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for torchdata-0.7.0-cp310-cp310-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 6118262f35fa359e8b5ea050e812bb4e23337f683a02f54f7716b43450556bc0
MD5 94d1d242106462f61bd379cd3f53cbb0
BLAKE2b-256 f2fd854841e2165dbaf3c1654d6e34d58c859af3d5457c8d52fc4a52cca43ae6

See more details on using hashes here.

Provenance

File details

Details for the file torchdata-0.7.0-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: torchdata-0.7.0-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.7

File hashes

Hashes for torchdata-0.7.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 0e8002223f9f80ec13ac9f0012bcbe64faa4060621525f4ed37e032c23ffeb07
MD5 46697dd8033e2bc1940acea609600b2e
BLAKE2b-256 3db8b0179aeb0a3b7dfcc66fb7f6d6eb7daacce5eb8ac52e086746f2aef740cb

See more details on using hashes here.

Provenance

File details

Details for the file torchdata-0.7.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for torchdata-0.7.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a616e20849ae0f1b8d5d09c03aa1ee79dbe1ff369b412e907e006dc39013b62f
MD5 0287f454e21820b23ea9ad93d6e8635a
BLAKE2b-256 c0d5714489eca322aa0927e86488fda9cf9ad5a25f360df088778fc0f78426e6

See more details on using hashes here.

Provenance

File details

Details for the file torchdata-0.7.0-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for torchdata-0.7.0-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 00b2efdce02fab70a067598f4847692bae57d3781876bd7a01340b3836bf42b7
MD5 6ba912ef5a4931849e55fefa2ecddcd3
BLAKE2b-256 02eec7da9be869852ec646c7b369301612ddd8882ed17d32adb0530ebee4a79b

See more details on using hashes here.

Provenance

File details

Details for the file torchdata-0.7.0-cp39-cp39-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for torchdata-0.7.0-cp39-cp39-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 b801ecff7ffb63f0fa61688da3a20431872bc15aa625fbc3a723b618e3132992
MD5 65326ba41f578e30b9870d3f631eec61
BLAKE2b-256 3ded1d344573b59ff56da0fa29e8c4c51aec6e483577b2cb52d4bc66d35e980f

See more details on using hashes here.

Provenance

File details

Details for the file torchdata-0.7.0-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: torchdata-0.7.0-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.7

File hashes

Hashes for torchdata-0.7.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 b5514d2e2d8bb0ba81ad808aa2f50886bc66ab96fd5da1eedd0aaabbe7fc5f7c
MD5 ce8fd1b7977c8a615577d5f6add1f76b
BLAKE2b-256 191e0daa9f490bbfb471926e3d591ed19ed555cc5b69b3f06e82e20dc15acc64

See more details on using hashes here.

Provenance

File details

Details for the file torchdata-0.7.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for torchdata-0.7.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 236186def583a8704c5f1f4f6f0e95d474b37dcf992a9da7463b5911bb16b565
MD5 22e462c9b03a97027f65eb4c67219c32
BLAKE2b-256 b3df0cb60d3de6780267838808721c47a25e9d6784f9345eb3f483a0a432891b

See more details on using hashes here.

Provenance

File details

Details for the file torchdata-0.7.0-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for torchdata-0.7.0-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6139102ccb8d7376eb83361c57fb4cc25771b331e1e4ca2090b671ac38b48e5c
MD5 c4294f72f522421df08593fedd90ac36
BLAKE2b-256 0ca80dc363b378235a11829c059fff4ed8cb8ad5bfd3c19e4834e8e64007c3de

See more details on using hashes here.

Provenance

File details

Details for the file torchdata-0.7.0-cp38-cp38-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for torchdata-0.7.0-cp38-cp38-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 8ea089167ce62c2399fff81f561700bded21023a270f02f78d485f40a34ff8f7
MD5 ca1497043f9caa818f200090d95ce236
BLAKE2b-256 f7e4c6a5fa132f221ee30d716c04c1b2eb2ee7cce5cebb5eabc4a59abca03269

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page