Skip to main content

Text utilities and datasets for PyTorch

Project description

https://circleci.com/gh/pytorch/text.svg?style=svg https://codecov.io/gh/pytorch/text/branch/master/graph/badge.svg https://img.shields.io/badge/dynamic/json.svg?label=docs&url=https%3A%2F%2Fpypi.org%2Fpypi%2Ftorchtext%2Fjson&query=%24.info.version&colorB=brightgreen&prefix=v

torchtext

This repository consists of:

  • torchtext.data: Generic data loaders, abstractions, and iterators for text (including vocabulary and word vectors)

  • torchtext.datasets: Pre-built loaders for common NLP datasets

Note: we are currently re-designing the torchtext library to make it more compatible with pytorch (e.g. torch.utils.data). Several datasets have been written with the new abstractions in torchtext.experimental folder. We also created an issue to discuss the new abstraction, and users are welcome to leave feedback link. These prototype building blocks and datasets in the experimental folder are available in the nightly release only. The nightly packages are accessible via Pip and Conda for Windows, Mac, and Linux. For example, Linux users can install the nightly wheels with the following command:

pip install --pre torch torchtext -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html

For more detail instructions, please refer to Install PyTorch. It should be noted that the new building blocks are still under development, and the APIs have not been solidified.

Installation

We recommend Anaconda as Python package management system. Please refer to pytorch.org for the detail of PyTorch installation. The following is the corresponding torchtext versions and supported Python versions.

Version Compatibility

PyTorch version

torchtext version

Supported Python version

nightly build

master

3.6+

1.7

0.8

3.6+

1.6

0.7

3.6+

1.5

0.6

3.5+

1.4

0.5

2.7, 3.5+

0.4 and below

0.2.3

2.7, 3.5+

Using conda:

conda install -c pytorch torchtext

Using pip:

pip install torchtext

Optional requirements

If you want to use English tokenizer from SpaCy, you need to install SpaCy and download its English model:

pip install spacy
python -m spacy download en

Alternatively, you might want to use the Moses tokenizer port in SacreMoses (split from NLTK). You have to install SacreMoses:

pip install sacremoses

For torchtext 0.5 and below, sentencepiece:

conda install -c powerai sentencepiece

Building from source

To build torchtext from source, you need git, CMake and C++11 compiler such as g++.:

git clone https://github.com/pytorch/text torchtext
cd torchtext
git submodule update --init --recursive

# Linux
python setup.py clean install

# OSX
MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ python setup.py clean install

# or ``python setup.py develop`` if you are making modifications.

Note

When building from source, make sure that you have the same C++ compiler as the one used to build PyTorch. A simple way is to build PyTorch from source and use the same environment to build torchtext. If you are using nightly build of PyTorch, checkout the environment it was built here (conda) and here (pip).

Documentation

Find the documentation here.

Data

The data module provides the following:

  • Ability to describe declaratively how to load a custom NLP dataset that’s in a “normal” format:

    >>> pos = data.TabularDataset(
    ...    path='data/pos/pos_wsj_train.tsv', format='tsv',
    ...    fields=[('text', data.Field()),
    ...            ('labels', data.Field())])
    ...
    >>> sentiment = data.TabularDataset(
    ...    path='data/sentiment/train.json', format='json',
    ...    fields={'sentence_tokenized': ('text', data.Field(sequential=True)),
    ...            'sentiment_gold': ('labels', data.Field(sequential=False))})
  • Ability to define a preprocessing pipeline:

    >>> src = data.Field(tokenize=my_custom_tokenizer)
    >>> trg = data.Field(tokenize=my_custom_tokenizer)
    >>> mt_train = datasets.TranslationDataset(
    ...     path='data/mt/wmt16-ende.train', exts=('.en', '.de'),
    ...     fields=(src, trg))
  • Batching, padding, and numericalizing (including building a vocabulary object):

    >>> # continuing from above
    >>> mt_dev = datasets.TranslationDataset(
    ...     path='data/mt/newstest2014', exts=('.en', '.de'),
    ...     fields=(src, trg))
    >>> src.build_vocab(mt_train, max_size=80000)
    >>> trg.build_vocab(mt_train, max_size=40000)
    >>> # mt_dev shares the fields, so it shares their vocab objects
    >>>
    >>> train_iter = data.BucketIterator(
    ...     dataset=mt_train, batch_size=32,
    ...     sort_key=lambda x: data.interleave_keys(len(x.src), len(x.trg)))
    >>> # usage
    >>> next(iter(train_iter))
    <data.Batch(batch_size=32, src=[LongTensor (32, 25)], trg=[LongTensor (32, 28)])>
  • Wrapper for dataset splits (train, validation, test):

    >>> TEXT = data.Field()
    >>> LABELS = data.Field()
    >>>
    >>> train, val, test = data.TabularDataset.splits(
    ...     path='/data/pos_wsj/pos_wsj', train='_train.tsv',
    ...     validation='_dev.tsv', test='_test.tsv', format='tsv',
    ...     fields=[('text', TEXT), ('labels', LABELS)])
    >>>
    >>> train_iter, val_iter, test_iter = data.BucketIterator.splits(
    ...     (train, val, test), batch_sizes=(16, 256, 256),
    >>>     sort_key=lambda x: len(x.text), device=0)
    >>>
    >>> TEXT.build_vocab(train)
    >>> LABELS.build_vocab(train)

Datasets

The datasets module currently contains:

  • Sentiment analysis: SST and IMDb

  • Question classification: TREC

  • Entailment: SNLI, MultiNLI

  • Language modeling: abstract class + WikiText-2, WikiText103, PennTreebank

  • Machine translation: abstract class + Multi30k, IWSLT, WMT14

  • Sequence tagging (e.g. POS/NER): abstract class + UDPOS, CoNLL2000Chunking

  • Question answering: 20 QA bAbI tasks

  • Text classification: AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull

Others are planned or a work in progress:

  • Question answering: SQuAD

See the test directory for examples of dataset usage.

Experimental Code

We have re-written several datasets under torchtext.experimental.datasets:

  • Sentiment analysis: IMDb

  • Language modeling: abstract class + WikiText-2, WikiText103, PennTreebank

A new pattern is introduced in Release v0.5.0. Several other datasets are also in the new pattern:

  • Unsupervised learning dataset: Enwik9

  • Text classification: AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull

Disclaimer on Datasets

This is a utility library that downloads and prepares public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use the dataset. It is your responsibility to determine whether you have permission to use the dataset under the dataset’s license.

If you’re a dataset owner and wish to update any part of it (description, citation, etc.), or do not want your dataset to be included in this library, please get in touch through a GitHub issue. Thanks for your contribution to the ML community!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

torchtext-0.8.0-cp38-cp38-manylinux1_x86_64.whl (7.0 MB view details)

Uploaded CPython 3.8

torchtext-0.8.0-cp38-cp38-macosx_10_9_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

torchtext-0.8.0-cp37-cp37m-manylinux1_x86_64.whl (6.9 MB view details)

Uploaded CPython 3.7m

torchtext-0.8.0-cp37-cp37m-macosx_10_9_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

torchtext-0.8.0-cp36-cp36m-manylinux1_x86_64.whl (6.9 MB view details)

Uploaded CPython 3.6m

torchtext-0.8.0-cp36-cp36m-macosx_10_9_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.6m macOS 10.9+ x86-64

File details

Details for the file torchtext-0.8.0-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: torchtext-0.8.0-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 7.0 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1.post20200622 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.1

File hashes

Hashes for torchtext-0.8.0-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 30586bc67bd0651a3b86a17a980147dcd84cad2fd3e39bfaa8fa6fd673a07c35
MD5 c16dc31510cbac8ac8893b17acae8d76
BLAKE2b-256 a18190701a8c230bbecace97b9572fa2eaf2b71081930407ab932eda2e53f20b

See more details on using hashes here.

Provenance

File details

Details for the file torchtext-0.8.0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: torchtext-0.8.0-cp38-cp38-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: CPython 3.8, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1.post20200622 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.1

File hashes

Hashes for torchtext-0.8.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 33b041e2fb6c063c0d5540eb35922ad4f06d822a9018fc2ebbb65510c5ec69ad
MD5 64870faa8c82f15d5ec39af3a9592c32
BLAKE2b-256 080c86a32f0dd35f8ce595498a8d8f7c53240499e9be779e3a3308c5d60db10b

See more details on using hashes here.

Provenance

File details

Details for the file torchtext-0.8.0-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: torchtext-0.8.0-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 6.9 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1.post20200622 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.1

File hashes

Hashes for torchtext-0.8.0-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 be710a4ff82c3762deb2e755186cf3815d60007620ccee27faa3eb4e57e5bc21
MD5 3c8a69ae6f2224d87946553139ac47cc
BLAKE2b-256 268ae09b9b82d4dd676f17aa681003a7533765346744391966dec0d5dba03ee4

See more details on using hashes here.

Provenance

File details

Details for the file torchtext-0.8.0-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: torchtext-0.8.0-cp37-cp37m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: CPython 3.7m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1.post20200622 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.1

File hashes

Hashes for torchtext-0.8.0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 36373bf8c99efc6a6707298e64315caf54a564dc50436882734580b360e90014
MD5 9f7185367a76fffce52f93a2024cc16b
BLAKE2b-256 93d071cfe283aa0370026ce487d23d4239d30d6666b97ecf41361d2511ccb051

See more details on using hashes here.

Provenance

File details

Details for the file torchtext-0.8.0-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: torchtext-0.8.0-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 6.9 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1.post20200622 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.1

File hashes

Hashes for torchtext-0.8.0-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 6530f37774cce4307f6d674d5434aaa6988de51428e3dc920cfa43a0ac3b95e3
MD5 6112198646651ed045c2e27ef2e5164a
BLAKE2b-256 23238499af6d9c22b29b01f66a2c11d38ce71cd1cafa2655913c29818ed4a00f

See more details on using hashes here.

Provenance

File details

Details for the file torchtext-0.8.0-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: torchtext-0.8.0-cp36-cp36m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: CPython 3.6m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1.post20200622 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.1

File hashes

Hashes for torchtext-0.8.0-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 60b20caebcf998f2b5089583309a9871b080e7449b8f6b819ada09ae5fe6892a
MD5 0de510b6066b391ebe47239f4ee80e24
BLAKE2b-256 1d3a6cf5ca87bd5cff3ffac7c642b7deed9bd45617c721b27209230da8c9d300

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page