Skip to main content

Text utilities and datasets for PyTorch

Project description

https://circleci.com/gh/pytorch/text.svg?style=svg https://codecov.io/gh/pytorch/text/branch/master/graph/badge.svg https://img.shields.io/badge/dynamic/json.svg?label=docs&url=https%3A%2F%2Fpypi.org%2Fpypi%2Ftorchtext%2Fjson&query=%24.info.version&colorB=brightgreen&prefix=v

torchtext

This repository consists of:

  • torchtext.data: Generic data loaders, abstractions, and iterators for text (including vocabulary and word vectors)

  • torchtext.datasets: Pre-built loaders for common NLP datasets

Note: we are currently re-designing the torchtext library to make it more compatible with pytorch (e.g. torch.utils.data). Several datasets have been written with the new abstractions in torchtext.experimental folder. We also created an issue to discuss the new abstraction, and users are welcome to leave feedback link.

Installation

We recommend Anaconda as Python package management system. Please refer to pytorch.org for the detail of PyTorch installation. The following is the corresponding torchtext versions and supported Python versions.

Version Compatibility

PyTorch version

torchtext version

Supported Python version

nightly build

master

3.6+

1.5

0.5

3.5+

1.4

0.4

2.7, 3.5+

0.4 and below

0.2.3

2.7, 3.5+

Using conda;:

conda install -c pytorch torchtext

Using pip;:

pip install torchtext

Optional requirements

If you want to use English tokenizer from SpaCy, you need to install SpaCy and download its English model:

pip install spacy
python -m spacy download en

Alternatively, you might want to use the Moses tokenizer port in SacreMoses (split from NLTK). You have to install SacreMoses:

pip install sacremoses

For torchtext 0.5 and below, sentencepiece:

conda install -c powerai sentencepiece

Building from source

To build torchtext from source, you need git, CMake and C++11 compiler such as g++.:

git clone https://github.com/pytorch/text torchtext
cd torchtext
git submodule update --init --recursive
python setup.py clean install
# or ``python setup.py develop`` if you are making modifications.

Note

When building from source, make sure that you have the same C++ compiler as the one used to build PyTorch. A simple way is to build PyTorch from source and use the same environment to build torchtext. If you are using nightly build of PyTorch, checkout the environment it was built here (conda) and here (pip).

Documentation

Find the documentation here.

Data

The data module provides the following:

  • Ability to describe declaratively how to load a custom NLP dataset that’s in a “normal” format:

    >>> pos = data.TabularDataset(
    ...    path='data/pos/pos_wsj_train.tsv', format='tsv',
    ...    fields=[('text', data.Field()),
    ...            ('labels', data.Field())])
    ...
    >>> sentiment = data.TabularDataset(
    ...    path='data/sentiment/train.json', format='json',
    ...    fields={'sentence_tokenized': ('text', data.Field(sequential=True)),
    ...            'sentiment_gold': ('labels', data.Field(sequential=False))})
  • Ability to define a preprocessing pipeline:

    >>> src = data.Field(tokenize=my_custom_tokenizer)
    >>> trg = data.Field(tokenize=my_custom_tokenizer)
    >>> mt_train = datasets.TranslationDataset(
    ...     path='data/mt/wmt16-ende.train', exts=('.en', '.de'),
    ...     fields=(src, trg))
  • Batching, padding, and numericalizing (including building a vocabulary object):

    >>> # continuing from above
    >>> mt_dev = datasets.TranslationDataset(
    ...     path='data/mt/newstest2014', exts=('.en', '.de'),
    ...     fields=(src, trg))
    >>> src.build_vocab(mt_train, max_size=80000)
    >>> trg.build_vocab(mt_train, max_size=40000)
    >>> # mt_dev shares the fields, so it shares their vocab objects
    >>>
    >>> train_iter = data.BucketIterator(
    ...     dataset=mt_train, batch_size=32,
    ...     sort_key=lambda x: data.interleave_keys(len(x.src), len(x.trg)))
    >>> # usage
    >>> next(iter(train_iter))
    <data.Batch(batch_size=32, src=[LongTensor (32, 25)], trg=[LongTensor (32, 28)])>
  • Wrapper for dataset splits (train, validation, test):

    >>> TEXT = data.Field()
    >>> LABELS = data.Field()
    >>>
    >>> train, val, test = data.TabularDataset.splits(
    ...     path='/data/pos_wsj/pos_wsj', train='_train.tsv',
    ...     validation='_dev.tsv', test='_test.tsv', format='tsv',
    ...     fields=[('text', TEXT), ('labels', LABELS)])
    >>>
    >>> train_iter, val_iter, test_iter = data.BucketIterator.splits(
    ...     (train, val, test), batch_sizes=(16, 256, 256),
    >>>     sort_key=lambda x: len(x.text), device=0)
    >>>
    >>> TEXT.build_vocab(train)
    >>> LABELS.build_vocab(train)

Datasets

The datasets module currently contains:

  • Sentiment analysis: SST and IMDb

  • Question classification: TREC

  • Entailment: SNLI, MultiNLI

  • Language modeling: abstract class + WikiText-2, WikiText103, PennTreebank

  • Machine translation: abstract class + Multi30k, IWSLT, WMT14

  • Sequence tagging (e.g. POS/NER): abstract class + UDPOS, CoNLL2000Chunking

  • Question answering: 20 QA bAbI tasks

  • Text classification: AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull

Others are planned or a work in progress:

  • Question answering: SQuAD

See the test directory for examples of dataset usage.

Experimental Code

We have re-written several datasets under torchtext.experimental.datasets:

  • Sentiment analysis: IMDb

  • Language modeling: abstract class + WikiText-2, WikiText103, PennTreebank

A new pattern is introduced in Release v0.5.0. Several other datasets are also in the new pattern:

  • Unsupervised learning dataset: Enwik9

  • Text classification: AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull

Disclaimer on Datasets

This is a utility library that downloads and prepares public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use the dataset. It is your responsibility to determine whether you have permission to use the dataset under the dataset’s license.

If you’re a dataset owner and wish to update any part of it (description, citation, etc.), or do not want your dataset to be included in this library, please get in touch through a GitHub issue. Thanks for your contribution to the ML community!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

torchtext-0.7.0-cp38-cp38-manylinux1_x86_64.whl (4.5 MB view details)

Uploaded CPython 3.8

torchtext-0.7.0-cp38-cp38-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

torchtext-0.7.0-cp37-cp37m-manylinux1_x86_64.whl (4.5 MB view details)

Uploaded CPython 3.7m

torchtext-0.7.0-cp37-cp37m-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

torchtext-0.7.0-cp36-cp36m-manylinux1_x86_64.whl (4.5 MB view details)

Uploaded CPython 3.6m

torchtext-0.7.0-cp36-cp36m-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.6m macOS 10.9+ x86-64

File details

Details for the file torchtext-0.7.0-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: torchtext-0.7.0-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 4.5 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1.post20200622 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.1

File hashes

Hashes for torchtext-0.7.0-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 04816a0cdb29153e10a32075b6171fc94be94ecd5dece3eeb8bbfc41faf63637
MD5 0fb85b9c1f1d1f9e97211d72eab81dda
BLAKE2b-256 0be91d0aa430ef8bee17615ea77a11720bd5868d4dabc1ab5269ed33478135ed

See more details on using hashes here.

Provenance

File details

Details for the file torchtext-0.7.0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: torchtext-0.7.0-cp38-cp38-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.8, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1.post20200622 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.1

File hashes

Hashes for torchtext-0.7.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 874ab595613220070b8eb9abb31c3c8a0353ab741d537c9da9bf01d39e79dc8a
MD5 fb0adb4ad8058146ed6e253727b52333
BLAKE2b-256 b0fe6c619322cf4ba2c192f68b2bf84045d4e8b035b43fb896312a0c12837784

See more details on using hashes here.

Provenance

File details

Details for the file torchtext-0.7.0-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: torchtext-0.7.0-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 4.5 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1.post20200622 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.1

File hashes

Hashes for torchtext-0.7.0-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 a17e361dc3d6e1b96d0ea00223fee82b9725c1580726acb53dba23bca0a2fe04
MD5 ccb87fc601f8c45104e3767db4ef1246
BLAKE2b-256 726da6f757ab75c57868a7ae0e839d631f3384e7e992216bd8cb21739358b5b1

See more details on using hashes here.

Provenance

File details

Details for the file torchtext-0.7.0-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: torchtext-0.7.0-cp37-cp37m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.7m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1.post20200622 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.1

File hashes

Hashes for torchtext-0.7.0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 f3a238bd0ca68c0fd5a84459419c6589b48d720ecacb11b5c5d664cf61a0a747
MD5 08b1a914e420858e0d4d82a5325a17f4
BLAKE2b-256 b8ef0d3dc1b8326f114900031c7096fae1ceea54065ebbc7648cd1c7a2c4dfc7

See more details on using hashes here.

Provenance

File details

Details for the file torchtext-0.7.0-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: torchtext-0.7.0-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 4.5 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1.post20200622 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.1

File hashes

Hashes for torchtext-0.7.0-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 09bda774d37a1c84bdd89edaa3f4ded0a14bd1e1a52d2407c665cf9d133c9473
MD5 2fe15456aadc5e5624bb62d67806ba04
BLAKE2b-256 b9f9224b3893ab11d83d47fde357a7dcc75f00ba219f34f3d15e06fe4cb62e05

See more details on using hashes here.

Provenance

File details

Details for the file torchtext-0.7.0-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: torchtext-0.7.0-cp36-cp36m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.6m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1.post20200622 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.1

File hashes

Hashes for torchtext-0.7.0-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 4438cfd2b3f45f3ddf3acf1e2ac61cd14fd44bf45b0a56aa266666f49d6a2083
MD5 f7af7506ea5a82871c22d40d599cdb76
BLAKE2b-256 a1cd39adf4e7e9289032882ab219cf5768b875c5bd685b0d1c9982961bc137f6

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page