Skip to main content

Text utilities and datasets for PyTorch

Project description

https://circleci.com/gh/pytorch/text.svg?style=svg https://codecov.io/gh/pytorch/text/branch/master/graph/badge.svg https://img.shields.io/badge/dynamic/json.svg?label=docs&url=https%3A%2F%2Fpypi.org%2Fpypi%2Ftorchtext%2Fjson&query=%24.info.version&colorB=brightgreen&prefix=v

torchtext

This repository consists of:

  • torchtext.data: Generic data loaders, abstractions, and iterators for text (including vocabulary and word vectors)

  • torchtext.datasets: Pre-built loaders for common NLP datasets

Note: we are currently re-designing the torchtext library to make it more compatible with pytorch (e.g. torch.utils.data). Several datasets have been written with the new abstractions in torchtext.experimental folder. We also created an issue to discuss the new abstraction, and users are welcome to leave feedback link. These prototype building blocks and datasets in the experimental folder are available in the nightly release only. The nightly packages are accessible via Pip and Conda for Windows, Mac, and Linux. For example, Linux users can install the nightly wheels with the following command:

pip install --pre torch torchtext -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html

For more detail instructions, please refer to Install PyTorch. It should be noted that the new building blocks are still under development, and the APIs have not been solidified.

Installation

We recommend Anaconda as Python package management system. Please refer to pytorch.org for the detail of PyTorch installation. The following is the corresponding torchtext versions and supported Python versions.

Version Compatibility

PyTorch version

torchtext version

Supported Python version

nightly build

master

3.6+

1.7

0.8

3.6+

1.6

0.7

3.6+

1.5

0.6

3.5+

1.4

0.5

2.7, 3.5+

0.4 and below

0.2.3

2.7, 3.5+

Using conda:

conda install -c pytorch torchtext

Using pip:

pip install torchtext

Optional requirements

If you want to use English tokenizer from SpaCy, you need to install SpaCy and download its English model:

pip install spacy
python -m spacy download en

Alternatively, you might want to use the Moses tokenizer port in SacreMoses (split from NLTK). You have to install SacreMoses:

pip install sacremoses

For torchtext 0.5 and below, sentencepiece:

conda install -c powerai sentencepiece

Building from source

To build torchtext from source, you need git, CMake and C++11 compiler such as g++.:

git clone https://github.com/pytorch/text torchtext
cd torchtext
git submodule update --init --recursive

# Linux
python setup.py clean install

# OSX
MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ python setup.py clean install

# or ``python setup.py develop`` if you are making modifications.

Note

When building from source, make sure that you have the same C++ compiler as the one used to build PyTorch. A simple way is to build PyTorch from source and use the same environment to build torchtext. If you are using nightly build of PyTorch, checkout the environment it was built here (conda) and here (pip).

Documentation

Find the documentation here.

Data

The data module provides the following:

  • Ability to describe declaratively how to load a custom NLP dataset that’s in a “normal” format:

    >>> pos = data.TabularDataset(
    ...    path='data/pos/pos_wsj_train.tsv', format='tsv',
    ...    fields=[('text', data.Field()),
    ...            ('labels', data.Field())])
    ...
    >>> sentiment = data.TabularDataset(
    ...    path='data/sentiment/train.json', format='json',
    ...    fields={'sentence_tokenized': ('text', data.Field(sequential=True)),
    ...            'sentiment_gold': ('labels', data.Field(sequential=False))})
  • Ability to define a preprocessing pipeline:

    >>> src = data.Field(tokenize=my_custom_tokenizer)
    >>> trg = data.Field(tokenize=my_custom_tokenizer)
    >>> mt_train = datasets.TranslationDataset(
    ...     path='data/mt/wmt16-ende.train', exts=('.en', '.de'),
    ...     fields=(src, trg))
  • Batching, padding, and numericalizing (including building a vocabulary object):

    >>> # continuing from above
    >>> mt_dev = datasets.TranslationDataset(
    ...     path='data/mt/newstest2014', exts=('.en', '.de'),
    ...     fields=(src, trg))
    >>> src.build_vocab(mt_train, max_size=80000)
    >>> trg.build_vocab(mt_train, max_size=40000)
    >>> # mt_dev shares the fields, so it shares their vocab objects
    >>>
    >>> train_iter = data.BucketIterator(
    ...     dataset=mt_train, batch_size=32,
    ...     sort_key=lambda x: data.interleave_keys(len(x.src), len(x.trg)))
    >>> # usage
    >>> next(iter(train_iter))
    <data.Batch(batch_size=32, src=[LongTensor (32, 25)], trg=[LongTensor (32, 28)])>
  • Wrapper for dataset splits (train, validation, test):

    >>> TEXT = data.Field()
    >>> LABELS = data.Field()
    >>>
    >>> train, val, test = data.TabularDataset.splits(
    ...     path='/data/pos_wsj/pos_wsj', train='_train.tsv',
    ...     validation='_dev.tsv', test='_test.tsv', format='tsv',
    ...     fields=[('text', TEXT), ('labels', LABELS)])
    >>>
    >>> train_iter, val_iter, test_iter = data.BucketIterator.splits(
    ...     (train, val, test), batch_sizes=(16, 256, 256),
    >>>     sort_key=lambda x: len(x.text), device=0)
    >>>
    >>> TEXT.build_vocab(train)
    >>> LABELS.build_vocab(train)

Datasets

The datasets module currently contains:

  • Sentiment analysis: SST and IMDb

  • Question classification: TREC

  • Entailment: SNLI, MultiNLI

  • Language modeling: abstract class + WikiText-2, WikiText103, PennTreebank

  • Machine translation: abstract class + Multi30k, IWSLT, WMT14

  • Sequence tagging (e.g. POS/NER): abstract class + UDPOS, CoNLL2000Chunking

  • Question answering: 20 QA bAbI tasks

  • Text classification: AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull

Others are planned or a work in progress:

  • Question answering: SQuAD

See the test directory for examples of dataset usage.

Experimental Code

We have re-written several datasets under torchtext.experimental.datasets:

  • Sentiment analysis: IMDb

  • Language modeling: abstract class + WikiText-2, WikiText103, PennTreebank

A new pattern is introduced in Release v0.5.0. Several other datasets are also in the new pattern:

  • Unsupervised learning dataset: Enwik9

  • Text classification: AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull

Disclaimer on Datasets

This is a utility library that downloads and prepares public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use the dataset. It is your responsibility to determine whether you have permission to use the dataset under the dataset’s license.

If you’re a dataset owner and wish to update any part of it (description, citation, etc.), or do not want your dataset to be included in this library, please get in touch through a GitHub issue. Thanks for your contribution to the ML community!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

torchtext-0.8.1-cp39-cp39-manylinux1_x86_64.whl (6.9 MB view details)

Uploaded CPython 3.9

torchtext-0.8.1-cp39-cp39-macosx_10_9_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

torchtext-0.8.1-cp38-cp38-manylinux1_x86_64.whl (7.0 MB view details)

Uploaded CPython 3.8

torchtext-0.8.1-cp38-cp38-macosx_10_9_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

torchtext-0.8.1-cp37-cp37m-manylinux1_x86_64.whl (7.0 MB view details)

Uploaded CPython 3.7m

torchtext-0.8.1-cp37-cp37m-macosx_10_9_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

torchtext-0.8.1-cp36-cp36m-manylinux1_x86_64.whl (7.0 MB view details)

Uploaded CPython 3.6m

torchtext-0.8.1-cp36-cp36m-macosx_10_9_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.6m macOS 10.9+ x86-64

File details

Details for the file torchtext-0.8.1-cp39-cp39-manylinux1_x86_64.whl.

File metadata

  • Download URL: torchtext-0.8.1-cp39-cp39-manylinux1_x86_64.whl
  • Upload date:
  • Size: 6.9 MB
  • Tags: CPython 3.9
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2.post20201201 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.8.5

File hashes

Hashes for torchtext-0.8.1-cp39-cp39-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 28036a61bf97d965775b32065ff31661637662124f6aabf4eccd2ef12d9f3d43
MD5 cc189d6dd83bc05ba2cc8f06aec18188
BLAKE2b-256 935aeeb1db04950099894184c4da4ced335f7772f132021de1a6866dbdf011ad

See more details on using hashes here.

Provenance

File details

Details for the file torchtext-0.8.1-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: torchtext-0.8.1-cp39-cp39-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.9, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2.post20201201 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.8.5

File hashes

Hashes for torchtext-0.8.1-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 e2e82629a682064e21c20c2d6b34a3a4212e0ec816de0e69db6ee43da48f3eb0
MD5 61878d783c3d0eb822fb6e4969bf2005
BLAKE2b-256 e51a0cd04179257b02b15719482f1ccd5ca145c2116dd3f317094b092990a24f

See more details on using hashes here.

Provenance

File details

Details for the file torchtext-0.8.1-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: torchtext-0.8.1-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 7.0 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2.post20201201 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.8.5

File hashes

Hashes for torchtext-0.8.1-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 90c4699d3f923cf937c89579e08f560094874ecdcd0a62603bef2bda961553ed
MD5 262cc019a697eaf183699bee697094a5
BLAKE2b-256 0c63a6ca1bb52775697b2b0f0c4389cc3d4d37415853d41a046a12068f4bda3a

See more details on using hashes here.

Provenance

File details

Details for the file torchtext-0.8.1-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: torchtext-0.8.1-cp38-cp38-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: CPython 3.8, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2.post20201201 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.8.5

File hashes

Hashes for torchtext-0.8.1-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 5b479e2c98525a77ab112e6dd624a1ccc783e927b25b618218793254fc09e2d2
MD5 f5bcbad17e5a53b2c8aff933cff2d19f
BLAKE2b-256 cbd859ce9890ea374e7c22a904b416c9ba489fc2e9c8830e21c41db7bb5fae40

See more details on using hashes here.

Provenance

File details

Details for the file torchtext-0.8.1-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: torchtext-0.8.1-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 7.0 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2.post20201201 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.8.5

File hashes

Hashes for torchtext-0.8.1-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 46cae2155fa28ab9920e23e6fb8d445911183e88e7f9eeb74024ee0a20671961
MD5 76a3a457835b11a459166ec8f5aa1ced
BLAKE2b-256 1380046f0691b296e755ae884df3ca98033cb9afcaf287603b2b7999e94640b8

See more details on using hashes here.

Provenance

File details

Details for the file torchtext-0.8.1-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: torchtext-0.8.1-cp37-cp37m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: CPython 3.7m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2.post20201201 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.8.5

File hashes

Hashes for torchtext-0.8.1-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 991d9d38fd1d47a8517e624223e3537123a48175b00b74c6508daa2906431176
MD5 dd72c97c5db911cbac5f12b9072ab468
BLAKE2b-256 ebe5bbc50f25cf9747bb8605fe103e04302e5761dc08a524a3ba889435a5a444

See more details on using hashes here.

Provenance

File details

Details for the file torchtext-0.8.1-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: torchtext-0.8.1-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 7.0 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2.post20201201 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.8.5

File hashes

Hashes for torchtext-0.8.1-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 19c9976400e09ab1008c3fb0d1162dc80214b6ac45012d2e1692c25337119157
MD5 cf4aaa4c51a6e3438c70ee1873d8e62a
BLAKE2b-256 0e81be2d72b1ea641afc74557574650a5b421134198de9f68f483ab10d515dca

See more details on using hashes here.

Provenance

File details

Details for the file torchtext-0.8.1-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: torchtext-0.8.1-cp36-cp36m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: CPython 3.6m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2.post20201201 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.8.5

File hashes

Hashes for torchtext-0.8.1-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 a0c0b7221fdfdd124f98de854d922c111084a4defe11ea32ecc22b56d1f46fd9
MD5 822345fd5eea3b7449278bc54ae6d6c4
BLAKE2b-256 c0abbf0262801347faa30acd67c470da8ce52e4233883d4a48859c9df3a23245

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page