Skip to main content

Text utilities and datasets for PyTorch

Project description

https://travis-ci.org/pytorch/text.svg?branch=master https://codecov.io/gh/pytorch/text/branch/master/graph/badge.svg https://img.shields.io/badge/dynamic/json.svg?label=docs&url=https%3A%2F%2Fpypi.org%2Fpypi%2Ftorchtext%2Fjson&query=%24.info.version&colorB=brightgreen&prefix=v

torchtext

This repository consists of:

  • torchtext.data: Generic data loaders, abstractions, and iterators for text (including vocabulary and word vectors)

  • torchtext.datasets: Pre-built loaders for common NLP datasets

Installation

Make sure you have Python 2.7 or 3.5+ and PyTorch 0.4.0 or newer. You can then install torchtext using pip:

pip install torchtext

For PyTorch versions before 0.4.0, please use pip install torchtext==0.2.3.

Optional requirements

If you want to use English tokenizer from SpaCy, you need to install SpaCy and download its English model:

pip install spacy
python -m spacy download en

Alternatively, you might want to use the Moses tokenizer port in SacreMoses (split from NLTK). You have to install SacreMoses:

pip install sacremoses

Documentation

Find the documentation here.

Data

The data module provides the following:

  • Ability to describe declaratively how to load a custom NLP dataset that’s in a “normal” format:

    >>> pos = data.TabularDataset(
    ...    path='data/pos/pos_wsj_train.tsv', format='tsv',
    ...    fields=[('text', data.Field()),
    ...            ('labels', data.Field())])
    ...
    >>> sentiment = data.TabularDataset(
    ...    path='data/sentiment/train.json', format='json',
    ...    fields={'sentence_tokenized': ('text', data.Field(sequential=True)),
    ...            'sentiment_gold': ('labels', data.Field(sequential=False))})
  • Ability to define a preprocessing pipeline:

    >>> src = data.Field(tokenize=my_custom_tokenizer)
    >>> trg = data.Field(tokenize=my_custom_tokenizer)
    >>> mt_train = datasets.TranslationDataset(
    ...     path='data/mt/wmt16-ende.train', exts=('.en', '.de'),
    ...     fields=(src, trg))
  • Batching, padding, and numericalizing (including building a vocabulary object):

    >>> # continuing from above
    >>> mt_dev = datasets.TranslationDataset(
    ...     path='data/mt/newstest2014', exts=('.en', '.de'),
    ...     fields=(src, trg))
    >>> src.build_vocab(mt_train, max_size=80000)
    >>> trg.build_vocab(mt_train, max_size=40000)
    >>> # mt_dev shares the fields, so it shares their vocab objects
    >>>
    >>> train_iter = data.BucketIterator(
    ...     dataset=mt_train, batch_size=32,
    ...     sort_key=lambda x: data.interleave_keys(len(x.src), len(x.trg)))
    >>> # usage
    >>> next(iter(train_iter))
    <data.Batch(batch_size=32, src=[LongTensor (32, 25)], trg=[LongTensor (32, 28)])>
  • Wrapper for dataset splits (train, validation, test):

    >>> TEXT = data.Field()
    >>> LABELS = data.Field()
    >>>
    >>> train, val, test = data.TabularDataset.splits(
    ...     path='/data/pos_wsj/pos_wsj', train='_train.tsv',
    ...     validation='_dev.tsv', test='_test.tsv', format='tsv',
    ...     fields=[('text', TEXT), ('labels', LABELS)])
    >>>
    >>> train_iter, val_iter, test_iter = data.BucketIterator.splits(
    ...     (train, val, test), batch_sizes=(16, 256, 256),
    >>>     sort_key=lambda x: len(x.text), device=0)
    >>>
    >>> TEXT.build_vocab(train)
    >>> LABELS.build_vocab(train)

Datasets

The datasets module currently contains:

  • Sentiment analysis: SST and IMDb

  • Question classification: TREC

  • Entailment: SNLI, MultiNLI

  • Language modeling: abstract class + WikiText-2, WikiText103, PennTreebank

  • Machine translation: abstract class + Multi30k, IWSLT, WMT14

  • Sequence tagging (e.g. POS/NER): abstract class + UDPOS, CoNLL2000Chunking

  • Question answering: 20 QA bAbI tasks

  • Text classification: AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull

Others are planned or a work in progress:

  • Question answering: SQuAD

See the test directory for examples of dataset usage.

Experimental Code

We have re-written several datasets under `torchtext.experimental.datasets`:

  • Sentiment analysis: IMDb

  • Language modeling: abstract class + WikiText-2, WikiText103, PennTreebank

A new pattern is introduced in Release v0.5.0. Several other datasets are also in the new pattern:

  • Unsupervised learning dataset: Enwik9

  • Text classification: AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull

Disclaimer on Datasets

This is a utility library that downloads and prepares public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use the dataset. It is your responsibility to determine whether you have permission to use the dataset under the dataset’s license.

If you’re a dataset owner and wish to update any part of it (description, citation, etc.), or do not want your dataset to be included in this library, please get in touch through a GitHub issue. Thanks for your contribution to the ML community!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

torchtext-0.5.0.tar.gz (54.9 kB view details)

Uploaded Source

Built Distribution

torchtext-0.5.0-py3-none-any.whl (73.2 kB view details)

Uploaded Python 3

File details

Details for the file torchtext-0.5.0.tar.gz.

File metadata

  • Download URL: torchtext-0.5.0.tar.gz
  • Upload date:
  • Size: 54.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.3

File hashes

Hashes for torchtext-0.5.0.tar.gz
Algorithm Hash digest
SHA256 7f22e24e9b939fff56b9118c78dc07aafec8dcc67164de15b9b5ed339e4179c6
MD5 799f7bb957bd7095aa92f2554e8fd30c
BLAKE2b-256 58f2ccad6aaec0494afec1529af17b1a20a86e0bf9b5e86a79f4e65cc02f67a6

See more details on using hashes here.

Provenance

File details

Details for the file torchtext-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: torchtext-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 73.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.3

File hashes

Hashes for torchtext-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1caed2e155c45b4885daedb735b0f41e2f86ecd9dc788f75f824683bf1645f67
MD5 08b3bb64c1bf4eb15c2dcfa3e8d3b91f
BLAKE2b-256 79ef54b8da26f37787f5c670ae2199329e7dccf195c060b25628d99e587dac51

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page