Skip to main content

Text utilities and datasets for PyTorch

Project description

https://travis-ci.org/pytorch/text.svg?branch=master https://codecov.io/gh/pytorch/text/branch/master/graph/badge.svg http://readthedocs.org/projects/torchtext/badge/?version=latest

torchtext

This repository consists of:

  • torchtext.data: Generic data loaders, abstractions, and iterators for text (including vocabulary and word vectors)

  • torchtext.datasets: Pre-built loaders for common NLP datasets

Installation

Make sure you have Python 2.7 or 3.5+ and PyTorch 0.4.0 or newer. You can then install torchtext using pip:

pip install torchtext

For PyTorch versions before 0.4.0, please use pip install torchtext==0.2.3.

Optional requirements

If you want to use English tokenizer from SpaCy, you need to install SpaCy and download its English model:

pip install spacy
python -m spacy download en

Alternatively, you might want to use Moses tokenizer from NLTK. You have to install NLTK and download the data needed:

pip install nltk
python -m nltk.downloader perluniprops nonbreaking_prefixes

Documentation

Find the documentation here.

Data

The data module provides the following:

  • Ability to describe declaratively how to load a custom NLP dataset that’s in a “normal” format:

    >>> pos = data.TabularDataset(
    ...    path='data/pos/pos_wsj_train.tsv', format='tsv',
    ...    fields=[('text', data.Field()),
    ...            ('labels', data.Field())])
    ...
    >>> sentiment = data.TabularDataset(
    ...    path='data/sentiment/train.json', format='json',
    ...    fields={'sentence_tokenized': ('text', data.Field(sequential=True)),
    ...            'sentiment_gold': ('labels', data.Field(sequential=False))})
  • Ability to define a preprocessing pipeline:

    >>> src = data.Field(tokenize=my_custom_tokenizer)
    >>> trg = data.Field(tokenize=my_custom_tokenizer)
    >>> mt_train = datasets.TranslationDataset(
    ...     path='data/mt/wmt16-ende.train', exts=('.en', '.de'),
    ...     fields=(src, trg))
  • Batching, padding, and numericalizing (including building a vocabulary object):

    >>> # continuing from above
    >>> mt_dev = datasets.TranslationDataset(
    ...     path='data/mt/newstest2014', exts=('.en', '.de'),
    ...     fields=(src, trg))
    >>> src.build_vocab(mt_train, max_size=80000)
    >>> trg.build_vocab(mt_train, max_size=40000)
    >>> # mt_dev shares the fields, so it shares their vocab objects
    >>>
    >>> train_iter = data.BucketIterator(
    ...     dataset=mt_train, batch_size=32,
    ...     sort_key=lambda x: data.interleave_keys(len(x.src), len(x.trg)))
    >>> # usage
    >>> next(iter(train_iter))
    <data.Batch(batch_size=32, src=[LongTensor (32, 25)], trg=[LongTensor (32, 28)])>
  • Wrapper for dataset splits (train, validation, test):

    >>> TEXT = data.Field()
    >>> LABELS = data.Field()
    >>>
    >>> train, val, test = data.TabularDataset.splits(
    ...     path='/data/pos_wsj/pos_wsj', train='_train.tsv',
    ...     validation='_dev.tsv', test='_test.tsv', format='tsv',
    ...     fields=[('text', TEXT), ('labels', LABELS)])
    >>>
    >>> train_iter, val_iter, test_iter = data.BucketIterator.splits(
    ...     (train, val, test), batch_sizes=(16, 256, 256),
    >>>     sort_key=lambda x: len(x.text), device=0)
    >>>
    >>> TEXT.build_vocab(train)
    >>> LABELS.build_vocab(train)

Datasets

The datasets module currently contains:

  • Sentiment analysis: SST and IMDb

  • Question classification: TREC

  • Entailment: SNLI, MultiNLI

  • Language modeling: abstract class + WikiText-2, WikiText103, PennTreebank

  • Machine translation: abstract class + Multi30k, IWSLT, WMT14

  • Sequence tagging (e.g. POS/NER): abstract class + UDPOS, CoNLL2000Chunking

  • Question answering: 20 QA bAbI tasks

Others are planned or a work in progress:

  • Question answering: SQuAD

See the test directory for examples of dataset usage.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

torchtext-0.3.1.tar.gz (50.1 kB view details)

Uploaded Source

Built Distributions

torchtext-0.3.1-py3-none-any.whl (62.4 kB view details)

Uploaded Python 3

torchtext-0.3.1-py2-none-any.whl (62.4 kB view details)

Uploaded Python 2

File details

Details for the file torchtext-0.3.1.tar.gz.

File metadata

  • Download URL: torchtext-0.3.1.tar.gz
  • Upload date:
  • Size: 50.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/39.2.0 requests-toolbelt/0.8.0 tqdm/4.19.6 CPython/3.6.4

File hashes

Hashes for torchtext-0.3.1.tar.gz
Algorithm Hash digest
SHA256 869e0860917b5a8660ebaa468f3cd3104a7acf3941a1f86e8e9a8ea61e78113d
MD5 471324b9c8ebf92d5cb8005e9ad59e7f
BLAKE2b-256 ff67b1b27c3318772cf75f3bf204bdb3a1b2008ae35564852d18a43a1605ae6e

See more details on using hashes here.

Provenance

File details

Details for the file torchtext-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: torchtext-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 62.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/39.2.0 requests-toolbelt/0.8.0 tqdm/4.19.6 CPython/3.6.4

File hashes

Hashes for torchtext-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7b5bc7af67d9c3892bdf6f4895734768f2836c13156a783c96597168176ce2d5
MD5 042343d90f8c1319f18b37c3a6d45f42
BLAKE2b-256 c6bcb28b9efb4653c03e597ed207264eea45862b5260f48e9f010b5068d64db1

See more details on using hashes here.

Provenance

File details

Details for the file torchtext-0.3.1-py2-none-any.whl.

File metadata

  • Download URL: torchtext-0.3.1-py2-none-any.whl
  • Upload date:
  • Size: 62.4 kB
  • Tags: Python 2
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/39.2.0 requests-toolbelt/0.8.0 tqdm/4.19.6 CPython/3.6.4

File hashes

Hashes for torchtext-0.3.1-py2-none-any.whl
Algorithm Hash digest
SHA256 963160f97cf449edad1183e95d2dd0b4694225b7060a1a8b23e71bccb08022e0
MD5 a383dc6aab13276f3559b4a6badec30f
BLAKE2b-256 26b52022b596796eceba0143df5a18be2c17c9ecda95bdeab133225e0d46fae8

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page