Neural Machine Translation Framework in Pytorch
Project description
nmtpytorch
allows training of various end-to-end neural architectures including
but not limited to neural machine translation, image captioning and automatic
speech recognition systems. The initial codebase was in Theano
and was
inspired from the famous dl4mt-tutorial
codebase.
nmtpytorch
is mainly developed by the Language and Speech Team of Le Mans University but
receives valuable contributions from the Grounded Sequence-to-sequence Transduction Team
of Frederick Jelinek Memorial Summer Workshop 2018:
Loic Barrault, Ozan Caglayan, Amanda Duarte, Desmond Elliott, Spandana Gella, Nils Holzenberger, Chirag Lala, Jasmine (Sun Jae) Lee, Jindřich Libovický, Pranava Madhyastha, Florian Metze, Karl Mulligan, Alissa Ostapenko, Shruti Palaskar, Ramon Sanabria, Lucia Specia and Josiah Wang.
If you use nmtpytorch, you may want to cite the following paper:
@article{nmtpy2017,
author = {Ozan Caglayan and
Mercedes Garc\'{i}a-Mart\'{i}nez and
Adrien Bardet and
Walid Aransa and
Fethi Bougares and
Lo\"{i}c Barrault},
title = {NMTPY: A Flexible Toolkit for Advanced Neural Machine Translation Systems},
journal = {Prague Bull. Math. Linguistics},
volume = {109},
pages = {15--28},
year = {2017},
url = {https://ufal.mff.cuni.cz/pbml/109/art-caglayan-et-al.pdf},
doi = {10.1515/pralin-2017-0035},
timestamp = {Tue, 12 Sep 2017 10:01:08 +0100}
}
Installation
nmtpytorch
currently requires python>=3.6
, torch==0.3.1
and a GPU to work.
We are not planning to support Python 2.x but it will be updated to work with the
newer versions of torch
and also on CPU.
pip
You can install nmtpytorch
from PyPI
using pip
(or pip3
depending on your
operating system and environment):
$ pip install nmtpytorch
This will automatically fetch and install the dependencies as well. For the torch
dependency it will specifically install the torch 0.3.1
package from PyPI
that
ships CUDA 8.0
within. If instead you want to use a newer version of CUDA
,
you can uninstall the torch
package manually afterwards and install another 0.3.1
package from here.
conda
We provide an environment.yml
file in the repository that you can use to create
a ready-to-use anaconda environment for nmtpytorch
:
$ conda update --all
$ git clone https://github.com/lium-lst/nmtpytorch.git
$ conda env create -f nmtpytorch/environment.yml
Unlike the pip
method, this environment explicitly installs the CUDA 9.0
version of torch 0.3.1
and enables editable mode similar to the development
mode explained below.
Development Mode
For continuous development and testing, it is sufficient to run python setup.py develop
in the root folder of your GIT checkout. From now on, all modifications to the source
tree are directly taken into account without requiring reinstallation.
METEOR Installation
After the above installation steps, you finally need to run nmtpy-install-extra
in order to fetch and store METEOR related files in your ${HOME}/.nmtpy
folder.
This step is only required once.
Documentation
We currently only provide some preliminary documentation in our wiki.
Release Notes
v2.0.0 (26/09/2018)
- Ability to install through
pip
. - Advanced layers are now organized into subfolders.
- New basic layers: Convolution over sequence, MaxMargin.
- New attention layers: Co-attention, multi-head attention, hierarchical attention.
- New encoders: Arbitrary sequence-of-vectors encoder, BiLSTMp speech feature encoder.
- New decoders: Multi-source decoder, switching decoder, vector decoder.
- New datasets: Kaldi dataset (.ark/.scp reader), Shelve dataset, Numpy sequence dataset.
- Added learning rate annealing: See
lr_decay*
options inconfig.py
. - Removed subword-nmt and METEOR files from repository. We now depend on
the PIP package for subword-nmt. For METEOR,
nmtpy-install-extra
should be launched after installation. - More multi-task and multi-input/output
translate
andtraining
regimes. - New early-stopping metrics: Character and word error rate (cer,wer) and ROUGE (rouge).
- Curriculum learning option for the
BucketBatchSampler
, i.e. length-ordered batches. - New models:
- ASR: Listen-attend-and-spell like automatic speech recognition
- Multitask*: Experimental multi-tasking & scheduling between many inputs/outputs.
v1.4.0 (09/05/2018)
- Add
environment.yml
for easy installation usingconda
. You can now create a ready-to-useconda
environment by just callingconda env create -f environment.yml
. - Make
NumpyDataset
memory efficient by keepingfloat16
arrays as they are until batch creation time. - Rename
Multi30kRawDataset
toMulti30kDataset
which now supports both raw image files and pre-extracted visual features file stored as.npy
. - Add CNN feature extraction script under
scripts/
. - Add doubly stochastic attention to
ShowAttendAndTell
and multimodal NMT. - New model
MNMTDecinit
to initialize decoder with auxiliary features. - New model
AMNMTFeatures
which is the attentive MMT but with features file instead of end-to-end feature extraction which was memory hungry.
v1.3.2 (02/05/2018)
- Updates to
ShowAttendAndTell
model.
v1.3.1 (01/05/2018)
- Removed old
Multi30kDataset
. - Sort batches by source sequence length instead of target.
- Fix
ShowAttendAndTell
model. It should now work.
v1.3 (30/04/2018)
- Added
Multi30kRawDataset
for training end-to-end systems from raw images as input. - Added
NumpyDataset
to read.npy/.npz
tensor files as input features. - You can now pass
-S
tonmtpy train
to produce shorter experiment files with not all the hyperparameters in file name. - New post-processing filter option
de-spm
for Google SentencePiece (SPM) processed files. sacrebleu
is now a dependency as it is now accepted as an early-stopping metric. It only makes sense to use it with SPM processed files since they are detokenized once post-processed.- Added
sklearn
as a dependency for some metrics. - Added
momentum
andnesterov
parameters to[train]
section for SGD. ImageEncoder
layer is improved in many ways. Please see the code for further details.- Added unmerged upstream PR for
ModuleDict()
support. METEOR
will now fallback to English if language can not be detected from file suffixes.-f
now produces a separate numpy file for token frequencies when building vocabulary files withnmtpy-build-vocab
.- Added new command
nmtpy test
for non beam-search inference modes. - Removed
nmtpy resume
command and addedpretrained_file
option for[train]
to initialize model weights from a checkpoint. - Added
freeze_layers
option for[train]
to give comma-separated list of layer name prefixes to freeze. - Improved seeding: seed is now printed in order to reproduce the results.
- Added IPython notebook for attention visualization.
- Layers
- New shallow
SimpleGRUDecoder
layer. TextEncoder
: Ability to setmaxnorm
andgradscale
of embeddings and work with or without sorted-length batches.ConditionalDecoder
: Make it work with GRU/LSTM, allow settingmaxnorm/gradscale
for embeddings.ConditionalMMDecoder
: Same as above.
- New shallow
- nmtpy translate
--avoid-double
and--avoid-unk
removed for now.- Added Google's length penalty normalization switch
--lp-alpha
. - Added ensembling which is enabled automatically if you give more than 1 model checkpoints.
- New machine learning metric wrappers in
utils/ml_metrics.py
:- Label-ranking average precision
lrap
- Coverage error
- Mean reciprocal rank
- Label-ranking average precision
v1.2 (20/02/2018)
- You can now use
$HOME
and$USER
in your configuration files. - Fixed an overflow error that would cause NMT with more than 255 tokens to fail.
- METEOR worker process is now correctly killed after validations.
- Many runs of an experiment are now suffixed with a unique random string instead of incremental integers to avoid race conditions in cluster setups.
- Replaced
utils.nn.get_network_topology()
with a newTopology
class that will parse thedirection
string of the model in a more smart way. - If
CUDA_VISIBLE_DEVICES
is set, theGPUManager
will always honor it. - Dropped creation of temporary/advisory lock files under
/tmp
for GPU reservation. - Time measurements during training are now structered into batch overhead, training and evaluation timings.
- Datasets
- Added
TextDataset
for standalone text file reading. - Added
OneHotDataset
, a variant ofTextDataset
where the sequences are not prefixed/suffixed with<bos>
and<eos>
respectively. - Added experimental
MultiParallelDataset
that merges an arbitrary number of parallel datasets together.
- Added
- nmtpy translate
.nodbl
and.nounk
suffixes are now added to output files for--avoid-double
and--avoid-unk
arguments respectively.- A model-agnostic enough
beam_search()
is now separated out into its own filenmtpytorch/search.py
. max_len
default is increased to 200.
v1.1 (25/01/2018)
- New experimental
Multi30kDataset
andImageFolderDataset
classes torchvision
dependency added for CNN supportnmtpy-coco-metrics
now computes one METEOR withoutnorm=True
- Mainloop mechanism is completely refactored with backward-incompatible
configuration option changes for
[train]
section:patience_delta
option is removed- Added
eval_batch_size
to define batch size for GPU beam-search during training eval_freq
default is now3000
which means per3000
minibatcheseval_metrics
now defaults toloss
. As before, you can provide a list of metrics likebleu,meteor,loss
to compute all of them and early-stop based on the first- Added
eval_zero (default: False)
which tells to evaluate the model once on dev set right before the training starts. Useful for sanity checking if you fine-tune a model initialized with pre-trained weights - Removed
save_best_n
: we no longer save the bestN
models on dev set w.r.t. early-stopping metric - Added
save_best_metrics (default: True)
which will save best models on dev set w.r.t each metric provided ineval_metrics
. This kind of remedies the removal ofsave_best_n
checkpoint_freq
now to defaults to5000
which means per5000
minibatches.- Added
n_checkpoints (default: 5)
to define the number of last checkpoints that will be kept ifcheckpoint_freq > 0
i.e. checkpointing enabled
- Added
ExtendedInterpolation
support to configuration files:- You can now define intermediate variables in
.conf
files to avoid typing same paths again and again. A variable can be referenced from within its section usingtensorboard_dir: ${save_path}/tb
notation Cross-section references are also possible:${data:root}
will be replaced by the value of theroot
variable defined in the[data]
section.
- You can now define intermediate variables in
- Added
-p/--pretrained
tonmtpy train
to initialize the weights of the model using another checkpoint.ckpt
. - Improved input/output handling for
nmtpy translate
:-s
accepts a comma-separated test sets defined in the configuration file of the experiment to translate them at once. Example:-s val,newstest2016,newstest2017
- The mutually exclusive counterpart of
-s
is-S
which receives a single input file of source sentences. - For both cases, an output prefix should now be provided with
-o
. In the case of multiple test sets, the output prefix will be appended the name of the test set and the beam size. If you just provide a single file with-S
the final output name will only reflect the beam size information.
- Two new arguments for
nmtpy-build-vocab
:-f
: Stores frequency counts as well inside the finaljson
vocabulary-x
: Does not add special markers<eos>,<bos>,<unk>,<pad>
into the vocabulary
Layers/Architectures
- Added
Fusion()
layer toconcat,sum,mul
an arbitrary number of inputs - Added experimental
ImageEncoder()
layer to seamlessly plug a VGG or ResNet CNN usingtorchvision
pretrained models Attention
layer arguments improved. You can now select the bottleneck dimensionality for MLP attention withatt_bottleneck
. Thedot
attention is still not tested and probably broken.
New layers/architectures:
- Added AttentiveMNMT which implements modality-specific multimodal attention from the paper Multimodal Attention for Neural Machine Translation
- Added ShowAttendAndTell model
Changes in NMT:
dec_init
defaults tomean_ctx
, i.e. the decoder will be initialized with the mean context computed from the source encoderenc_lnorm
which was just a placeholder is now removed since we do not provided layer-normalization for now- Beam Search is completely moved to GPU
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file nmtpytorch-2.0.0.tar.gz
.
File metadata
- Download URL: nmtpytorch-2.0.0.tar.gz
- Upload date:
- Size: 99.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.19.1 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.6.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3ea5448cc3d2c2f2d3704f1c37b632eabc54836ca2683aaf86d09bd3336d00aa |
|
MD5 | e4a7e396d0a4bc759b7aef5881f9ae20 |
|
BLAKE2b-256 | 33dedfc94c038779a1a826c50035cb9ff38b4ab80288f549c011e1e9b8198908 |
File details
Details for the file nmtpytorch-2.0.0-py3-none-any.whl
.
File metadata
- Download URL: nmtpytorch-2.0.0-py3-none-any.whl
- Upload date:
- Size: 144.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.19.1 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.6.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 149042651a163b0fdf9d76e1ae14f954d29e8142377819aac7733f60f9931900 |
|
MD5 | 9ec961b956cde9a0126b7e302cd53693 |
|
BLAKE2b-256 | 0aa387931507b1173f577610fafa18ccfd06a64460f08c168fa7a7788a7d9e47 |