Reader Translator Generator(RTG), a Neural Machine Translator(NMT) toolkit based on Pytorch
Project description
# Reader-Translator-Generator (RTG)
Reader-Translator-Generator (RTG) is a Neural Machine Translation toolkit based on pytorch.
Refer to https://isi-nlp.github.io/rtg/ for the docs.
## Features
- Reproducible experiments: one `conf.yml` that has everything -- data paths, params, and
hyper params -- required to reproduce experiments.
- Pre-processing options: [sentencepiece](https://github.com/google/sentencepiece) or [nlcodec](https://github.com/isi-nlp/nlcodec) (or add your own)
- word/char/bpe etc types
- shared vocabulary, seperate vocabulary
- one-way, two-way, three-way tied embeddings
- [Transformer model from "Attention is all you need"](https://arxiv.org/abs/1706.03762) (fully tested and competes with [Tensor2Tensor](https://github.com/tensorflow/tensor2tensor)
- Automatically detects and parallelizes across multi GPUs. (Note: All GPUs must be in the same node, though!)
- Lot of varieties of transformer: width varying, skip transformer etc
- [RNN based Encoder-Decoder](https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf) with [Attention](https://nlp.stanford.edu/pubs/emnlp15_attn.pdf) . (No longer use it, but it's there for experimentation)
- Language Modeling: RNN, Transformer
- And more ..
+ Easy and interpretable code (for those who read code as much as papers)
+ Object Orientated Design. (Not too many levels of functions and function factories like Tensor2Tensor)
+ Experiments and reproducibility are main focus. To control an experiment you edit an YAML file that is inside the experiment directory.
+ Where ever possible, prefer [convention-over-configuation](https://www.wikiwand.com/en/Convention_over_configuration). Have a look at this experiment directory for the [examples/transformer.test.yml](examples/transformer.test.yml);
### Setup
Add the root of this repo to `PYTHONPATH` or install it via `pip --editable`
```bash
git clone https://github.com/isi-nlp/rtg-xt.git # use rtg.git if you have access
cd rtg # go to the code
conda create -n rtg python=3.7 # adds a conda env named rtg
conda activate rtg # activate it
# install this as a local editable pip package
pip install --editable .
# All requirements are in setup.py
```
# Usage
Refer to `scripts/rtg-pipeline.sh` bash script and `examples/transformer.base.yml` file for specific examples.
The pipeline takes source (`.src`) and target (`.tgt`) files. The sources are in one language and the targets in another. At a minimum, supply a training source, training target, validation source, and validation target. It is best to use `.tok` files for training. (`.tok` means tokenized.)
Example of training and running a mdoel:
```bash
# disable gpu use (force cpu)
export CUDA_VISIBLE_DEVICES=
# call as python module
rtg-pipe experiments/sample-exp/
# OR, you can call a shell scrupt to submit job to slurm/SGE
scripts/rtg-pipeline.sh -d experiments/sample-exp/ -c experiments/sample-exp/conf.yml
# Note: use examples/transformer.base.yml config to setup transformer base
# Then to use the model to translate something:
# (VERY poor translation due to small training data)
echo "Chacun voit midi à sa porte." | rtg-decode experiments/sample-exp/
```
The `001-tfm` directory that hosts an experiment looks like this:
```
001-tfm
├── _PREPARED <-- Flag file indicating experiment is prepared
├── _TRAINED <-- Flag file indicating experiment is trained
├── conf.yml <-- Where all the params and hyper params are! You should look into this
├── data
│ ├── samples.tsv.gz <-- samples to log after each check point during training
│ ├── sentpiece.shared.model <-- as the name says, sentence piece model, shared
│ ├── sentpiece.shared.vocab <-- as the name says
│ ├── train.db <-- all the prepared trainig data in a sqlite db
│ └── valid.tsv.gz <-- and the validation data
├── githead <-- whats was the git HEAD hash this experiment was started?
├── job.sh.bak <-- job script used to submit this to grid. Just in case
├── models <-- All checkpoints go inside this
│ ├── model_400_5.265583_4.977106.pkl
│ ├── model_800_4.478784_4.606745.pkl
│ ├── ...
│ └── scores.tsv <-- train and validation losses. incase you dont want to see tensorboard
├── rtg.log <-- the python logs are redirected here
├── rtg.zip <-- the source code used to run. just `export PYTHONPATH=rtg.zip` to
├── scripts -> /Users/tg/work/me/rtg/scripts <-- link to some perl scripts for detok+BLEU
├── tensorboard <-- Tensorboard stuff for visualizations
│ ├── events.out.tfevents.1552850552.hackb0x2
│ └── ....
└── test_step2000_beam4_ens5 <-- Tests after the end of training, BLEU scores
├── valid.ref -> /Users/tg/work/me/rtg/data/valid.ref
├── valid.src -> /Users/tg/work/me/rtg/data/valid.src
├── valid.out.tsv
├── valid.out.tsv.detok.tc.bleu
└── valid.out.tsv.detok.lc.bleu
```
---------
### Authors:
[See Here](https://github.com/isi-nlp/rtg-xt/graphs/contributors)
### Credits / Thanks
+ OpenNMT and the Harvard NLP team for [Annotated transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html), I learned a lot from their work
+ [My team at USC ISI](https://www.isi.edu/research_groups/nlg/people) for everything else
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
rtg-0.4.0.tar.gz
(111.9 kB
view details)
Built Distribution
rtg-0.4.0-py3-none-any.whl
(145.3 kB
view details)
File details
Details for the file rtg-0.4.0.tar.gz
.
File metadata
- Download URL: rtg-0.4.0.tar.gz
- Upload date:
- Size: 111.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3.post20200330 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cb8e9a256898ca311fd25a63c33316e3cab57ccf9f65d641731da8dac0764d09 |
|
MD5 | 7982bd34490ea6671f48fc5adebf81d2 |
|
BLAKE2b-256 | 8e2daa4469198b5c26a47b55e0a51822436ebb8fb97ef9b03398646a9d01f82e |
File details
Details for the file rtg-0.4.0-py3-none-any.whl
.
File metadata
- Download URL: rtg-0.4.0-py3-none-any.whl
- Upload date:
- Size: 145.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3.post20200330 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b5dbc547838fa754b2a7ec454a58469163f11af22b3cb4872c20b5ffdd51bedf |
|
MD5 | 25b5908bcaa90003ecacce040c14d7d2 |
|
BLAKE2b-256 | 12947dfdc12c7fa62292579b15131f2175e5ad054d6bc9f50a0976f8f1a9b0f3 |