Skip to main content

Reader Translator Generator(RTG), a Neural Machine Translator(NMT) toolkit based on Pytorch

Project description

Reader-Translator-Generator (RTG)

Reader-Translator-Generator (RTG) is a Neural Machine Translation toolkit based on pytorch. Refer to https://isi-nlp.github.io/rtg/ for the docs.

Features

  • Reproducible experiments: one conf.yml that has everything -- data paths, params, and hyper params -- required to reproduce experiments.
  • Pre-processing options: sentencepiece or nlcodec (or add your own)
    • word/char/bpe etc types
    • shared vocabulary, seperate vocabulary
    • one-way, two-way, three-way tied embeddings
  • Transformer model from "Attention is all you need" (fully tested and competes with Tensor2Tensor
    • Automatically detects and parallelizes across multi GPUs. (Note: All GPUs must be in the same node, though!)
    • Lot of varieties of transformer: width varying, skip transformer etc
  • RNN based Encoder-Decoder with Attention . (No longer use it, but it's there for experimentation)
  • Language Modeling: RNN, Transformer
  • And more ..
    • Easy and interpretable code (for those who read code as much as papers)
    • Object Orientated Design. (Not too many levels of functions and function factories like Tensor2Tensor)
    • Experiments and reproducibility are main focus. To control an experiment you edit an YAML file that is inside the experiment directory.
    • Where ever possible, prefer convention-over-configuation. Have a look at this experiment directory for the examples/transformer.test.yml;

Quick Start

Use this Google Colab Notebook for learning how to train your NMT model with RTG: https://colab.research.google.com/drive/198KbkUcCGXJXnWiM7IyEiO1Mq2hdVq8T?usp=sharing

Setup

Add the root of this repo to PYTHONPATH or install it via pip --editable

git clone https://github.com/isi-nlp/rtg-xt.git # use rtg.git if you have access
cd rtg                # go to the code


conda create -n rtg python=3.7   # adds a conda env named rtg
conda activate rtg  # activate it

# install this as a local editable pip package
pip install --editable .   
# All requirements are in setup.py

Usage

Refer to scripts/rtg-pipeline.sh bash script and examples/transformer.base.yml file for specific examples.

The pipeline takes source (.src) and target (.tgt) files. The sources are in one language and the targets in another. At a minimum, supply a training source, training target, validation source, and validation target. It is best to use .tok files for training. (.tok means tokenized.)

Example of training and running a mdoel:

# disable gpu use (force cpu)
export CUDA_VISIBLE_DEVICES=
# call as python module
rtg-pipe experiments/sample-exp/

# OR, you can call a shell scrupt to submit job to slurm/SGE
scripts/rtg-pipeline.sh -d experiments/sample-exp/ -c experiments/sample-exp/conf.yml
# Note: use examples/transformer.base.yml config to setup transformer base

# Then to use the model to translate something:
# (VERY poor translation due to small training data)
echo "Chacun voit midi à sa porte." | rtg-decode experiments/sample-exp/

The 001-tfm directory that hosts an experiment looks like this:

001-tfm
├── _PREPARED    <-- Flag file indicating experiment is prepared 
├── _TRAINED     <-- Flag file indicating experiment is trained
├── conf.yml     <-- Where all the params and hyper params are! You should look into this
├── data        
│   ├── samples.tsv.gz          <-- samples to log after each check point during training
│   ├── sentpiece.shared.model  <-- as the name says, sentence piece model, shared
│   ├── sentpiece.shared.vocab  <-- as the name says
│   ├── train.db                <-- all the prepared trainig data in a sqlite db
│   └── valid.tsv.gz            <-- and the validation data
├── githead       <-- whats was the git HEAD hash this experiment was started? 
├── job.sh.bak    <-- job script used to submit this to grid. Just in case
├── models        <-- All checkpoints go inside this
│   ├── model_400_5.265583_4.977106.pkl
│   ├── model_800_4.478784_4.606745.pkl
│   ├── ...
│   └── scores.tsv <-- train and validation losses. incase you dont want to see tensorboard
├── rtg.log   <-- the python logs are redirected here
├── rtg.zip   <-- the source code used to run. just `export PYTHONPATH=rtg.zip` to 
├── scripts -> /Users/tg/work/me/rtg/scripts  <-- link to some perl scripts for detok+BLEU
├── tensorboard    <-- Tensorboard stuff for visualizations
│   ├── events.out.tfevents.1552850552.hackb0x2
│   └── ....
└── test_step2000_beam4_ens5   <-- Tests after the end of training, BLEU scores
    ├── valid.ref -> /Users/tg/work/me/rtg/data/valid.ref
    ├── valid.src -> /Users/tg/work/me/rtg/data/valid.src
    ├── valid.out.tsv
    ├── valid.out.tsv.detok.tc.bleu
    └── valid.out.tsv.detok.lc.bleu


Authors:

See Here

Credits / Thanks

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rtg-0.4.2.tar.gz (113.1 kB view details)

Uploaded Source

Built Distribution

rtg-0.4.2-py3-none-any.whl (146.4 kB view details)

Uploaded Python 3

File details

Details for the file rtg-0.4.2.tar.gz.

File metadata

  • Download URL: rtg-0.4.2.tar.gz
  • Upload date:
  • Size: 113.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.0.0.post20200309 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.7.7

File hashes

Hashes for rtg-0.4.2.tar.gz
Algorithm Hash digest
SHA256 5469db3b078d45c842f04f5abbbbfae67a99be30917f3b1a18b7e985e81f2557
MD5 9681b6a3969128831cd053210f3221b3
BLAKE2b-256 f3eb6cf198081a6e6a894fa87a4b18e9c92a8d79d133ce77db0d5a4e01640ac2

See more details on using hashes here.

File details

Details for the file rtg-0.4.2-py3-none-any.whl.

File metadata

  • Download URL: rtg-0.4.2-py3-none-any.whl
  • Upload date:
  • Size: 146.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.0.0.post20200309 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.7.7

File hashes

Hashes for rtg-0.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 5566855b06c93fbdd601e6404f16a325e5b9ce3364aae0ce0f5796162b7072fa
MD5 178448ae79ce2994682f9084286f1263
BLAKE2b-256 2675d7aa957ae4df90fb509fa2b151d124bad581bc54f08a384ce11042f7d21c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page