Scheduled training for machine translation systems
Project description
Trainer
The purpose of the trainer is to provide the user with a flexible way of scheduling various sources of input data, as well as augment the training data with tittle casing, all caps, etc. This is particularly useful when you have multiple data sources and you want to pretrain the model first on backtranslated data, gradually add other sources of data, and finally fine tune, all in one go.
Alternatively, this tool is particularly suited to training multilingual models, as it provides an easy way to define the desired mixture of datasets from different language sources.
Configuration file
Define your training process via a configuration file. You define the datasets on top, the stages and then for each stage a mixing criteria and a stage termination criteria. An example configuration file is provided below. The path to the trainer
is a path to any neural network trainer that supports having stdin as training input format.
# Datasets are already TSV files
datasets:
clean: test/data/clean
medium: test/data/medium
dirty: test/data/dirty
stages:
- start
- mid
- end
start:
- clean 0.8
- medium 0.2
- dirty 0
- until clean 2 # Until two epochs of clean
mid:
- clean 0.6
- medium 0.3
- dirty 0.1
- until medium 1
end:
- clean 0.4
- medium 0.3
- dirty 0.3
- until dirty 5 # use `inf` to mean until forever
modifiers:
- uppercase 0.05 # Apply uppercase randomly to 0.05% of sentences. Use 0 to disable
- titlecase 0.05 # Apply titlecase randomly to 0.05% of sentences. Use 0 to disable
seed: 1111
trainer: /path/to/trainer/run.py
Usage
% ./trainer.py --help
usage: trainer.py [-h] --config CONFIG [--temporary-directory TEMPORARY_DIR] [--state STATE_FILE] [--do-not-resume] [--sync] [trainer-command [arguments]]
Feeds marian tsv data for training.
options:
-h, --help show this help message and exit
--config CONFIG, -c CONFIG
YML configuration input.
--temporary-directory TEMPORARY_DIR, -t TEMPORARY_DIR
Temporary dir, used for shuffling and tracking state
--state STATE_FILE Path to trainer state file which stores how much of
each dataset has been read. Defaults to ${CONFIG}.state
--sync Do not shuffle in the background
--do-not-resume, -d Do not resume from the previous training state
Once you fix the paths in the configuration file, train_config.yml
you can run a test case by doing:
./trainer.py -c train_config.yml
You can check resulting mixed file in /tmp/test
. If your neural network trainer doesn't support training from stdin
, you can use this tool to generate a training dataset and then disable data reordering or shuffling at your trainer implementation, as your training input should be balanced.
At the start of the training all datasets are shuffled. Each time a dataset's end is reached, it is re-shuffled. Shuffling in the system temp directory but can be repositioned using --temporary-directory
or the TMPDIR
environment variable. By default, the training state is kept in the same place as the configuration file. If training is interrupted, re-running the trainer should resume from where it was (depending on how much your neural network trainer has buffered, that part will be skipped).
Generating vocabulary and placeholders before training
To use the placeholder code augment your training data with placeholders before training, look at this example script:
#!/usr/bin/env bash
# Get the placeholders
../placeholders/placeholders.py -c train_config_bgen.yml --dump_placeholders > my_placeholders
# train vocabulary
spm_train --bos_id=-1 --eos_id=0 --unk_id=1 --user_defined_symbols_file my_placeholders \
--model_prefix="test/vocab.bgen" --vocab_size=12000 \
--input="/home/dheart/uni_stuff/postdoc/empty-train/trainer/test/data/clean.bgen" \
--shuffle_input_sentence=true --character_coverage 1
# Move vocabulary to the new location
mv test/vocab.bgen.model test/vocab.bgen.spm
# Make all datasets placeholded
for myfile in test/data/*.bgen; do
../placeholders/placeholders.py -n --strict --encode -c train_config_bgen.yml < ${myfile} > ${myfile}.pls
done
You need to augment the training configuration with additional placeholder configuration setting:
vocab: /home/dheart/uni_stuff/postdoc/empty-train/trainer/test/vocab.bgen.spm
placeholder-symbol: "<PLACEHOLDER>"
num-placeholders: 4
regexes:
- (https?:\/\/www\.\w{1,63}\.\w{1,63}(?:\/\w{0,63}){0,})
- (www\.\w{1,63}\.\w{1,63}(?:\/\w{0,63}){0,})
- ([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)
After vocabulary is trained and data is preprocessed, proceed with a normal training run.
Future work
- Terminology support (using a dictionary). We should augment the training data with terminology (possibly stemmed on the source side) so that we can use it real world models
- A one click run training
Acknowledgements
This project has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10052546]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file opustrainer-0.1-py3-none-any.whl
.
File metadata
- Download URL: opustrainer-0.1-py3-none-any.whl
- Upload date:
- Size: 14.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a7b86657003ea4ca591f126a492c7ba75d9bd14f78f710aefaf5200d6bc7447f |
|
MD5 | db8582786bbf9b0ab04b12e52e68bed5 |
|
BLAKE2b-256 | 4260840a0884ec3ea6269eef01a48ba1e261a056eb01c544fdd8be774e6282fe |