Parallel corpus classifier, indicating the likelihood of a pair of sentences being mutual translations or not (neural version)

These details have not been verified by PyPI

Project links

Project description

Bicleaner AI

License

Bicleaner AI (bicleaner-ai-classify) is a tool in Python that aims at detecting noisy sentence pairs in a parallel corpus. It indicates the likelihood of a pair of sentences being mutual translations (with a value near to 1) or not (with a value near to 0). Sentence pairs considered very noisy are scored with 0.

Although a training tool (bicleaner-ai-train) is provided, you may want to use the available ready-to-use language packages. Please, visit https://github.com/bitextor/bicleaner-data/releases/latest or use ./utils/download-pack.sh to download the latest language packages. Visit our Wiki for a detailed example on Bicleaner training.

What is New?

Bicleaner AI is a Bicleaner fork that uses neural networks. It comes with two types of models, lite models for fast scoring and full models for high performance. Lite models use A Decomposable Attention Model for Natural Language Inference (Parikh et al.). Full models use fine-tuned XLMRoberta (Unsupervised Cross-lingual Representation Learning at Scale).

The use of XLMRoberta and 1:10 positive to negative ratio were inspired in the winner of WMT20 Parallel Corpus Filtering Task paper (Filtering noisy parallel corpus using transformers with proxy task learning).

Installation & Requirements

Bicleaner AI is written in Python and can be installed using pip:

pip install bicleaner-ai

Bicleaner AI requires the KenLM Python bindings with support for 7-gram language models. You can easily install it by running the following commands:

git clone https://github.com/kpu/kenlm
cd kenlm
pip install . --install-option="--max_order 7"
mkdir -p build && cd build
cmake .. -DKENLM_MAX_ORDER=7 -DCMAKE_INSTALL_PREFIX:PATH=/your/prefix/path
make -j all install

The remaining extra modules required by Bicleaner AI will be automatically downloaded and installed/upgraded (if required) with the first command.

After installation, three binary files (bicleaner-ai-train, bicleaner-ai-classify and bicleaner-ai-classify-lite) will be located in your python/installation/prefix/bin directory. This is usually $HOME/.local/bin or /usr/local/bin/.

TensorFlow

TensorFlow 2 will be installed as a dependency and GPU support is required for training. pip will install latest TensorFlow but older versions >=2.3.2 are supported and can be installed if your machine does not meet TensorFlow CUDA requirements. See this table for the CUDA and TensorFlow versions compatibility. In case you want a different TensorFlow version, you can downgrade using:

pip install tensorflow==2.3.2

TensorFlow logging messages are suppressed by default, in case you want to see them you have to explicitly set TF_CPP_MIN_LOG_LEVEL environment variable. For example:

TF_CPP_MIN_LOG_LEVEL=0 bicleaner-ai-classify

Cleaning

bicleaner-ai-classify aims at detecting noisy sentence pairs in a parallel corpus. It indicates the likelihood of a pair of sentences being mutual translations (with a value near to 1) or not (with a value near to 0). Sentence pairs considered very noisy are scored with 0.

By default, the input file (the parallel corpus to be classified) must contain at least four columns, being:

col1: URL 1
col2: URL 2
col3: Source sentence
col4: Target sentence

but the source and target sentences column index can be customized by using the --scol and --tcol flags.

The generated output file will contain the same lines and columns that the original input file had, adding an extra column containing the Bicleaner AI classifier score.

This tool can be run with

bicleaner-ai-classify [-h]
    [-S SOURCE_TOKENIZER_COMMAND]
    [-T TARGET_TOKENIZER_COMMAND]
    [--scol SCOL]
    [--tcol TCOL]
    [-b BLOCK_SIZE]
    [-p PROCESSES]
    [--batch_size BATCH_SIZE]
    [--tmp_dir TMP_DIR]
    [-d DISCARDED_TUS]
    [--score_only]
    [--calibrated]
    [--raw_output]
    [--disable_hardrules]
    [--disable_lm_filter]
    [--disable_porn_removal]
    [--disable_minimal_length]
    [-q]
    [--debug]
    [--logfile LOGFILE]
    [-v]
    input [output] metadata

Parameters

positional arguments:
- input: Tab-separated files to be classified (default line format: URL1 URL2 SOURCE_SENTENCE TARGET_SENTENCE [EXTRA_COLUMNS], tab-separated). When input is -, reads standard input.
- output: Output of the classification (default: standard output). When output is -, writes standard output.
- metadata: Training metadata (YAML file), generated by bicleaner-ai-train or downloaded as a part of a language pack. You just need to untar the language pack for the pair of languages of the file you want to clean. The tar file contains the YAML metadata file. There's a script that can download and unpack it for you, use:
```
$ ./utils/download-pack.sh en cs ./models
```
to download English-Czech language pack to the ./models directory and unpack it.
optional arguments:
- -h, --help: show this help message and exit
Optional:
- -S SOURCE_TOKENIZER_COMMAND: Source language tokenizer full command (including flags if needed). If not given, Sacremoses tokenizer is used (with escape=False option).
- -T TARGET_TOKENIZER_COMMAND: Target language tokenizer full command (including flags if needed). If not given, Sacremoses tokenizer is used (with escape=False option).
- --scol SCOL: Source sentence column (starting in 1) (default: 3)
- --tcol TCOL: Target sentence column (starting in 1) (default: 4)
- --tmp_dir TMP_DIR: Temporary directory where creating the temporary files of this program (default: default system temp dir, defined by the environment variable TMPDIR in Unix)
- -b BLOCK_SIZE, --block_size BLOCK_SIZE: Sentence pairs per block (default: 10000)
- -p PROCESSES, --processes PROCESSES: Number of processes to use (default: all CPUs minus one)
- -d DISCARDED_TUS, --discarded_tus DISCARDED_TUS: TSV file with discarded TUs. Discarded TUs by the classifier are written in this file in TSV file. (default: None)
- --lm_threshold LM_THRESHOLD: Threshold for language model fluency scoring. All sentence pairs whose LM fluency score falls below the threshold are removed (classifier score set to 0), unless the option --keep_lm_result is set. (default: 0.5)
- --score_only: Only output one column which is the bicleaner score (default: False)
- --calibrated: Output calibrated scores (default: False)
- --raw_output: Return raw output without computing positive class probability. (default: False)
- --disable_hardrules: Disables the bicleaner_hardrules filtering (only bicleaner_classify is applied) (default: False)
- --disable_lm_filter: Disables LM filtering.
- --disable_porn_removal: Disables porn removal.
- --disable_minimal_length : Don't apply minimal length rule (default: False).
Logging:
- -q, --quiet: Silent logging mode (default: False)
- --debug: Debug logging mode (default: False)
- --logfile LOGFILE: Store log to a file (default: <_io.TextIOWrapper name='' mode='w' encoding='UTF-8'>)
- -v, --version: show version of this script and exit

Example

bicleaner-ai-classify  \
        corpus.en-es.raw  \
        corpus.en-es.classifed  \
        model/en-es/metadata.yaml

This will read the corpus.en-es.raw file, classify it with the classifier indicated in the models/en-es/metadata.yaml metadata file, writing the result of the classification in the corpus.en-es.classified file. Each line of the new file will contain the same content as the input file, adding a column with the score given by the Bicleaner classifier.

Training classifiers

In case you need to train a new classifier (i.e. because it is not available in the language packs provided at bicleaner-ai-data), you can use bicleaner-ai-train. bicleaner-ai-train is a Python tool that allows you to train a classifier which predicts whether a pair of sentences are mutual translations or not and discards too noisy sentence pairs. Visit our Wiki for a detailed example on Bicleaner AI training.

Requirements

In order to train a new classifier, you must provide:

A clean parallel corpus (500k pairs of sentences is the recommended size).
Monolingual corpus for the source and the target language (not necessary for xlmr classifier).
Gzipped lists of monolingual word frequencies. You can check their format by downloading any of the available language packs.
- The SL list of word frequencies with one entry per line. Each entry must contain the following 2 fields, split by space, in this order: word frequency (number of times a word appears in text), SL word.
- The TL list of word frequencies with one entry per line. Each entry must contain the following 2 fields, split by space, in this order: word frequency (number of times a word appears in text), TL word.
- These lists can easily be obtained from a monolingual corpus and a command line in bash:

$ cat monolingual.SL \
    | sacremoses -l SL tokenize -x \
    | awk '{print tolower($0)}' \
    | tr ' ' '\n' \
    | LC_ALL=C sort | uniq -c \
    | LC_ALL=C sort -nr \ \
    | grep -v '[[:space:]]*1' \
    | gzip > wordfreq-SL.gz
$ cat monolingual.TL \
    | sacremoses -l TL tokenize -x \
    | awk '{print tolower($0)}' \
    | tr ' ' '\n' \
    | LC_ALL=C sort | uniq -c \
    | LC_ALL=C sort -nr \ \
    | grep -v '[[:space:]]*1' \
    | gzip > wordfreq-TL.gz

Optionally, if you want the classifier to include a porn filter, you must also provide:

File with training dataset for porn removal classifier. Each sentence must contain at the beginning the __label__negative or __label__positive according to FastText convention. It should be lowercased and tokenized.

Parameters

It can be used as follows.

bicleaner-ai-train [-h]
    -m MODEL_DIR
    -s SOURCE_LANG
    -t TARGET_LANG
    [--mono_train MONO_TRAIN]
    --parallel_train PARALLEL_TRAIN
    --parallel_dev PARALLEL_DEV
    [-S SOURCE_TOKENIZER_COMMAND]
    [-T TARGET_TOKENIZER_COMMAND]
    [-F TARGET_WORD_FREQS]
    [--block_size BLOCK_SIZE]
    [-p PROCESSES]
    [-g GPU]
    [--mixed_precision]
    [--save_train_data SAVE_TRAIN_DATA]
    [--distilled]
    [--seed SEED]
    [--classifier_type {dec_attention,transformer,xlmr}]
    [--batch_size BATCH_SIZE]
    [--steps_per_epoch STEPS_PER_EPOCH]
    [--epochs EPOCHS]
    [--patience PATIENCE]
    [--pos_ratio POS_RATIO]
    [--rand_ratio RAND_RATIO]
    [--womit_ratio WOMIT_RATIO]
    [--freq_ratio FREQ_RATIO]
    [--fuzzy_ratio FUZZY_RATIO]
    [--neighbour_mix NEIGHBOUR_MIX]
    [--porn_removal_train PORN_REMOVAL_TRAIN]
    [--porn_removal_test PORN_REMOVAL_TEST]
    [--porn_removal_file PORN_REMOVAL_FILE]
    [--porn_removal_side {sl,tl}]
    [--noisy_examples_file_sl NOISY_EXAMPLES_FILE_SL]
    [--noisy_examples_file_tl NOISY_EXAMPLES_FILE_TL]
    [--lm_dev_size LM_DEV_SIZE]
    [--lm_file_sl LM_FILE_SL]
    [--lm_file_tl LM_FILE_TL]
    [--lm_training_file_sl LM_TRAINING_FILE_SL]
    [--lm_training_file_tl LM_TRAINING_FILE_TL]
    [--lm_clean_examples_file_sl LM_CLEAN_EXAMPLES_FILE_SL]
    [--lm_clean_examples_file_tl LM_CLEAN_EXAMPLES_FILE_TL]
    [-q]
    [--debug]
    [--logfile LOGFILE]

positional arguments:
- input: Tab-separated bilingual input file (default: Standard input)(line format: SOURCE_SENTENCE TARGET_SENTENCE, tab-separated)
optional arguments:
- -h, --help: show this help message and exit
Mandatory:
- -m MODEL_DIR, --model_dir MODEL_DIR: Model directory, metadata, classifier and SentencePiece models will be saved in the same directory (default: None)
- -s SOURCE_LANG, --source_lang SOURCE_LANG: Source language (default: None)
- -t TARGET_LANG, --target_lang TARGET_LANG: Target language (default: None)
- --mono_train MONO_TRAIN: File containing monolingual sentences of both languages shuffled together, used to train SentencePiece embeddings. Not required for XLMR. (default: None)
- --parallel_train PARALLEL_TRAIN: TSV file containing parallel sentences to train the classifier (default: None)
- --parallel_dev PARALLEL_DEV: TSV file containing parallel sentences for development (default: None)
Options:
- -S SOURCE_TOKENIZER_COMMAND, --source_tokenizer_command SOURCE_TOKENIZER_COMMAND: Source language tokenizer full command (default: None)
- -T TARGET_TOKENIZER_COMMAND, --target_tokenizer_command TARGET_TOKENIZER_COMMAND: Target language tokenizer full command (default: None)
- -F TARGET_WORD_FREQS, --target_word_freqs TARGET_WORD_FREQS: R language gzipped list of word frequencies (needed for frequence based noise) (default: None)
- --block_size BLOCK_SIZE: Sentence pairs per block when apliying multiprocessing in the noise function (default: 10000)
- -p PROCESSES, --processes PROCESSES: Number of process to use (default: 71)
- -g GPU, --gpu GPU: Which GPU use, starting from 0. Will set the CUDA_VISIBLE_DEVICES. (default: None)
- --mixed_precision: Use mixed precision float16 for training (default: False)
- --save_train_data SAVE_TRAIN_DATA: Save the generated dataset into a file. If the file already exists the training dataset will be loaded from there. (default: None)
- --distilled: Enable Knowledge Distillation training. It needs pre-built training set with raw scores from a teacher model. (default: False)
- --seed: SEED Seed for random number generation. By default, no seeed is used. (default: None)
- --classifier_type {dec_attention,transformer,xlmr}: Neural network architecture of the classifier (default: dec_attention)
- --batch_size BATCH_SIZE: Batch size during classifier training. If None, default architecture value will be used. (default: None)
- --steps_per_epoch STEPS_PER_EPOCH: Number of batch updates per epoch during training. If None, default architecture value will be used or the full dataset size. (default: None)
- --epochs EPOCHS: Number of epochs for training. If None, default architecture value will be used. (default: None)
- --patience PATIENCE: Stop training when validation has stopped improving after PATIENCE number of epochs (default: None)
- --pos_ratio POS_RATIO: Ratio of positive samples used to oversample on validation and test sets (default: 1)
- --rand_ratio RAND_RATIO: Ratio of negative samples misaligned randomly (default: 3)
- --womit_ratio WOMIT_RATIO: Ratio of negative samples misaligned by randomly omitting words (default: 3)
- --freq_ratio FREQ_RATIO: Ratio of negative samples misaligned by replacing words by frequence (needs --target_word_freq) (default: 3)
- --fuzzy_ratio FUZZY_RATIO: Ratio of negative samples misaligned by fuzzy matching (default: 0)
- --neighbour_mix NEIGHBOUR_MIX: If use negative samples misaligned by neighbourhood (default: False)
- --porn_removal_train PORN_REMOVAL_TRAIN: File with training dataset for FastText classifier. Each sentence must contain at the beginning the '__label__negative' or '__label__positive' according to FastText convention. It should be lowercased and tokenized. (default: None)
- --porn_removal_test PORN_REMOVAL_TEST: Test set to compute precision and accuracy of the porn removal classifier (default: None)
- --porn_removal_file PORN_REMOVAL_FILE: Porn removal classifier output file (default: porn_removal.bin)
- --porn_removal_side {sl,tl}: Whether the porn removal should be applied at the source or at the target language. (default: sl)
- --noisy_examples_file_sl NOISY_EXAMPLES_FILE_SL: File with noisy text in the SL. These are used to estimate the perplexity of noisy text. (default: None)
- --noisy_examples_file_tl NOISY_EXAMPLES_FILE_TL: File with noisy text in the TL. These are used to estimate the perplexity of noisy text. (default: None)
- --lm_dev_size LM_DEV_SIZE: Number of sentences to be removed from clean text before training LMs. These are used to estimate the perplexity of clean text. (default: 2000)
- --lm_file_sl LM_FILE_SL: SL language model output file. (default: None)
- --lm_file_tl LM_FILE_TL: TL language model output file. (default: None)
- --lm_training_file_sl LM_TRAINING_FILE_SL: SL text from which the SL LM is trained. If this parameter is not specified, SL LM is trained from the SL side of the input file, after removing --lm_dev_size sentences. (default: None)
- --lm_training_file_tl LM_TRAINING_FILE_TL: TL text from which the TL LM is trained. If this parameter is not specified, TL LM is trained from the TL side of the input file, after removing --lm_dev_size sentences. (default: None)
- --lm_clean_examples_file_sl LM_CLEAN_EXAMPLES_FILE_SL: File with clean text in the SL. Used to estimate the perplexity of clean text. This option must be used together with --lm_training_file_sl and both files must not have common sentences. This option replaces --lm_dev_size. (default: None)
- --lm_clean_examples_file_tl LM_CLEAN_EXAMPLES_FILE_TL: File with clean text in the TL. Used to estimate the perplexity of clean text. This option must be used together with --lm_training_file_tl and both files must not have common sentences. This option replaces --lm_dev_size. (default: None)
Logging:
- -q, --quiet: Silent logging mode (default: False)
- --debug: Debug logging mode (default: False)
- --logfile LOGFILE: Store log to a file (default: <_io.TextIOWrapper name='' mode='w' encoding='UTF-8'>)

Example

bicleaner-ai-train \
          --parallel_train corpus.en-cs.train \
          --parallel_dev corpus.en-cs.dev \
          --mono_train mono.en-cs \
          -m models/en-cs \
          -s en \
          -t cs \
          -F wordfreqs-cs.gz \
          --lm_file_sl models/en-cs/lm.en  --lm_file_tl models/en-cs/lm.cs \
          --porn_removal_train porn-removal.txt.en  --porn_removal_file models/en-cs/porn-model.en \

This will train a lite classifier for English-Czech using the corpus corpus.en-cs.train, the corpus.en-cs.dev as development set and the monolingual corpus mono.en-cs to train the vocabulary embeddings. All the model files created during training, the language model files, the porn removal file, and the metadata.yaml will be stored in the model directory models/en-cs.

To train full models you would need to use --classifier_type xlmr and --mono_train is not needed.

Synthetic noise

By default the training will use rand_ratio, womit_ratio and freq_ratio options with a value of 3. Both womit_ratio and freq_ratio will use Sacremoses tokenizer by default. So, for languages that are not supported by this tokenizer or are poorly supported, source_tokenizer_command and/or target_tokenizer_command should be provided. Also note that, if a tokenizer command is used, the word frequencies need to be tokenized in the same way to allow noise based on frequency work correctly.

If no tokenization is available for your languages, you can disable these noise option that use tokenization and use fuzzy mathing noise: --womit_ratio 0 --freq_ratio 0 --fuzzy_ratio 6.

Connecting Europe Facility

All documents and software contained in this repository reflect only the authors' view. The Innovation and Networks Executive Agency of the European Union is not responsible for any use that may be made of the information it contains.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

3.1.0

Jul 26, 2024

3.0.1

Apr 16, 2024

3.0.0

Apr 16, 2024

2.3.2

Aug 21, 2023

2.3.1

Aug 9, 2023

2.3

Jul 11, 2023

2.2.2

Jun 9, 2023

2.2.1

May 8, 2023

2.2.0

Mar 27, 2023

2.1.1

Mar 14, 2023

2.1.0

Mar 10, 2023

2.0.0

Mar 10, 2023

2.0.0rc2 pre-release

Jan 18, 2023

1.0.1

Jun 16, 2021

This version

1.0

Jun 14, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bicleaner-ai-1.0.tar.gz (39.2 kB view details)

Uploaded Jun 14, 2021 Source

Built Distribution

bicleaner_ai-1.0-py3-none-any.whl (53.4 kB view details)

Uploaded Jun 14, 2021 Python 3

File details

Details for the file bicleaner-ai-1.0.tar.gz.

File metadata

Download URL: bicleaner-ai-1.0.tar.gz
Upload date: Jun 14, 2021
Size: 39.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.0 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.7.5

File hashes

Hashes for bicleaner-ai-1.0.tar.gz
Algorithm	Hash digest
SHA256	`4b2ad8b38f83494541f29986d6c7666d6ec64ab0a067d79c72cdcff5faf064ae`
MD5	`dba959c9948f091d7b5eca96d2276b22`
BLAKE2b-256	`bc816999c3a8f8cb884edb2bf84cff35fc2cc8109c33ded27742f96e96c2ab5c`

See more details on using hashes here.

File details

Details for the file bicleaner_ai-1.0-py3-none-any.whl.

File metadata

Download URL: bicleaner_ai-1.0-py3-none-any.whl
Upload date: Jun 14, 2021
Size: 53.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.0 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.7.5

File hashes

Hashes for bicleaner_ai-1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`41f94056d99717ff986f4633662f11e14f0f61d3fc8c09f8e6f59f538d19f74c`
MD5	`c5e1bc19dd144cd73a3117be0b130e21`
BLAKE2b-256	`a1720dfcecb504f11b628645115e132d9356ebfc91e77ff3a421f59ab0847686`