Parallel corpus classifier, indicating the likelihood of a pair of sentences being mutual translations or not (neural version)
Project description
Bicleaner AI
Bicleaner AI (bicleaner-ai-classify
) is a tool in Python that aims at detecting noisy sentence pairs in a parallel corpus. It
indicates the likelihood of a pair of sentences being mutual translations (with a value near to 1) or not (with a value near to 0).
Sentence pairs considered very noisy are scored with 0.
Although a training tool (bicleaner-ai-train
) is provided, you may want to use the available ready-to-use language packages.
Please, use bicleaner-ai-download
to download the latest language packages or visit the Github releases for lite models and Hugging Face Hub for full models since v2.0.
Visit our docs for a detailed example on Bicleaner training.
If you find Bicleaner AI useful, please consider citing us.
What is New?
v3.0.0 Improving Multilinguality!
New improved multilingual models for zero-shot classification.
Previous news
v2.0.0, March 10, 2023
Model accuracy improvements and HF integration! See CHANGELOG.
v1.0.0, June 6 2021
Bicleaner AI is a Bicleaner fork that uses neural networks. It comes with two types of models, lite models for fast scoring and full models for high performance. Lite models use A Decomposable Attention Model for Natural Language Inference (Parikh et al.). Full models use fine-tuned XLMRoberta (Unsupervised Cross-lingual Representation Learning at Scale).
The use of XLMRoberta and 1:10 positive to negative ratio were inspired in the winner of WMT20 Parallel Corpus Filtering Task paper (Filtering noisy parallel corpus using transformers with proxy task learning).
Installation & Requirements
- Python >= 3.8
- PIP >= 23.0
- CUDA >=11.2 (for training and inference with full models)
Bicleaner AI is written in Python and can be installed using pip
.
It also requires the KenLM Python bindings with support for 7-gram language models.
Hardrules uses FastSpell that requires cyhunspell
to be installed manually.
You can easily install all the requirements by running the following commands:
pip install bicleaner-ai git+https://github.com/MSeal/cython_hunspell@2.0.3
pip install --config-settings="--build-option=--max_order=7" https://github.com/kpu/kenlm/archive/master.zip
After installation, three binary files (bicleaner-ai-train
, bicleaner-ai-classify
, bicleaner-ai-download
) will be located in your python/installation/prefix/bin
directory. This is usually $HOME/.local/bin
or /usr/local/bin/
.
TensorFlow
TensorFlow 2 will be installed as a dependency and GPU support is required for training.
pip
will install latest TensorFlow supported version, but older versions >=2.6.5
are supported and can be installed if your machine does not meet TensorFlow CUDA requirements.
See this table for the CUDA and TensorFlow versions compatibility.
In case you want a different TensorFlow version, you can downgrade using:
pip install tensorflow==2.6.5
TensorFlow logging messages are suppressed by default, in case you want to see them you have to explicitly set TF_CPP_MIN_LOG_LEVEL
environment variable.
For example:
TF_CPP_MIN_LOG_LEVEL=0 bicleaner-ai-classify
WARNING: If you are experiencing slow downs because Bicleaner AI is not running in the GPU, you should check those logs to see if TensorFlow is loading all the libraries correctly.
Optional requirements
For Serbo-Croatian languages, models work better with transliteration. To be able score transliterated text, install optional dependency:
pip install bicleaner-ai[transliterate]
Note that this won't transliterate the output text, it will be used only for scoring.
Cleaning
Getting started
bicleaner-ai-classify
aims at detecting noisy sentence pairs in a parallel corpus. It
indicates the likelihood of a pair of sentences being mutual translations (with a value near to 1) or not (with a value near to 0). Sentence pairs considered very noisy are scored with 0.
By default, the input file (the parallel corpus to be classified) expects at least four columns, being:
- col1: URL 1
- col2: URL 2
- col3: Source sentence
- col4: Target sentence
but the source and target sentences column index can be customized by using the --scol
and --tcol
flags. Urls are not mandatory.
The generated output file will contain the same lines and columns that the original input file had, adding an extra column containing the Bicleaner AI classifier score.
Download a model
Bicleaner AI has two types of models, full and lite models. Full models are recommended, as they provide much higher quality. If speed is a hard constraint to you, lite models could be an option (take a look at the speed comparison).
See available full models here and available lite models here.
You can download the model with:
bicleaner-ai-download en fr full
This will download bitextor/bicleaner-ai-full-en-fr
model from HuggingFace and store it at the cache directory.
Or you can download a lite model with:
bicleaner-ai-download en fr lite ./bicleaner-models
This will download and store the en-fr lite model at ./bicleaner-models/en-fr
.
Since 2.3.0 version, full models also accept a local path to download, instead of the HF cache directory. In that case, to use the model, provide the local path instead of the HF identifier.
To read more information about how HF cache works, please read the official documentation.
Classifying
To classify a tab separated file containing English sentences in the first column and French sentences in the second column, use
bicleaner-ai-classify \
--scol 1 --tcol 2
corpus.en-fr.tsv \
corpus.en-fr.classifed.tsv \
bitextor/bicleaner-ai-full-en-fr
where --scol
and --tcol
indicate the location of source and target sentence,
corpus.en-fr.tsv
the input file,
corpus.en-fr.classified.tsv
output file and bitextor/bicleaner-ai-en-fr
is the HuggingFace model name.
Each line of the new file will contain the same content as the input file, adding a column with the score given by the Bicleaner AI classifier.
Note that, to use a lite model, you need to provide model path in your local file system, instead of HuggingFace model name.
Multilingual models
There are multilingual full models available. They can work with, potentially, any language (currently only paired with English) that XLMR supports. To see a further explaination on how to train a multilingual model or how our models perform, take a look here and here.
WARNING: multilingual models will disable hardrules that expect language parameter.
You can, however, overwrite the language code in the model configuration with -s
/--source_lang
or -t
/--target_lang
options during classify. For example when scoring English-Icelandic data, use:
bicleaner-ai-classify \
--scol 1 --tcol 2 \
-t is \
corpus.en-is.tsv \
corpus.en-is.classified.tsv \
bitextor/bicleaner-ai-full-en-xx
Usage
Full description of the command-line parameters:
usage: bicleaner-ai-classify [-h] [-s SOURCE_LANG] [-t TARGET_LANG] [-S SOURCE_TOKENIZER_COMMAND] [-T TARGET_TOKENIZER_COMMAND] [--header] [--scol SCOL] [--tcol TCOL] [-b BLOCK_SIZE] [-p PROCESSES] [--batch_size BATCH_SIZE]
[--tmp_dir TMP_DIR] [--score_only] [--calibrated] [--raw_output] [--lm_threshold LM_THRESHOLD] [--disable_hardrules] [--disable_lm_filter] [--disable_porn_removal] [--disable_minimal_length]
[--run_all_rules] [--rules_config RULES_CONFIG] [--offline] [--auth_token AUTH_TOKEN] [-q] [--debug] [--logfile LOGFILE] [-v]
input [output] model
positional arguments:
input Tab-separated files to be classified
output Output of the classification (default: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)
model Path to model directory or HuggingFace Hub model identifier (such as 'bitextor/bicleaner-ai-full-en-fr')
options:
-h, --help show this help message and exit
Optional:
-s SOURCE_LANG, --source_lang SOURCE_LANG
Overwrite model config source language (default: None)
-t TARGET_LANG, --target_lang TARGET_LANG
Overwrite model config target language (default: None)
-S SOURCE_TOKENIZER_COMMAND, --source_tokenizer_command SOURCE_TOKENIZER_COMMAND
Source language (SL) tokenizer full command (default: None)
-T TARGET_TOKENIZER_COMMAND, --target_tokenizer_command TARGET_TOKENIZER_COMMAND
Target language (TL) tokenizer full command (default: None)
--header Input file will be expected to have a header, and the output will have a header as well (default: False)
--scol SCOL Source sentence column (starting in 1). The name of the field is expected instead of the position if --header is set (default: 3)
--tcol TCOL Target sentence column (starting in 1). The name of the field is expected instead of the position if --header is set (default: 4)
-b BLOCK_SIZE, --block_size BLOCK_SIZE
Sentence pairs per block (default: 10000)
-p PROCESSES, --processes PROCESSES
Option no longer available, please set BICLEANER_AI_THREADS environment variable (default: None)
--batch_size BATCH_SIZE
Sentence pairs per block (default: 32)
--tmp_dir TMP_DIR Temporary directory where creating the temporary files of this program (default: /tmp)
--score_only Only output one column which is the bicleaner score (default: False)
--calibrated Output calibrated scores (default: False)
--raw_output Return raw output without computing positive class probability. (default: False)
--lm_threshold LM_THRESHOLD
Threshold for language model fluency scoring. All TUs whose LM fluency score falls below the threshold will are removed (classifier score set to 0), unless the option --keep_lm_result set. (default: 0.5)
--disable_hardrules Disables the bicleaner_hardrules filtering (only bicleaner_classify is applied) (default: False)
--disable_lm_filter Disables LM filtering (default: False)
--disable_porn_removal
Don't apply porn removal (default: False)
--disable_minimal_length
Don't apply minimal length rule (default: False)
--run_all_rules Run all rules of Hardrules instead of stopping at first discard (default: False)
--rules_config RULES_CONFIG
Hardrules configuration file (default: None)
--offline Don't try to download the model, instead try directly to load from local storage (default: False)
--auth_token AUTH_TOKEN
Auth token for the Hugging Face Hub (default: None)
Logging:
-q, --quiet Silent logging mode (default: False)
--debug Debug logging mode (default: False)
--logfile LOGFILE Store log to a file (default: <_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>)
-v, --version Show version of the package and exit
Training models
Bicleaner AI provides a command-line tool to train your own model, in case available models do not fit your needs. Please go to our training documentation for a quick start and further details.
Setting the number of threads
To set the maximum number of threads/processes to be used during training or classifying, --processes
option is no longer available.
You will need to set BICLEANER_AI_THREADS
environment variable to the desired value.
For example:
BICLEANER_AI_THREADS=12 bicleaner-ai-classify ...
If the variable is not set, the program will use all the available CPU cores.
Speed
A comparison of the speed in number of sentences per second between different types of models and hardware:
model | speed CPUx1 | speed GPUx1 |
---|---|---|
full | 1.78 rows/sec | 200 rows/sec |
lite | 600 rows/sec | 10,000 rows/sec |
- CPU: Intel Core i9-9960X single core (lite model batch 16, full model batch 1)
- GPU: Nvidia V100 (lite model batch 2048, full model batch 16)
Citation
J. Zaragoza-Bernabeu, M. Bañón, G. Ramírez-Sánchez, S. Ortiz-Rojas,
"Bicleaner AI: Bicleaner Goes Neural",
in Proceedings of the 13th Language Resources and Evaluation Conference.
Marseille, France: Language Resources and Evaluation Conference, June 2022
@inproceedings{zaragoza-bernabeu-etal-2022-bicleaner,
title = {"Bicleaner {AI}: Bicleaner Goes Neural"},
author = {"Zaragoza-Bernabeu, Jaume and
Ram{\'\i}rez-S{\'a}nchez, Gema and
Ba{\~n}{\'o}n, Marta and
Ortiz Rojas, Sergio"},
booktitle = {"Proceedings of the Thirteenth Language Resources and Evaluation Conference"},
month = jun,
year = {"2022"},
address = {"Marseille, France"},
publisher = {"European Language Resources Association"},
url = {"https://aclanthology.org/2022.lrec-1.87"},
pages = {"824--831"},
abstract = {"This paper describes the experiments carried out during the development of the latest version of Bicleaner, named Bicleaner AI, a tool that aims at detecting noisy sentences in parallel corpora. The tool, which now implements a new neural classifier, uses state-of-the-art techniques based on pre-trained transformer-based language models fine-tuned on a binary classification task. After that, parallel corpus filtering is performed, discarding the sentences that have lower probability of being mutual translations. Our experiments, based on the training of neural machine translation (NMT) with corpora filtered using Bicleaner AI for two different scenarios, show significant improvements in translation quality compared to the previous version of the tool which implemented a classifier based on Extremely Randomized Trees."},
}
All documents and software contained in this repository reflect only the authors' view. The Innovation and Networks Executive Agency of the European Union is not responsible for any use that may be made of the information it contains.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file bicleaner_ai-3.0.0.tar.gz
.
File metadata
- Download URL: bicleaner_ai-3.0.0.tar.gz
- Upload date:
- Size: 81.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7e4efdea6f30cff678e96252834a15442b36bce42f0d2a063b51862ebbcf33ff |
|
MD5 | fbaeaa583bd2457263b0299fd51c3738 |
|
BLAKE2b-256 | 11c74652f403d6a9e5fdcac981ca152385dd61132e4ea3b7919a0c2e5461efa7 |
File details
Details for the file bicleaner_ai-3.0.0-py3-none-any.whl
.
File metadata
- Download URL: bicleaner_ai-3.0.0-py3-none-any.whl
- Upload date:
- Size: 73.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1e755e78d38f816b5acc07a7452924eff4174d6bc4148893b2c1a51a53718bf2 |
|
MD5 | 56553952945e41916b68063c8b36ed76 |
|
BLAKE2b-256 | d55e98a08bdd7372eaa15c239a86959a4e9d409ac104c402e3e73460a726b362 |