Monolingual corpus fluency filter
Project description
monocleaner
Monocleaner is a Python tool that aims to detect disfluent sentences in a monolingual corpus. Each sentence is assigned a fluency score between 0 and 1, with higher scores indicating more fluency. In addition to a continuous score, several handwritten rules assign a score of 0 to obviously poor sentences.
Although a training tool (monocleaner-train
) is provided, you may want to use the available ready-to-use language packages.
Please, visit https://github.com/bitextor/monocleaner-data/releases/latest or use monocleaner-download
to download the latest language packages.
Citation
If you find Monocleaner useful, please consider citing the following papers:
V. M. Sánchez-Cartagena, M. Bañón, S. Ortiz-Rojas and G. Ramírez-Sánchez,
"Prompsit's submission to WMT 2018 Parallel Corpus Filtering shared task",
in Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers.
Brussels, Belgium: Association for Computational Linguistics, October 2018
@InProceedings{prompsit:2018:WMT,
author = { V\'{i}ctor M. S\'{a}nchez-Cartagena and Marta Ba{\~n}\'{o}n and Sergio Ortiz-Rojas and Gema Ram\'{i}rez-S\'{a}nchez},
title = {Prompsit's submission to WMT 2018 Parallel Corpus Filtering shared task},
booktitle = {Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers},
month = {October},
address = {Brussels, Belgium},
publisher = {Association for Computational Linguistics}
}
Installation & Requirements
Monocleaner uses FastSpell that requires python-dev
and libhunspell-dev
:
sudo apt install python-dev libhunspell-dev
Monocleaner can be installed using pip
:
python3.7 -m pip install monocleaner
Monocleaner requires the KenLM Python bindings with support for 7-gram language models. You can easily install it by running the following commands:
git clone https://github.com/kpu/kenlm
cd kenlm
pip install --config-settings="--build-option=--max_order=7" .
mkdir -p build && cd build
cmake .. -DKENLM_MAX_ORDER=7 -DCMAKE_INSTALL_PREFIX:PATH=/your/prefix/path
make -j all install
The remaining extra modules required by Monocleaner will be automatically downloaded and installed/upgraded (if required) with the first command.
After installation, two binary files (monocleaner-train
and monocleaner
) will be located in your python/installation/prefix/bin
directory. This is usually $HOME/.local/bin
or /usr/local/bin/
.
Scoring
monocleaner
aims to detect disfluent sentences in a monolingual corpus.
Each sentence is assigned a fluency score between 0 and 1, with higher scores indicating more fluency.
In addition to a continuous score, several handwritten hardrules
assign a score of 0 to obviously poor sentences.
The input file (monolingual corpus) must contain one sentence per line text. The generated output file will contain the same lines adding a column containing the Monocleaner fluency score.
This tool can be run with
monocleaner [-h]
[--disable_minimal_length]
[--disable_hardrules]
[--score_only]
[--annotated_output]
[--add_lang_ident]
[--debug]
[-q]
model_dir [input] [output]
If input and output are omitted, it will read from stdin and write to stdout.
Parameters
- Positional arguments:
model_dir
: Directory where the model is stored.input
: Input text file, one sentence per line. When omitted jointly with output, it will read from stdin.output
: Output tab-separated text file adding monocleaner score. When omitted output will be written to stdout.
- Optional arguments:
--score_only
: Only output one column which is the monocleaner score (default: False)--add_lang_ident
: Add another column with the identified language if it's not disabled.--disable_hardrules
: Disables the hardrules filtering (only monocleaner fluency scoring is applied) (default: False)--disable_minimal_length
: Don't apply minimal length rule (default: False).
- Logging:
-q, --quiet
: Silent logging mode (default: False)--debug
: Debug logging mode (default: False)-v, --version
: show version of this script and exit
Example
monocleaner models/es mono.es.txt mono.es.scored.txt
This will use the Spanish model located at models/es
, read mono.es.txt
file and write the sentences to mono.es.scored.txt
adding the monocleaner score column.
All documents and software contained in this repository reflect only the authors' view. The Innovation and Networks Executive Agency of the European Union is not responsible for any use that may be made of the information it contains.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file monocleaner-1.3.tar.gz
.
File metadata
- Download URL: monocleaner-1.3.tar.gz
- Upload date:
- Size: 28.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6980211375328d83bf322f6677be750d994d523948086bde6dbc838af5589ad7 |
|
MD5 | e7e85c229a0737dc2ef33c3674618863 |
|
BLAKE2b-256 | 07e6d2970c57eed324b7df1d85be92122bc9c685e25f380a880e2965d14d6b87 |
File details
Details for the file monocleaner-1.3-py3-none-any.whl
.
File metadata
- Download URL: monocleaner-1.3-py3-none-any.whl
- Upload date:
- Size: 28.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ee786666b135a6d8b0621de2d6d451c959a5230a53e9b418a7bdaa098dff995f |
|
MD5 | 1e95de9eed2ea9d75c91173d10ab3fae |
|
BLAKE2b-256 | 41cd9620a7a165b83ce71ad9ac2e4fd6c20d0ceb54f60cd895234f4e82b6326e |