Skip to main content

Monolingual corpus fluency filter

Project description

monocleaner

License

Monocleaner is a Python tool that aims to detect disfluent sentences in a monolingual corpus. Each sentence is assigned a fluency score between 0 and 1, with higher scores indicating more fluency. In addition to a continuous score, several handwritten rules assign a score of 0 to obviously poor sentences.

Although a training tool (monocleaner-train) is provided, you may want to use the available ready-to-use language packages. Please, visit https://github.com/bitextor/monocleaner-data/releases/latest or use monocleaner-download to download the latest language packages.

Citation

If you find Monocleaner useful, please consider citing the following papers:

V. M. Sánchez-Cartagena, M. Bañón, S. Ortiz-Rojas and G. Ramírez-Sánchez,
"Prompsit's submission to WMT 2018 Parallel Corpus Filtering shared task",
in Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers.
Brussels, Belgium: Association for Computational Linguistics, October 2018

@InProceedings{prompsit:2018:WMT,
  author    = { V\'{i}ctor M. S\'{a}nchez-Cartagena and Marta Ba{\~n}\'{o}n and Sergio Ortiz-Rojas and Gema Ram\'{i}rez-S\'{a}nchez},
  title     = {Prompsit's submission to WMT 2018 Parallel Corpus Filtering shared task},
  booktitle = {Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers},
  month     = {October},
  address   = {Brussels, Belgium},
  publisher = {Association for Computational Linguistics}
}

Installation & Requirements

Monocleaner uses FastSpell that requires python-dev:

sudo apt install python-dev

Monocleaner can be installed using pip:

python3 -m pip install monocleaner

Monocleaner requires the KenLM Python bindings with support for 7-gram language models. You can easily install it by running the following commands:

git clone https://github.com/kpu/kenlm
cd kenlm
pip install --config-settings="--build-option=--max_order=7" .
mkdir -p build && cd build
cmake .. -DKENLM_MAX_ORDER=7 -DCMAKE_INSTALL_PREFIX:PATH=/your/prefix/path
make -j all install

The remaining extra modules required by Monocleaner will be automatically downloaded and installed/upgraded (if required) with the first command.

After installation, two binary files (monocleaner-train and monocleaner) will be located in your python/installation/prefix/bin directory. This is usually $HOME/.local/bin or /usr/local/bin/.

Scoring

monocleaner aims to detect disfluent sentences in a monolingual corpus. Each sentence is assigned a fluency score between 0 and 1, with higher scores indicating more fluency. In addition to a continuous score, several handwritten hardrules assign a score of 0 to obviously poor sentences.

The input file (monolingual corpus) must contain one sentence per line text. The generated output file will contain the same lines adding a column containing the Monocleaner fluency score.

This tool can be run with

monocleaner [-h]
            [--scol SCOL]
            [--disable_lang_ident] 
            [--disable_hardrules]
            [--disable_minimal_length]
            [--disable_hbs]
            [--score_only]
            [--annotated_output]
            [--add_lang_ident]
            [--detect_script]
            [--debug]
            [-q]
            [-v]
            model_dir [input] [output]

If input and output are omitted, it will read from stdin and write to stdout.

Parameters

  • Positional arguments:
    • model_dir: Directory where the model is stored.
    • input: Input text file, one sentence per line. When omitted jointly with output, it will read from stdin.
    • output: Output tab-separated text file adding monocleaner score. When omitted output will be written to stdout.
  • Optional arguments:
    • --scol: Sentence column (starting in 1) (default: 1)
    • --disable_lang_ident: Disables language identification in hardrules. (default: False)
    • --disable_hardrules: Disables the hardrules filtering (only monocleaner fluency scoring is applied) (default: False)
    • --disable_minimal_length : Don't apply minimal length rule (default: False).
    • --disable_hbs: Don't group Serbo-Croatian under 'hbs' tag. (default: False)
    • --score_only: Only output one column which is the monocleaner score (default: False)
    • --annotated_output: Add hardrules annotation for each sentence. (default: False)
    • --add_lang_ident: Add another column with the identified language if it's not disabled. (default: False)
    • --detect_script: Detect writing script with FastSpell (only Serbo-Croatian is supported) (default: False)
  • Logging:
    • --debug: Debug logging mode (default: False)
    • -q, --quiet: Silent logging mode (default: False)
    • -v, --version: show version of this script and exit

Example

monocleaner models/es mono.es.txt mono.es.scored.txt

This will use the Spanish model located at models/es, read mono.es.txt file and write the sentences to mono.es.scored.txt adding the monocleaner score column.

Monocleaner hard-rules

monocleaner-hardrules is an optional pre-filtering step for obvious noise based on rules and incorrect language identified by FastSpell. It can be used integrated into the monocleaner endpoint, or separately.

Cleaning

monocleaner-hardrules aims at detecting obvious noisey sentences in a monolingual corpus. Sentences that are considered noisy will be tagged with a 0 and the rest will be tagged with a 1. By default, the input monolingual file must contain at least one column with the sentences needed to be cleaned. If more columns are present, the column index of the sentences desired to be cleaned can be customized via the --scol parameter.

By default, the generated output file will contain the same lines and columns that the original input file has, however, an extra column containing the Monocleaner hard-rules score is added. The amount of newly inserted columns will vary depending on which parameters are enabled.

This tool can be run with:

monocleaner-hardrules [-h]
            [--scol SCOL]
            [--disable_lang_ident]
            [--disable_minimal_length]
            [--disable_hbs]
            [--score_only]
            [--add_lang_ident]
            [--detect_script]
            [--annotated_output]
            [--debug]
            [-q]
            [-v]
            language [input] [output]

Parameters

  • Positional arguments:
    • language: Language code of corpus in ISO 639-1 format (2-char code).
    • input: Input text file, one sentence per line. When omitted jointly with output, it will read from stdin.
    • output: Output tab-separated text file adding monocleaner score. When omitted output will be written to stdout.
  • Optional arguments:
    • --scol: Sentence column (starting in 1) (default: 1)
    • --disable_lang_ident: Disables language identification in hardrules. (default: False)
    • --disable_minimal_length : Don't apply minimal length rule (default: False).
    • --disable_hbs: Don't group Serbo-Croatian under 'hbs' tag. (default: False)
    • --score_only: Only output one column which is the monocleaner score (default: False)
    • --add_lang_ident: Add another column with the identified language if it's not disabled. (default: False)
    • --detect_script: Detect writing script with FastSpell (only Serbo-Croatian is supported) (default: False)
    • --annotated_output: Add hardrules annotation for each sentence. (default: False)
  • Logging:
    • --debug: Debug logging mode (default: False)
    • -q, --quiet: Silent logging mode (default: False)
    • -v, --version: show version of this script and exit

Example

monocleaner-hardrules en mono.en.txt mono.en.scored.txt

Understanding annotated output

When using the --annotated_output flag, an extra column with each sentence's evaluation is added to the output. If the evaluation returns the keep tag (with score column: 1), it means that the sentence is considered good and passed all filters. However, any other tag value (with score column: 0) in the extra column means that the sentence should be rejected. The rejection reasons, their meaning, and the order in which hard-rules are applied, is shown below:

no_empty	Sentence is empty
no_titles	All words in source sentence or target sentence are uppercased or in titlecase
not_too_long	Sentence is more than 1024 characters long
not_too_short	Sentence is less than	3 words long
no_bad_encoding	Source sentence or target sentence contains mojibake
no_only_symbols	The ratio of non-alphabetic characters in source sentence is more than 90%
no_only_numbers	The ratio of numeric characters in source sentence is too high
no_urls	There are URLs (disabled by default)
no_breadcrumbs	There are more than 2 breadcrumb characters in the sentence
no_unicode_noise	Too many characters from unwanted unicode in source sentence
no_space_noise	Too many consecutive single characters separated by spaces in the sentence (excludes digits)
no_paren	Too many parenthesis or brackets in sentence
no_literals	Unwanted literals: "Re:","{{", "%s", "}}", "+++", "***", '=\"'
no_escaped_unicode	There is unescaped unicode characters in sentence
no_glued_words	There are words in the sentence containing too many uppercased characters between lowercased characters
no_repeated_words There are more than 1 consecutive words repeated
no_wrong_language	Sentence is not in the desired language specifide in the cleaning command

Connecting Europe Facility

All documents and software contained in this repository reflect only the authors' view. The Innovation and Networks Executive Agency of the European Union is not responsible for any use that may be made of the information it contains.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

monocleaner-1.6.tar.gz (33.8 kB view details)

Uploaded Source

Built Distribution

monocleaner-1.6-py3-none-any.whl (34.3 kB view details)

Uploaded Python 3

File details

Details for the file monocleaner-1.6.tar.gz.

File metadata

  • Download URL: monocleaner-1.6.tar.gz
  • Upload date:
  • Size: 33.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.4

File hashes

Hashes for monocleaner-1.6.tar.gz
Algorithm Hash digest
SHA256 1316052e1604c6cbf8ad75a94a8f7ae29d660cd46c7a0e158a5d4db400cfdbe0
MD5 e3db57e6cda78b053e6983f9549ca7f4
BLAKE2b-256 e3dbe074c9f6638f590d0e7a5eea34611019755e8e8a126072e372f56e45baa0

See more details on using hashes here.

File details

Details for the file monocleaner-1.6-py3-none-any.whl.

File metadata

  • Download URL: monocleaner-1.6-py3-none-any.whl
  • Upload date:
  • Size: 34.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.4

File hashes

Hashes for monocleaner-1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 34ae0317776fd3598c28649a07750ba8b9e3a002329b5f86db839da3eb662251
MD5 09d1895a1a937c3b6c539019ad2ac875
BLAKE2b-256 6f7b0ba6d63eb71fd568f99eb58f3a48acc8c6989c01b0d1c158960f7211c1a6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page