Junk Not-Junk Detector
Project description
Junk, Not-Junk Detector
This tool is built to do just one simple task: detect junk and not-junk texts from a variety of languages. Just like that famous hotdog not-hotdog, but applied on natural language text. It can be very useful to test tools that extract, decompress, and/or decrypt natural language texts.
Setup
Uses fairseq
# Optionally create a brand new conda environment for this
#conda create -n junkdetect python=3.7 pip
#conda activate junkdetect
# Installing directly from github
pip install git+https://github.com/thammegowda/junkdetect
# Installing after cloning this repo
git clone https://github.com/thammegowda/junkdetect
cd junkdetect
pip install .
How to use
Once you install it via pip, junkdetect
or python -m junkdetect
can be used to invoke from commandline
printf "This is a good sentence. \nT6785*&^T is 747658 you T&*^\n" | junkdetect
0.999824 This is a good sentence.
0.0747487 T6785*&^T is 747658 you T&*^
The output is one line per input, with two column separated by \t
.
The first column has perplexity
: a lower value (i.e close to 0.0) means junk and an higher value (close to 1.0) means not-junk. If you dont want input sentences back in the output, please cut them out -- just use junkdetect | cut -f1 > scores.txt
How does this work
junkdetect looks like only a few lines of python code, but under the hood, it hides a great deal of complexity.
It uses perplexity from neural (masked/auto-regressive) language models that were trained on tera bytes of web text from 100s of languages.
Specifically, it uses Facebookresearch's XML-R retrieved from torch.hub.
Quoting the original developers of XML-R and their paper, (see Table 6)
XLM-R handles the following 100 languages: Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, Belarusian, Bengali, Bengali Romanized, Bosnian, Breton, Bulgarian, Burmese, Burmese, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Hausa, Hebrew, Hindi, Hindi Romanized, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Macedonian, Malagasy, Malay, Malayalam, Marathi, Mongolian, Nepali, Norwegian, Oriya, Oromo, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskri, Scottish, Gaelic, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tamil, Tamil Romanized, Telugu, Telugu Romanized, Thai, Turkish, Ukrainian, Urdu, Urdu Romanized, Uyghur, Uzbek, Vietnamese, Welsh, Western, Frisian, Xhosa, Yiddish.
Back Story and Acknowledgements:
- This idea came out of discussion with Tim Allison. He said it was hard to tell whether text was correctly extracted or not from files like PDFs using Apache Tika. Thanks to him for making me think of something like this.
- I had read Facebook's very nice XML-R paper of Conneau et al and it was top of my mind.
Although XLM folks didnt help me get perplexity, and I had to dug it out of their code by myself,
I still like to thank them for making such useful pretrained models available for easy to use via
torch.hub
.
Developers:
- Thamme Gowda (wrote the version 0.1)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file junkdetect-0.1.1.tar.gz
.
File metadata
- Download URL: junkdetect-0.1.1.tar.gz
- Upload date:
- Size: 4.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.0.post20200616 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7a263e62afdb396bb3acfa9ebff22137a1d9de26a900743b7e86cf74acd854a6 |
|
MD5 | a351af735ddcc3377db55d928754c275 |
|
BLAKE2b-256 | a61742a47e71f338d3ba2fba9d9eb1e16cbabf16240514f1d0a30a1e00129212 |
File details
Details for the file junkdetect-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: junkdetect-0.1.1-py3-none-any.whl
- Upload date:
- Size: 9.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.0.post20200616 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4719c647d6be9ca169d7a95b4f0ba258311016e68f5b71eef007d60cb9e56c72 |
|
MD5 | 93b698057896982ce9c249a2acf4a516 |
|
BLAKE2b-256 | f44e3d1ffee63304dd329d649cca6c9907160f8477e7f809bf9c0eb34e9372ee |