Skip to main content

Junk Not-Junk Detector

Project description

Junk, Not-Junk Detector

This tool is built to do just one simple task: detect junk and not-junk texts from a variety of languages. Just like that famous hotdog not-hotdog, but applied on natural language text. It can be very useful to test tools that extract, decompress, and/or decrypt natural language texts.

Setup

Uses fairseq

# Optionally create a brand new conda environment for this
#conda create -n junkdetect python=3.7 pip 
#conda activate junkdetect

# Installing directly from github
pip install git+https://github.com/thammegowda/junkdetect

# Installing after cloning this repo
git clone https://github.com/thammegowda/junkdetect
cd junkdetect
pip install .

How to use

Once you install it via pip, junkdetect or python -m junkdetect can be used to invoke from commandline

printf "This is a good sentence. \nT6785*&^T is 747658 you T&*^\n" | junkdetect
0.999824	This is a good sentence.
0.0747487	T6785*&^T is 747658 you T&*^

The output is one line per input, with two column separated by \t. The first column has perplexity: a lower value (i.e close to 0.0) means junk and an higher value (close to 1.0) means not-junk. If you dont want input sentences back in the output, please cut them out -- just use junkdetect | cut -f1 > scores.txt

How does this work

junkdetect looks like only a few lines of python code, but under the hood, it hides a great deal of complexity.
It uses perplexity from neural (masked/auto-regressive) language models that were trained on tera bytes of web text from 100s of languages.
Specifically, it uses Facebookresearch's XML-R retrieved from torch.hub. Quoting the original developers of XML-R and their paper, (see Table 6)

XLM-R handles the following 100 languages: Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, Belarusian, Bengali, Bengali Romanized, Bosnian, Breton, Bulgarian, Burmese, Burmese, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Hausa, Hebrew, Hindi, Hindi Romanized, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Macedonian, Malagasy, Malay, Malayalam, Marathi, Mongolian, Nepali, Norwegian, Oriya, Oromo, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskri, Scottish, Gaelic, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tamil, Tamil Romanized, Telugu, Telugu Romanized, Thai, Turkish, Ukrainian, Urdu, Urdu Romanized, Uyghur, Uzbek, Vietnamese, Welsh, Western, Frisian, Xhosa, Yiddish.

Back Story and Acknowledgements:

Developers:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

junkdetect-0.1.1.tar.gz (4.6 kB view details)

Uploaded Source

Built Distribution

junkdetect-0.1.1-py3-none-any.whl (9.4 kB view details)

Uploaded Python 3

File details

Details for the file junkdetect-0.1.1.tar.gz.

File metadata

  • Download URL: junkdetect-0.1.1.tar.gz
  • Upload date:
  • Size: 4.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.0.post20200616 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.7

File hashes

Hashes for junkdetect-0.1.1.tar.gz
Algorithm Hash digest
SHA256 7a263e62afdb396bb3acfa9ebff22137a1d9de26a900743b7e86cf74acd854a6
MD5 a351af735ddcc3377db55d928754c275
BLAKE2b-256 a61742a47e71f338d3ba2fba9d9eb1e16cbabf16240514f1d0a30a1e00129212

See more details on using hashes here.

File details

Details for the file junkdetect-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: junkdetect-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 9.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.0.post20200616 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.7

File hashes

Hashes for junkdetect-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4719c647d6be9ca169d7a95b4f0ba258311016e68f5b71eef007d60cb9e56c72
MD5 93b698057896982ce9c249a2acf4a516
BLAKE2b-256 f44e3d1ffee63304dd329d649cca6c9907160f8477e7f809bf9c0eb34e9372ee

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page