Skip to main content

Targetted language identifier, based on FastText and Hunspell.

Project description

FastSpell

Targetted language identifier, based on FastText and Hunspell.

How it works

FastSpell will try to determine the language of a sentence by using FastText.

If the language detected is very similar to the target language (i.e. FastText detected Spanish, while the targetted language is Galician), extra checks are performed with Hunspell to determine the language more precisely.

Requirements & Installation

FastSpell can be installed from PyPI

python3.7 -m pip install fastspell

or directly from source:

python3.7 setup.py install

Note that Hunspell requires python-dev and libhunspell-dev:

sudo apt-get install python-dev libhunspell-dev

Also note that Hunspell language packages must be installed by hand, i.e.:

sudo apt-get install hunspell-es

or downloaded from an external source, such as https://github.com/wooorm/dictionaries/tree/main/dictionaries

You can also provide the path to the Hunspell dictionaries directories by using the dictpath atribute in /config/hunspell.yaml. Default is /usr/share/hunspell

Configuration

A few configuration files are provided under the /config directory.

tokenizers.yaml

By default, MosesTokenizer(lang) is used. When there is no specific rules for lang, Moses Tokenizer failsback to English. For some languages, we know that using other language is better (for example, using Spanish for Galician instead of English). Tokenizers for these languages can be customized in this file.

similar.yaml

In this dictionary-like file, similar languages are stored. These are the languages that are going to be "double-checked" with Hunspell after being identified with FastText. For example, see the line gl: [es, pt, gl] . This means that, when the targetted language is Galician, and FastText identifies a given sentence as Spanish, Portuguese or Galician, extra checks will be performed with Hunspell to confirm which of the three similar languages is more suitable for the sentence.

Please note that you need Hunspell dictionaries for all languages in this file. This file can be modified to remove a language you are not interested in, or a language for which you don't have Hunspell dictionaries, or to add new similar or target languages.

hunspell.yaml

In this file, both the path to Hunspell dictionary files (default: dictpath: /usr/share/hunspell/) and the names of such dictionaries are stored. All similar languages must be in this list in order to properly work.

For example, the first entry in the hunspell_codes is ca: ca_ES, and the dictionary path is /usr/share/hunspell/. That means that the Hunspell files for Catalan are /usr/share/hunspell/ca_ES.dic and /usr/share/hunspell/ca_ES.aff.

Usage

Module:

In order to use FastSpell as a Python module, just install and import it :

from fastspell import FastSpell

Build a FastSpell object, like:

fsobj = FastSpell.FastSpell("en", mode="cons")

(learn more about modes in the section below)

And then use the getlang function with the sentences you want to identify, for example:

fsobj.getlang("Hello, world")
#'en'
fsobj.getlang("Hola, mundo")
#'es'

CLI:

usage: fastspell.py [-h] [--aggr] [--cons] [-q] [--debug] [--logfile LOGFILE]
                    [-v]
                    lang [input] [output]

positional arguments:
  lang
  input              Input sentences. (default: <_io.TextIOWrapper
                     name='<stdin>' encoding='UTF-8'>)
  output             Output of the language identification. (default:
                     <_io.TextIOWrapper name='<stdout>' mode='w'
                     encoding='UTF-8'>)

optional arguments:
  -h, --help         show this help message and exit
  --aggr             Aggressive strategy (more positives) (default: False)
  --cons             Conservative strategy (less positives) (default: False)

Logging:
  -q, --quiet        Silent logging mode (default: False)
  --debug            Debug logging mode (default: False)
  --logfile LOGFILE  Store log to a file (default: <_io.TextIOWrapper
                     name='<stderr>' mode='w' encoding='UTF-8'>)
  -v, --version      show version of this script and exit

Aggressive vs Conservative

FastSpell comes in two flavours: Aggressive and Conservative.

The Aggressive mode is less hesitant to tag a sentence with the target language, and never has doubts. The Conservative version, on the other hand, is more reluctant to tag a sentence with the target language and will use the unk(unknown) tag in case of doubt (when there is a tie between the target language and other language, for example)

Benchmark

comparative.png

Usage example

Input text:

19-01-2011 47 comentarios 7o Xornadas de Xardinería de Galicia (RE)PLANTEAR
• Proceso de valoración de idoneidade: entrevistas psicosociais e visita domiciliaria e aplicación de test psicolóxicos, se é o caso.
- Chrome e Firefox en MacOS non son compatibles (unicamente Safari é compatible con MacOS), pero invocarase PSAL ao intentar empregar Chrome ou Firefox.
Mago da luz / Maga da luz
Celebrada a homenaxe a Xosé Manuel Seivane Rivas
A instalación eléctrica en teletraballo
Saltar á navegación Navegación INICIO
Julio Freire, competidor da FGA, invitado polo Kennel club de Inglaterra, para participar nos Crufts 2014 (Birmingham, 6 - 9 de marzo).
25 de xullo - Truong Tan Sang toma posesión como presidente de Vietnam
Quen pode solicitar o dito financiamento?

Command:

python3.7 fastspell.py $L --aggr inputtext
python3.7 fastspell.py $L --cons inputtext

Aggressive output:

19-01-2011 47 comentarios 7o Xornadas de Xardinería de Galicia (RE)PLANTEAR     gl
• Proceso de valoración de idoneidade: entrevistas psicosociais e visita domiciliaria e aplicación de test psicolóxicos, se é o caso.   gl
- Chrome e Firefox en MacOS non son compatibles (unicamente Safari é compatible con MacOS), pero invocarase PSAL ao intentar empregar Chrome ou Firefox.        gl
Mago da luz / Maga da luz       gl
Celebrada a homenaxe a Xosé Manuel Seivane Rivas        gl
A instalación eléctrica en teletraballo gl
Saltar á navegación Navegación INICIO   gl
Julio Freire, competidor da FGA, invitado polo Kennel club de Inglaterra, para participar nos Crufts 2014 (Birmingham, 6 - 9 de marzo). es
25 de xullo - Truong Tan Sang toma posesión como presidente de Vietnam  gl
Quen pode solicitar o dito financiamento?       gl

Conservative output:

19-01-2011 47 comentarios 7o Xornadas de Xardinería de Galicia (RE)PLANTEAR     unk
• Proceso de valoración de idoneidade: entrevistas psicosociais e visita domiciliaria e aplicación de test psicolóxicos, se é o caso.   gl
- Chrome e Firefox en MacOS non son compatibles (unicamente Safari é compatible con MacOS), pero invocarase PSAL ao intentar empregar Chrome ou Firefox.        gl
Mago da luz / Maga da luz       unk
Celebrada a homenaxe a Xosé Manuel Seivane Rivas        gl
A instalación eléctrica en teletraballo unk
Saltar á navegación Navegación INICIO   gl
Julio Freire, competidor da FGA, invitado polo Kennel club de Inglaterra, para participar nos Crufts 2014 (Birmingham, 6 - 9 de marzo). es
25 de xullo - Truong Tan Sang toma posesión como presidente de Vietnam  gl
Quen pode solicitar o dito financiamento?       gl

Getting stats:

cat inputtext | python3.7 fastspell.py $L --aggr | cut -f2 | sort | uniq -c | sort -nr
cat inputtext | python3.7 fastspell.py $L --cons | cut -f2 | sort | uniq -c | sort -nr

Aggressive:

9 gl
1 es

Conservative:

6 gl
3 unk
1 es

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastspell-0.1.2.tar.gz (8.5 kB view details)

Uploaded Source

Built Distributions

fastspell-0.1.2-py3.8.egg (11.1 kB view details)

Uploaded Source

fastspell-0.1.2-py3-none-any.whl (9.6 kB view details)

Uploaded Python 3

File details

Details for the file fastspell-0.1.2.tar.gz.

File metadata

  • Download URL: fastspell-0.1.2.tar.gz
  • Upload date:
  • Size: 8.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.7.5

File hashes

Hashes for fastspell-0.1.2.tar.gz
Algorithm Hash digest
SHA256 ff6562952223bba1f69bb55e1aa2d252e0e9a93d2b57df29e971e21ed4f3aa03
MD5 4fb4b60ea0c59c11a89772e4ef6a474d
BLAKE2b-256 723fbf8e54b2d3ea2a1f74e65ae5e14cd3ebd988394ebef7193d54aa5fa31297

See more details on using hashes here.

File details

Details for the file fastspell-0.1.2-py3.8.egg.

File metadata

  • Download URL: fastspell-0.1.2-py3.8.egg
  • Upload date:
  • Size: 11.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for fastspell-0.1.2-py3.8.egg
Algorithm Hash digest
SHA256 8b630dbb49d67cb66256239c7bc21169ffdc1c32354f2fd3c2261aaca4aa71dd
MD5 2031bfcb9b6d5e62effbe89b143a6c5b
BLAKE2b-256 75dbc97e4a2cc4e67e95d2b7c11ef7ece7330513b3ba271ac7dba382646e84e7

See more details on using hashes here.

File details

Details for the file fastspell-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: fastspell-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 9.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.7.5

File hashes

Hashes for fastspell-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 202da451d99a6e6c7598e7de757011a55437e9d611d16ce4ab610882d1144065
MD5 45b8d950f0196618e004dc8a6c2fcd36
BLAKE2b-256 384189c6bb015e1b96322fee50da9959ab1a8b0b2397a91946671c21bca55535

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page