Skip to main content

Annotator combining different NLP pipelines

Project description

Automated annotation of natural languages using selected toolchains

Version License: MIT GitHub Workflow Status codecov Quality Gate Status Language Code style: black OpenSSF Best Practices

This project just had its first version release and is still under development.

Description

The nlpannotator package serves as modular toolchain to combine different natural language processing (nlp) tools to annotate texts (sentencizing, tokenization, part-of-speech (POS) and lemma).

Tools that can be combined are:

  • spaCy (sentencize, tokenize, POS, lemma)
  • stanza (sentencize, tokenize, POS, lemma)
  • SoMaJo (sentencize, tokenize)
  • Flair (POS)
  • Treetagger (tokenize, POS, lemma) These tools can be combined in any desired fashion, to target either maximum efficiency or accuracy.

Installation

Install the project and its dependencies from PyPi:

pip install nlpannotator

The language models need to be installed separately. You can make use of the convenience script here which installs all language models for all languages that have been implemented for spaCy and stanza.

Options

All input options are provided in an input dictionary. Two pre-set toolchains can be used: fast using spaCy for all annotations; accurate using SoMaJo for sentencizing and tokenization, and stanza for POS and lemma; and manual where any combination of spaCy, stanza, SoMaJo, Flair, Treetagger can be used, given the tool supports the selected annotation and language.

Keyword Default setting Possible options Description
input example_en.txt Name of the text file containing the raw text for annotation
corpus_name test Name of the corpus that is generated
language en see below Language of the text to annotate
processing_option manual fast, accurate, manual Select the tool pipeline - fast and accurate provide you with good default options for English
processing_type sentencize, tokenize, pos, lemma see below
tool spacy, spacy, spacy, spacy see below Tool to use for each of the four annotation types
output_format xml xml, vrt Format of the generated annotated text file
encoding yes yes, no Directly encode the annotated text file into cwb

Tools

The available annotation tools are listed below, and can be set using the following keywords:

Processors

The available processors depend on the selected tool. This is a summary of the possible options:

Tool Available processors
spacy sentencize, tokenize, pos, lemma
stanza sentencize, tokenize, pos, lemma
somajo sentencize, tokenize
flair pos
treetagger tokenize, pos, lemma
Some of the processors depend on each other. For example, pos and lemma are only possible after sentencize and tokenize. tokenize depends on sentencize.

Languages

The availabe languages depend on the selected tool. So far, the following languages have been added to the pipeline (there may be additional language models available for the respective tool, but they have not been added to this package - for stanza, the pipeline will still run and load the model on demand).

Tool Available languages
spacy en, de, fr, it, ja, pt, ru, es
stanza load on demand from available stanza models
somajo en, de
flair en, de
treetagger en, de, fr, es (both tokenization and pos/lemma)
treetagger bg, nl, et, fi, gl, it, kr, la, mn, pl, ru, sk, sw (only pos/lemma)

Input/Output

nlpannotator expects a raw text file as an input, together with an input dictionary that specifies the selected options. The input dictionary is also printed out when a run is initiated, so that the selected options are stored and can be looked up at a later time. Both of these can be provided through a Jupyter interface as in the Demo Notebook.

The output that is generated is either of vrt format (for cwb) or xml. Both output formats can directly be encoded into cwb.

Demo notebook

Take a look at the DemoNotebook or run it on Binder.

Questions and bug reports

Please ask questions / submit bug reports using our issue tracker.

Contribute

Contributions are wellcome. Please fork the nlpannotator repo and open a Pull Request for any changes to the code. These will be reviewed and merged by our team. Make sure that your contributions are clean, properly formatted and for any new modules follow the general design principle.

Take a look at the source code documentation.

The additions must have at least have 80% test coverage.

Releases

A summary of the releases and release notes are available here.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlpannotator-1.0.4.tar.gz (26.4 kB view details)

Uploaded Source

Built Distribution

nlpannotator-1.0.4-py3-none-any.whl (28.9 kB view details)

Uploaded Python 3

File details

Details for the file nlpannotator-1.0.4.tar.gz.

File metadata

  • Download URL: nlpannotator-1.0.4.tar.gz
  • Upload date:
  • Size: 26.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for nlpannotator-1.0.4.tar.gz
Algorithm Hash digest
SHA256 11cd34d71693d26e83d8055f2a0af8883ebd53ae98b9d7dd085d19332d083574
MD5 9c86eefbd21e6167d1fc5577286389fd
BLAKE2b-256 8ec0648393f87de78c6878d94a6f0a2e1f5a29cb632fdad10c1efe3ec4fcaca2

See more details on using hashes here.

File details

Details for the file nlpannotator-1.0.4-py3-none-any.whl.

File metadata

  • Download URL: nlpannotator-1.0.4-py3-none-any.whl
  • Upload date:
  • Size: 28.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for nlpannotator-1.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 138f25aa555f78440f6f003965ee1674b722e55ecac01dcfea4bf05ea08f4aec
MD5 edefe8c40fdd6ffa864ba7e5f6ccfc66
BLAKE2b-256 a1932f190989d729bcf6fff1ef2f348c8f640256a7df2da891bed40c97541999

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page