Skip to main content

Annotation tool for NER tasks on Jupyter

Project description

PyLighter: Annotation tool for NER tasks

PyLighter is a tool that allows data scientists to annotate a corpus of documents directly on Jupyter for NER (Named Entity Recognition) tasks.

pylighter_gif

Contents

Installation

From Pypi: https://pypi-hypernode.com/project/pylighter/

pip install pylighter
jupyter nbextension enable --py widgetsnbextension

From Github: https://github.com/PayLead/PyLighter

git clone git@github.com:PayLead/PyLighter.git
cd PyLighter
python setup.py install
jupyter nbextension enable --py widgetsnbextension

Demos

The demo folder contains working examples of PyLighter in use. To view them, open any of the ipynb files in Jupyter.

Basic usage

The use case of PyLighter is to easily annotate a corpus in Jupyter. So let's first define a corpus for this example:

corpus = [
    "PyLighter is an annotation tool for NER tasks directly on Jupyter. "
    + "It aims on helping data scientists easily and quickly annotate datasets. "
    + "This tool was developed by Paylead.",
    "PayLead is a fintech company specializing in transaction data analysis. "
    + "Paylead brings retail and banking together, so customers get rewarded when they buy. "
    + "Welcome to the data-for-value economy."
]

Now let's start annotating !

from pylighter import Annotation

annotation = Annotation(corpus)

Running that cell gives you the following output:

screenshot_basic_usage.png

You can know start annotating entities using the predefined labels l1, l2, etc.

When your annotation is finished, you can either click on the save button or retrieve the results in the current Notebook.

  • The save button will save the results in a csv file named annotation.csv with two columns: the documents and the labels.
  • You can access the labels of your annotations in annotation.labels

Note: The given labels are in IOB2 format.

Advanced usage

The above example works just fine but PyLighter can be customized to best fit your specific use case.

Using an already annotated corpus

In most cases, you want to use an already annotated corpus or simply continue your annotation.

To this, you can use the argument named labels with the labels of the corpus. Moreover, if you stopped at the ith document, you can directly get back to where you stopped with start_index=i.

screenshot_pre_annotated

You can see more on that with this demo.

Changing labels names

PyLighter uses l1, l2, ...., l7 as default labels names, but in most cases, you want to have explicit labels such as Noun, Verb, etc.

You can define your own labels names with the argument labels_names. You can also define your own colors for your labels with the argument labels_colors in HEX format.

screenshot_labels_changed

You can see more on that with this demo.

Document styling

You can adjust the font size, the minimal distance between two characters and the size of spaces with the argument char_params.

Default value for char_params is:

# Each field expects css value as a string (ex:"10px", "1em", "large", etc.)
char_params = {
    "font_size": "medium", 
    "width_white_space": "1Opx",
    "min_width_between_chars": "4px",
}

Adding additional information

In some cases, you may want to know additional information about the current document, such as the source of it.

To do this, you can use the argument additional_infos. This argument must be a pandas DataFrame of shape (size of the corpus, number of additional information). The ith row of the DataFrame will be associated with the ith element of the corpus.

The elements of the given DataFrame need to have a proper string representation to be correctly displayed.

For instance, to add the source to each element of the corpus:

import pandas as pd

# define corpus of size 2
additional_infos = pd.DataFrame({"source":["Github", "Paylead.fr"]})
annotation = Annotation(corpus, additional_infos=additional_infos)

The result will be:

screenshot_additional_information

You can see more on that with this demo.

Adding additional outputs

In some cases, you want to flag a document as difficult to annotate, or spot as wrong, or give a value that estimates your confidence in your annotation, etc. In short, you need to return additional information.

To do this, you can use the argument: additional_outputs_elements. This argument expects a list of pylighter.AdditionalOutputElement.

A pylighter.AdditionalOutputElement is defined like this:

from pyligher import AdditionalOutputElement

AdditionalOutputElement(
    name="name_of_my_element",
    display_type="type_of_display" # checkbox, int_text, float_text, text, text_area
    description="Description of the element to display",
    default_value="Default value for the element"
)

Here is an example:

screenshot_additional_outputs

Note: Additional outputs will be added to the save file. But you can also retrieve them with annotation.additional_outputs_values. You can also use previously returned additional outputs values with the argument: additional_outputs_values (same as the label).

You can see more on that with this demo.

Using keyboard shortcuts

Annotation tasks are pretty boring. Thus you may want to use keyboard shortcuts to easily change documents or to select an other label.

By default, there are only a few shortcuts defined:

  • next: Alt + n
  • previous: Alt + p
  • skip: Alt + s
  • save: Shift + Alt + s

However, you can fully customize them with the arguments: standard_shortcuts and labels_shorcuts. The standard_shortcuts argument is used to redefined shortcuts for the standard buttons such as the next button whereas the

A shortcut is defined like this:

from pylighter import Shortcut

Shortcut(
    name="skip",  # Name of the button to bind on (ex: "next", "skip") or name of the label (ex: "l1", "l2", or one you defined)
    key="Ò",  # Usually represents the character that is displayed.
    code="KeyS",  # Usually represents the key that is pressed.
    shift_key=False,  # Wether the shift key is pressed
    alt_key=True,
    ctrl_key=False
)

It is pretty hard to know what is the value for the key and the value for the code. It depends on a lot of different factors such as your keyboard, your browser, etc.

Thus, you can use the ShortcutHelper to pick the right shortcut. Here is an example of it.

from pylighter import ShortcutHelper

ShortcutHelper()

screenshot_shortcut_helper

You can see more on that with this demo.

Contributing

Testing

PyLighter uses pytest. Thus, tests can be run with:

make test

PyLighter uses flake8, isort and check-manifest to control the quality of the code. You can test the quality of the code with:

make test-quality

If you wish to test everything including the packaging, you can run:

make test-all

License

MIT License

pylighter_gif

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pylighter-0.0.3.tar.gz (32.6 kB view details)

Uploaded Source

Built Distribution

pylighter-0.0.3-py2.py3-none-any.whl (25.5 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file pylighter-0.0.3.tar.gz.

File metadata

  • Download URL: pylighter-0.0.3.tar.gz
  • Upload date:
  • Size: 32.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.8.6

File hashes

Hashes for pylighter-0.0.3.tar.gz
Algorithm Hash digest
SHA256 547fa47252b0f63e0a1301facb26b1801f5afc1d1f1bb632a95d6088f5034b00
MD5 a78382fd0681655301fecabee63b890f
BLAKE2b-256 b37bbf45bebdd13072562782fa2c0eaad4adc1677f5e41d2587bbd144bff08e7

See more details on using hashes here.

File details

Details for the file pylighter-0.0.3-py2.py3-none-any.whl.

File metadata

  • Download URL: pylighter-0.0.3-py2.py3-none-any.whl
  • Upload date:
  • Size: 25.5 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.8.6

File hashes

Hashes for pylighter-0.0.3-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 f7e48bf42a8d2a6dc396b5ad36c3148578f931f3cc75bc02c595ccbe27442d3b
MD5 3e30f548621b86be6d2d7cfe63af6ee6
BLAKE2b-256 fb7539d5f954bc173deefc9a1176884affac32dd259027e33957704932026810

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page