Annotation tool for NER tasks on Jupyter
Project description
PyLighter: Annotation tool for NER tasks
PyLighter is a tool that allows data scientists to annotate a corpus of documents directly on Jupyter for NER (Named Entity Recognition) tasks.
Contents
Installation
From Pypi: https://pypi-hypernode.com/project/pylighter/
pip install pylighter
jupyter nbextension enable --py widgetsnbextension
From Github: https://github.com/PayLead/PyLighter
git clone git@github.com:PayLead/PyLighter.git
cd PyLighter
python setup.py install
jupyter nbextension enable --py widgetsnbextension
Demos
The demo folder contains working examples of PyLighter in use. To view them, open any of the ipynb files in Jupyter.
Basic usage
The use case of PyLighter is to easily annotate a corpus in Jupyter. So let's first define a corpus for this example:
corpus = [
"PyLighter is an annotation tool for NER tasks directly on Jupyter. "
+ "It aims on helping data scientists easily and quickly annotate datasets. "
+ "This tool was developed by Paylead.",
"PayLead is a fintech company specializing in transaction data analysis. "
+ "Paylead brings retail and banking together, so customers get rewarded when they buy. "
+ "Welcome to the data-for-value economy."
]
Now let's start annotating !
from pylighter import Annotation
annotation = Annotation(corpus)
Running that cell gives you the following output:
You can know start annotating entities using the predefined labels l1, l2, etc.
When your annotation is finished, you can either click on the save button or retrieve the results in the current Notebook.
- The save button will save the results in a csv file named annotation.csv with two columns: the documents and the labels.
- You can access the labels of your annotations in
annotation.labels
Note: The given labels are in IOB2 format.
Advanced usage
The above example works just fine but PyLighter can be customized to best fit your specific use case.
Using an already annotated corpus
In most cases, you want to use an already annotated corpus or simply continue your annotation.
To this, you can use the argument named labels
with the labels of the corpus. Moreover, if you stopped at the ith document, you can directly get back to where you stopped with start_index=i
.
You can see more on that with this demo.
Changing labels names
PyLighter uses l1, l2, ...., l7 as default labels names, but in most cases, you want to have explicit labels such as Noun, Verb, etc.
You can define your own labels names with the argument labels_names
. You can also define your own colors for your labels with the argument labels_colors
in HEX format.
You can see more on that with this demo.
Document styling
You can adjust the font size, the minimal distance between two characters and the size of spaces with the argument char_params
.
Default value for char_params is:
# Each field expects css value as a string (ex:"10px", "1em", "large", etc.)
char_params = {
"font_size": "medium",
"width_white_space": "1Opx",
"min_width_between_chars": "4px",
}
Adding additional information
In some cases, you may want to know additional information about the current document, such as the source of it.
To do this, you can use the argument additional_infos
. This argument must be a pandas DataFrame of shape (size of the corpus, number of additional information). The ith row of the DataFrame will be associated with the ith element of the corpus.
The elements of the given DataFrame need to have a proper string representation to be correctly displayed.
For instance, to add the source to each element of the corpus:
import pandas as pd
# define corpus of size 2
additional_infos = pd.DataFrame({"source":["Github", "Paylead.fr"]})
annotation = Annotation(corpus, additional_infos=additional_infos)
The result will be:
You can see more on that with this demo.
Adding additional outputs
In some cases, you want to flag a document as difficult to annotate, or spot as wrong, or give a value that estimates your confidence in your annotation, etc. In short, you need to return additional information.
To do this, you can use the argument: additional_outputs_elements
. This argument expects a list of pylighter.AdditionalOutputElement
.
A pylighter.AdditionalOutputElement
is defined like this:
from pyligher import AdditionalOutputElement
AdditionalOutputElement(
name="name_of_my_element",
display_type="type_of_display" # checkbox, int_text, float_text, text, text_area
description="Description of the element to display",
default_value="Default value for the element"
)
Here is an example:
Note: Additional outputs will be added to the save file. But you can also retrieve them with annotation.additional_outputs_values
. You can also use previously returned additional outputs values with the argument: additional_outputs_values
(same as the label).
You can see more on that with this demo.
Using keyboard shortcuts
Annotation tasks are pretty boring. Thus you may want to use keyboard shortcuts to easily change documents or to select an other label.
By default, there are only a few shortcuts defined:
- next: Alt + n
- previous: Alt + p
- skip: Alt + s
- save: Shift + Alt + s
However, you can fully customize them with the arguments: standard_shortcuts
and labels_shorcuts
. The standard_shortcuts
argument is used to redefined shortcuts for the standard buttons such as the next button whereas the
A shortcut is defined like this:
from pylighter import Shortcut
Shortcut(
name="skip", # Name of the button to bind on (ex: "next", "skip") or name of the label (ex: "l1", "l2", or one you defined)
key="Ò", # Usually represents the character that is displayed.
code="KeyS", # Usually represents the key that is pressed.
shift_key=False, # Wether the shift key is pressed
alt_key=True,
ctrl_key=False
)
It is pretty hard to know what is the value for the key
and the value for the code
. It depends on a lot of different factors such as your keyboard, your browser, etc.
Thus, you can use the ShortcutHelper
to pick the right shortcut. Here is an example of it.
from pylighter import ShortcutHelper
ShortcutHelper()
You can see more on that with this demo.
Contributing
Testing
PyLighter uses pytest. Thus, tests can be run with:
make test
PyLighter uses flake8, isort and check-manifest to control the quality of the code. You can test the quality of the code with:
make test-quality
If you wish to test everything including the packaging, you can run:
make test-all
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pylighter-0.0.3.tar.gz
.
File metadata
- Download URL: pylighter-0.0.3.tar.gz
- Upload date:
- Size: 32.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.8.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 547fa47252b0f63e0a1301facb26b1801f5afc1d1f1bb632a95d6088f5034b00 |
|
MD5 | a78382fd0681655301fecabee63b890f |
|
BLAKE2b-256 | b37bbf45bebdd13072562782fa2c0eaad4adc1677f5e41d2587bbd144bff08e7 |
File details
Details for the file pylighter-0.0.3-py2.py3-none-any.whl
.
File metadata
- Download URL: pylighter-0.0.3-py2.py3-none-any.whl
- Upload date:
- Size: 25.5 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.8.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f7e48bf42a8d2a6dc396b5ad36c3148578f931f3cc75bc02c595ccbe27442d3b |
|
MD5 | 3e30f548621b86be6d2d7cfe63af6ee6 |
|
BLAKE2b-256 | fb7539d5f954bc173deefc9a1176884affac32dd259027e33957704932026810 |