Named Entity Recognition using Span Markers

These details have not been verified by PyPI

Project links

Project description

SpanMarker for Named Entity Recognition

🤗 Models | 🛠️ Getting Started In Google Colab | 📄 Documentation | 📊 Thesis

SpanMarker is a framework for training powerful Named Entity Recognition models using familiar encoders such as BERT, RoBERTa and ELECTRA. Built on top of the familiar 🤗 Transformers library, SpanMarker inherits a wide range of powerful functionalities, such as easily loading and saving models, hyperparameter optimization, automatic logging in various tools, checkpointing, callbacks, mixed precision training, 8-bit inference, and more.

Based on the PL-Marker paper, SpanMarker breaks the mold through its accessibility and ease of use. Crucially, SpanMarker works out of the box with many common encoders such as bert-base-cased and roberta-large, and automatically works with datasets using the IOB, IOB2, BIOES, BILOU or no label annotation scheme.

Additionally, the SpanMarker library has been integrated with the Hugging Face Hub and the Hugging Face Inference API. See the SpanMarker documentation on Hugging Face or see all SpanMarker models on the Hugging Face Hub. Through the Inference API integration, users can test any SpanMarker model on the Hugging Face Hub for free using a widget on the model page. Furthermore, each public SpanMarker model offers a free API for fast prototyping and can be deployed to production using Hugging Face Inference Endpoints.

Inference API Widget (on a model page)	Free Inference API (`Deploy` > `Inference API` on a model page)

Documentation

Feel free to have a look at the documentation.

Installation

You may install the span_marker Python module via pip like so:

pip install span_marker

Quick Start

Training

Please have a look at our Getting Started notebook for details on how SpanMarker is commonly used. It explains the following snippet in more detail. Alternatively, have a look at the training scripts that have been successfully used in the past.

Colab	Kaggle	Gradient	Studio Lab

from datasets import load_dataset
from transformers import TrainingArguments
from span_marker import SpanMarkerModel, Trainer


def main() -> None:
    # Load the dataset, ensure "tokens" and "ner_tags" columns, and get a list of labels
    dataset = load_dataset("DFKI-SLT/few-nerd", "supervised")
    dataset = dataset.remove_columns("ner_tags")
    dataset = dataset.rename_column("fine_ner_tags", "ner_tags")
    labels = dataset["train"].features["ner_tags"].feature.names

    # Initialize a SpanMarker model using a pretrained BERT-style encoder
    model_name = "bert-base-cased"
    model = SpanMarkerModel.from_pretrained(
        model_name,
        labels=labels,
        # SpanMarker hyperparameters:
        model_max_length=256,
        marker_max_length=128,
        entity_max_length=8,
    )

    # Prepare the 🤗 transformers training arguments
    args = TrainingArguments(
        output_dir="models/span_marker_bert_base_cased_fewnerd_fine_super",
        # Training Hyperparameters:
        learning_rate=5e-5,
        per_device_train_batch_size=32,
        per_device_eval_batch_size=32,
        num_train_epochs=3,
        weight_decay=0.01,
        warmup_ratio=0.1,
        bf16=True,  # Replace `bf16` with `fp16` if your hardware can't use bf16.
        # Other Training parameters
        logging_first_step=True,
        logging_steps=50,
        evaluation_strategy="steps",
        save_strategy="steps",
        eval_steps=3000,
        save_total_limit=2,
        dataloader_num_workers=2,
    )

    # Initialize the trainer using our model, training args & dataset, and train
    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=dataset["train"],
        eval_dataset=dataset["validation"],
    )
    trainer.train()
    trainer.save_model("models/span_marker_bert_base_cased_fewnerd_fine_super/checkpoint-final")

    # Compute & save the metrics on the test set
    metrics = trainer.evaluate(dataset["test"], metric_key_prefix="test")
    trainer.save_metrics("test", metrics)


if __name__ == "__main__":
    main()

Inference

from span_marker import SpanMarkerModel

# Download from the 🤗 Hub
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-fewnerd-fine-super")
# Run inference
entities = model.predict("Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris.")
[{'span': 'Amelia Earhart', 'label': 'person-other', 'score': 0.7659597396850586, 'char_start_index': 0, 'char_end_index': 14},
 {'span': 'Lockheed Vega 5B', 'label': 'product-airplane', 'score': 0.9725785851478577, 'char_start_index': 38, 'char_end_index': 54},
 {'span': 'Atlantic', 'label': 'location-bodiesofwater', 'score': 0.7587679028511047, 'char_start_index': 66, 'char_end_index': 74},
 {'span': 'Paris', 'label': 'location-GPE', 'score': 0.9892390966415405, 'char_start_index': 78, 'char_end_index': 83}]

Pretrained Models

All models in this list contain train.py files that show the training scripts used to generate them. Additionally, all training scripts used are stored in the training_scripts directory. These trained models have Hosted Inference API widgets that you can use to experiment with the models on their Hugging Face model pages. Additionally, Hugging Face provides each model with a free API (Deploy > Inference API on the model page).

These models are further elaborated on in my thesis.

FewNERD

tomaarsen/span-marker-bert-base-fewnerd-fine-super is a model that I have trained in 2 hours on the finegrained, supervised Few-NERD dataset. It reached a 0.7053 Test F1, competitive in the all-time Few-NERD leaderboard using bert-base. My training script resembles the one that you can see above.
- Try the model out online using this 🤗 Space.
tomaarsen/span-marker-roberta-large-fewnerd-fine-super was trained in 6 hours on the finegrained, supervised Few-NERD dataset using roberta-large. It reached a 0.7103 Test F1, reaching a new state of the art in the all-time Few-NERD leaderboard.
tomaarsen/span-marker-xlm-roberta-base-fewnerd-fine-super is a multilingual model that I have trained in 1.5 hours on the finegrained, supervised Few-NERD dataset. It reached a 0.686 Test F1 on English, and works well on other languages like Spanish, French, German, Russian, Dutch, Polish, Icelandic, Greek and many more.

OntoNotes v5.0

tomaarsen/span-marker-roberta-large-ontonotes5 was trained in 3 hours on the OntoNotes v5.0 dataset, reaching a performance of 0.9154 F1. For reference, the current strongest spaCy model (en_core_web_trf) reaches 0.898. This SpanMarker model uses a roberta-large encoder under the hood.

CoNLL03

tomaarsen/span-marker-xlm-roberta-large-conll03 is a SpanMarker model using xlm-roberta-large that was trained in 45 minutes. It reaches a state of the art 0.931 F1 on CoNLL03 without using document-level context. For reference, the current strongest spaCy model (en_core_web_trf) reaches 91.6.
tomaarsen/span-marker-xlm-roberta-large-conll03-doc-context is another SpanMarker model using the xlm-roberta-large encoder. It uses document-level context to reach a state of the art 0.944 F1. For the best performance, inference should be performed using document-level context (docs). This model was trained in 1 hour.

CoNLL++

tomaarsen/span-marker-xlm-roberta-large-conllpp-doc-context was trained in an hour using the xlm-roberta-large encoder on the CoNLL++ dataset. Using document-level context, it reaches a very competitive 0.955 F1. For the best performance, inference should be performed using document-level context (docs).

Using pretrained SpanMarker models with spaCy

All SpanMarker models on the Hugging Face Hub can also be easily used in spaCy. It's as simple as including 1 line to add the span_marker pipeline. See the Documentation or API Reference for more information.

import spacy

# Load the spaCy model with the span_marker pipeline component
nlp = spacy.load("en_core_web_sm", exclude=["ner"])
nlp.add_pipe("span_marker", config={"model": "tomaarsen/span-marker-roberta-large-ontonotes5"})

# Feed some text through the model to get a spacy Doc
text = """Cleopatra VII, also known as Cleopatra the Great, was the last active ruler of the \
Ptolemaic Kingdom of Egypt. She was born in 69 BCE and ruled Egypt from 51 BCE until her \
death in 30 BCE."""
doc = nlp(text)

# And look at the entities
print([(entity, entity.label_) for entity in doc.ents])
"""
[(Cleopatra VII, "PERSON"), (Cleopatra the Great, "PERSON"), (the Ptolemaic Kingdom of Egypt, "GPE"),
(69 BCE, "DATE"), (Egypt, "GPE"), (51 BCE, "DATE"), (30 BCE, "DATE")]
"""

Context

I have developed this library as a part of my thesis work at Argilla. Feel free to read my finished thesis here in this repository!

Changelog

See CHANGELOG.md for news on all SpanMarker versions.

License

See LICENSE for the current license.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.5.0

Oct 31, 2023

1.4.0

Sep 29, 2023

1.3.0

Aug 24, 2023

1.2.5

Aug 24, 2023

This version

1.2.4

Jul 18, 2023

1.2.3

Jun 20, 2023

1.2.2

Jun 20, 2023

1.2.1

Jun 19, 2023

1.2.0

Jun 15, 2023

1.1.1

Jun 13, 2023

1.1.0

Jun 10, 2023

1.0.1

May 1, 2023

1.0.0

May 1, 2023

0.2.2

Apr 13, 2023

0.2.1

Apr 7, 2023

0.2.0

Apr 6, 2023

0.1.1

Mar 31, 2023

0.1.0

Mar 30, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

span_marker-1.2.4.tar.gz (45.7 kB view details)

Uploaded Jul 18, 2023 Source

Built Distribution

span_marker-1.2.4-py3-none-any.whl (40.8 kB view details)

Uploaded Jul 18, 2023 Python 3

File details

Details for the file span_marker-1.2.4.tar.gz.

File metadata

Download URL: span_marker-1.2.4.tar.gz
Upload date: Jul 18, 2023
Size: 45.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for span_marker-1.2.4.tar.gz
Algorithm	Hash digest
SHA256	`a3bbd02a3daf867539ac556c55e243c65d5bc7fcf605673040efa77f2c7a2839`
MD5	`c0993a9e486a6028ae04b629a351eaf6`
BLAKE2b-256	`87ff66521fbc0273d6cb573c80b7dc90e2ff387b94682a1114b54c3c903d551a`

See more details on using hashes here.

File details

Details for the file span_marker-1.2.4-py3-none-any.whl.

File metadata

Download URL: span_marker-1.2.4-py3-none-any.whl
Upload date: Jul 18, 2023
Size: 40.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for span_marker-1.2.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3b27953480d8d10e2c23c53b96293690f4f2e28e07a5ee0723abfa4b83e61946`
MD5	`627f6c11e2c9049a43c5e38790d44873`
BLAKE2b-256	`a1f7a07b6633e320b71828f581df9f9f91bb55990b16bf849ad76bcef450acf1`