Skip to main content

A multi-lingual approach to AllenNLP CoReference Resolution, along with a wrapper for spaCy.

Project description

Crosslingual Coreference

Coreference is amazing but the data required for training a model is very scarce. In our case, the available training for non-English languages also proved to be poorly annotated. Crosslingual Coreference, therefore, uses the assumption a trained model with English data and cross-lingual embeddings should work for languages with similar sentence structures.

Current Release Version pypi Version PyPi downloads Code style: black

Install

pip install crosslingual-coreference

Quickstart

from crosslingual_coreference import Predictor

text = (
    "Do not forget about Momofuku Ando! He created instant noodles in Osaka. At"
    " that location, Nissin was founded. Many students survived by eating these"
    " noodles, but they don't even know him."
)

# choose minilm for speed/memory and info_xlm for accuracy
predictor = Predictor(
    language="en_core_web_sm", device=-1, model_name="minilm"
)

print(predictor.predict(text)["resolved_text"])
# Output
#
# Do not forget about Momofuku Ando!
# Momofuku Ando created instant noodles in Osaka.
# At Osaka, Nissin was founded.
# Many students survived by eating instant noodles,
# but Many students don't even know Momofuku Ando.

Chunking/batching to resolve memory OOM errors

from crosslingual_coreference import Predictor

predictor = Predictor(
    language="en_core_web_sm",
    device=0,
    model_name="minilm",
    chunk_size=2500,
    chunk_overlap=2,
)

Use spaCy pipeline

import spacy

import crosslingual_coreference

text = (
    "Do not forget about Momofuku Ando! He created instant noodles in Osaka. At"
    " that location, Nissin was founded. Many students survived by eating these"
    " noodles, but they don't even know him."
)


nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(
    "xx_coref", config={"chunk_size": 2500, "chunk_overlap": 2, "device": 0}
)

doc = nlp(text)
print(doc._.coref_clusters)
# Output
#
# [[[4, 5], [7, 7], [27, 27], [36, 36]],
# [[12, 12], [15, 16]],
# [[9, 10], [27, 28]],
# [[22, 23], [31, 31]]]
print(doc._.resolved_text)
# Output
#
# Do not forget about Momofuku Ando!
# Momofuku Ando created instant noodles in Osaka.
# At Osaka, Nissin was founded.
# Many students survived by eating instant noodles,
# but Many students don't even know Momofuku Ando.

Available models

As of now, there are two models available "info_xlm", "xlm_roberta", "minilm", which scored 77, 74 and 74 on OntoNotes Release 5.0 English data, respectively.

More Examples

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crosslingual-coreference-0.2.2.tar.gz (10.2 kB view details)

Uploaded Source

Built Distribution

crosslingual_coreference-0.2.2-py3-none-any.whl (11.5 kB view details)

Uploaded Python 3

File details

Details for the file crosslingual-coreference-0.2.2.tar.gz.

File metadata

File hashes

Hashes for crosslingual-coreference-0.2.2.tar.gz
Algorithm Hash digest
SHA256 83c4340ee9787ebeb3475c6a6725277cfe3fb0cfd4fad68971d65b2b6f7a3065
MD5 0669b42f7d15443cbb68124a7c8e0d85
BLAKE2b-256 994a0a84161efdf5aef71df44a493302c641627bd9249ce139dde1b7a4c6adb2

See more details on using hashes here.

File details

Details for the file crosslingual_coreference-0.2.2-py3-none-any.whl.

File metadata

File hashes

Hashes for crosslingual_coreference-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 fb8bf390b5c11a85cd354bb5c297fc88d547166e0eb5ce4d2d9a9094a6bd0282
MD5 8d90101fc1bb1ad257cb9e1596a0ab4d
BLAKE2b-256 e5b6c41412ebb3714c7b53860e7e266da53be38a8b8420f545e729fdaa79b0a4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page