Skip to main content

A multi-lingual approach to AllenNLP CoReference Resolution, along with a wrapper for spaCy.

Project description

Crosslingual Coreference

Coreference is amazing but the data required for training a model is very scarce. In our case, the available training for non-English languages also proved to be poorly annotated. Crosslingual Coreference, therefore, uses the assumption a trained model with English data and cross-lingual embeddings should work for languages with similar sentence structures.

Current Release Version pypi Version PyPi downloads Code style: black

Install

pip install crosslingual-coreference

Quickstart

from crosslingual_coreference import Predictor

text = (
    "Do not forget about Momofuku Ando! He created instant noodles in Osaka. At"
    " that location, Nissin was founded. Many students survived by eating these"
    " noodles, but they don't even know him."
)

# choose minilm for speed/memory and info_xlm for accuracy
predictor = Predictor(
    language="en_core_web_sm", device=-1, model_name="minilm"
)

print(predictor.predict(text)["resolved_text"])
print(predictor.pipe([text])[0]["resolved_text"])
# Note you can also get 'cluster_heads' and 'clusters'
# Output
#
# Do not forget about Momofuku Ando!
# Momofuku Ando created instant noodles in Osaka.
# At Osaka, Nissin was founded.
# Many students survived by eating instant noodles,
# but Many students don't even know Momofuku Ando.

Models

As of now, there are two models available "spanbert", "info_xlm", "xlm_roberta", "minilm", which scored 83, 77, 74 and 74 on OntoNotes Release 5.0 English data, respectively.

  • The "minilm" model is the best quality speed trade-off for both mult-lingual and english texts.
  • The "info_xlm" model produces the best quality for multi-lingual texts.
  • The AllenNLP "spanbert" model produces the best quality for english texts.

Chunking/batching to resolve memory OOM errors

from crosslingual_coreference import Predictor

predictor = Predictor(
    language="en_core_web_sm",
    device=0,
    model_name="minilm",
    chunk_size=2500,
    chunk_overlap=2,
)

Use spaCy pipeline

import spacy

import crosslingual_coreference

text = (
    "Do not forget about Momofuku Ando! He created instant noodles in Osaka. At"
    " that location, Nissin was founded. Many students survived by eating these"
    " noodles, but they don't even know him."
)


nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(
    "xx_coref", config={"chunk_size": 2500, "chunk_overlap": 2, "device": 0}
)

doc = nlp(text)
print(doc._.coref_clusters)
# Output
#
# [[[4, 5], [7, 7], [27, 27], [36, 36]],
# [[12, 12], [15, 16]],
# [[9, 10], [27, 28]],
# [[22, 23], [31, 31]]]
print(doc._.resolved_text)
# Output
#
# Do not forget about Momofuku Ando!
# Momofuku Ando created instant noodles in Osaka.
# At Osaka, Nissin was founded.
# Many students survived by eating instant noodles,
# but Many students don't even know Momofuku Ando.
print(doc._.cluster_heads)
# Output
#
# {Momofuku Ando: [5, 6],
# instant noodles: [11, 12],
# Osaka: [14, 14],
# Nissin: [21, 21],
# Many students: [26, 27]}

Visualize spacy pipeline

This only works with spacy >= 3.3.

import spacy
import crosslingual_coreference
from spacy.tokens import Span
from spacy import displacy

text = (
    "Do not forget about Momofuku Ando! He created instant noodles in Osaka. At"
    " that location, Nissin was founded. Many students survived by eating these"
    " noodles, but they don't even know him."
)

nlp = spacy.load("nl_core_news_sm")
nlp.add_pipe("xx_coref", config={"model_name": "minilm"})
doc = nlp(text)
spans = []
for idx, cluster in enumerate(doc._.coref_clusters):
    for span in cluster:
        spans.append(
            Span(doc, span[0], span[1]+1, str(idx).upper())
        )

doc.spans["custom"] = spans

displacy.render(doc, style="span", options={"spans_key": "custom"})

More Examples

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crosslingual_coreference-0.3.tar.gz (10.8 kB view details)

Uploaded Source

Built Distribution

crosslingual_coreference-0.3-py3-none-any.whl (12.4 kB view details)

Uploaded Python 3

File details

Details for the file crosslingual_coreference-0.3.tar.gz.

File metadata

File hashes

Hashes for crosslingual_coreference-0.3.tar.gz
Algorithm Hash digest
SHA256 3056d85d9b05ca7481e9f6def609625b913cdabaf4949e93768a23ce974280ce
MD5 4d9b5137724f00f4b693d465bfb3e4f5
BLAKE2b-256 d9502bc3c2f5a5a717e2e22e278d89600460051fb347675a2db604490226dcee

See more details on using hashes here.

File details

Details for the file crosslingual_coreference-0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for crosslingual_coreference-0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 cbf91e3882e0aa5d0fdbcc9f81e4e396aad8748a3a9af74920aa0bc66235f3fa
MD5 27012c7586f4005ff63375a9caa8ff45
BLAKE2b-256 cdba990a3369d991f1c0359348717490028e17326c3776733a8d24aeae6bc9da

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page