A multi-lingual approach to AllenNLP CoReference Resolution, along with a wrapper for spaCy.
Project description
Crosslingual Coreference
Coreference is amazing but the data required for training a model is very scarce. In our case, the available training for non-English languages also proved to be poorly annotated. Crosslingual Coreference, therefore, uses the assumption a trained model with English data and cross-lingual embeddings should work for languages with similar sentence structures.
Install
pip install crosslingual-coreference
Quickstart
from crosslingual_coreference import Predictor
text = (
"Do not forget about Momofuku Ando! He created instant noodles in Osaka. At"
" that location, Nissin was founded. Many students survived by eating these"
" noodles, but they don't even know him."
)
# choose minilm for speed/memory and info_xlm for accuracy
predictor = Predictor(
language="en_core_web_sm", device=-1, model_name="minilm"
)
print(predictor.predict(text)["resolved_text"])
# Note you can also get 'cluster_heads' and 'clusters'
# Output
#
# Do not forget about Momofuku Ando!
# Momofuku Ando created instant noodles in Osaka.
# At Osaka, Nissin was founded.
# Many students survived by eating instant noodles,
# but Many students don't even know Momofuku Ando.
Models
As of now, there are two models available "spanbert", "info_xlm", "xlm_roberta", "minilm", which scored 83, 77, 74 and 74 on OntoNotes Release 5.0 English data, respectively.
- The "minilm" model is the best quality speed trade-off for both mult-lingual and english texts.
- The "info_xlm" model produces the best quality for multi-lingual texts.
- The AllenNLP "spanbert" model produces the best quality for english texts.
Chunking/batching to resolve memory OOM errors
from crosslingual_coreference import Predictor
predictor = Predictor(
language="en_core_web_sm",
device=0,
model_name="minilm",
chunk_size=2500,
chunk_overlap=2,
)
Use spaCy pipeline
import spacy
import crosslingual_coreference
text = (
"Do not forget about Momofuku Ando! He created instant noodles in Osaka. At"
" that location, Nissin was founded. Many students survived by eating these"
" noodles, but they don't even know him."
)
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(
"xx_coref", config={"chunk_size": 2500, "chunk_overlap": 2, "device": 0}
)
doc = nlp(text)
print(doc._.coref_clusters)
# Output
#
# [[[4, 5], [7, 7], [27, 27], [36, 36]],
# [[12, 12], [15, 16]],
# [[9, 10], [27, 28]],
# [[22, 23], [31, 31]]]
print(doc._.resolved_text)
# Output
#
# Do not forget about Momofuku Ando!
# Momofuku Ando created instant noodles in Osaka.
# At Osaka, Nissin was founded.
# Many students survived by eating instant noodles,
# but Many students don't even know Momofuku Ando.
print(doc._.cluster_heads)
# Output
#
# {Momofuku Ando: [5, 6],
# instant noodles: [11, 12],
# Osaka: [14, 14],
# Nissin: [21, 21],
# Many students: [26, 27]}
More Examples
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file crosslingual-coreference-0.2.7.tar.gz
.
File metadata
- Download URL: crosslingual-coreference-0.2.7.tar.gz
- Upload date:
- Size: 10.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.11 CPython/3.8.2 Windows/10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 33bbf813fce2de2302f304e1d0c95e8e216f98e34fa20b531f7cc8c54ce433eb |
|
MD5 | dfcb2b3eca4fa4f59ca0c79d9550efc3 |
|
BLAKE2b-256 | a5f75b931b8267c1526a406d3ac1785fe561d0e64f46961cf45c707b04605fd7 |
File details
Details for the file crosslingual_coreference-0.2.7-py3-none-any.whl
.
File metadata
- Download URL: crosslingual_coreference-0.2.7-py3-none-any.whl
- Upload date:
- Size: 12.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.11 CPython/3.8.2 Windows/10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | eb5798f0c9ce65ff69df75a978ef3f9a2782bea8443e5655e4f1beb4fe535e71 |
|
MD5 | dc56c32227e6faa363b85a3a7094f3d8 |
|
BLAKE2b-256 | af9b6f347021deb62ec0ed73b86df79bec209068d0fc8a5db2ec2ad542e77f09 |