Coreference resolution with e2e for Dutch
Project description
e2e-Dutch
Code for e2e coref model in Dutch. The code is based on the original e2e model for English, and modified to work for Dutch. If you make use of this code, please cite it and also cite the original e2e paper.
Installation
Requirements:
- Python 3.6 or 3.7
- pip
In this repository, run:
pip install -r requirements.txt
./scripts/setup_all.sh
pip install .
The setup_all
script downloads the word vector files to the data
directories. It also builds the application-specific tensorflow kernels.
Quick start - Stanza
e2edutch can be used as part of a Stanza pipeline.
Coreferences are added similarly to Stanza's entities:
- a Document has an attribute clusters that is a List of coreference clusters;
- a coreference cluster is a List of Stanza Spans.
import stanza
import e2edutch.stanza
nlp = stanza.Pipeline(lang='nl', processors='tokenize,coref')
doc = nlp('Dit is een test document. Dit document bevat coreferenties.')
print ([[span.text for span in cluster] for cluster in doc.clusters])
Quick start
A pretrained model is available to download:
python -m e2edutch.download
This downloads the model files, the default location is the data
directory inside the python package location.
It can also be set manually by specifying the enviornment vairable E2E_HOME
or through the config file (see below).
The pretrained model can be used to predict coreferences on a conll 2012 files, jsonlines files, NAF files or plain text files (in the latter case, the nltk package will be used for tokenization).
python -m e2edutch.predict [-h] [-o OUTPUT_FILE] [-f {conll,jsonlines,naf}]
[-c WORD_COL] [--cfg_file CFG_FILE] [-v]
config input_filename
positional arguments:
config: name of the model to use for prediction ('final' for the pretrained)
input_filename
optional arguments:
-h, --help show this help message and exit
-o OUTPUT_FILE, --output_file OUTPUT_FILE
-f {conll,jsonlines,naf}, --format_out {conll,jsonlines,naf}
-c WORD_COL, --word_col WORD_COL
--cfg_file CFG_FILE config file
-v, --verbose
The user-specific configurations (such as data directory, data files, etc) can be provided in a separate config file, the defaults are specified in cfg/defaults.conf
.
Train your own model
To train a new model:
- Make sure the model config file (default:
e2edutch/cfg/models.conf
) describes the model you wish to train - Make sure your config file (default:
e2edutch/cfg/defaults.conf
) includes the data files you want to use for training - Run
scripts/setup_train.sh e2edutch/cfg/defaults.conf
. This script converts the conll2012 data to jsonlines files, and caches the word and contextualized embeddings. - If you want to enable the use of a GPU, set the environment variable:
export GPU=0
- Run the training script:
python -m e2edutch.train <model-name>
Citing this code
If you use this code in your research, please cite it as follows:
@misc{YourReferenceHere,
author = {
Dafne van Kuppevelt and
Jisk Attema
},
title = {e2e-Dutch},
doi = {10.5281/zenodo.4146960},
url = {https://github.com/Filter-Bubble/e2e-Dutch}
}
As the code is largely based on original e2e model for English, please make sure to also cite the original e2e paper.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file e2e-Dutch-0.4.0.tar.gz
.
File metadata
- Download URL: e2e-Dutch-0.4.0.tar.gz
- Upload date:
- Size: 30.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/51.0.0.post20201207 requests-toolbelt/0.9.1 tqdm/4.55.1 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d3cad020c38ea55b303c48ac0856c4f1ce9b8e108f72f8d145aefb3c43971156 |
|
MD5 | ad461b773a99d67f46a515811b2e3f20 |
|
BLAKE2b-256 | 01861abeffc57f4e625b1a4e3790aebf101af21bf9d747b461fde35054755914 |
Provenance
File details
Details for the file e2e_Dutch-0.4.0-py3-none-any.whl
.
File metadata
- Download URL: e2e_Dutch-0.4.0-py3-none-any.whl
- Upload date:
- Size: 75.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/51.0.0.post20201207 requests-toolbelt/0.9.1 tqdm/4.55.1 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5d418f3e83b78a41583a0e8ef6c6cc89dc87f342752e4d59e2c284d0e053797d |
|
MD5 | b3d87ed5facbec571f6c7d4cdf626c4d |
|
BLAKE2b-256 | 51647169cbae499aa66f561ca7828be26e997db07e898e8c6623f8b823b0d242 |