Skip to main content

Coreference resolution with e2e for Dutch

Project description

Python package Scrutinizer Code Quality codecov DOI

e2e-Dutch

Code for e2e coref model in Dutch. The code is based on the original e2e model for English, and modified to work for Dutch. If you make use of this code, please cite it and also cite the original e2e paper.

This code can be used with a pre-trained model for Dutch, trained on the SoNaR-1 dataset. The model file and documentation can be found at 10.5281/zenodo.5153574

Installation

Requirements:

  • Python 3.6 or 3.7
  • pip
  • tensorflow v2.0.0 or higher

In this repository, run:

pip install -r requirements.txt
pip install .

Alternatively, you can install directly from Pypi:

pip install tensorflow
pip install e2e-Dutch

Quick start - Stanza

e2edutch can be used as part of a Stanza pipeline.

Coreferences are added similarly to Stanza's entities:

  • a Document has an attribute clusters that is a List of coreference clusters;
  • a coreference cluster is a List of Stanza Spans.
import stanza
import e2edutch.stanza

nlp = stanza.Pipeline(lang='nl', processors='tokenize,coref')

doc = nlp('Dit is een test document. Dit document bevat coreferenties.')
print ([[span.text for span in cluster] for cluster in doc.clusters])

Note that you first need to download the stanza models with stanza.download('nl'). The e2e-Dutch model files are automatically downloaded to the stanza resources directory when loading the pipeline.

Quick start

A pretrained model is available to download:

python -m e2edutch.download [-d DATAPATH]

This downloads the model files, the default location is the data directory inside the python package location. It can also be set manually with the DATAPATH argument, or by specifying the enviornment vairable E2E_HOME.

The pretrained model can be used to predict coreferences on a conll 2012 files, jsonlines files, NAF files or plain text files (in the latter case, the stanza package will be used for tokenization).

python -m e2edutch.predict.py [-h] [-o OUTPUT_FILE] [-f {conll,jsonlines,naf}] [-m MODEL] [-c WORD_COL] [--cfg_file CFG_FILE] [--model_cfg_file MODEL_CFG_FILE] [-v] input_filename

positional arguments:
  input_filename

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT_FILE, --output_file OUTPUT_FILE
  -f {conll,jsonlines,naf}, --format_out {conll,jsonlines,naf}
  -m MODEL, --model MODEL
                        model name
  -c WORD_COL, --word_col WORD_COL
  --cfg_file CFG_FILE   config file
  --model_cfg_file MODEL_CFG_FILE
                        model config file
  -v, --verbose

The user-specific configurations (such as data directory, data files, etc) can be provided in a separate config file, the defaults are specified in cfg/defaults.conf. The options model_cfg_file and model are relevant when you want to use a user-specified model instead of the pretrained model to predict (see the section below on how to train a model).

Train your own model

To train a new model:

  • Make sure the model config file (default: e2edutch/cfg/models.conf) describes the model you wish to train
  • Make sure your config file (default: e2edutch/cfg/defaults.conf) includes the data files you want to use for training
  • Run scripts/setup_train.sh e2edutch/cfg/defaults.conf. This script converts the conll2012 data to jsonlines files, and caches the word and contextualized embeddings.
  • If you want to enable the use of a GPU, set the environment variable:
export GPU=0
  • Run the training script:
python -m e2edutch.train <model-name>

Citing this code

If you use this code in your research, please cite it as follows:

@misc{YourReferenceHere,
author = {
            Dafne van Kuppevelt and
            Jisk Attema
         },
title  = {e2e-Dutch},
doi    = {10.5281/zenodo.4146960},
url    = {https://github.com/Filter-Bubble/e2e-Dutch}
}

As the code is largely based on original e2e model for English, please make sure to also cite the original e2e paper.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

e2e-Dutch-0.4.1.tar.gz (33.4 kB view details)

Uploaded Source

Built Distributions

e2e_Dutch-0.4.1-py3.7.egg (102.5 kB view details)

Uploaded Source

e2e_Dutch-0.4.1-py3-none-any.whl (59.3 kB view details)

Uploaded Python 3

File details

Details for the file e2e-Dutch-0.4.1.tar.gz.

File metadata

  • Download URL: e2e-Dutch-0.4.1.tar.gz
  • Upload date:
  • Size: 33.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.7.11

File hashes

Hashes for e2e-Dutch-0.4.1.tar.gz
Algorithm Hash digest
SHA256 971fee482579edd9a18c612f479d8353b74d3cd5bf3046e80f948372deb99849
MD5 5068d195984fae26630b6f6df0bfa3bc
BLAKE2b-256 62b31a645d46ac22f2b2c247325f66ce77f9c1accd8a824bb6253821e6856af2

See more details on using hashes here.

Provenance

File details

Details for the file e2e_Dutch-0.4.1-py3.7.egg.

File metadata

  • Download URL: e2e_Dutch-0.4.1-py3.7.egg
  • Upload date:
  • Size: 102.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.7.11

File hashes

Hashes for e2e_Dutch-0.4.1-py3.7.egg
Algorithm Hash digest
SHA256 cb8b0e656e45c0ff99cc8bf4f15302192b67a14fd884896c1a8f66330ab66acb
MD5 86a32f2b595bfa8e13450c0a9d98afb1
BLAKE2b-256 8f199ecbd977348499cdaf24aec5383339e636e2f569e93f3c05d4c5d2d7698f

See more details on using hashes here.

Provenance

File details

Details for the file e2e_Dutch-0.4.1-py3-none-any.whl.

File metadata

  • Download URL: e2e_Dutch-0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 59.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.7.11

File hashes

Hashes for e2e_Dutch-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4d82c1f96e9eb9c8c1430b62557fe6e9fa0edd6b65adfb6022d504dfa119a964
MD5 b869865cb7db4bdb24f9e5b8fe505a02
BLAKE2b-256 8ffd861bca6e7da498e025012687f433bbf6955263b924a89fee9cb44cc0c623

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page