Skip to main content

Stochastic Edit Distance aligenr for string transduction

Project description

Maxwell 👹

PyPI version Supported Python versions CircleCI

Maxwell is a Python library for learning the stochastic edit distance (SED) between source and target alphabets for string transduction.

Given a corpus of source and target string pairs, it uses expectation-maximization to learn the log-probability weights of edit actions (copy, substitution, deletion, insertion) that minimize the number of edits between source and target strings. These weights can then be used for edits over unknown strings through Viterbi decoding.

Install

First install dependencies:

pip install -r requirements.txt

Then install:

python setup.py install

Or:

python setup.py develop

The latter creates a Python module in your environment that updates as you update the code. It can then be imported like a regular Python module:

import maxwell

Usage

SED training can be done as either a command line tool or imported as a Python dependency.

For command-line use, run:

maxwell-train \
    --train /path/to/train/data \
    --output /path/to/output/file \
    --epochs "${NUM_EPOCHS}"

As a library object, you can use the StochasticEditDistance class to pass any iterable of source-target pairs for training. Learned edit weights can then be saved with the write_params method:

from maxwell import sed


aligner = sed.StochasticEditDistance.fit_from_data(
    training_samples, NUM_EPOCHS
)
aligner.params.write_params("/path/to/output/file")

After training, parameters can be loaded from file to calculate optimal edits between strings with the action_sequence method, which returns a tuple of the learned optimal sequence and the weight given to the sequence:

from maxwell import sed


params = sed.ParamsDict.read_params("/path/to/learned/parameters")
aligner = sed.StochasticEditDistance(params)
optimal_sequence, optimal_cost = aligner.action_sequence(source, target)

If only weight and no actions are required, action_sequence_cost can be called instead:

optimal_cost = aligner.action_sequence_cost(source, target)

Conversely, individual actions can be evaluated with the action_cost method:

action_cost = aligner.action_cost(action)

Details

Data

The default data format is based on the SIGMORPHON 2017 shared tasks:

source   target    ...

That is, the first column is the source (a lemma) and the second is the target.

In the case where the formatting is different, the --source-col and --target-col flags can be invoked. For instance, for the SIGMORPHON 2016 shared task data format:

source   ...    target

one would instead use the flag --target-col 3 to use the third column as target strings (note the use of 1-based indexing).

Edit actions

Edit weights are maintained as a ParamsDict object, a dataclass comprising three dictionaries and one floats. The dictionaries, and their indexing, are as follows:

  1. delta_sub Keys: Tuple of source alphabet X target alphabet. Values: Substitution weight for all non-equivalent source-target pairs. If source symbol == target symbol, a seperate copy probability is used.
  2. delta_del Keys: All symbols in source string alphabet. Represents deletion from string. Values: Deletion weight for removal of source symbol from string.
  3. delta_ins Keys: All symbols in target string alphabet. Represents insertion into string. Values: Insertion weight for introduction of target symbol into string.
  4. delta_eos A float value representing probability of terminating the string.

In Python, these values may be accessed through a StochasticEditDistance object's params attribute.

Further reading

For further reading, please see:

Dempster, A., Laird, N., and Rubin, D. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 30(1): 1-38.

Ristad, E. S. and Yianilos, P. N. 1998. Learning string-edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(5): 522-532.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

maxwell-0.2.3.tar.gz (16.2 kB view details)

Uploaded Source

Built Distribution

maxwell-0.2.3-py3-none-any.whl (14.4 kB view details)

Uploaded Python 3

File details

Details for the file maxwell-0.2.3.tar.gz.

File metadata

  • Download URL: maxwell-0.2.3.tar.gz
  • Upload date:
  • Size: 16.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.5

File hashes

Hashes for maxwell-0.2.3.tar.gz
Algorithm Hash digest
SHA256 40bc6f339842545aa7179461df5ee443e935994c07af454387eaa56c329df16d
MD5 207aeb2e86e10c272a3dafbedffe4316
BLAKE2b-256 e8d536379e7b024ade2e9c8352d1cd091c421bfccf73d802fa662eebd25884e5

See more details on using hashes here.

File details

Details for the file maxwell-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: maxwell-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 14.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.5

File hashes

Hashes for maxwell-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 220d6f7368da60bd7a0a4e01fe1b746775cb4c3493c4f45a4e163f3babcfcc09
MD5 c7dfdd4c11a5b1f66dc1701e5302d79c
BLAKE2b-256 252e3b3a59c1bf9b3edbe1056932dca465e2da427ec9a0ebee6be1e6583e54a4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page