Word segmentation models
Project description
wordseg
wordseg
is a Python package of word segmentation models.
Table of contents:
Installation
wordseg
is available through pip:
pip install wordseg
To install wordseg
from the GitHub source:
git clone https://github.com/jacksonllee/wordseg.git
cd wordseg
pip install -e ".[dev]"
Usage
wordseg
implements a word segmentation model as a Python class.
An instantiated model class object has the following methods
(emulating the scikit-learn-styled API for machine learning):
fit
: Train the model with segmented sentences.predict
: Predict the segmented sentences from unsegmented sentences.
The implemented model classes are as follows:
RandomSegmenter
: Segmentation is predicted at random at each potential word boundary independently for some given probability. No training is required.LongestStringMatching
: This model constructs predicted words by moving from left to right along an unsegmented sentence and finding the longest matching words, constrained by a maximum word length parameter.
Sample code snippet:
from src.wordseg import LongestStringMatching
# Initialize a model.
model = LongestStringMatching(max_word_length=4)
# Train the model.
# `fit` takes an iterable of segmented sentences (a tuple or list of strings).
model.fit(
[
("this", "is", "a", "sentence"),
("that", "is", "not", "a", "sentence"),
]
)
# Make some predictions; `predict` gives a generator, which is materialized by list() here.
list(model.predict(["thatisadog", "thisisnotacat"]))
# [['that', 'is', 'a', 'd', 'o', 'g'], ['this', 'is', 'not', 'a', 'c', 'a', 't']]
# We can't get 'dog' and 'cat' because they aren't in the training data.
License
MIT License. Please see LICENSE.txt
.
Changelog
Please see CHANGELOG.md
.
Contributing
Please see CONTRIBUTING.md
.
Citation
Lee, Jackson L. 2023. wordseg: Word segmentation models in Python. https://doi.org/10.5281/zenodo.4077433
@software{leengrams,
author = {Jackson L. Lee},
title = {wordseg: Word segmentation models in Python},
year = 2023,
doi = {10.5281/zenodo.4077433},
url = {https://doi.org/10.5281/zenodo.4077433}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
wordseg-0.0.5.tar.gz
(6.2 kB
view hashes)