Word segmentation models
Project description
wordseg
wordseg
is a Python package of word segmentation models.
Table of contents:
Installation
wordseg
is available through pip:
pip install wordseg
To install wordseg
from the GitHub source:
git clone https://github.com/jacksonllee/wordseg.git
cd wordseg
pip install -r dev-requirements.txt # For running the linter and tests
pip install -e .
Usage
wordseg
implements a word segmentation model as a Python class.
An instantiated model class object has the following methods
(emulating the scikit-learn-styled API for machine learning):
fit
: Train the model with segmented sentences.predict
: Predict the segmented sentences from unsegmented sentences.
The implemented model classes are as follows:
RandomSegmenter
: Segmentation is predicted at random at each potential word boundary independently for some given probability. No training is required.LongestStringMatching
: This model constructs predicted words by moving from left to right along an unsegmented sentence and finding the longest matching words, constrained by a maximum word length parameter.
Sample code snippet:
from wordseg import LongestStringMatching
# Initialize a model.
model = LongestStringMatching(max_word_length=4)
# Train the model.
# `fit` takes an iterable of segmented sentences (a tuple or list of strings).
model.fit(
[
("this", "is", "a", "sentence"),
("that", "is", "not", "a", "sentence"),
]
)
# Make some predictions; `predict` gives a generator, which is materialized by list() here.
list(model.predict(["thatisadog", "thisisnotacat"]))
# [['that', 'is', 'a', 'd', 'o', 'g'], ['this', 'is', 'not', 'a', 'c', 'a', 't']]
# We can't get 'dog' and 'cat' because they aren't in the training data.
Citation
Lee, Jackson L. 2020. wordseg: Word segmentation models in Python. https://doi.org/10.5281/zenodo.4077433
@software{leengrams,
author = {Jackson L. Lee},
title = {wordseg: Word segmentation models in Python},
year = 2020,
doi = {10.5281/zenodo.4077433},
url = {https://doi.org/10.5281/zenodo.4077433}
}
License
MIT License. Please see LICENSE.txt
.
Changelog
Please see CHANGELOG.md
.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file wordseg-0.0.1.tar.gz
.
File metadata
- Download URL: wordseg-0.0.1.tar.gz
- Upload date:
- Size: 6.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0.post20201006 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 851c63350f608c470cdd748080e9f1737b1f6cbeb4e6314676687e54ad7f4a29 |
|
MD5 | 71017e3edab016c5cfac91e92a4c754d |
|
BLAKE2b-256 | 43505f2287dec58dd5f21a42ad1cef257b34c2fdc4436a43630c52123ca2318a |
Provenance
File details
Details for the file wordseg-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: wordseg-0.0.1-py3-none-any.whl
- Upload date:
- Size: 9.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0.post20201006 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ba146f6feba55d9caae23de9efd6a0a51d5a59aa88f6faf06eda595d4675d0fd |
|
MD5 | 2f4ab1d34957131f9ddc0b7aea6440ab |
|
BLAKE2b-256 | 4b6461cdf6b05b26e50ccbfdc0a33acc4a0d392e1558aa504a9c12ef0837c1eb |