Skip to main content

Word segmentation models

Project description

wordseg

DOI CircleCI

wordseg is a Python package of word segmentation models.

Table of contents:

Installation

wordseg is available through pip:

pip install wordseg

To install wordseg from the GitHub source:

git clone https://github.com/jacksonllee/wordseg.git
cd wordseg
pip install -r dev-requirements.txt  # For running the linter and tests
pip install -e .

Usage

wordseg implements a word segmentation model as a Python class. An instantiated model class object has the following methods (emulating the scikit-learn-styled API for machine learning):

  • fit: Train the model with segmented sentences.
  • predict: Predict the segmented sentences from unsegmented sentences.

The implemented model classes are as follows:

  • RandomSegmenter: Segmentation is predicted at random at each potential word boundary independently for some given probability. No training is required.
  • LongestStringMatching: This model constructs predicted words by moving from left to right along an unsegmented sentence and finding the longest matching words, constrained by a maximum word length parameter.

Sample code snippet:

from wordseg import LongestStringMatching

# Initialize a model.
model = LongestStringMatching(max_word_length=4)

# Train the model.
# `fit` takes an iterable of segmented sentences (a tuple or list of strings).
model.fit(
    [
        ("this", "is", "a", "sentence"),
        ("that", "is", "not", "a", "sentence"),
    ]
)

# Make some predictions; `predict` gives a generator, which is materialized by list() here.
list(model.predict(["thatisadog", "thisisnotacat"]))
# [['that', 'is', 'a', 'd', 'o', 'g'], ['this', 'is', 'not', 'a', 'c', 'a', 't']]
# We can't get 'dog' and 'cat' because they aren't in the training data.

Citation

Lee, Jackson L. 2020. wordseg: Word segmentation models in Python. https://doi.org/10.5281/zenodo.4077433

@software{leengrams,
  author       = {Jackson L. Lee},
  title        = {wordseg: Word segmentation models in Python},
  year         = 2020,
  doi          = {10.5281/zenodo.4077433},
  url          = {https://doi.org/10.5281/zenodo.4077433}
}

License

MIT License. Please see LICENSE.txt.

Changelog

Please see CHANGELOG.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wordseg-0.0.1.tar.gz (6.9 kB view details)

Uploaded Source

Built Distribution

wordseg-0.0.1-py3-none-any.whl (9.3 kB view details)

Uploaded Python 3

File details

Details for the file wordseg-0.0.1.tar.gz.

File metadata

  • Download URL: wordseg-0.0.1.tar.gz
  • Upload date:
  • Size: 6.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0.post20201006 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for wordseg-0.0.1.tar.gz
Algorithm Hash digest
SHA256 851c63350f608c470cdd748080e9f1737b1f6cbeb4e6314676687e54ad7f4a29
MD5 71017e3edab016c5cfac91e92a4c754d
BLAKE2b-256 43505f2287dec58dd5f21a42ad1cef257b34c2fdc4436a43630c52123ca2318a

See more details on using hashes here.

Provenance

File details

Details for the file wordseg-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: wordseg-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 9.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0.post20201006 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for wordseg-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ba146f6feba55d9caae23de9efd6a0a51d5a59aa88f6faf06eda595d4675d0fd
MD5 2f4ab1d34957131f9ddc0b7aea6440ab
BLAKE2b-256 4b6461cdf6b05b26e50ccbfdc0a33acc4a0d392e1558aa504a9c12ef0837c1eb

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page