Skip to main content

Fast English word segmentation

Project description

Cover logo

Instant Segment: fast English word segmentation in Rust

Documentation Crates.io PyPI Build status License: Apache 2.0

Instant Segment is a fast Apache-2.0 library for English word segmentation. It is based on the Python wordsegment project written by Grant Jenks, which is in turn based on code from Peter Norvig's chapter Natural Language Corpus Data from the book Beautiful Data (Segaran and Hammerbacher, 2009).

For the microbenchmark included in this repository, Instant Segment is ~100x faster than the Python implementation. The API was carefully constructed so that multiple segmentations can share the underlying state to allow parallel usage.

How it works

Instant Segment works by segmenting a string into words by selecting the splits with the highest probability given a corpus of words and their occurrences.

For instance, provided that choose and spain occur more frequently than chooses and pain, and that the pair choose spain occurs more frequently than chooses pain, Instant Segment can help identify the domain choosespain.com as ChooseSpain.com which more likely matches user intent.

Read about how we built and improved Instant Segment for use in production at Instant Domain Search to help our users find relevant domains they can register.

Using the library

Python (>= 3.9)

pip install instant-segment

Rust

[dependencies]
instant-segment = "0.8.1"

Examples

The following examples expect unigrams and bigrams to exist. See the examples (Rust, Python) to see how to construct these objects.

import instant_segment

segmenter = instant_segment.Segmenter(unigrams, bigrams)
search = instant_segment.Search()
segmenter.segment("instantdomainsearch", search)
print([word for word in search])

--> ['instant', 'domain', 'search']
use instant_segment::{Search, Segmenter};
use std::collections::HashMap;

let segmenter = Segmenter::new(unigrams, bigrams);
let mut search = Search::default();
let words = segmenter
    .segment("instantdomainsearch", &mut search)
    .unwrap();
println!("{:?}", words.collect::<Vec<&str>>())

--> ["instant", "domain", "search"]

Check out the tests for more thorough examples: Rust, Python

Testing

To run the tests run the following:

cargo t -p instant-segment --all-features

You can also test the Python bindings with:

make test-python

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

instant_segment-0.1.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (261.7 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

instant_segment-0.1.6-cp310-cp310-macosx_10_7_x86_64.whl (230.1 kB view details)

Uploaded CPython 3.10 macOS 10.7+ x86-64

instant_segment-0.1.6-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (261.6 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

instant_segment-0.1.6-cp39-cp39-macosx_10_7_x86_64.whl (230.5 kB view details)

Uploaded CPython 3.9 macOS 10.7+ x86-64

instant_segment-0.1.6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (262.3 kB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

instant_segment-0.1.6-cp38-cp38-macosx_10_7_x86_64.whl (230.6 kB view details)

Uploaded CPython 3.8 macOS 10.7+ x86-64

instant_segment-0.1.6-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (262.4 kB view details)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

instant_segment-0.1.6-cp37-cp37m-macosx_10_7_x86_64.whl (230.6 kB view details)

Uploaded CPython 3.7m macOS 10.7+ x86-64

File details

Details for the file instant_segment-0.1.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 668730619018b7640893bb96f64947733ba2b6d9277b4aacceaa7a0ae60946e1
MD5 cd6f768bfa5395f9eabb08e513e8a727
BLAKE2b-256 7e2caec16cec9144545086c851370eedc7d92bc2c4fdae03aea4cea525cb14d2

See more details on using hashes here.

Provenance

File details

Details for the file instant_segment-0.1.6-cp310-cp310-macosx_10_7_x86_64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.6-cp310-cp310-macosx_10_7_x86_64.whl
Algorithm Hash digest
SHA256 86b9cccc5727537616ff26d1bb78bf799c0402bf3a576af98ab88663caa93916
MD5 9d2b34337fd0e6afbd33e25199be4c43
BLAKE2b-256 e0fd5bb04b316cb20a3a3e81b6b5454c76bd9215d88756a63feed262ae7ed7f6

See more details on using hashes here.

Provenance

File details

Details for the file instant_segment-0.1.6-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.6-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5bf14e0951aa292e98cc36e612e77dd25beed5414318dab1d952e25dac138727
MD5 3b01c66077a2e63e748c18de2d1a4300
BLAKE2b-256 963ce18e61a6baa412c2bee387443a4c9bbf11046d86dcfdfffb733d37c9ed80

See more details on using hashes here.

Provenance

File details

Details for the file instant_segment-0.1.6-cp39-cp39-macosx_10_7_x86_64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.6-cp39-cp39-macosx_10_7_x86_64.whl
Algorithm Hash digest
SHA256 04964a0316c4afe834b50707228e2cccca0d48a498fd1c208a5952f626f4acdd
MD5 d379f4bc97ea6ef420b10dc2723f44e6
BLAKE2b-256 b5d2e5c58526c602160e171e15c3e6c7bf5b2fd7a2636bc3ba97228db687c182

See more details on using hashes here.

Provenance

File details

Details for the file instant_segment-0.1.6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ee980de7865e9d59d62e0cdf3739aab188010d747c6a43ffdbf46c2a4ea53ec3
MD5 960f982ec84b7c33cc74b9f075354960
BLAKE2b-256 c8b40520330bc3b589c26c6c3d3ff880ad582b15c10e988a404b14134aa667bc

See more details on using hashes here.

Provenance

File details

Details for the file instant_segment-0.1.6-cp38-cp38-macosx_10_7_x86_64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.6-cp38-cp38-macosx_10_7_x86_64.whl
Algorithm Hash digest
SHA256 3869bfa4c480db2096988565a3c50ec00c6305bedc373d4418c8d119b1323a63
MD5 e4ba913e7aae163654bc9f4bb550c386
BLAKE2b-256 3163beced97f6568652d7044de255677045a4188accfe5aee606ccf713d5c195

See more details on using hashes here.

Provenance

File details

Details for the file instant_segment-0.1.6-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.6-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 fefd085ff57e04d1e27f9447ff806ced1102be80e11d73a9a5a05480c209a7ae
MD5 1bea87505ac40a9b838a59466631d39f
BLAKE2b-256 c0bf7b85c05fd08a7c27bd0edcd2ec869d8684f41818a164f8dd3db795208cf3

See more details on using hashes here.

Provenance

File details

Details for the file instant_segment-0.1.6-cp37-cp37m-macosx_10_7_x86_64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.6-cp37-cp37m-macosx_10_7_x86_64.whl
Algorithm Hash digest
SHA256 56798f52539577a468cd842edf4be205a4695f5758863188988995f9bda499cb
MD5 9b2c12a43d0ea8d8113b08ede2473a45
BLAKE2b-256 d624240259f9abcc50e67dce6a0ff2b75f2f147fea9b25456ec5263ac3b02e0f

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page