c++ mosestokenizer
Project description
fast-mosestokenizer is a C++ implementation of the moses tokenizer which is a favourite among the folks in NLP research.
The reason for using this package over the original perl implementation is for the purpose of portability. With the C++ source code, you can use this library basically in every language.
The C++ script was adapted from the mosesdecoder repository
contrib/c++tokenizer
.
Benchmark
fast-mosestokenizer is also fast.
On english, it is about 6x faster than tokenizer.perl
and 15x faster than
sacremoses
.
see ./bench/README.md for more information.
Installation
Python users using linux
and osx>=10.15
can install directly from PyPI.
pip install fast-mosestokenizer
See ./INSTALL.md for more information.
Usage (Command-line tool)
# Piping is the standard way to configure input and output stream.
# mosestokenizer would apply tokenization to each line of the input stream.
mosestokenizer en < infile > outfile
# For a full list of options, refer to the help message.
mosestokenizer -h
Usage (Python)
# Usage patterns are mostly the same as sacremoses.
>>> from mosestokenizer import MosesTokenizer
>>> tokenizer = MosesTokenizer('en')
>>> tokenizer.tokenize("""
The English name of Singapore is an anglicisation of the native Malay name for
the country, Singapura, which was in turn derived from the Sanskrit word for
lion city (romanised: Siṃhapura; Brahmi: 𑀲𑀺𑀁𑀳𑀧𑀼𑀭; literally "lion city"; siṃha
means "lion", pura means "city" or "fortress").[8]
""")
[
'The', 'English', 'name', 'of', 'Singapore', 'is', 'an', 'anglicisation',
'of', 'the', 'native', 'Malay', 'name', 'for', 'the', 'country', ',',
'Singapura', ',', 'which', 'was', 'in', 'turn', 'derived', 'from', 'the',
'Sanskrit', 'word', 'for', 'lion', 'city', '(', 'romanised', ':',
'Siṃhapura', ';', 'Brahmi', ':', '𑀲𑀺𑀁𑀳𑀧𑀼𑀭', ';', 'literally', '"', 'lion',
'city', '"', ';', 'siṃha', 'means', '"', 'lion', '"', ',', 'pura', 'means',
'"', 'city', '"', 'or', '"', 'fortress', '"', ')', '.', '[', '8', ']'
]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
File details
Details for the file fast_mosestokenizer-0.0.1-cp38-cp38-manylinux1_x86_64.whl
.
File metadata
- Download URL: fast_mosestokenizer-0.0.1-cp38-cp38-manylinux1_x86_64.whl
- Upload date:
- Size: 4.1 MB
- Tags: CPython 3.8
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1.post20200622 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bcdc2977adc23ed7533e96afd295367ec495db30fb2e4f8b0d81494d4d25e779 |
|
MD5 | b5dce405eec86c6bc7787c7f7f7e5ea4 |
|
BLAKE2b-256 | c78917caa73b521ef6c73aab80b0edff20f58a4ba58af48568eecfe77c37000d |
File details
Details for the file fast_mosestokenizer-0.0.1-cp38-cp38-macosx_10_15_x86_64.whl
.
File metadata
- Download URL: fast_mosestokenizer-0.0.1-cp38-cp38-macosx_10_15_x86_64.whl
- Upload date:
- Size: 2.7 MB
- Tags: CPython 3.8, macOS 10.15+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1.post20200622 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 96c956ba719620dde36f3c8b39a6a90e090f4c50d89108c7bad20323523aa27b |
|
MD5 | 6dd625b51505360832c86fa1abe6a330 |
|
BLAKE2b-256 | ca56a942fe6d141ac97437d0c23f5ca7901edaf91096b26266b34296cddf6b43 |
File details
Details for the file fast_mosestokenizer-0.0.1-cp37-cp37m-manylinux1_x86_64.whl
.
File metadata
- Download URL: fast_mosestokenizer-0.0.1-cp37-cp37m-manylinux1_x86_64.whl
- Upload date:
- Size: 4.1 MB
- Tags: CPython 3.7m
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1.post20200622 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 41a2579bb8ce8f3fa344ae74b225441d77e8dbb3267e103596657018c53b297d |
|
MD5 | 1dafcb7826c675666ac0bf9b1513bb01 |
|
BLAKE2b-256 | 0e41484254d7736ccb5e71c022071171dba983df46ab408f809ccd6198d89d4c |
File details
Details for the file fast_mosestokenizer-0.0.1-cp37-cp37m-macosx_10_15_x86_64.whl
.
File metadata
- Download URL: fast_mosestokenizer-0.0.1-cp37-cp37m-macosx_10_15_x86_64.whl
- Upload date:
- Size: 2.7 MB
- Tags: CPython 3.7m, macOS 10.15+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2eb3619378673be41b3c37c93b5137addffaf03bf421b12923a80b8fa6173ef7 |
|
MD5 | ba4fcee22039c72f7cb632ac713cb0ab |
|
BLAKE2b-256 | 51beccd7290cd1267cbd0b90bdd7edbe621ef54456e53d84fc84faacd2f8c8d6 |
File details
Details for the file fast_mosestokenizer-0.0.1-cp36-cp36m-manylinux1_x86_64.whl
.
File metadata
- Download URL: fast_mosestokenizer-0.0.1-cp36-cp36m-manylinux1_x86_64.whl
- Upload date:
- Size: 4.1 MB
- Tags: CPython 3.6m
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1.post20200622 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bc45f0f04957cf972457fbc3a1c994dabcfe829ced96751e416b88b4915cb8a1 |
|
MD5 | c5699ed4a5ce901820f1449a333baaaf |
|
BLAKE2b-256 | f6eeda71eee17941944c662344b475b309fcdd5c2722f26e725f4a418247f760 |
File details
Details for the file fast_mosestokenizer-0.0.1-cp36-cp36m-macosx_10_15_x86_64.whl
.
File metadata
- Download URL: fast_mosestokenizer-0.0.1-cp36-cp36m-macosx_10_15_x86_64.whl
- Upload date:
- Size: 2.7 MB
- Tags: CPython 3.6m, macOS 10.15+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1.post20200622 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8bd928fe435ae86fb5e68fe81341ca9f8924764ae212e8f6b88c544ad5bc1fb3 |
|
MD5 | cf014a51b751c0eb3ca9d33ad7e5e688 |
|
BLAKE2b-256 | 3c5fcd98c431d783590a8e94dacaaed5d7c6f24cdafded1fd3c5e311c7645f9a |