Skip to main content

c++ mosestokenizer

Project description

fast-mosestokenizer is a C++ implementation of the moses tokenizer which is a favourite among the folks in NLP research.

The reason for using this package over the original perl implementation is for the purpose of portability. With the C++ source code, you can use this library basically in every language.

The C++ script was adapted from the mosesdecoder repository contrib/c++tokenizer.

Benchmark

fast-mosestokenizer is also fast. On english, it is about 6x faster than tokenizer.perl and 15x faster than sacremoses.

see ./bench/README.md for more information.

Installation

Python users using linux and osx>=10.15 can install directly from PyPI.

pip install fast-mosestokenizer

See ./INSTALL.md for more information.

Usage (Command-line tool)

# Piping is the standard way to configure input and output stream.
# mosestokenizer would apply tokenization to each line of the input stream.
mosestokenizer en < infile > outfile

# For a full list of options, refer to the help message.
mosestokenizer -h

Usage (Python)

# Usage patterns are mostly the same as sacremoses.
>>> from mosestokenizer import MosesTokenizer

>>> tokenizer = MosesTokenizer('en')
>>> tokenizer.tokenize("""
The English name of Singapore is an anglicisation of the native Malay name for
the country, Singapura, which was in turn derived from the Sanskrit word for
lion city (romanised: Siṃhapura; Brahmi: 𑀲𑀺𑀁𑀳𑀧𑀼𑀭; literally "lion city"; siṃha
means "lion", pura means "city" or "fortress").[8]
""")
[
  'The', 'English', 'name', 'of', 'Singapore', 'is', 'an', 'anglicisation',
  'of', 'the', 'native', 'Malay', 'name', 'for', 'the', 'country', ',',
  'Singapura', ',', 'which', 'was', 'in', 'turn', 'derived', 'from', 'the',
  'Sanskrit', 'word', 'for', 'lion', 'city', '(', 'romanised', ':',
  'Siṃhapura', ';', 'Brahmi', ':', '𑀲𑀺𑀁𑀳𑀧𑀼𑀭', ';', 'literally', '"', 'lion',
  'city', '"', ';', 'siṃha', 'means', '"', 'lion', '"', ',', 'pura', 'means',
  '"', 'city', '"', 'or', '"', 'fortress', '"', ')', '.', '[', '8', ']'
]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

fast_mosestokenizer-0.0.1-cp38-cp38-macosx_10_15_x86_64.whl (2.7 MB view details)

Uploaded CPython 3.8 macOS 10.15+ x86-64

fast_mosestokenizer-0.0.1-cp37-cp37m-macosx_10_15_x86_64.whl (2.7 MB view details)

Uploaded CPython 3.7m macOS 10.15+ x86-64

fast_mosestokenizer-0.0.1-cp36-cp36m-macosx_10_15_x86_64.whl (2.7 MB view details)

Uploaded CPython 3.6m macOS 10.15+ x86-64

File details

Details for the file fast_mosestokenizer-0.0.1-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.1-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 4.1 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1.post20200622 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.7.7

File hashes

Hashes for fast_mosestokenizer-0.0.1-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 bcdc2977adc23ed7533e96afd295367ec495db30fb2e4f8b0d81494d4d25e779
MD5 b5dce405eec86c6bc7787c7f7f7e5ea4
BLAKE2b-256 c78917caa73b521ef6c73aab80b0edff20f58a4ba58af48568eecfe77c37000d

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.1-cp38-cp38-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.1-cp38-cp38-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 2.7 MB
  • Tags: CPython 3.8, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1.post20200622 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.7.7

File hashes

Hashes for fast_mosestokenizer-0.0.1-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 96c956ba719620dde36f3c8b39a6a90e090f4c50d89108c7bad20323523aa27b
MD5 6dd625b51505360832c86fa1abe6a330
BLAKE2b-256 ca56a942fe6d141ac97437d0c23f5ca7901edaf91096b26266b34296cddf6b43

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.1-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.1-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 4.1 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1.post20200622 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.7.7

File hashes

Hashes for fast_mosestokenizer-0.0.1-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 41a2579bb8ce8f3fa344ae74b225441d77e8dbb3267e103596657018c53b297d
MD5 1dafcb7826c675666ac0bf9b1513bb01
BLAKE2b-256 0e41484254d7736ccb5e71c022071171dba983df46ab408f809ccd6198d89d4c

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.1-cp37-cp37m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.1-cp37-cp37m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 2.7 MB
  • Tags: CPython 3.7m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.7.7

File hashes

Hashes for fast_mosestokenizer-0.0.1-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 2eb3619378673be41b3c37c93b5137addffaf03bf421b12923a80b8fa6173ef7
MD5 ba4fcee22039c72f7cb632ac713cb0ab
BLAKE2b-256 51beccd7290cd1267cbd0b90bdd7edbe621ef54456e53d84fc84faacd2f8c8d6

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.1-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.1-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 4.1 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1.post20200622 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.7.7

File hashes

Hashes for fast_mosestokenizer-0.0.1-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 bc45f0f04957cf972457fbc3a1c994dabcfe829ced96751e416b88b4915cb8a1
MD5 c5699ed4a5ce901820f1449a333baaaf
BLAKE2b-256 f6eeda71eee17941944c662344b475b309fcdd5c2722f26e725f4a418247f760

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.1-cp36-cp36m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.1-cp36-cp36m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 2.7 MB
  • Tags: CPython 3.6m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1.post20200622 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.7.7

File hashes

Hashes for fast_mosestokenizer-0.0.1-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 8bd928fe435ae86fb5e68fe81341ca9f8924764ae212e8f6b88c544ad5bc1fb3
MD5 cf014a51b751c0eb3ca9d33ad7e5e688
BLAKE2b-256 3c5fcd98c431d783590a8e94dacaaed5d7c6f24cdafded1fd3c5e311c7645f9a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page