Skip to main content

c++ mosestokenizer

Project description

fast-mosestokenizer is a C++ implementation of the moses tokenizer which is a favourite among the folks in NLP research.

The reason for using this package over the original perl implementation is for the purpose of portability. With the C++ source code, you can use this library basically in every language.

The C++ script was adapted from the mosesdecoder repository contrib/c++tokenizer.

Benchmark

fast-mosestokenizer is also fast. On english, it is about 6x faster than tokenizer.perl and 15x faster than sacremoses.

see ./bench/README.md for more information.

Installation

Python users using linux and osx>=10.15 can install directly from PyPI.

pip install fast-mosestokenizer

See ./INSTALL.md for more information.

Usage (Command-line tool)

# Piping is the standard way to configure input and output stream.
# mosestokenizer would apply tokenization to each line of the input stream.
mosestokenizer en < infile > outfile

# For a full list of options, refer to the help message.
mosestokenizer -h

Usage (Python)

# Usage patterns are mostly the same as sacremoses.
>>> from mosestokenizer import MosesTokenizer

>>> tokenizer = MosesTokenizer('en')
>>> tokenizer.tokenize("""
The English name of Singapore is an anglicisation of the native Malay name for
the country, Singapura, which was in turn derived from the Sanskrit word for
lion city (romanised: Siṃhapura; Brahmi: 𑀲𑀺𑀁𑀳𑀧𑀼𑀭; literally "lion city"; siṃha
means "lion", pura means "city" or "fortress").[8]
""")
[
  'The', 'English', 'name', 'of', 'Singapore', 'is', 'an', 'anglicisation',
  'of', 'the', 'native', 'Malay', 'name', 'for', 'the', 'country', ',',
  'Singapura', ',', 'which', 'was', 'in', 'turn', 'derived', 'from', 'the',
  'Sanskrit', 'word', 'for', 'lion', 'city', '(', 'romanised', ':',
  'Siṃhapura', ';', 'Brahmi', ':', '𑀲𑀺𑀁𑀳𑀧𑀼𑀭', ';', 'literally', '"', 'lion',
  'city', '"', ';', 'siṃha', 'means', '"', 'lion', '"', ',', 'pura', 'means',
  '"', 'city', '"', 'or', '"', 'fortress', '"', ')', '.', '[', '8', ']'
]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

fast_mosestokenizer-0.0.6-cp38-cp38-macosx_10_15_x86_64.whl (714.8 kB view details)

Uploaded CPython 3.8 macOS 10.15+ x86-64

fast_mosestokenizer-0.0.6-cp37-cp37m-macosx_10_15_x86_64.whl (714.1 kB view details)

Uploaded CPython 3.7m macOS 10.15+ x86-64

fast_mosestokenizer-0.0.6-cp36-cp36m-macosx_10_15_x86_64.whl (714.0 kB view details)

Uploaded CPython 3.6m macOS 10.15+ x86-64

File details

Details for the file fast_mosestokenizer-0.0.6-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.6-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 801.8 kB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fast_mosestokenizer-0.0.6-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 6d5d9e13f222969a58aa5455ddc6c165be2c780c0141af45da0356e0dacb0ebe
MD5 04cf2266fd902437d8a397aef1d491c3
BLAKE2b-256 253628dce16cfc0defc7c48a44e67e9bfd45ce850a2cd2d8bbf7acf4660e3f3b

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.6-cp38-cp38-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.6-cp38-cp38-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 714.8 kB
  • Tags: CPython 3.8, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fast_mosestokenizer-0.0.6-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 7d9246b468e7ee22b58091e83997cb795e2bc9c518237874e2d5691025988e59
MD5 6f899fa24b2b4771ba77eda10f5be2c8
BLAKE2b-256 5504385840a5f636fbea8f58a5be089eaa57e371a71b8f579626d5c38024256f

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.6-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.6-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 800.9 kB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fast_mosestokenizer-0.0.6-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 ac55cd0b2a1b98f394e2256e06467d07739a437822a4fad74ed181bfc19c4adb
MD5 6c2c0dea784bb1b6e82b991ad53bc0f0
BLAKE2b-256 23cfebfedcb1b84477ea978d040d449588cc7b6ae79120ab54f462f6da5e24cb

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.6-cp37-cp37m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.6-cp37-cp37m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 714.1 kB
  • Tags: CPython 3.7m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fast_mosestokenizer-0.0.6-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 739e86b39f3efebdf79aba1b238be3f7c70231c0ff2ced46abf13d4cec88c6ed
MD5 94c1173cb02c99304935df7a8501575f
BLAKE2b-256 1c337048a73b9fd47c2f031915d5462ad85ae752ae5a6f08bfd91c4b5ec03a1d

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.6-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.6-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 800.9 kB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fast_mosestokenizer-0.0.6-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 0f98903ac97e82655e2040c1c1bcb9b3bbe78a9b57954c909952d9f874aebf47
MD5 a7804e630f4c4543a5ff01ab8bb6b2ed
BLAKE2b-256 5d505505e013fb9d5182d26c9d063b463f1a85daffe2106727e640238d249f3b

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.6-cp36-cp36m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.6-cp36-cp36m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 714.0 kB
  • Tags: CPython 3.6m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fast_mosestokenizer-0.0.6-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 292bf6dd524e4f1041ff60010748f038db204f45eb33d4018180b38ad65e682e
MD5 c07b4732bf985cdbbdfa373c1bd4ebed
BLAKE2b-256 385583c1597e9666b2eae88a6542c28bc97a71a2952d8f732b80bb190aefa15a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page