Skip to main content

c++ mosestokenizer

Project description

fast-mosestokenizer is a C++ implementation of the moses tokenizer which is a favourite among the folks in NLP research.

The reason for using this package over the original perl implementation is for the purpose of portability. With the C++ source code, you can use this library basically in every language.

The C++ script was adapted from the mosesdecoder repository contrib/c++tokenizer.

Benchmark

fast-mosestokenizer is also fast. On english, it is about 6x faster than tokenizer.perl and 15x faster than sacremoses.

see ./bench/README.md for more information.

Installation

Python users using linux and osx>=10.15 can install directly from PyPI.

pip install fast-mosestokenizer

See ./INSTALL.md for more information.

Usage (Command-line tool)

# Piping is the standard way to configure input and output stream.
# mosestokenizer would apply tokenization to each line of the input stream.
mosestokenizer en < infile > outfile

# For a full list of options, refer to the help message.
mosestokenizer -h

Usage (Python)

# Usage patterns are mostly the same as sacremoses.
>>> from mosestokenizer import MosesTokenizer

>>> tokenizer = MosesTokenizer('en')
>>> tokenizer.tokenize("""
The English name of Singapore is an anglicisation of the native Malay name for
the country, Singapura, which was in turn derived from the Sanskrit word for
lion city (romanised: Siṃhapura; Brahmi: 𑀲𑀺𑀁𑀳𑀧𑀼𑀭; literally "lion city"; siṃha
means "lion", pura means "city" or "fortress").[8]
""")
[
  'The', 'English', 'name', 'of', 'Singapore', 'is', 'an', 'anglicisation',
  'of', 'the', 'native', 'Malay', 'name', 'for', 'the', 'country', ',',
  'Singapura', ',', 'which', 'was', 'in', 'turn', 'derived', 'from', 'the',
  'Sanskrit', 'word', 'for', 'lion', 'city', '(', 'romanised', ':',
  'Siṃhapura', ';', 'Brahmi', ':', '𑀲𑀺𑀁𑀳𑀧𑀼𑀭', ';', 'literally', '"', 'lion',
  'city', '"', ';', 'siṃha', 'means', '"', 'lion', '"', ',', 'pura', 'means',
  '"', 'city', '"', 'or', '"', 'fortress', '"', ')', '.', '[', '8', ']'
]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

fast_mosestokenizer-0.0.5-cp38-cp38-macosx_10_15_x86_64.whl (714.8 kB view details)

Uploaded CPython 3.8 macOS 10.15+ x86-64

fast_mosestokenizer-0.0.5-cp37-cp37m-macosx_10_15_x86_64.whl (714.0 kB view details)

Uploaded CPython 3.7m macOS 10.15+ x86-64

fast_mosestokenizer-0.0.5-cp36-cp36m-macosx_10_15_x86_64.whl (713.9 kB view details)

Uploaded CPython 3.6m macOS 10.15+ x86-64

File details

Details for the file fast_mosestokenizer-0.0.5-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.5-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 801.7 kB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for fast_mosestokenizer-0.0.5-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 28423d2d00e233330898bed211663743f4b28b1b9afbe62345329d2917409576
MD5 3d4b5f85d27f9261cd3a133b12292e18
BLAKE2b-256 3ca2d2d1571c0328348030293c590ccf769e67f06a723c2b3a9255db9754b029

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.5-cp38-cp38-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.5-cp38-cp38-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 714.8 kB
  • Tags: CPython 3.8, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for fast_mosestokenizer-0.0.5-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 4aa4a3604244e365af90e7f9891f5d98b4823ce18a1f456a9272e932fb826147
MD5 b47a9022fec5170434b811aed8e873c6
BLAKE2b-256 6b77a0f31031efb362eb80f3694982b0f94af863cd9c75f456b9e7247d999135

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.5-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.5-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 800.9 kB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for fast_mosestokenizer-0.0.5-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 266c1cce6e3e932b23c630e019aef7b3e85178593e5f922afd9fc0d21a8d3217
MD5 27ea2ba54c97c3797d9a27aefa68f610
BLAKE2b-256 40c5f8276625bb8b118ee27ce169e061e7d3dc6877d0220bf28accca9af79f96

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.5-cp37-cp37m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.5-cp37-cp37m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 714.0 kB
  • Tags: CPython 3.7m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for fast_mosestokenizer-0.0.5-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 cd39fe2d082f23bbfd82cbf80a5aef4efd338c516a31a1f7e14caf565518912d
MD5 92846fd36735f33f37b3e755d978c776
BLAKE2b-256 1e3c8329c0cb987a90abf104bdac643d98709d3dd86184ba49b7e9d9f4c98de6

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.5-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.5-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 800.9 kB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for fast_mosestokenizer-0.0.5-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 ecb501b2803184c50b61a82538ad4dd7cc4bb2cff53fa5e3999d1979b3c854a4
MD5 b0c7c068efd20f3e8c629b9c2c20f4b0
BLAKE2b-256 b998edd07294ea211a4bd996c272e91364c3c8f0e32e3746b39da438870ccae7

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.5-cp36-cp36m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.5-cp36-cp36m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 713.9 kB
  • Tags: CPython 3.6m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for fast_mosestokenizer-0.0.5-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 913b3d7901edfb023d521228ffbae6cdcd180a9741f01d773177fc232b9c75aa
MD5 d8633104d60171ac29908061eacb5777
BLAKE2b-256 a4a861210d3b7c3fe574ace6887cbed1e79fe6df9ac42586c7703200354d7813

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page