Skip to main content

c++ mosestokenizer

Project description

fast-mosestokenizer is a C++ implementation of the moses tokenizer which is a favourite among the folks in NLP research.

The reason for using this package over the original perl implementation is for the purpose of portability. With the C++ source code, you can use this library basically in every language.

The C++ script was adapted from the mosesdecoder repository contrib/c++tokenizer.

Benchmark

fast-mosestokenizer is also fast. On english, it is about 6x faster than tokenizer.perl and 15x faster than sacremoses.

see ./bench/README.md for more information.

Installation

Python users using linux and osx>=10.15 can install directly from PyPI.

pip install fast-mosestokenizer

See ./INSTALL.md for more information.

Usage (Command-line tool)

# Piping is the standard way to configure input and output stream.
# mosestokenizer would apply tokenization to each line of the input stream.
mosestokenizer en < infile > outfile

# For a full list of options, refer to the help message.
mosestokenizer -h

Usage (Python)

# Usage patterns are mostly the same as sacremoses.
>>> from mosestokenizer import MosesTokenizer

>>> tokenizer = MosesTokenizer('en')
>>> tokenizer.tokenize("""
The English name of Singapore is an anglicisation of the native Malay name for
the country, Singapura, which was in turn derived from the Sanskrit word for
lion city (romanised: Siṃhapura; Brahmi: 𑀲𑀺𑀁𑀳𑀧𑀼𑀭; literally "lion city"; siṃha
means "lion", pura means "city" or "fortress").[8]
""")
[
  'The', 'English', 'name', 'of', 'Singapore', 'is', 'an', 'anglicisation',
  'of', 'the', 'native', 'Malay', 'name', 'for', 'the', 'country', ',',
  'Singapura', ',', 'which', 'was', 'in', 'turn', 'derived', 'from', 'the',
  'Sanskrit', 'word', 'for', 'lion', 'city', '(', 'romanised', ':',
  'Siṃhapura', ';', 'Brahmi', ':', '𑀲𑀺𑀁𑀳𑀧𑀼𑀭', ';', 'literally', '"', 'lion',
  'city', '"', ';', 'siṃha', 'means', '"', 'lion', '"', ',', 'pura', 'means',
  '"', 'city', '"', 'or', '"', 'fortress', '"', ')', '.', '[', '8', ']'
]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

fast_mosestokenizer-0.0.2-cp38-cp38-macosx_10_15_x86_64.whl (713.9 kB view details)

Uploaded CPython 3.8 macOS 10.15+ x86-64

fast_mosestokenizer-0.0.2-cp37-cp37m-macosx_10_15_x86_64.whl (713.2 kB view details)

Uploaded CPython 3.7m macOS 10.15+ x86-64

fast_mosestokenizer-0.0.2-cp36-cp36m-macosx_10_15_x86_64.whl (713.1 kB view details)

Uploaded CPython 3.6m macOS 10.15+ x86-64

File details

Details for the file fast_mosestokenizer-0.0.2-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.2-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 816.3 kB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/45.2.0.post20200210 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.3

File hashes

Hashes for fast_mosestokenizer-0.0.2-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 6d9b3d814f1aec5e4d4d4a2e3f7ddcebaefd11a3c11456b2abc6c02f5eb21af0
MD5 ee6433685f5a9957ca785b24f0e76923
BLAKE2b-256 fbb47e0380d050b410c45083602e9394480e581c7ae11a5ccbe24bbba03a396d

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.2-cp38-cp38-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.2-cp38-cp38-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 713.9 kB
  • Tags: CPython 3.8, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.1.0.post20200710 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.3

File hashes

Hashes for fast_mosestokenizer-0.0.2-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 188d7023c3cf89928daa8b25b5ba31bcd42382c8e9b85551332ca1af06956d9d
MD5 29d58e4cbb4c783429b0de96d731c229
BLAKE2b-256 96f635f5361d87ed29c71761a5f5bc07bfc9bffa0e442bedcb9fa6335ea229e2

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.2-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.2-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 814.9 kB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/45.2.0.post20200210 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.3

File hashes

Hashes for fast_mosestokenizer-0.0.2-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 f22a29f84847dc4a8b343bbc187b7c1a621f9593487501269e84b7dacc0d6c08
MD5 1b1aa834254db9119ffa1853d8fa655c
BLAKE2b-256 ca43ae53bbaab902ce91b094ab3b6a0ae0b31810cd87551b21d1c0f33ade5b5a

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.2-cp37-cp37m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.2-cp37-cp37m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 713.2 kB
  • Tags: CPython 3.7m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.1.0.post20200710 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.3

File hashes

Hashes for fast_mosestokenizer-0.0.2-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 4ff58388c9f62794b3a05faa0e5bd5115978e60d41d5c38ff9f89bd4cd0cd345
MD5 46f262aa3d3e74f0ea8c611e7f891b5e
BLAKE2b-256 fa1cabe6e492ac9abdaa66b245a01edbbe11d25965495afee3bd7add67d6940c

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.2-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.2-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 814.8 kB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/45.2.0.post20200210 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.3

File hashes

Hashes for fast_mosestokenizer-0.0.2-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 f962733f76f06ef6199fe852d69d67564fe3d23cf79b718e82899ba5a9cefda9
MD5 32cb9214de65afa84811edeeb69410bf
BLAKE2b-256 7a68db7011219aaf99ca10e88e950ef0c95ed5a98ac20b9020a807439b754766

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.2-cp36-cp36m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.2-cp36-cp36m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 713.1 kB
  • Tags: CPython 3.6m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.1.0.post20200710 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.3

File hashes

Hashes for fast_mosestokenizer-0.0.2-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 f2fde82b3cfcd1858cfaa62679a73b620fb4931c4db90103805ca3870be263ed
MD5 c3e3a4517aa654aac0d563325f2a6659
BLAKE2b-256 49248b0d93d227c6bcdfcf9382d59c35eb14a3ac5435501be14dd57779ca5e1f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page