Skip to main content

c++ mosestokenizer

Project description

fast-mosestokenizer is a C++ implementation of the moses tokenizer which is a favourite among the folks in NLP research.

The reason for using this package over the original perl implementation is for the purpose of portability. With the C++ source code, you can use this library basically in every language.

The C++ script was adapted from the mosesdecoder repository contrib/c++tokenizer.

Benchmark

fast-mosestokenizer is also fast. On english, it is about 6x faster than tokenizer.perl and 15x faster than sacremoses.

see ./bench/README.md for more information.

Installation

Python users using linux and osx>=10.15 can install directly from PyPI.

pip install fast-mosestokenizer

See ./INSTALL.md for more information.

Usage (Command-line tool)

# Piping is the standard way to configure input and output stream.
# mosestokenizer would apply tokenization to each line of the input stream.
mosestokenizer en < infile > outfile

# For a full list of options, refer to the help message.
mosestokenizer -h

Usage (Python)

# Usage patterns are mostly the same as sacremoses.
>>> from mosestokenizer import MosesTokenizer

>>> tokenizer = MosesTokenizer('en')
>>> tokenizer.tokenize("""
The English name of Singapore is an anglicisation of the native Malay name for
the country, Singapura, which was in turn derived from the Sanskrit word for
lion city (romanised: Siṃhapura; Brahmi: 𑀲𑀺𑀁𑀳𑀧𑀼𑀭; literally "lion city"; siṃha
means "lion", pura means "city" or "fortress").[8]
""")
[
  'The', 'English', 'name', 'of', 'Singapore', 'is', 'an', 'anglicisation',
  'of', 'the', 'native', 'Malay', 'name', 'for', 'the', 'country', ',',
  'Singapura', ',', 'which', 'was', 'in', 'turn', 'derived', 'from', 'the',
  'Sanskrit', 'word', 'for', 'lion', 'city', '(', 'romanised', ':',
  'Siṃhapura', ';', 'Brahmi', ':', '𑀲𑀺𑀁𑀳𑀧𑀼𑀭', ';', 'literally', '"', 'lion',
  'city', '"', ';', 'siṃha', 'means', '"', 'lion', '"', ',', 'pura', 'means',
  '"', 'city', '"', 'or', '"', 'fortress', '"', ')', '.', '[', '8', ']'
]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

fast_mosestokenizer-0.0.4-cp38-cp38-macosx_10_15_x86_64.whl (714.5 kB view details)

Uploaded CPython 3.8 macOS 10.15+ x86-64

fast_mosestokenizer-0.0.4-cp37-cp37m-macosx_10_15_x86_64.whl (713.8 kB view details)

Uploaded CPython 3.7m macOS 10.15+ x86-64

fast_mosestokenizer-0.0.4-cp36-cp36m-macosx_10_15_x86_64.whl (713.7 kB view details)

Uploaded CPython 3.6m macOS 10.15+ x86-64

File details

Details for the file fast_mosestokenizer-0.0.4-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.4-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 813.2 kB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.7.7

File hashes

Hashes for fast_mosestokenizer-0.0.4-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 3efed81c9660106b6240814fd5eb396e60dcbe62371f2eec2608fb7256421109
MD5 9e06f21dad36b443133c69c293950ef6
BLAKE2b-256 11d4d397a214d536eb04bd713bcf41015e2ac750e9d279ccb543cd7dc93567a3

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.4-cp38-cp38-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.4-cp38-cp38-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 714.5 kB
  • Tags: CPython 3.8, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for fast_mosestokenizer-0.0.4-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 af549bb1e38fc3e13b892a96c43826d7c9d92396ce326f43fa7485ed588715c1
MD5 3942fd35c8d531977076459bb009868e
BLAKE2b-256 efa0ecf34c1f339b646cc3ce7d07d18c48795e3876e217a75132e979572f0825

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.4-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.4-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 812.5 kB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.7.7

File hashes

Hashes for fast_mosestokenizer-0.0.4-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 4d6f227be0c7cd42062f6d6a75958d4b43b88345c579d0f4af6786fc02e9eed9
MD5 3e83ec7182ab56bb5e78649ab1c030b7
BLAKE2b-256 71d9bfb3bac954a62f80d0da4dc6dedfd3da978615d6d70de19fa3d269f0ca14

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.4-cp37-cp37m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.4-cp37-cp37m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 713.8 kB
  • Tags: CPython 3.7m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for fast_mosestokenizer-0.0.4-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 fc2cdef0ef08edb1cf589cfbe133a86db75bd79e2127621ad7d4f3e90673ec07
MD5 af77a75c7727f1a8be8439aa19039ba4
BLAKE2b-256 80bafcf637e333552581f338a73b83702e9a5bc6a9333accc3ea82bc123be27d

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.4-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.4-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 812.5 kB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.7.7

File hashes

Hashes for fast_mosestokenizer-0.0.4-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 b663ec75eb81738574e7d70db1177e732bdfbf49eaf27f2e8f7716831d464839
MD5 ca269176b3a63f66c904fd0e5a07beca
BLAKE2b-256 a261ae829b92273fd70af290beae9b781bd9b9b0c12091aed430f9109e7fbd0e

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.4-cp36-cp36m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.4-cp36-cp36m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 713.7 kB
  • Tags: CPython 3.6m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for fast_mosestokenizer-0.0.4-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 5ec216b4af7e081b86665b40b7ee4f0121d68446e1c9fccd2687add20b430b36
MD5 67f19d2e128d69483b39b2f347143ecc
BLAKE2b-256 2da9c5e255b3230b9516c1fdfbd10d31b3cd9d39728536d1e3df302c6b5b137d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page