Skip to main content

c++ mosestokenizer

Project description

fast-mosestokenizer is a C++ implementation of the moses tokenizer which is a favourite among the folks in NLP research.

The reason for using this package over the original perl implementation is for the purpose of portability. With the C++ source code, you can use this library basically in every language.

The C++ script was adapted from the mosesdecoder repository contrib/c++tokenizer.

Benchmark

fast-mosestokenizer is also fast. On english, it is about 6x faster than tokenizer.perl and 15x faster than sacremoses.

see ./bench/README.md for more information.

Installation

Python users using linux and osx>=10.15 can install directly from PyPI.

pip install fast-mosestokenizer

See ./INSTALL.md for more information.

Usage (Command-line tool)

# Piping is the standard way to configure input and output stream.
# mosestokenizer would apply tokenization to each line of the input stream.
mosestokenizer en < infile > outfile

# For a full list of options, refer to the help message.
mosestokenizer -h

Usage (Python)

# Usage patterns are mostly the same as sacremoses.
>>> from mosestokenizer import MosesTokenizer

>>> tokenizer = MosesTokenizer('en')
>>> tokenizer.tokenize("""
The English name of Singapore is an anglicisation of the native Malay name for
the country, Singapura, which was in turn derived from the Sanskrit word for
lion city (romanised: Siṃhapura; Brahmi: 𑀲𑀺𑀁𑀳𑀧𑀼𑀭; literally "lion city"; siṃha
means "lion", pura means "city" or "fortress").[8]
""")
[
  'The', 'English', 'name', 'of', 'Singapore', 'is', 'an', 'anglicisation',
  'of', 'the', 'native', 'Malay', 'name', 'for', 'the', 'country', ',',
  'Singapura', ',', 'which', 'was', 'in', 'turn', 'derived', 'from', 'the',
  'Sanskrit', 'word', 'for', 'lion', 'city', '(', 'romanised', ':',
  'Siṃhapura', ';', 'Brahmi', ':', '𑀲𑀺𑀁𑀳𑀧𑀼𑀭', ';', 'literally', '"', 'lion',
  'city', '"', ';', 'siṃha', 'means', '"', 'lion', '"', ',', 'pura', 'means',
  '"', 'city', '"', 'or', '"', 'fortress', '"', ')', '.', '[', '8', ']'
]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

fast_mosestokenizer-0.0.7-cp38-cp38-macosx_10_15_x86_64.whl (714.7 kB view details)

Uploaded CPython 3.8 macOS 10.15+ x86-64

fast_mosestokenizer-0.0.7-cp37-cp37m-macosx_10_15_x86_64.whl (714.0 kB view details)

Uploaded CPython 3.7m macOS 10.15+ x86-64

fast_mosestokenizer-0.0.7-cp36-cp36m-macosx_10_15_x86_64.whl (713.9 kB view details)

Uploaded CPython 3.6m macOS 10.15+ x86-64

File details

Details for the file fast_mosestokenizer-0.0.7-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.7-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 801.8 kB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fast_mosestokenizer-0.0.7-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 92d76e3c3d2689e164537b363cf4920cf41973731eded9409bf16717674a8f13
MD5 b7cf9233ef2250678f1030910ae85c0d
BLAKE2b-256 6c20520b3c6fb67f4166465c67bcf9817a552fec4bcbfcf958a663b180d06b8a

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.7-cp38-cp38-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.7-cp38-cp38-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 714.7 kB
  • Tags: CPython 3.8, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fast_mosestokenizer-0.0.7-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 ea1960b6dccd158562a064865a980c93a40c1268e51260af37b6598ddb8c47b6
MD5 54493d1341741628ca179ba90aec5990
BLAKE2b-256 8e9d734daea9abd147fe8aed827e4af44102806c8d19a09704ba7e59b51e96fc

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.7-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.7-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 800.9 kB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fast_mosestokenizer-0.0.7-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 3d3d39f82db9de3ff7b1073872352b412d6cdaffd342a3be94c72d9ab9b557c1
MD5 b672ac038dbbd1db9d8aa90660c466f4
BLAKE2b-256 8fbab48cf4048a3c17acca81c165c6e43303d68a765ce7b2a85e910c1208fb4b

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.7-cp37-cp37m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.7-cp37-cp37m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 714.0 kB
  • Tags: CPython 3.7m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fast_mosestokenizer-0.0.7-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 89641b22bf1fcd65cdcee1ebcb4dd5c3f7fa92303cae2178c9ecc624fbc429e1
MD5 886057533c335da43129dd402e86d271
BLAKE2b-256 067d18ec64eb70c4a04ea5dd7327c2ac679f3887d4e78638faee4664fa06c607

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.7-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.7-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 800.9 kB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fast_mosestokenizer-0.0.7-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 2c77e16dabd99b598c4e5fce2196d2db79b6e1f3e2e4ce34d4c01f1f985d3d4d
MD5 ed7f812ca4b78be847df152ac7aaf12a
BLAKE2b-256 6aef75475b25c97b475943fc4fd5a4d1f6cd7e221057d9ad6aebf49858e6718f

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.7-cp36-cp36m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.7-cp36-cp36m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 713.9 kB
  • Tags: CPython 3.6m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fast_mosestokenizer-0.0.7-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 affe755e69ee5580259bf4a0b00fb3b7d06b03ee78e601c9fe0905e05b4785f2
MD5 4e7b8be0770fc25f63e1f9717cec41d0
BLAKE2b-256 430e4decb596c5fd70b3b1c7ce36e95837cb9be96fbaa6786228515325a737fb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page