Skip to main content

c++ mosestokenizer

Project description

fast-mosestokenizer is a C++ implementation of the moses tokenizer which is a favourite among the folks in NLP research.

The reason for using this package over the original perl implementation is for the purpose of portability. With the C++ source code, you can use this library basically in every language.

The C++ script was adapted from the mosesdecoder repository contrib/c++tokenizer.

Benchmark

fast-mosestokenizer is also fast. On english, it is about 6x faster than tokenizer.perl and 15x faster than sacremoses.

see ./bench/README.md for more information.

Installation

Python users using linux and osx>=10.15 can install directly from PyPI.

pip install fast-mosestokenizer

See ./INSTALL.md for more information.

Usage (Command-line tool)

# Piping is the standard way to configure input and output stream.
# mosestokenizer would apply tokenization to each line of the input stream.
mosestokenizer en < infile > outfile

# For a full list of options, refer to the help message.
mosestokenizer -h

Usage (Python)

# Usage patterns are mostly the same as sacremoses.
>>> from mosestokenizer import MosesTokenizer

>>> tokenizer = MosesTokenizer('en')
>>> tokenizer.tokenize("""
The English name of Singapore is an anglicisation of the native Malay name for
the country, Singapura, which was in turn derived from the Sanskrit word for
lion city (romanised: Siṃhapura; Brahmi: 𑀲𑀺𑀁𑀳𑀧𑀼𑀭; literally "lion city"; siṃha
means "lion", pura means "city" or "fortress").[8]
""")
[
  'The', 'English', 'name', 'of', 'Singapore', 'is', 'an', 'anglicisation',
  'of', 'the', 'native', 'Malay', 'name', 'for', 'the', 'country', ',',
  'Singapura', ',', 'which', 'was', 'in', 'turn', 'derived', 'from', 'the',
  'Sanskrit', 'word', 'for', 'lion', 'city', '(', 'romanised', ':',
  'Siṃhapura', ';', 'Brahmi', ':', '𑀲𑀺𑀁𑀳𑀧𑀼𑀭', ';', 'literally', '"', 'lion',
  'city', '"', ';', 'siṃha', 'means', '"', 'lion', '"', ',', 'pura', 'means',
  '"', 'city', '"', 'or', '"', 'fortress', '"', ')', '.', '[', '8', ']'
]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

fast_mosestokenizer-0.0.8-cp38-cp38-macosx_10_15_x86_64.whl (714.7 kB view details)

Uploaded CPython 3.8 macOS 10.15+ x86-64

fast_mosestokenizer-0.0.8-cp37-cp37m-macosx_10_15_x86_64.whl (714.0 kB view details)

Uploaded CPython 3.7m macOS 10.15+ x86-64

fast_mosestokenizer-0.0.8-cp36-cp36m-macosx_10_15_x86_64.whl (713.9 kB view details)

Uploaded CPython 3.6m macOS 10.15+ x86-64

File details

Details for the file fast_mosestokenizer-0.0.8-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.8-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 801.9 kB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fast_mosestokenizer-0.0.8-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 645a58992c6a71f7d87022a13a1278f8bff0f5cf7b027e1ee46eead60f925f1d
MD5 4fa6efbad62a870371ccb56a332ddd45
BLAKE2b-256 e7e56b371ce085b44db627b889e6ff8134b4a8db26776f535fac56bf6d96490e

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.8-cp38-cp38-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.8-cp38-cp38-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 714.7 kB
  • Tags: CPython 3.8, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.3.1.post20200810 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fast_mosestokenizer-0.0.8-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 f3d454f8ee50651cda6db1bc115191c2bd3555503307d89a3aa581d504dd5fdd
MD5 f7b99d4bf3f141890f077bb3478ecbf8
BLAKE2b-256 a5a1614e3acaf6730c4957a118527838b8d02506fc30f34aababd9c0391bc5ed

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.8-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.8-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 801.0 kB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fast_mosestokenizer-0.0.8-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 1b3502a12bb0ce87b1fbc73c392ed1c714c48f1dd7e80e6fe58863f1b5e1259c
MD5 68fc925def6639772085674bc77ae187
BLAKE2b-256 058e5d907fad82c07d48f5507aa52a447b329748bb05047eb53ba35c98b54ce1

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.8-cp37-cp37m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.8-cp37-cp37m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 714.0 kB
  • Tags: CPython 3.7m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.3.1.post20200810 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fast_mosestokenizer-0.0.8-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 ee5e2987eda0337e80109f1343b3ad1fa33004dbfdae2483a77b19523a64823a
MD5 e460f5546c1a4caa07abcff2bb6ffb75
BLAKE2b-256 1b60dee4e0020651ca763d4a08366354674dd94e40939eaa0e5b3e81887c4ace

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.8-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.8-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 801.0 kB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fast_mosestokenizer-0.0.8-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 211edd0137d0401d8e801d905310e86f4c95cb3968181dfa9132caa16825434d
MD5 b345fd80a32ba74da212d0f06bd68cc0
BLAKE2b-256 6646a20db5f8685f7f3a9ca01a521ba13fb9de876ac5193ee8f6079f412608a9

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.8-cp36-cp36m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.8-cp36-cp36m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 713.9 kB
  • Tags: CPython 3.6m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.3.1.post20200810 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fast_mosestokenizer-0.0.8-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 802ede8afc53adbad1aa137e1fa91b1d16561365682aeb0aac0cb3498c4b54b9
MD5 cd521862b212f31d8277c763954e46fa
BLAKE2b-256 56c8f902bbc8cadf675c87484999f50334628b7696b8199981307ffeadb28021

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page