Skip to main content

c++ mosestokenizer

Project description

fast-mosestokenizer is a C++ implementation of the moses tokenizer which is a favourite among the folks in NLP research.

The reason for using this package over the original perl implementation is for the purpose of portability. With the C++ source code, you can use this library basically in every language.

The C++ script was adapted from the mosesdecoder repository contrib/c++tokenizer.

Benchmark

fast-mosestokenizer is also fast. On english, it is about 6x faster than tokenizer.perl and 15x faster than sacremoses.

see ./bench/README.md for more information.

Installation

Python users using linux and osx>=10.15 can install directly from PyPI.

pip install fast-mosestokenizer

See ./INSTALL.md for more information.

Usage (Command-line tool)

# Piping is the standard way to configure input and output stream.
# mosestokenizer would apply tokenization to each line of the input stream.
mosestokenizer en < infile > outfile

# For a full list of options, refer to the help message.
mosestokenizer -h

Usage (Python)

# Usage patterns are mostly the same as sacremoses.
>>> from mosestokenizer import MosesTokenizer

>>> tokenizer = MosesTokenizer('en')
>>> tokenizer.tokenize("""
The English name of Singapore is an anglicisation of the native Malay name for
the country, Singapura, which was in turn derived from the Sanskrit word for
lion city (romanised: Siṃhapura; Brahmi: 𑀲𑀺𑀁𑀳𑀧𑀼𑀭; literally "lion city"; siṃha
means "lion", pura means "city" or "fortress").[8]
""")
[
  'The', 'English', 'name', 'of', 'Singapore', 'is', 'an', 'anglicisation',
  'of', 'the', 'native', 'Malay', 'name', 'for', 'the', 'country', ',',
  'Singapura', ',', 'which', 'was', 'in', 'turn', 'derived', 'from', 'the',
  'Sanskrit', 'word', 'for', 'lion', 'city', '(', 'romanised', ':',
  'Siṃhapura', ';', 'Brahmi', ':', '𑀲𑀺𑀁𑀳𑀧𑀼𑀭', ';', 'literally', '"', 'lion',
  'city', '"', ';', 'siṃha', 'means', '"', 'lion', '"', ',', 'pura', 'means',
  '"', 'city', '"', 'or', '"', 'fortress', '"', ')', '.', '[', '8', ']'
]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

fast_mosestokenizer-0.0.8.2-cp38-cp38-macosx_10_15_x86_64.whl (736.9 kB view details)

Uploaded CPython 3.8 macOS 10.15+ x86-64

fast_mosestokenizer-0.0.8.2-cp37-cp37m-macosx_10_15_x86_64.whl (736.0 kB view details)

Uploaded CPython 3.7m macOS 10.15+ x86-64

fast_mosestokenizer-0.0.8.2-cp36-cp36m-macosx_10_15_x86_64.whl (736.0 kB view details)

Uploaded CPython 3.6m macOS 10.15+ x86-64

File details

Details for the file fast_mosestokenizer-0.0.8.2-cp38-cp38-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for fast_mosestokenizer-0.0.8.2-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 7ea4c4c9c1ee7b14b2371507e4708df93d8cbcf6b04b3d5bfe60ad7914c51d15
MD5 00e9c2bfab1860d4e77fab17d347bec9
BLAKE2b-256 2e2391b0e910e30ee0ed831bf05494e03231139cd192e3baf26e8841b28ec53c

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.8.2-cp38-cp38-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.8.2-cp38-cp38-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 736.9 kB
  • Tags: CPython 3.8, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.12

File hashes

Hashes for fast_mosestokenizer-0.0.8.2-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 fc92dd8ea19a1ddef60c112539bf34852934e1aee4ded109346496f0f5a6d68f
MD5 46acc85c2334e0c25912184d5d3380e2
BLAKE2b-256 0f03ec5938a59c53cad61aa2136de414abcc4c752be745fbd568fa01f686d45b

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.8.2-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for fast_mosestokenizer-0.0.8.2-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 b1a5ffb9281495b16c23bdbbc7705af66888b8896e3fd55d9dd6977f28a3d23f
MD5 294935cf4be0f997272af82cb0326de4
BLAKE2b-256 82d3933ca0cd41340b9dd55a47a7a56820f5307815be107f266d2c00384dd0ff

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.8.2-cp37-cp37m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.8.2-cp37-cp37m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 736.0 kB
  • Tags: CPython 3.7m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.12

File hashes

Hashes for fast_mosestokenizer-0.0.8.2-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 e3e883e9840880ca3d78b64114b75331acb1267e65f6ada444a05a34b89d6b9b
MD5 e66a2a3ee0e6cfe57779570ae4fc35da
BLAKE2b-256 2d73f34347b4efebd81c41982c5a180eed3483c28fd0aae8362bb8cee18f2751

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.8.2-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for fast_mosestokenizer-0.0.8.2-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 bced5df75d6c4677be3d09d0151c4b2972135a5435c9d512518da4657faa03ce
MD5 992865ca62368da9760d9f6240c384de
BLAKE2b-256 5dff2f7058d6229798dae8e0365833baf7a8d3355ae5f8fd30ba0dd232b8247d

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.8.2-cp36-cp36m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.8.2-cp36-cp36m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 736.0 kB
  • Tags: CPython 3.6m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.12

File hashes

Hashes for fast_mosestokenizer-0.0.8.2-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 cacfc9b89b61cbd29664f23ab56753011ab5a50fa983578d35dc9e360ae9d616
MD5 3015d5356714a4146e69e0042074df90
BLAKE2b-256 96d70df5ac6a769d57c3d14eb5d2b4f3af594467d18e193816722aa73726da72

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page