Skip to main content

c++ mosestokenizer

Project description

fast-mosestokenizer is a C++ implementation of the moses tokenizer which is a favourite among the folks in NLP research.

The reason for using this package over the original perl implementation is for the purpose of portability. With the C++ source code, you can use this library basically in every language.

The C++ script was adapted from the mosesdecoder repository contrib/c++tokenizer.

Benchmark

fast-mosestokenizer is also fast. On english, it is about 6x faster than tokenizer.perl and 15x faster than sacremoses.

see ./bench/README.md for more information.

Installation

Python users using linux and osx>=10.15 can install directly from PyPI.

pip install fast-mosestokenizer

See ./INSTALL.md for more information.

Usage (Command-line tool)

# Piping is the standard way to configure input and output stream.
# mosestokenizer would apply tokenization to each line of the input stream.
mosestokenizer en < infile > outfile

# For a full list of options, refer to the help message.
mosestokenizer -h

Usage (Python)

# Usage patterns are mostly the same as sacremoses.
>>> from mosestokenizer import MosesTokenizer

>>> tokenizer = MosesTokenizer('en')
>>> tokenizer.tokenize("""
The English name of Singapore is an anglicisation of the native Malay name for
the country, Singapura, which was in turn derived from the Sanskrit word for
lion city (romanised: Siṃhapura; Brahmi: 𑀲𑀺𑀁𑀳𑀧𑀼𑀭; literally "lion city"; siṃha
means "lion", pura means "city" or "fortress").[8]
""")
[
  'The', 'English', 'name', 'of', 'Singapore', 'is', 'an', 'anglicisation',
  'of', 'the', 'native', 'Malay', 'name', 'for', 'the', 'country', ',',
  'Singapura', ',', 'which', 'was', 'in', 'turn', 'derived', 'from', 'the',
  'Sanskrit', 'word', 'for', 'lion', 'city', '(', 'romanised', ':',
  'Siṃhapura', ';', 'Brahmi', ':', '𑀲𑀺𑀁𑀳𑀧𑀼𑀭', ';', 'literally', '"', 'lion',
  'city', '"', ';', 'siṃha', 'means', '"', 'lion', '"', ',', 'pura', 'means',
  '"', 'city', '"', 'or', '"', 'fortress', '"', ')', '.', '[', '8', ']'
]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

fast_mosestokenizer-0.0.8.1-cp38-cp38-macosx_10_15_x86_64.whl (714.8 kB view details)

Uploaded CPython 3.8 macOS 10.15+ x86-64

fast_mosestokenizer-0.0.8.1-cp37-cp37m-macosx_10_15_x86_64.whl (714.0 kB view details)

Uploaded CPython 3.7m macOS 10.15+ x86-64

fast_mosestokenizer-0.0.8.1-cp36-cp36m-macosx_10_15_x86_64.whl (714.0 kB view details)

Uploaded CPython 3.6m macOS 10.15+ x86-64

File details

Details for the file fast_mosestokenizer-0.0.8.1-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.8.1-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 802.0 kB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fast_mosestokenizer-0.0.8.1-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 8e6b670b3a5d1535202854e8d20ef51da4a5d478b04fe602298488110c804c71
MD5 8dc6974cc7748ab0ecaf769198d3320c
BLAKE2b-256 1de2d008ed160cc87201a58346b7f6745680a75944f434cbd106b4d8ec1a7977

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.8.1-cp38-cp38-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.8.1-cp38-cp38-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 714.8 kB
  • Tags: CPython 3.8, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fast_mosestokenizer-0.0.8.1-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 952c257cf995c662b5358f75328b60c0fa9efeceb410b503c2b75ee1bd62b619
MD5 3e99309bd83b17d9d49e16b3c84b65ce
BLAKE2b-256 7947069b0bc5160aa98b3fd3d20a8c4762aa975765bb55144b19b8784a2d9ebc

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.8.1-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.8.1-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 801.2 kB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fast_mosestokenizer-0.0.8.1-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 387a67ff4cadbf4dc7546d255b9b870a63d1a2fac8bc347be900224223988050
MD5 efd5dee207b727077cdf1d6b2c69dd5c
BLAKE2b-256 5bcb42939165b0aaf21e97ab6e491f6d9583d07ef8be63c19c7ef84bc72ca362

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.8.1-cp37-cp37m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.8.1-cp37-cp37m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 714.0 kB
  • Tags: CPython 3.7m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fast_mosestokenizer-0.0.8.1-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 d930f9e7ab100ff080b5389f3a41fdbcf51865d3135fb470113f5e558fea6ade
MD5 8e8883ec0850f9748996a4fe0960fc1b
BLAKE2b-256 17176a2325112678739a18e0b528d17e61341ee2091e8072e5b1bbcb5180f9a8

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.8.1-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.8.1-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 801.1 kB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fast_mosestokenizer-0.0.8.1-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 7c244e160d278f38a5361db26813ff3ca6a5f9fb17b422aea2f24d2181689a9c
MD5 5694cb1075503eb6c68955c1fe5c11ce
BLAKE2b-256 61e62d6fd8e198b06cafb2b3831163fe8d095f130b2c8345793ab657e3a26538

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.8.1-cp36-cp36m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.8.1-cp36-cp36m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 714.0 kB
  • Tags: CPython 3.6m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fast_mosestokenizer-0.0.8.1-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 8b85beddfa1460e5b4ee27a2883eb00a581eab7617d35bca25ad79e78346db58
MD5 c8e72c9a48205de4cafabd1d2d172b66
BLAKE2b-256 ecd0d25ce53c9462da8e73a57a581dbfdd073ec5e4cc069a1c54574787ec45c8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page