Skip to main content

c++ mosestokenizer

Project description

fast-mosestokenizer is a C++ implementation of the moses tokenizer which is a favourite among the folks in NLP research.

The reason for using this package over the original perl implementation is for the purpose of portability. With the C++ source code, you can use this library basically in every language.

The C++ script was adapted from the mosesdecoder repository contrib/c++tokenizer.

Benchmark

fast-mosestokenizer is also fast. On english, it is about 6x faster than tokenizer.perl and 15x faster than sacremoses.

see ./bench/README.md for more information.

Installation

Python users using linux and osx>=10.15 can install directly from PyPI.

pip install fast-mosestokenizer

See ./INSTALL.md for more information.

Usage (Command-line tool)

# Piping is the standard way to configure input and output stream.
# mosestokenizer would apply tokenization to each line of the input stream.
mosestokenizer en < infile > outfile

# For a full list of options, refer to the help message.
mosestokenizer -h

Usage (Python)

# Usage patterns are mostly the same as sacremoses.
>>> from mosestokenizer import MosesTokenizer

>>> tokenizer = MosesTokenizer('en')
>>> tokenizer.tokenize("""
The English name of Singapore is an anglicisation of the native Malay name for
the country, Singapura, which was in turn derived from the Sanskrit word for
lion city (romanised: Siṃhapura; Brahmi: 𑀲𑀺𑀁𑀳𑀧𑀼𑀭; literally "lion city"; siṃha
means "lion", pura means "city" or "fortress").[8]
""")
[
  'The', 'English', 'name', 'of', 'Singapore', 'is', 'an', 'anglicisation',
  'of', 'the', 'native', 'Malay', 'name', 'for', 'the', 'country', ',',
  'Singapura', ',', 'which', 'was', 'in', 'turn', 'derived', 'from', 'the',
  'Sanskrit', 'word', 'for', 'lion', 'city', '(', 'romanised', ':',
  'Siṃhapura', ';', 'Brahmi', ':', '𑀲𑀺𑀁𑀳𑀧𑀼𑀭', ';', 'literally', '"', 'lion',
  'city', '"', ';', 'siṃha', 'means', '"', 'lion', '"', ',', 'pura', 'means',
  '"', 'city', '"', 'or', '"', 'fortress', '"', ')', '.', '[', '8', ']'
]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

fast_mosestokenizer-0.0.3.1-cp38-cp38-macosx_10_15_x86_64.whl (714.1 kB view details)

Uploaded CPython 3.8 macOS 10.15+ x86-64

fast_mosestokenizer-0.0.3.1-cp37-cp37m-macosx_10_15_x86_64.whl (713.3 kB view details)

Uploaded CPython 3.7m macOS 10.15+ x86-64

fast_mosestokenizer-0.0.3.1-cp36-cp36m-macosx_10_15_x86_64.whl (713.3 kB view details)

Uploaded CPython 3.6m macOS 10.15+ x86-64

File details

Details for the file fast_mosestokenizer-0.0.3.1-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.3.1-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 801.4 kB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/46.0.0.post20200309 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.7.7

File hashes

Hashes for fast_mosestokenizer-0.0.3.1-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 6750b7592f7ab3fdd46d91f21a5361d23cafcee19b5c6e44e869f677753d6f0c
MD5 2da33cd3754aa2f7cf648bd2a7993fe0
BLAKE2b-256 300b7b80a0f5e01016cf921752c97097f7c0e824a8b385be9cd392e6a01bd234

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.3.1-cp38-cp38-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.3.1-cp38-cp38-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 714.1 kB
  • Tags: CPython 3.8, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/46.0.0.post20200309 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.7.7

File hashes

Hashes for fast_mosestokenizer-0.0.3.1-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 77f16f0fce3acb507d86fc891edd1b575178ee98b8229be78890725e7db66ae7
MD5 d6fb5d1673b76882ba4ecc87eb4e0327
BLAKE2b-256 936ed761ebf0a31bc913e62fb9cda30e3b2e818b3638161dbba57aa68c6b9e69

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.3.1-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.3.1-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 800.6 kB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/46.0.0.post20200309 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.7.7

File hashes

Hashes for fast_mosestokenizer-0.0.3.1-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 213f22b1025e1ecae8f5ff17082fd6bd45f956fa33572eac0efa9d9bb5e1bc54
MD5 91a78aaa8203b07360f10af3ac585a16
BLAKE2b-256 7316ba43e385670c83147404fadc448c11949a74a9c3a6c5db32e6b15f6671d9

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.3.1-cp37-cp37m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.3.1-cp37-cp37m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 713.3 kB
  • Tags: CPython 3.7m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/46.0.0.post20200309 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.7.7

File hashes

Hashes for fast_mosestokenizer-0.0.3.1-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 a03d592c1dddedbb32c3f48c7e424c072304a5a61007456150bf2f64de65fe7a
MD5 1a06bf8fe8a7770939bfbdcafba9175b
BLAKE2b-256 9b6cc2b7f6a42df93849f1b63d789216055988fdb58ede9e953c3bae37399115

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.3.1-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.3.1-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 800.7 kB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/46.0.0.post20200309 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.7.7

File hashes

Hashes for fast_mosestokenizer-0.0.3.1-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 3734275f10820f11c8c70473a2b8e7afbfa28000c0db9dd32e74038e79ca1e45
MD5 5e01b37859ea3b32b6c9e4876206656d
BLAKE2b-256 bc0abc892f0d29ad10339f37cbe4d3c088096cea977a72ff8a44306b59a8c809

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.3.1-cp36-cp36m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.3.1-cp36-cp36m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 713.3 kB
  • Tags: CPython 3.6m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/46.0.0.post20200309 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.7.7

File hashes

Hashes for fast_mosestokenizer-0.0.3.1-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 ebbf3b56701d179c8a3fe08d40a8f65a33500226600f02723003c0730a19a7fc
MD5 1c707ee9230a7731eabb9cd4c74faaf8
BLAKE2b-256 98569a49bfe64956a4bdfc2bdca88b4156bb28c0ec2a55cf3da4e252ad13afcf

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page