Skip to main content

c++ mosestokenizer

Project description

fast-mosestokenizer is a C++ implementation of the moses tokenizer which is a favourite among the folks in NLP research.

The reason for using this package over the original perl implementation is for the purpose of portability. With the C++ source code, you can use this library basically in every language.

The C++ script was adapted from the mosesdecoder repository contrib/c++tokenizer.

Benchmark

fast-mosestokenizer is also fast. On english, it is about 6x faster than tokenizer.perl and 15x faster than sacremoses.

see ./bench/README.md for more information.

Installation

Python users using linux and osx>=10.15 can install directly from PyPI.

pip install fast-mosestokenizer

See ./INSTALL.md for more information.

Usage (Command-line tool)

# Piping is the standard way to configure input and output stream.
# mosestokenizer would apply tokenization to each line of the input stream.
mosestokenizer en < infile > outfile

# For a full list of options, refer to the help message.
mosestokenizer -h

Usage (Python)

# Usage patterns are mostly the same as sacremoses.
>>> from mosestokenizer import MosesTokenizer

>>> tokenizer = MosesTokenizer('en')
>>> tokenizer.tokenize("""
The English name of Singapore is an anglicisation of the native Malay name for
the country, Singapura, which was in turn derived from the Sanskrit word for
lion city (romanised: Siṃhapura; Brahmi: 𑀲𑀺𑀁𑀳𑀧𑀼𑀭; literally "lion city"; siṃha
means "lion", pura means "city" or "fortress").[8]
""")
[
  'The', 'English', 'name', 'of', 'Singapore', 'is', 'an', 'anglicisation',
  'of', 'the', 'native', 'Malay', 'name', 'for', 'the', 'country', ',',
  'Singapura', ',', 'which', 'was', 'in', 'turn', 'derived', 'from', 'the',
  'Sanskrit', 'word', 'for', 'lion', 'city', '(', 'romanised', ':',
  'Siṃhapura', ';', 'Brahmi', ':', '𑀲𑀺𑀁𑀳𑀧𑀼𑀭', ';', 'literally', '"', 'lion',
  'city', '"', ';', 'siṃha', 'means', '"', 'lion', '"', ',', 'pura', 'means',
  '"', 'city', '"', 'or', '"', 'fortress', '"', ')', '.', '[', '8', ']'
]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

fast_mosestokenizer-0.0.7.2-cp38-cp38-macosx_10_15_x86_64.whl (714.7 kB view details)

Uploaded CPython 3.8 macOS 10.15+ x86-64

fast_mosestokenizer-0.0.7.2-cp37-cp37m-macosx_10_15_x86_64.whl (714.0 kB view details)

Uploaded CPython 3.7m macOS 10.15+ x86-64

fast_mosestokenizer-0.0.7.2-cp36-cp36m-macosx_10_15_x86_64.whl (713.9 kB view details)

Uploaded CPython 3.6m macOS 10.15+ x86-64

File details

Details for the file fast_mosestokenizer-0.0.7.2-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.7.2-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 801.8 kB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fast_mosestokenizer-0.0.7.2-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 3c15886aac652d0e915536f0f2a734f8df95853dabb30eae264b24ba6b64ac8b
MD5 83c6e0ff774e17c196f1a0ba5d5dc8c3
BLAKE2b-256 f7fa6cdc3f784090de3e9b426c87ad9313d7976c6aa3cedd96a93e2102f5f4a6

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.7.2-cp38-cp38-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.7.2-cp38-cp38-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 714.7 kB
  • Tags: CPython 3.8, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fast_mosestokenizer-0.0.7.2-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 05f1bd59bb14b412ee4130efa0140993282133a1b5ae3453a612694809f46c40
MD5 9e35bcf143122858180ca01f6fbcf0bd
BLAKE2b-256 f543b6040ba2e8cc4a9b07bcc338d7c89c7e1662b3f174f5441d9789b8d6dc75

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.7.2-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.7.2-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 801.0 kB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fast_mosestokenizer-0.0.7.2-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 1d37a51cc7719a06a9ebd36f4a0673c5f6e40fa8a9c53b38ebbeedd76ac33044
MD5 433b18dd592a9f432e9b9341ad62b769
BLAKE2b-256 ecfc8362b24cbc765951c324db5e44974717c506d43d372a2cdf85286d4da25c

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.7.2-cp37-cp37m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.7.2-cp37-cp37m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 714.0 kB
  • Tags: CPython 3.7m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fast_mosestokenizer-0.0.7.2-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 cc4efe49306008c47b79b0ff097e14d8a372607eb3e60829f4b95e697aab1807
MD5 6462ad14dc1857cfdd9045aed21bf78e
BLAKE2b-256 f8ae679cad9ef39933b2eacebbc1930d5aca57f73ccd52ffcc29727a1fe336ab

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.7.2-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.7.2-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 800.9 kB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fast_mosestokenizer-0.0.7.2-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 5a26b444cc69ea35273c11a50803d11ca1547cd62daf8b17bb6a3fa662557412
MD5 41ff5aca5b0c43d012739d040ffdbec7
BLAKE2b-256 901ee5f54d370567e5629873e48606a756eda9ac4fa1373ee5ba6d0b62b3ea1a

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.7.2-cp36-cp36m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.7.2-cp36-cp36m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 713.9 kB
  • Tags: CPython 3.6m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fast_mosestokenizer-0.0.7.2-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 9feeaaaaf8de7d6fd8237e8d346ae2d2b81de370ed673149dc7d8d050342702c
MD5 30d8cc9c95659edcc5188646722b2a51
BLAKE2b-256 ef0231f053fa8a5866c046a7102709c45dfb46a08e58b66ef301dc7c38ca5d2c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page