Skip to main content

c++ mosestokenizer

Project description

fast-mosestokenizer is a C++ implementation of the moses tokenizer which is a favourite among the folks in NLP research.

The reason for using this package over the original perl implementation is for the purpose of portability. With the C++ source code, you can use this library basically in every language.

The C++ script was adapted from the mosesdecoder repository contrib/c++tokenizer.

Benchmark

fast-mosestokenizer is also fast. On english, it is about 6x faster than tokenizer.perl and 15x faster than sacremoses.

see ./bench/README.md for more information.

Installation

Python users using linux and osx>=10.15 can install directly from PyPI.

pip install fast-mosestokenizer

See ./INSTALL.md for more information.

Usage (Command-line tool)

# Piping is the standard way to configure input and output stream.
# mosestokenizer would apply tokenization to each line of the input stream.
mosestokenizer en < infile > outfile

# For a full list of options, refer to the help message.
mosestokenizer -h

Usage (Python)

# Usage patterns are mostly the same as sacremoses.
>>> from mosestokenizer import MosesTokenizer

>>> tokenizer = MosesTokenizer('en')
>>> tokenizer.tokenize("""
The English name of Singapore is an anglicisation of the native Malay name for
the country, Singapura, which was in turn derived from the Sanskrit word for
lion city (romanised: Siṃhapura; Brahmi: 𑀲𑀺𑀁𑀳𑀧𑀼𑀭; literally "lion city"; siṃha
means "lion", pura means "city" or "fortress").[8]
""")
[
  'The', 'English', 'name', 'of', 'Singapore', 'is', 'an', 'anglicisation',
  'of', 'the', 'native', 'Malay', 'name', 'for', 'the', 'country', ',',
  'Singapura', ',', 'which', 'was', 'in', 'turn', 'derived', 'from', 'the',
  'Sanskrit', 'word', 'for', 'lion', 'city', '(', 'romanised', ':',
  'Siṃhapura', ';', 'Brahmi', ':', '𑀲𑀺𑀁𑀳𑀧𑀼𑀭', ';', 'literally', '"', 'lion',
  'city', '"', ';', 'siṃha', 'means', '"', 'lion', '"', ',', 'pura', 'means',
  '"', 'city', '"', 'or', '"', 'fortress', '"', ')', '.', '[', '8', ']'
]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

fast_mosestokenizer-0.0.7.1-cp38-cp38-macosx_10_15_x86_64.whl (714.7 kB view details)

Uploaded CPython 3.8 macOS 10.15+ x86-64

fast_mosestokenizer-0.0.7.1-cp37-cp37m-macosx_10_15_x86_64.whl (714.0 kB view details)

Uploaded CPython 3.7m macOS 10.15+ x86-64

fast_mosestokenizer-0.0.7.1-cp36-cp36m-macosx_10_15_x86_64.whl (713.9 kB view details)

Uploaded CPython 3.6m macOS 10.15+ x86-64

File details

Details for the file fast_mosestokenizer-0.0.7.1-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.7.1-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 801.8 kB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fast_mosestokenizer-0.0.7.1-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 2231912b048e05f8378ea2ffaa2c5a038c8d7c362cf91cd366b370d939b04f7f
MD5 032f0c5fc0262f29375b46f4b7f5fb02
BLAKE2b-256 7b91c22ff352602a4e80c1944d93ce55e2b0e7001dd60630b654b50af4aac609

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.7.1-cp38-cp38-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.7.1-cp38-cp38-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 714.7 kB
  • Tags: CPython 3.8, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fast_mosestokenizer-0.0.7.1-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 31e118c8d22e9f26696606ce6347e0ae0b9daf06bc28a45ed6ec26218b94781b
MD5 4b80859675f4ad0e0c674a403b9a06bf
BLAKE2b-256 9065d678282204a89328a3cd9713ddb014950ac5b3545b0dca75852c4486dfc5

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.7.1-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.7.1-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 801.0 kB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fast_mosestokenizer-0.0.7.1-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 927260e73d7a94a9f7892332854b884606c07af49908cae8a1557f732f264804
MD5 a56f8cfe02c66ccfd109e997333d5675
BLAKE2b-256 cab08a51d79b6ce0445b890fde95d642bcd58810e144ad59e8657f2ccaa185f7

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.7.1-cp37-cp37m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.7.1-cp37-cp37m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 714.0 kB
  • Tags: CPython 3.7m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fast_mosestokenizer-0.0.7.1-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 2a194230fee1ab03c3beed68e215826097fb3a97c7c7e7df9688a19de894c51a
MD5 de76f7b938295928b39265d55a55e3a9
BLAKE2b-256 5b2e6a23de913638883bbe3ca8e83667dd6b7b1ce71bcf053048c3420e092cdd

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.7.1-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.7.1-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 800.9 kB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fast_mosestokenizer-0.0.7.1-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 536b615f28ed1f1629e02f6a32629dc8c6b857d9c87be9b10b5b3999a7367513
MD5 ecaf5e63fceadd9e30d9d74e103a8fab
BLAKE2b-256 021b5bcb76603d9241e2d5f44413b099af28c3046c3d8fb453e55f7fb571e468

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.7.1-cp36-cp36m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.7.1-cp36-cp36m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 713.9 kB
  • Tags: CPython 3.6m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fast_mosestokenizer-0.0.7.1-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 e6c16036eb695a457bba60f2a8c3a8061734dde23a7bdbed6183cbeb8e12ce6e
MD5 7f2b89660b84bf06f6f0f9dbcdecf09a
BLAKE2b-256 03e7d236d793bd17c6159fa064fb2c0026e9e9152558fac3b53d358cf8c7ed4c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page