Skip to main content

c++ mosestokenizer

Project description

fast-mosestokenizer is a C++ implementation of the moses tokenizer which is a favourite among the folks in NLP research.

The reason for using this package over the original perl implementation is for the purpose of portability. With the C++ source code, you can use this library basically in every language.

The C++ script was adapted from the mosesdecoder repository contrib/c++tokenizer.

Benchmark

fast-mosestokenizer is also fast. On english, it is about 6x faster than tokenizer.perl and 15x faster than sacremoses.

see ./bench/README.md for more information.

Installation

Python users using linux and osx>=10.15 can install directly from PyPI.

pip install fast-mosestokenizer

See ./INSTALL.md for more information.

Usage (Command-line tool)

# Piping is the standard way to configure input and output stream.
# mosestokenizer would apply tokenization to each line of the input stream.
mosestokenizer en < infile > outfile

# For a full list of options, refer to the help message.
mosestokenizer -h

Usage (Python)

# Usage patterns are mostly the same as sacremoses.
>>> from mosestokenizer import MosesTokenizer

>>> tokenizer = MosesTokenizer('en')
>>> tokenizer.tokenize("""
The English name of Singapore is an anglicisation of the native Malay name for
the country, Singapura, which was in turn derived from the Sanskrit word for
lion city (romanised: Siṃhapura; Brahmi: 𑀲𑀺𑀁𑀳𑀧𑀼𑀭; literally "lion city"; siṃha
means "lion", pura means "city" or "fortress").[8]
""")
[
  'The', 'English', 'name', 'of', 'Singapore', 'is', 'an', 'anglicisation',
  'of', 'the', 'native', 'Malay', 'name', 'for', 'the', 'country', ',',
  'Singapura', ',', 'which', 'was', 'in', 'turn', 'derived', 'from', 'the',
  'Sanskrit', 'word', 'for', 'lion', 'city', '(', 'romanised', ':',
  'Siṃhapura', ';', 'Brahmi', ':', '𑀲𑀺𑀁𑀳𑀧𑀼𑀭', ';', 'literally', '"', 'lion',
  'city', '"', ';', 'siṃha', 'means', '"', 'lion', '"', ',', 'pura', 'means',
  '"', 'city', '"', 'or', '"', 'fortress', '"', ')', '.', '[', '8', ']'
]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

fast_mosestokenizer-0.0.3-cp38-cp38-macosx_10_15_x86_64.whl (713.9 kB view details)

Uploaded CPython 3.8 macOS 10.15+ x86-64

fast_mosestokenizer-0.0.3-cp37-cp37m-macosx_10_15_x86_64.whl (713.2 kB view details)

Uploaded CPython 3.7m macOS 10.15+ x86-64

fast_mosestokenizer-0.0.3-cp36-cp36m-macosx_10_15_x86_64.whl (713.1 kB view details)

Uploaded CPython 3.6m macOS 10.15+ x86-64

File details

Details for the file fast_mosestokenizer-0.0.3-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.3-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 801.4 kB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/46.0.0.post20200309 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.7.7

File hashes

Hashes for fast_mosestokenizer-0.0.3-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 3524b572acf61d729d450debe8e993819fd8248b9d9a5b677f954b7c4a9d9467
MD5 de884b3a8d43e14b6f311ce492972a56
BLAKE2b-256 62da549a2fbe193d6a5a3c407d0dd76c75e43a149cdc0a459adb19a5228209c9

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.3-cp38-cp38-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.3-cp38-cp38-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 713.9 kB
  • Tags: CPython 3.8, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/46.0.0.post20200309 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.7.7

File hashes

Hashes for fast_mosestokenizer-0.0.3-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 93994786f82908b0664ea9020b327a73d59d5de538d2a596a5ea9bee224d7328
MD5 616cb92d5f2580178767e577d6f4fb9f
BLAKE2b-256 22a9e61dbded60bac34bc4cf8ad77011030af087fac3608740f56c865556a837

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.3-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.3-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 800.5 kB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/46.0.0.post20200309 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.7.7

File hashes

Hashes for fast_mosestokenizer-0.0.3-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 368b7c16020f17768e96fc2840ad09a0fdecb674351148340f0e7b6f28b457a8
MD5 686355107ce8d7c23f7e557f4d717890
BLAKE2b-256 f61bbf9c7f120b97a385c83bd97fd78f738ce30a576d26e578e97b45efbe0f56

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.3-cp37-cp37m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.3-cp37-cp37m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 713.2 kB
  • Tags: CPython 3.7m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/46.0.0.post20200309 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.7.7

File hashes

Hashes for fast_mosestokenizer-0.0.3-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 64c877000ddab095a00a6db85f67f90aad02e6e82cf6a9c96c5b4a8890b95130
MD5 d11ae524915c3f9e47711e683d7da93b
BLAKE2b-256 1a0f7295766a67bfba9a1e4e5eb7fcdbf74f2d31d2315738446edf79c70c4a2a

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.3-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.3-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 800.7 kB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/46.0.0.post20200309 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.7.7

File hashes

Hashes for fast_mosestokenizer-0.0.3-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 9a6113805aabbb7f93040cbbc372c4475d4535da9747c8a0ebccc3db2c4ee97a
MD5 1f3e971702d683c8e2e26b5a49bebb19
BLAKE2b-256 8a2e81251ff089bd0a57cc37fe7c5ac5460ad9b0cd800741d389943f9d315a7b

See more details on using hashes here.

File details

Details for the file fast_mosestokenizer-0.0.3-cp36-cp36m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fast_mosestokenizer-0.0.3-cp36-cp36m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 713.1 kB
  • Tags: CPython 3.6m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/46.0.0.post20200309 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.7.7

File hashes

Hashes for fast_mosestokenizer-0.0.3-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 c4e133fe1075183bf0ba7292106b9fb8ba4f6a37e8385c0f4e9661721c3bb451
MD5 bfe1fe257e2f8f5cdf47c4413ff8a22c
BLAKE2b-256 53c960f3b32b40cb1b7e1b4057bd392c66fa762c9d0cc084d8f4c0bc95124c27

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page