Skip to main content

Fast unicode based tokenizer for MT written in C++

Project description

FastTokenizer

FastTokenizer is a tokenizer meant to perform language agnostic tokenization using unicode information.

While the initial goal is to design a tokenizer for the purpose of machine translation, the same tokenizer is generic enough to be adapted to a wide range of tasks in NLP due to its' ability to handle a wide range of languages and writing systems.

Some of the notable features of FastTokenizer are

  • Providing just the right amount of tokenization.
    • Segmentation are designed to be intuitive and rule based. The format is ideal for downstream NLP models like subword modelling, RNNs or transformers.
    • Also designed to be not so aggressive. This way number of tokens can be kept down, allowing model to run faster.
  • Works for any and every langauge/writing system.
  • Cross programming language.
  • Performs format retaining unicode normalization.
  • Performance matches or exceeds moses-tokenizer on tasks such as WMT and GLUE.
  • Tokenization can be reversed.
    • However custom desegmenter should be used to achieve desired formatting as desegmentation is highly use-case driven.

Comparison with other tokenizers from the web

Source:          他的表现遭到《天空体育》评论员内维尔的批评。
Segmenter:       ['他的表现遭到', '《', '天空体育', '》', '评论员内维尔的批评', '。']
Moses:           ['他的表现遭到《天空体育》评论员内维尔的批评。']
Spacy Tokenizer: ['他的表现遭到《天空体育》评论员内维尔的批评。']
Tweet Tokenizer: ['他的表现遭到', '《', '天空体育', '》', '评论员内维尔的批评', '。']
NLTK Tokenizer:  ['他的表现遭到《天空体育》评论员内维尔的批评。'

Source:          AirPods耳機套
Segmenter:       ['AirPods', '耳機套']
Moses:           ['AirPods耳機套']
Spacy Tokenizer: ['AirPods耳機套']
Tweet Tokenizer: ['AirPods耳機套']
NLTK Tokenizer:  ['AirPods耳機套']

Source:          A typical master's programme has a duration of 1-1.5 years.
Segmenter:       ['A', 'typical', "master's", 'programme', 'has', 'a', 'duration', 'of', '1', '@-@', '1.5', 'years', '.']
Moses:           ['A', 'typical', 'master', "'s", 'programme', 'has', 'a', 'duration', 'of', '1', '@-@', '1.5', 'years', '.']
Spacy Tokenizer: ['A', 'typical', "master's", 'programme', 'has', 'a', 'duration', 'of', '1-1.5', 'years.']
Tweet Tokenizer: ['A', 'typical', "master's", 'programme', 'has', 'a', 'duration', 'of', '1-1', '.', '5', 'years', '.']
NLTK Tokenizer:  ['A', 'typical', 'master', "'s", 'programme', 'has', 'a', 'duration', 'of', '1-1.5', 'years', '.']

Installation

C++

Comming soon.

Python

pip install fasttokenizer

Usage

C++

#include <fasttokenizer/segmenter.h>

Segmenter segmenter = Segmenter(args.protected_dash_split);

std::string text = "Hello World!";
std::string output;

// Normalize
output = segmenter.normalize(text)

// Segment
output = segmenter.segment(text)

// Normalize and segment in one function
// Reduce string to icu::UnicodeString overhead
output = segmenter.normalize_and_segment(text);

// Desegment
output = segmenter.desegment(text);

Python

import fasttokenizer

segmenter = fasttokenizer.Segmenter()

text = "Hello World!"

# Normalize
output: str = segmenter.normalize(text)

# Segment
output: str = segmenter.segment()

# Normalize and segment
output: str = segmenter.normalize_and_segment(text)

# Output of segment is str.
# To get tokens, you can split by whitespace.
tokens = output.split()

# Desegment
output: str = segmenter.desegment(text)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

fasttokenizer-0.0.1-cp38-cp38-manylinux1_x86_64.whl (11.9 MB view details)

Uploaded CPython 3.8

fasttokenizer-0.0.1-cp38-cp38-macosx_10_9_x86_64.whl (11.9 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

fasttokenizer-0.0.1-cp37-cp37m-manylinux1_x86_64.whl (11.9 MB view details)

Uploaded CPython 3.7m

fasttokenizer-0.0.1-cp37-cp37m-macosx_10_9_x86_64.whl (11.9 MB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

fasttokenizer-0.0.1-cp36-cp36m-manylinux1_x86_64.whl (11.9 MB view details)

Uploaded CPython 3.6m

fasttokenizer-0.0.1-cp36-cp36m-macosx_10_9_x86_64.whl (11.9 MB view details)

Uploaded CPython 3.6m macOS 10.9+ x86-64

File details

Details for the file fasttokenizer-0.0.1-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: fasttokenizer-0.0.1-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 11.9 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.7.9

File hashes

Hashes for fasttokenizer-0.0.1-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 f6c9660a98543519c343acf3b46f29d180bcbfd258793de1cab45eb3f75ac1e8
MD5 a4357f0bc29f068033dfc27bcbb9fd45
BLAKE2b-256 60564a71da9c5b6f8472ee241f638fd47d10b9f304cbfad3342ba663172b58be

See more details on using hashes here.

File details

Details for the file fasttokenizer-0.0.1-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: fasttokenizer-0.0.1-cp38-cp38-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 11.9 MB
  • Tags: CPython 3.8, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.7.9

File hashes

Hashes for fasttokenizer-0.0.1-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 f94a0f39bd8857f4a90c5713a64cfaf8f7cf596cb3f93fec5dcd71fa2a49bb9a
MD5 4c117176815567de7be0eaee7526f855
BLAKE2b-256 a1d68e7fd6ca0aa7dffedbfd4ecf00116326ff16aea64790a630dd432e0b675a

See more details on using hashes here.

File details

Details for the file fasttokenizer-0.0.1-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: fasttokenizer-0.0.1-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 11.9 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.7.9

File hashes

Hashes for fasttokenizer-0.0.1-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 ff558ae2c85a686ebef683e519734c654768dffe906238dc94434996168271be
MD5 9cf4f8dff3930262d33e706866a3bbbf
BLAKE2b-256 0de802fb463d3b92d3a2bfc1fcea527616a31409598c5eda9127b71dd55bd0ea

See more details on using hashes here.

File details

Details for the file fasttokenizer-0.0.1-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: fasttokenizer-0.0.1-cp37-cp37m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 11.9 MB
  • Tags: CPython 3.7m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.7.9

File hashes

Hashes for fasttokenizer-0.0.1-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 9d2fe4a56b7804733be6d9226fb4145598aac946c9df7324ece85c19c6bc88a7
MD5 3bfc778dceda18e6e65e15d6e2e7215b
BLAKE2b-256 87de632a6ac7c7b21dbc50326d3b3ac59c62fb125c7ca0a20873ffc2900f631c

See more details on using hashes here.

File details

Details for the file fasttokenizer-0.0.1-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: fasttokenizer-0.0.1-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 11.9 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.7.9

File hashes

Hashes for fasttokenizer-0.0.1-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 81c9ee41b573436c685129c8e6cb913a46cca696878c50e7cd6d8b0d4cb583f6
MD5 289f7914485dec4b530ecdebfc6d070a
BLAKE2b-256 d279ac77db7063cf9c5907b52f243b828e007e06508d08daf67648af11282a56

See more details on using hashes here.

File details

Details for the file fasttokenizer-0.0.1-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: fasttokenizer-0.0.1-cp36-cp36m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 11.9 MB
  • Tags: CPython 3.6m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.7.9

File hashes

Hashes for fasttokenizer-0.0.1-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 aa4c3b6c2db7e42a932fcc49352cdd859295cc956e5cb4964e913d890cad6897
MD5 e787a7627c6243ddebd1a5e3ac099ff4
BLAKE2b-256 1ed3c727e35c794095121f2af0ebc2305c4606f623569afa9e96ae9a57f634ed

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page