Skip to main content

Fast unicode based tokenizer for MT written in C++

Project description

FastTokenizer

FastTokenizer is a tokenizer meant to perform language agnostic tokenization using unicode information.

While the initial goal is to design a tokenizer for the purpose of machine translation, the same tokenizer is generic enough to be adapted to a wide range of tasks in NLP due to its' ability to handle a wide range of languages and writing systems.

Some of the notable features of FastTokenizer are

  • Providing just the right amount of tokenization.
    • Segmentation are designed to be intuitive and rule based. The format is ideal for downstream NLP models like subword modelling, RNNs or transformers.
    • Also designed to be not so aggressive. This way number of tokens can be kept down, allowing model to run faster.
  • Works for any and every langauge/writing system.
  • Cross programming language.
  • Performs format retaining unicode normalization.
  • Performance matches or exceeds moses-tokenizer on tasks such as WMT and GLUE.
  • Tokenization can be reversed.
    • However custom desegmenter should be used to achieve desired formatting as desegmentation is highly use-case driven.

Comparison with other tokenizers from the web

Source:          他的表现遭到《天空体育》评论员内维尔的批评。
Segmenter:       ['他的表现遭到', '《', '天空体育', '》', '评论员内维尔的批评', '。']
Moses:           ['他的表现遭到《天空体育》评论员内维尔的批评。']
Spacy Tokenizer: ['他的表现遭到《天空体育》评论员内维尔的批评。']
Tweet Tokenizer: ['他的表现遭到', '《', '天空体育', '》', '评论员内维尔的批评', '。']
NLTK Tokenizer:  ['他的表现遭到《天空体育》评论员内维尔的批评。'

Source:          AirPods耳機套
Segmenter:       ['AirPods', '耳機套']
Moses:           ['AirPods耳機套']
Spacy Tokenizer: ['AirPods耳機套']
Tweet Tokenizer: ['AirPods耳機套']
NLTK Tokenizer:  ['AirPods耳機套']

Source:          A typical master's programme has a duration of 1-1.5 years.
Segmenter:       ['A', 'typical', "master's", 'programme', 'has', 'a', 'duration', 'of', '1', '@-@', '1.5', 'years', '.']
Moses:           ['A', 'typical', 'master', "'s", 'programme', 'has', 'a', 'duration', 'of', '1', '@-@', '1.5', 'years', '.']
Spacy Tokenizer: ['A', 'typical', "master's", 'programme', 'has', 'a', 'duration', 'of', '1-1.5', 'years.']
Tweet Tokenizer: ['A', 'typical', "master's", 'programme', 'has', 'a', 'duration', 'of', '1-1', '.', '5', 'years', '.']
NLTK Tokenizer:  ['A', 'typical', 'master', "'s", 'programme', 'has', 'a', 'duration', 'of', '1-1.5', 'years', '.']

Installation

C++

Comming soon.

Python

pip install fasttokenizer

Usage

C++

#include <fasttokenizer/segmenter.h>

Segmenter segmenter = Segmenter(args.protected_dash_split);

std::string text = "Hello World!";
std::string output;

// Normalize
output = segmenter.normalize(text)

// Segment
output = segmenter.segment(text)

// Normalize and segment in one function
// Reduce string to icu::UnicodeString overhead
output = segmenter.normalize_and_segment(text);

// Desegment
output = segmenter.desegment(text);

Python

import fasttokenizer

segmenter = fasttokenizer.Segmenter()

text = "Hello World!"

# Normalize
output: str = segmenter.normalize(text)

# Segment
output: str = segmenter.segment()

# Normalize and segment
output: str = segmenter.normalize_and_segment(text)

# Output of segment is str.
# To get tokens, you can split by whitespace.
tokens = output.split()

# Desegment
output: str = segmenter.desegment(text)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

fasttokenizer-0.0.2-cp38-cp38-manylinux1_x86_64.whl (11.9 MB view details)

Uploaded CPython 3.8

fasttokenizer-0.0.2-cp38-cp38-macosx_10_15_x86_64.whl (11.9 MB view details)

Uploaded CPython 3.8 macOS 10.15+ x86-64

fasttokenizer-0.0.2-cp37-cp37m-manylinux1_x86_64.whl (11.9 MB view details)

Uploaded CPython 3.7m

fasttokenizer-0.0.2-cp37-cp37m-macosx_10_15_x86_64.whl (11.9 MB view details)

Uploaded CPython 3.7m macOS 10.15+ x86-64

fasttokenizer-0.0.2-cp36-cp36m-manylinux1_x86_64.whl (11.9 MB view details)

Uploaded CPython 3.6m

fasttokenizer-0.0.2-cp36-cp36m-macosx_10_15_x86_64.whl (11.9 MB view details)

Uploaded CPython 3.6m macOS 10.15+ x86-64

File details

Details for the file fasttokenizer-0.0.2-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: fasttokenizer-0.0.2-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 11.9 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for fasttokenizer-0.0.2-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 d0ec3505e0f1938cac536e7aed7b2a3075eac727e0f8f9af082d7b74c04a3f69
MD5 fd80909eb64feb11367cb319669ef1ca
BLAKE2b-256 eaac44bd01c8d02d43a957bb305e9c1e896a5cfc38a6d508e9f26a7ba63f45a6

See more details on using hashes here.

File details

Details for the file fasttokenizer-0.0.2-cp38-cp38-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fasttokenizer-0.0.2-cp38-cp38-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 11.9 MB
  • Tags: CPython 3.8, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for fasttokenizer-0.0.2-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 7d17a7404d2192cf67728efc6ec58794f74d07f6b0a4ed324d4922cd93495a25
MD5 07be818057c0c03bf65dd627413c7248
BLAKE2b-256 425854b4c18ea637669fd323e043b8b6d1d52685f7cfbe4b1f8e420f3b1d2fcb

See more details on using hashes here.

File details

Details for the file fasttokenizer-0.0.2-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: fasttokenizer-0.0.2-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 11.9 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for fasttokenizer-0.0.2-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 918f0a493c826f0d29b575ff742b77f0fc23d418deff4bfb2c0f1b1bf552ebdf
MD5 82e3e7523359d63390b0dce3541ff411
BLAKE2b-256 e75998daa379d31f9eb5910abbb49c42ec3f35dbf98bd73f2606139055164d0b

See more details on using hashes here.

File details

Details for the file fasttokenizer-0.0.2-cp37-cp37m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fasttokenizer-0.0.2-cp37-cp37m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 11.9 MB
  • Tags: CPython 3.7m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for fasttokenizer-0.0.2-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 9193b9266ec066cc44b5c204ae3304cda9f1f5e4891d29c58e6256ece1cc3644
MD5 90b8e1c2a9ac1caafea009093b628781
BLAKE2b-256 168a8db8036e87a3cf80c7bf0aed378fbd98b7d5dea828313597fec1172b292b

See more details on using hashes here.

File details

Details for the file fasttokenizer-0.0.2-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: fasttokenizer-0.0.2-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 11.9 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for fasttokenizer-0.0.2-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 56429eaac0a5b0eb037c18c15f14074574e3e5cc39719d6693ebf7e9174943a0
MD5 4bd2524e150d69a0f2b145f8af508c1c
BLAKE2b-256 a6cebd87a57bf9fed91296adad210e657cbff3ba66f243e3ca45120b3b3242c2

See more details on using hashes here.

File details

Details for the file fasttokenizer-0.0.2-cp36-cp36m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fasttokenizer-0.0.2-cp36-cp36m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 11.9 MB
  • Tags: CPython 3.6m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for fasttokenizer-0.0.2-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 74b46a5db2e63c19a55552a61a40bb5c28119c04d4c101136f8d84d42dcf5130
MD5 08bf7cb7bb92b8824eedbe92f9796c9d
BLAKE2b-256 87c960c274967513123ee75150b2510ccd9e0435af124d390718df76d3b0f54e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page