Fast unicode based tokenizer for MT written in C++
Project description
FastTokenizer
FastTokenizer is a tokenizer meant to perform language agnostic tokenization using unicode information.
While the initial goal is to design a tokenizer for the purpose of machine translation, the same tokenizer is generic enough to be adapted to a wide range of tasks in NLP due to its' ability to handle a wide range of languages and writing systems.
Some of the notable features of FastTokenizer are
- Providing just the right amount of tokenization.
- Segmentation are designed to be intuitive and rule based. The format is ideal for downstream NLP models like subword modelling, RNNs or transformers.
- Also designed to be not so aggressive. This way number of tokens can be kept down, allowing model to run faster.
- Works for any and every langauge/writing system.
- Cross programming language.
- Performs format retaining unicode normalization.
- Performance matches or exceeds moses-tokenizer on tasks such as WMT and GLUE.
- Tokenization can be reversed.
- However custom desegmenter should be used to achieve desired formatting as desegmentation is highly use-case driven.
Comparison with other tokenizers from the web
Source: 他的表现遭到《天空体育》评论员内维尔的批评。
Segmenter: ['他的表现遭到', '《', '天空体育', '》', '评论员内维尔的批评', '。']
Moses: ['他的表现遭到《天空体育》评论员内维尔的批评。']
Spacy Tokenizer: ['他的表现遭到《天空体育》评论员内维尔的批评。']
Tweet Tokenizer: ['他的表现遭到', '《', '天空体育', '》', '评论员内维尔的批评', '。']
NLTK Tokenizer: ['他的表现遭到《天空体育》评论员内维尔的批评。'
Source: AirPods耳機套
Segmenter: ['AirPods', '耳機套']
Moses: ['AirPods耳機套']
Spacy Tokenizer: ['AirPods耳機套']
Tweet Tokenizer: ['AirPods耳機套']
NLTK Tokenizer: ['AirPods耳機套']
Source: A typical master's programme has a duration of 1-1.5 years.
Segmenter: ['A', 'typical', "master's", 'programme', 'has', 'a', 'duration', 'of', '1', '@-@', '1.5', 'years', '.']
Moses: ['A', 'typical', 'master', "'s", 'programme', 'has', 'a', 'duration', 'of', '1', '@-@', '1.5', 'years', '.']
Spacy Tokenizer: ['A', 'typical', "master's", 'programme', 'has', 'a', 'duration', 'of', '1-1.5', 'years.']
Tweet Tokenizer: ['A', 'typical', "master's", 'programme', 'has', 'a', 'duration', 'of', '1-1', '.', '5', 'years', '.']
NLTK Tokenizer: ['A', 'typical', 'master', "'s", 'programme', 'has', 'a', 'duration', 'of', '1-1.5', 'years', '.']
Installation
C++
Comming soon.
Python
pip install fasttokenizer
Usage
C++
#include <fasttokenizer/segmenter.h>
Segmenter segmenter = Segmenter(args.protected_dash_split);
std::string text = "Hello World!";
std::string output;
// Normalize
output = segmenter.normalize(text)
// Segment
output = segmenter.segment(text)
// Normalize and segment in one function
// Reduce string to icu::UnicodeString overhead
output = segmenter.normalize_and_segment(text);
// Desegment
output = segmenter.desegment(text);
Python
import fasttokenizer
segmenter = fasttokenizer.Segmenter()
text = "Hello World!"
# Normalize
output: str = segmenter.normalize(text)
# Segment
output: str = segmenter.segment()
# Normalize and segment
output: str = segmenter.normalize_and_segment(text)
# Output of segment is str.
# To get tokens, you can split by whitespace.
tokens = output.split()
# Desegment
output: str = segmenter.desegment(text)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
File details
Details for the file fasttokenizer-0.0.2-cp38-cp38-manylinux1_x86_64.whl
.
File metadata
- Download URL: fasttokenizer-0.0.2-cp38-cp38-manylinux1_x86_64.whl
- Upload date:
- Size: 11.9 MB
- Tags: CPython 3.8
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d0ec3505e0f1938cac536e7aed7b2a3075eac727e0f8f9af082d7b74c04a3f69 |
|
MD5 | fd80909eb64feb11367cb319669ef1ca |
|
BLAKE2b-256 | eaac44bd01c8d02d43a957bb305e9c1e896a5cfc38a6d508e9f26a7ba63f45a6 |
File details
Details for the file fasttokenizer-0.0.2-cp38-cp38-macosx_10_15_x86_64.whl
.
File metadata
- Download URL: fasttokenizer-0.0.2-cp38-cp38-macosx_10_15_x86_64.whl
- Upload date:
- Size: 11.9 MB
- Tags: CPython 3.8, macOS 10.15+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7d17a7404d2192cf67728efc6ec58794f74d07f6b0a4ed324d4922cd93495a25 |
|
MD5 | 07be818057c0c03bf65dd627413c7248 |
|
BLAKE2b-256 | 425854b4c18ea637669fd323e043b8b6d1d52685f7cfbe4b1f8e420f3b1d2fcb |
File details
Details for the file fasttokenizer-0.0.2-cp37-cp37m-manylinux1_x86_64.whl
.
File metadata
- Download URL: fasttokenizer-0.0.2-cp37-cp37m-manylinux1_x86_64.whl
- Upload date:
- Size: 11.9 MB
- Tags: CPython 3.7m
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 918f0a493c826f0d29b575ff742b77f0fc23d418deff4bfb2c0f1b1bf552ebdf |
|
MD5 | 82e3e7523359d63390b0dce3541ff411 |
|
BLAKE2b-256 | e75998daa379d31f9eb5910abbb49c42ec3f35dbf98bd73f2606139055164d0b |
File details
Details for the file fasttokenizer-0.0.2-cp37-cp37m-macosx_10_15_x86_64.whl
.
File metadata
- Download URL: fasttokenizer-0.0.2-cp37-cp37m-macosx_10_15_x86_64.whl
- Upload date:
- Size: 11.9 MB
- Tags: CPython 3.7m, macOS 10.15+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9193b9266ec066cc44b5c204ae3304cda9f1f5e4891d29c58e6256ece1cc3644 |
|
MD5 | 90b8e1c2a9ac1caafea009093b628781 |
|
BLAKE2b-256 | 168a8db8036e87a3cf80c7bf0aed378fbd98b7d5dea828313597fec1172b292b |
File details
Details for the file fasttokenizer-0.0.2-cp36-cp36m-manylinux1_x86_64.whl
.
File metadata
- Download URL: fasttokenizer-0.0.2-cp36-cp36m-manylinux1_x86_64.whl
- Upload date:
- Size: 11.9 MB
- Tags: CPython 3.6m
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 56429eaac0a5b0eb037c18c15f14074574e3e5cc39719d6693ebf7e9174943a0 |
|
MD5 | 4bd2524e150d69a0f2b145f8af508c1c |
|
BLAKE2b-256 | a6cebd87a57bf9fed91296adad210e657cbff3ba66f243e3ca45120b3b3242c2 |
File details
Details for the file fasttokenizer-0.0.2-cp36-cp36m-macosx_10_15_x86_64.whl
.
File metadata
- Download URL: fasttokenizer-0.0.2-cp36-cp36m-macosx_10_15_x86_64.whl
- Upload date:
- Size: 11.9 MB
- Tags: CPython 3.6m, macOS 10.15+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 74b46a5db2e63c19a55552a61a40bb5c28119c04d4c101136f8d84d42dcf5130 |
|
MD5 | 08bf7cb7bb92b8824eedbe92f9796c9d |
|
BLAKE2b-256 | 87c960c274967513123ee75150b2510ccd9e0435af124d390718df76d3b0f54e |