Skip to main content

Fast and Customizable Tokenizers

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install setuptools_rust
python setup.py install

Using the provided Tokenizers

Using a pre-trained tokenizer is really simple:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train yours just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# And you can use it
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory", "my-bpe")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

You can also easily build your own tokenizers, by putting all the different parts you need together:

Use a pre-trained tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders

# Load a BPE Model
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
bpe = models.BPE.from_files(vocab, merges)

# Initialize a tokenizer
tokenizer = Tokenizer(bpe)

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

# Or tokenize multiple sentences at once:
encoded = tokenizer.encode_batch([
	"I can feel the magic, can you?",
	"The quick brown fox jumps over the lazy dog"
])
print(encoded)

Train a new tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE.empty())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()

# And then train
trainer = trainers.BpeTrainer(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# Now we can encode
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.5.0.tar.gz (64.4 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.5.0-cp38-cp38-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.5.0-cp38-cp38-manylinux1_x86_64.whl (7.6 MB view details)

Uploaded CPython 3.8

tokenizers-0.5.0-cp38-cp38-macosx_10_15_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.8 macOS 10.15+ x86-64

tokenizers-0.5.0-cp37-cp37m-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.5.0-cp37-cp37m-manylinux1_x86_64.whl (5.7 MB view details)

Uploaded CPython 3.7m

tokenizers-0.5.0-cp37-cp37m-macosx_10_15_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.7m macOS 10.15+ x86-64

tokenizers-0.5.0-cp36-cp36m-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.5.0-cp36-cp36m-manylinux1_x86_64.whl (3.8 MB view details)

Uploaded CPython 3.6m

tokenizers-0.5.0-cp36-cp36m-macosx_10_15_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.6m macOS 10.15+ x86-64

tokenizers-0.5.0-cp35-cp35m-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.5.0-cp35-cp35m-manylinux1_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.5m

tokenizers-0.5.0-cp35-cp35m-macosx_10_15_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.5m macOS 10.15+ x86-64

File details

Details for the file tokenizers-0.5.0.tar.gz.

File metadata

  • Download URL: tokenizers-0.5.0.tar.gz
  • Upload date:
  • Size: 64.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.5.0.tar.gz
Algorithm Hash digest
SHA256 24f60da7f382543cd0839437d231cb462e65fbc1461fc3622db55f6df34959d0
MD5 affe6a5dd37ca9f71511991c91440650
BLAKE2b-256 4b8ebba2f969451a0f1989e57527bf2745e5455484c73905f3dfbab2f062ad92

See more details on using hashes here.

File details

Details for the file tokenizers-0.5.0-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.5.0-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.5.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 91976dc7feadcd4a2a0989f752dfcb10f3ac2f19554150b6b5553d9e2981fade
MD5 6c369d6182f2a66855da01d4f1c6a456
BLAKE2b-256 1fd4ce9f8f3fc1d3b98713a2c583959c9636393942edffb0dbb0154bf688310e

See more details on using hashes here.

File details

Details for the file tokenizers-0.5.0-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.5.0-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 7.6 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.5.0-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 efec1a518eab807897640a0ff5e2cc88881a31044545520cb9457c894ff10b1b
MD5 059771a8632250927939c416282df906
BLAKE2b-256 0fbf09aa26e87d4ed1e28124c30e5aa79344aed5fb0662dc60a7786496364995

See more details on using hashes here.

File details

Details for the file tokenizers-0.5.0-cp38-cp38-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.5.0-cp38-cp38-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.8, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.5.0-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 e5c021533988592014b40de937a7684c0a8798b0f57e4e43a75c38a389693013
MD5 421e8b09b910ae2bb5c958f87af475d1
BLAKE2b-256 a71fb6390f57d938c60f74e78a424d64ea8605f7cb6c608d7fe2ace07945f673

See more details on using hashes here.

File details

Details for the file tokenizers-0.5.0-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.5.0-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.5.0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 ab60be83fc67a61ea7a5c91391b71499194df29bdc6b91230c182da251681a41
MD5 9ae02944578faa43d9e17088adc8f184
BLAKE2b-256 ece29e49ddc5527ed0920406daa99033c0bb0195d4618c14a34bffcafd19be8c

See more details on using hashes here.

File details

Details for the file tokenizers-0.5.0-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.5.0-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 5.7 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.5.0-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 3596f565f2251f6c40acc79febe571b06780ba2b5dae4c6db1eddd6b2b15ef6c
MD5 68c916ce207f812e3029aac9c5efad65
BLAKE2b-256 0258a229b18e4b10d14fc5df7e0d4227e04a81ef43a6fe22d1e53d9d81c5ca9d

See more details on using hashes here.

File details

Details for the file tokenizers-0.5.0-cp37-cp37m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.5.0-cp37-cp37m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.7m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.5.0-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 7e47e1bb88753fda1c7a2c945482aa8c126632433696fd65c1fc00cc38c0cd8d
MD5 2f96134dfa3290adf7eaa4e08aaed33e
BLAKE2b-256 11350214c30040d2060552139def60a42ab9ffe657bf0ba37571d57a6334730d

See more details on using hashes here.

File details

Details for the file tokenizers-0.5.0-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.5.0-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.5.0-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 abeedd37cda2a11248eb38a7f46a874552c3593e1610b700fbe0a1726315d719
MD5 3b31bbeb5165c0002eccadc210fe92d0
BLAKE2b-256 5e72e56471abee64bb4a36da6c8a6e37cecc24aa8d60010c2095ec7eea08f2ea

See more details on using hashes here.

File details

Details for the file tokenizers-0.5.0-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.5.0-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.8 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.5.0-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 83582b2a9ddf29da9684b87f5fee203512f0f019bc05feadc3198db958ab8e6b
MD5 d10f1709d3dffbbe8e526be46adfbb8d
BLAKE2b-256 7e1dea7e2c628942e686595736f73678348272120d026b7acd54fe43e5211bb1

See more details on using hashes here.

File details

Details for the file tokenizers-0.5.0-cp36-cp36m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.5.0-cp36-cp36m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.6m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.5.0-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 fc8b9fd83d4053b08ea01de0713bde3d8a21502929d30151b68037e6e5a13e21
MD5 6949b5cfb0c34895b35468de59a18733
BLAKE2b-256 cc38017e227dc11537593286ea41ad74af8e7107d196877a4212cbe6d06eb642

See more details on using hashes here.

File details

Details for the file tokenizers-0.5.0-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.5.0-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.5.0-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 469ac5981040a0702a2972ef95205cfe5c27d96c8cfc7dc9f5f5f5d492205cc0
MD5 fbbd90254a8d8e74b84e7397cdeddf4c
BLAKE2b-256 55d3bc52dda93b9fc3394691702859ba0c1c7b50d9931ead15f69646e7f5d2ed

See more details on using hashes here.

File details

Details for the file tokenizers-0.5.0-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.5.0-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.5.0-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 ea38739d8ee75c4e4b50d20fbef514169a8554dafb7a6a1c14947039b2d32ec0
MD5 6501173950e6b0100244b1c632b608ab
BLAKE2b-256 bbf7ecbe01c098e403475ab71ebbd4e8e4246a4ced9c53e6420ca169fe2a813f

See more details on using hashes here.

File details

Details for the file tokenizers-0.5.0-cp35-cp35m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.5.0-cp35-cp35m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.5m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.5.0-cp35-cp35m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 65276fb65b1361735a976f1e0707e36f310651333f4bcee7c45d66e248e18c9e
MD5 963127d98732d2dea3a86eeb3d0c8751
BLAKE2b-256 4d5471129c8cb513c6e85c43c197335d7dc595a9579c71e858dbfd63a75ac0b2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page