Skip to main content

Fast and Customizable Tokenizers

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install setuptools_rust
python setup.py install

Using the provided Tokenizers

Using a pre-trained tokenizer is really simple:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train yours just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# And you can use it
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory", "my-bpe")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

You can also easily build your own tokenizers, by putting all the different parts you need together:

Use a pre-trained tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders

# Load a BPE Model
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
bpe = models.BPE.from_files(vocab, merges)

# Initialize a tokenizer
tokenizer = Tokenizer(bpe)

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

# Or tokenize multiple sentences at once:
encoded = tokenizer.encode_batch([
	"I can feel the magic, can you?",
	"The quick brown fox jumps over the lazy dog"
])
print(encoded)

Train a new tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE.empty())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()

# And then train
trainer = trainers.BpeTrainer(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# Now we can encode
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.5.1.tar.gz (64.5 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.5.1-cp38-cp38-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.5.1-cp38-cp38-manylinux1_x86_64.whl (7.5 MB view details)

Uploaded CPython 3.8

tokenizers-0.5.1-cp38-cp38-macosx_10_15_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.8 macOS 10.15+ x86-64

tokenizers-0.5.1-cp37-cp37m-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.5.1-cp37-cp37m-manylinux1_x86_64.whl (5.6 MB view details)

Uploaded CPython 3.7m

tokenizers-0.5.1-cp37-cp37m-macosx_10_15_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.7m macOS 10.15+ x86-64

tokenizers-0.5.1-cp36-cp36m-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.5.1-cp36-cp36m-manylinux1_x86_64.whl (3.7 MB view details)

Uploaded CPython 3.6m

tokenizers-0.5.1-cp36-cp36m-macosx_10_15_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.6m macOS 10.15+ x86-64

tokenizers-0.5.1-cp35-cp35m-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.5.1-cp35-cp35m-manylinux1_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.5m

tokenizers-0.5.1-cp35-cp35m-macosx_10_15_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.5m macOS 10.15+ x86-64

File details

Details for the file tokenizers-0.5.1.tar.gz.

File metadata

  • Download URL: tokenizers-0.5.1.tar.gz
  • Upload date:
  • Size: 64.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.5.1.tar.gz
Algorithm Hash digest
SHA256 bde4d5cee7d5360d17f41b6dbb9d94231980952d4a50746ae7267952e13b6533
MD5 03706225d12692ac468b1c67a1791f7a
BLAKE2b-256 5460c5eda181ffef8fb4a987eca3138c991ea62cd7eaa4d1dc48f5979500b3b8

See more details on using hashes here.

File details

Details for the file tokenizers-0.5.1-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.5.1-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.5.1-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 2bf4712039a4c586250c955b699132790d3c1c70adfcb39489b7fe5191434ebc
MD5 116e0d5f11810cfb880dc84a63583c27
BLAKE2b-256 9da58d668ef84bb9533af878dfcb21d1c261154c03f80382c97c3cb0356ea69b

See more details on using hashes here.

File details

Details for the file tokenizers-0.5.1-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.5.1-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 7.5 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.5.1-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 5cde09bba10ec4bf601dca21777e0f31533feda786c55086cdc92ff982b87e68
MD5 4990a1ab0f691376125be9b5b288246c
BLAKE2b-256 521d7c31f553a18746e2ef1ca92b0ac280032b00b2c87ce7766c5be41471967d

See more details on using hashes here.

File details

Details for the file tokenizers-0.5.1-cp38-cp38-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.5.1-cp38-cp38-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.8, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.5.1-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 adee3188cd54538f3d68183cd7215772c0d448846b40aefb3be71299595b28c5
MD5 801d5cd01d5c54cb56e99dd140b86aec
BLAKE2b-256 bc28cf7486a6e1844f5d23f418738e06f8465f90c229c949d3082eacdd5948fc

See more details on using hashes here.

File details

Details for the file tokenizers-0.5.1-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.5.1-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.5.1-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 5768edcb34714c80f179bdd37e52eaffaaadaede0c086d340b8db7bdb8b87140
MD5 60270b8776d7d69dcbd07e96deee38c2
BLAKE2b-256 ef7af83df79295c0ba6aea852b9541bd4cad477896cd9ea8077e0e52e0aa0828

See more details on using hashes here.

File details

Details for the file tokenizers-0.5.1-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.5.1-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 5.6 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.5.1-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 8fda9949346e5a89a17633b7acb88b045bbe7f3cf064c799d0cf628ae04216a3
MD5 e34cc0945918eca421074288b8e3d15b
BLAKE2b-256 2b9e24231f0643ffe70a34044eb56318237683a43502466c9db6b6d2bf7e30fe

See more details on using hashes here.

File details

Details for the file tokenizers-0.5.1-cp37-cp37m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.5.1-cp37-cp37m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.7m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.5.1-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 fc148e79e83d1685e8e43d68fbbcb1a82353039cd3c961544783a28d31201a35
MD5 38fbe86ad13d39dd20f75f57832d89a8
BLAKE2b-256 d2270a47e0627854908f815dd0a39f95be3c01c5535f99b52c250965fdab2942

See more details on using hashes here.

File details

Details for the file tokenizers-0.5.1-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.5.1-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.5.1-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 0d680599f85c551e11f1387c72c932c77d586b9ff388f351b075c75f6ba9f15a
MD5 902381f5e0921df697f0d1975036aa26
BLAKE2b-256 8b8c6805a9e5d19b222fc3a6ceee27862d6fa7aebaf8d25f1bcbcbb6013aaf5d

See more details on using hashes here.

File details

Details for the file tokenizers-0.5.1-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.5.1-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.7 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.5.1-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 0fed4d0b142623c4c46ba0319dac41e9a63eaf021bda671682af09d411ec97c1
MD5 e64ccbc30a3c652fd4a855d6d9d29049
BLAKE2b-256 6af78f28beb7b7221a1c04da6217b678fafdda40c5bebb63b1d14d4781082b1c

See more details on using hashes here.

File details

Details for the file tokenizers-0.5.1-cp36-cp36m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.5.1-cp36-cp36m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.6m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.5.1-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 f1d1ed4ba5310df1e864e34d512657f79310070cc7201b96894e255a1f5827e2
MD5 c216db47cc4c16cecd6d41e826815132
BLAKE2b-256 422129253c1c26890a67e54f966ef63a645c6994a4b07245c7f1a3c7e0fc16f9

See more details on using hashes here.

File details

Details for the file tokenizers-0.5.1-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.5.1-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.5.1-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 b0f069e731c3d75b1bb5f137d4e0069e4a8967992b272c87e1a02717550c7b72
MD5 db0d420d53332af646ab0bf7994c0f03
BLAKE2b-256 5a5628ec4e3a229cd0138b959cf50c6c391d47acc9a544b76087f8855d6c5a8a

See more details on using hashes here.

File details

Details for the file tokenizers-0.5.1-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.5.1-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.5.1-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 4409840c9cc1555d00d0182a47f79e55529fde7d8f61b84c62733570fa00d2d8
MD5 c004bde724c9e0c12c5b8b651d0e5608
BLAKE2b-256 9e5c77a4e945de03c7f2783ba1e1489864d67670baaf0cba5d31b2095a733c3f

See more details on using hashes here.

File details

Details for the file tokenizers-0.5.1-cp35-cp35m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.5.1-cp35-cp35m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.5m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.5.1-cp35-cp35m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 ff6361b4c2d8afa52fcbc557799d59624f9a5609a379c45808956bfb15d64eca
MD5 4c8be641e5556966ed9987fc7ddda449
BLAKE2b-256 0bb07ef0ae503b68855a52b1ce1eb47e0ceb9d5f1ae4dd3b228727a9304b0d5a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page