Skip to main content

Fast and Customizable Tokenizers

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install setuptools_rust
python setup.py install

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.8.1.tar.gz (97.0 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.8.1-cp38-cp38-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.8.1-cp38-cp38-win32.whl (1.7 MB view details)

Uploaded CPython 3.8 Windows x86

tokenizers-0.8.1-cp38-cp38-manylinux1_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.8

tokenizers-0.8.1-cp38-cp38-macosx_10_11_x86_64.whl (2.1 MB view details)

Uploaded CPython 3.8 macOS 10.11+ x86-64

tokenizers-0.8.1-cp37-cp37m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.8.1-cp37-cp37m-win32.whl (1.7 MB view details)

Uploaded CPython 3.7m Windows x86

tokenizers-0.8.1-cp37-cp37m-manylinux1_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.7m

tokenizers-0.8.1-cp37-cp37m-macosx_10_11_x86_64.whl (2.1 MB view details)

Uploaded CPython 3.7m macOS 10.11+ x86-64

tokenizers-0.8.1-cp36-cp36m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.8.1-cp36-cp36m-win32.whl (1.7 MB view details)

Uploaded CPython 3.6m Windows x86

tokenizers-0.8.1-cp36-cp36m-manylinux1_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.6m

tokenizers-0.8.1-cp36-cp36m-macosx_10_11_x86_64.whl (2.1 MB view details)

Uploaded CPython 3.6m macOS 10.11+ x86-64

tokenizers-0.8.1-cp35-cp35m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.8.1-cp35-cp35m-win32.whl (1.7 MB view details)

Uploaded CPython 3.5m Windows x86

tokenizers-0.8.1-cp35-cp35m-manylinux1_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.5m

tokenizers-0.8.1-cp35-cp35m-macosx_10_11_x86_64.whl (2.1 MB view details)

Uploaded CPython 3.5m macOS 10.11+ x86-64

File details

Details for the file tokenizers-0.8.1.tar.gz.

File metadata

  • Download URL: tokenizers-0.8.1.tar.gz
  • Upload date:
  • Size: 97.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1.tar.gz
Algorithm Hash digest
SHA256 e228ec9dcdced445124419219477ac4b4c4b0dc57b95b196a9ef37097d382559
MD5 97831d02d8619208eda9a1a974931977
BLAKE2b-256 7bb4e252a6e49cc59026d6d6975b9da76c39a9b4aeb9d5b76a22ad9ad33c1580

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.8.1-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 e9283a41bbdfb0808e019247e52ae75bba71740c9b20418bdef117eacdcb49db
MD5 ccdf47c78357ec930bacfe058105ffce
BLAKE2b-256 cef31c8e642d9ee2049b6738e6232eeeedca2593c3cdbe288b2cad804cef3a31

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1-cp38-cp38-win32.whl.

File metadata

  • Download URL: tokenizers-0.8.1-cp38-cp38-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.8, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1-cp38-cp38-win32.whl
Algorithm Hash digest
SHA256 a3b92e3ecb32ef9f1ec06a2c947086dc47c3735257f8b83d99249e0bef652d9c
MD5 6aae6b5455004d7b487fd117ca558b58
BLAKE2b-256 44a417878fc1ac7ccc1aaf11a235023749a2d17745811c6812238952d7e5ae72

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.1-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.0 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 e16834fa50d7bab02759cff82a9d4c16914dd298cef1c3f26e2752596a26b148
MD5 1b1f05a7f3130122988dc499b2b7249f
BLAKE2b-256 194eb2d00586911796031e5feb5acbb81190c0426c4026320f799d62ab7acacb

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1-cp38-cp38-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.1-cp38-cp38-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.8, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1-cp38-cp38-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 71784fa3b05849ce45041e7138dcebe8bbe1cdd99d1244deb4fff1a1409606aa
MD5 7cb0264637524a4262af19ff94052505
BLAKE2b-256 515fc3a203f1b9eb3f71ed5661df788dffcac83e8b2a8e457b3469a8bbf669ea

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.8.1-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 48bf8661318dbe14da89ebc4ce97cc72ad4a3ee46cbfe4ec6eded0d1f83d6bc8
MD5 3fce07e0eb336ec1415b05497efc874b
BLAKE2b-256 2d7180d2c516152d7a0c38469f27bf267b2844a94278b83589f906ab1b51a2d3

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1-cp37-cp37m-win32.whl.

File metadata

  • Download URL: tokenizers-0.8.1-cp37-cp37m-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.7m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1-cp37-cp37m-win32.whl
Algorithm Hash digest
SHA256 46d6a918f0f485f13c5be99e99016d8ea427638c726be48ea5e8ac2e83d079ea
MD5 25684252efdd16d17425b0eba69869ab
BLAKE2b-256 6ee1878a7dbee12f2f94e8a13bf72a5b955df6b7c79838ab122ad8fa9ad063e9

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.1-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.0 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 0f0861e9584448cf984b9cb9a6df3a8b6e2a89cafbfc320437a954772cc610c0
MD5 03ead8b368af746ee034183a974a8e9c
BLAKE2b-256 25fc775a802d0d59707a0a6448b0457c7222bd3d65a8232fbd02d904bae255b5

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1-cp37-cp37m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.1-cp37-cp37m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.7m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1-cp37-cp37m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 e4c1baaad75e0ba14db37d67517250efd6f99b35485d86d5ce0ec701365f7951
MD5 414252d16237596a54445a8cab1dfb92
BLAKE2b-256 2b3e7cf9b5daa88371c96d9b63d31917e30ba93b1d89421aef79c00e806bc54d

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.8.1-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 8ca93df772027de407136cb7cdf956e676e4164764e9608dc657949123bf4b90
MD5 925fc8b2280e14cf0312d978a3b19f6e
BLAKE2b-256 119b30c7d73b21e9f2566a3eaa9297951e2ff181e9d9c737a2cdc114f6808c99

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1-cp36-cp36m-win32.whl.

File metadata

  • Download URL: tokenizers-0.8.1-cp36-cp36m-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.6m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1-cp36-cp36m-win32.whl
Algorithm Hash digest
SHA256 49c1b04a87728326d0ae3e4c9a5d05550b88ef290c56e163fc88b83cb78ca266
MD5 c0b0d1641d5afb6d90fd2ed2d7c844ca
BLAKE2b-256 29d8aee4615a64eda59663deb921eba27aa15018347876b0513d13f8c7ccbe78

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.1-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.0 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 8eae686600042926272398d6ce7de77445830a01525bf032782bea61afc51951
MD5 315cdd740ff91fc90510c875b2d27d56
BLAKE2b-256 e9eefedc3509145ad60fe5b418783f4a4c1b5462a4f0e8c7bbdbda52bdcda486

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1-cp36-cp36m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.1-cp36-cp36m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.6m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1-cp36-cp36m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 e3d82ebf67fab020dcaf74bc7507558a0098e660eb05aa7bf2e31a80cee1283d
MD5 54980cf6391a979ee1132cd114dd4122
BLAKE2b-256 d2205506cf3c2d8a1111f6f9e351e146e38e882be9e1eaf2255d67f7f8ba2f96

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.8.1-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 f63c1b490f0764f8ff34ee069fae90999aa511129e8fae1cbfc68f545b757bb3
MD5 081d9e8e103be17846a63141b9ce90ff
BLAKE2b-256 49be43813b7a453d64014406a14d680e2700a07f91edef0549af5c10b18fcc2b

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1-cp35-cp35m-win32.whl.

File metadata

  • Download URL: tokenizers-0.8.1-cp35-cp35m-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.5m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1-cp35-cp35m-win32.whl
Algorithm Hash digest
SHA256 ec2d6c49272dfc8aa737790e25007ef2cb85921448b7e60134a4556974166f24
MD5 0b91c76bfa0ea28d3d3a332ed0c9737d
BLAKE2b-256 2b2245eef38d39bc1836671fe61f6ace59d06192170a60dba77aad6b3a63f91a

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.1-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.0 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 9bdf80b537a1e68b91b608d176cc091b01a3ae370f4120be051aa4b19c9ea8f3
MD5 768029ba8968a11264936e50fbdadecb
BLAKE2b-256 c1db429fe78ce75a241c55974ec2f6dff0038dc40aa34396591ce239fb451f20

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1-cp35-cp35m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.1-cp35-cp35m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.5m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1-cp35-cp35m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 a39211bc81c5bd005156df87f4a1f557fbcebd2d81fa3dc635370e0de48e5c7c
MD5 5b501eb307a9cf246a00e8aa4b9483f5
BLAKE2b-256 0a372e83be0e1cda225c12dff8bf1705b1854110bc0f60ce13293db74cfeec55

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page