Skip to main content

Fast and Customizable Tokenizers

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install setuptools_rust
python setup.py install

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can check how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.9.3.tar.gz (172.0 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.9.3-cp38-cp38-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.9.3-cp38-cp38-win32.whl (1.7 MB view details)

Uploaded CPython 3.8 Windows x86

tokenizers-0.9.3-cp38-cp38-manylinux1_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.8

tokenizers-0.9.3-cp38-cp38-macosx_10_11_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.8 macOS 10.11+ x86-64

tokenizers-0.9.3-cp37-cp37m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.9.3-cp37-cp37m-win32.whl (1.7 MB view details)

Uploaded CPython 3.7m Windows x86

tokenizers-0.9.3-cp37-cp37m-manylinux1_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.7m

tokenizers-0.9.3-cp37-cp37m-macosx_10_11_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.7m macOS 10.11+ x86-64

tokenizers-0.9.3-cp36-cp36m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.9.3-cp36-cp36m-win32.whl (1.7 MB view details)

Uploaded CPython 3.6m Windows x86

tokenizers-0.9.3-cp36-cp36m-manylinux1_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.6m

tokenizers-0.9.3-cp36-cp36m-macosx_10_11_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.6m macOS 10.11+ x86-64

tokenizers-0.9.3-cp35-cp35m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.9.3-cp35-cp35m-win32.whl (1.7 MB view details)

Uploaded CPython 3.5m Windows x86

tokenizers-0.9.3-cp35-cp35m-manylinux1_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.5m

tokenizers-0.9.3-cp35-cp35m-macosx_10_11_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.5m macOS 10.11+ x86-64

File details

Details for the file tokenizers-0.9.3.tar.gz.

File metadata

  • Download URL: tokenizers-0.9.3.tar.gz
  • Upload date:
  • Size: 172.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.0 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.9.0

File hashes

Hashes for tokenizers-0.9.3.tar.gz
Algorithm Hash digest
SHA256 bacaeb3621055aba9b5158a21dc8b5d80f65724ff47f414f9d06f1e087668d61
MD5 54946da0398d9212f2b2f7d710597ad8
BLAKE2b-256 4b90ee3a6dd493ac9345aac4afd29139c532c1f708d7c78bc1022beaafb17a6a

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.3-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.3-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.0 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.9.0

File hashes

Hashes for tokenizers-0.9.3-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 3100ed9051558a3b2a9c2b7dda6880328a7c4c0736be72d279b0029339d55789
MD5 959648d9a356aeb6fbe29a4a138c4227
BLAKE2b-256 5b6cecb61b56fa4a8cdcfb563b113daab32585ab28cd8d58aceed20482d1e1ed

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.3-cp38-cp38-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.3-cp38-cp38-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.8, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.0 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.9.0

File hashes

Hashes for tokenizers-0.9.3-cp38-cp38-win32.whl
Algorithm Hash digest
SHA256 e76f84123695ccfaef734912369edbcb8ed71723196987ffd086e07d00c9d73a
MD5 2c049a3163d805bc0708ccb7a93c5566
BLAKE2b-256 cdd1ea74eb81061289032b2a0afa2cd243a97abd45fd91e5300a5dfd3e6a4d40

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.3-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.3-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.9 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.0 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.9.0

File hashes

Hashes for tokenizers-0.9.3-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 27ec3543e6e5701586804b1536cf687d44f8167871bd84ab01e5700cbfa969ec
MD5 6cbd63c5452e753a41c3e9bff74da924
BLAKE2b-256 e128fd31e920ec1832495ee2c72837afd8794e551308666e3b81a8164438eef6

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.3-cp38-cp38-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.3-cp38-cp38-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 2.0 MB
  • Tags: CPython 3.8, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.0 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.9.0

File hashes

Hashes for tokenizers-0.9.3-cp38-cp38-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 a8c197a0d8f8b91cc220315e1617b942d992ae70bade504838b70e49bf5e4454
MD5 6d3c1306f625e67c83e66b5ac1a3c33c
BLAKE2b-256 6bdff522d4ea8542b6146ef960a9399daddd2af302e2a93949f3cbd2e59dfcc1

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.3-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.3-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.0 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.9.0

File hashes

Hashes for tokenizers-0.9.3-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 3f199b1042d56837b5764c212874510f8eab7702d7486fdb626432465717e4f7
MD5 eeed1d29a8cf22b74fa3ed39135762d9
BLAKE2b-256 c4eb7391faa9651b568a233379d93e0754d4fc94498191e23d77d9ab8274a3e7

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.3-cp37-cp37m-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.3-cp37-cp37m-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.7m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.0 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.9.0

File hashes

Hashes for tokenizers-0.9.3-cp37-cp37m-win32.whl
Algorithm Hash digest
SHA256 b9c5c304bfdba4e703f8828bf4c57d3d65b3c77d2732f5ec81044abe8287f2de
MD5 295d29d203bd246e48935d04afc31d42
BLAKE2b-256 5fe45fa0d9b751397e5e6352a2c1d222cacc6d5a54bac9a389b795def05488a7

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.3-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.3-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.9 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.0 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.9.0

File hashes

Hashes for tokenizers-0.9.3-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 2d875e019e39bce56c35a126ef4338345d97bf5e53730c68d0f7128a05c3d6c0
MD5 391b3bd962da0fd2c2bb1880b68da5d0
BLAKE2b-256 7bacf5ba028f0f097d855e1541301e946d4672eb0f30b6e25cb2369075f916d2

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.3-cp37-cp37m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.3-cp37-cp37m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 2.0 MB
  • Tags: CPython 3.7m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.0 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.9.0

File hashes

Hashes for tokenizers-0.9.3-cp37-cp37m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 0f30e6a9ea56cc8d999b17142f121ddb24b005b9fce936e7df47cdff1f55edcc
MD5 829b657596919a11ebcb4a190166734b
BLAKE2b-256 b34f359e75a776256f850bae1ac0a27dcd656ffc9fd59192a7127656783281fd

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.3-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.3-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.0 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.9.0

File hashes

Hashes for tokenizers-0.9.3-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 ef6208dfdd2536566d0eee06a03da1a74b9c9246ef5c4a213d5afd00f51d0976
MD5 fed10da19b8ae7f2b66d409ad2344819
BLAKE2b-256 1a0915824b8aa47821dda9bcff035380b581ed21dff54557809607f3fb100d13

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.3-cp36-cp36m-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.3-cp36-cp36m-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.6m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.0 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.9.0

File hashes

Hashes for tokenizers-0.9.3-cp36-cp36m-win32.whl
Algorithm Hash digest
SHA256 424d8ab61b0b0afadf4d88beded036cb801e3a4ddba6804621568cefd5ddb317
MD5 0cc106abaefd6e1d387c12e7c57e4759
BLAKE2b-256 88c0c30965f3c9156c243ad9f896c152da54ff3792908f68f36c764e595505b4

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.3-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.3-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.9 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.0 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.9.0

File hashes

Hashes for tokenizers-0.9.3-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 219dafb24d26a10bd4162372901ac886254e1c6bf6510dde85df282469231eaf
MD5 416def9a949ce6997d0115cc78853c2a
BLAKE2b-256 4c34b39eb9994bc3c999270b69c9eea40ecc6f0e97991dba28282b9fd32d44ee

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.3-cp36-cp36m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.3-cp36-cp36m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 2.0 MB
  • Tags: CPython 3.6m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.0 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.9.0

File hashes

Hashes for tokenizers-0.9.3-cp36-cp36m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 f71161ed25ab649e9291139f7ce6a28cd520de5de9246e2907a19a6f329b9105
MD5 d58f57cb985742773e8a87faccd073a0
BLAKE2b-256 0ac6cdab08cf0eacd47ba524237851259e498b2a58ab9659b0fb8ef07bf0c584

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.3-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.3-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.0 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.9.0

File hashes

Hashes for tokenizers-0.9.3-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 e2d709d028fe87dc305c767acca2da0425029fa0ae2135a2139ead7de62bc276
MD5 6dffb5b25812f146881040fafca2188d
BLAKE2b-256 0e440d96b42aa2dbd321142d946e63b3450a9166e0dabd3ee7cad5eca98837e5

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.3-cp35-cp35m-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.3-cp35-cp35m-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.5m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.0 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.9.0

File hashes

Hashes for tokenizers-0.9.3-cp35-cp35m-win32.whl
Algorithm Hash digest
SHA256 5b35be31ca999f9fb4732d947fa629104a29af3633e73cf8311cb69948f80165
MD5 61fbf22723019bc0e3ec0daa99977d9f
BLAKE2b-256 e973b64ba69f903c94a958d47050c377669e5aff93305c0cea8f4970cf66bc25

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.3-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.3-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.9 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.0 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.9.0

File hashes

Hashes for tokenizers-0.9.3-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 be639a8933206dfd6e1cab26683ce020c03c6a3757d4a1dd5c8854d2bbed09ae
MD5 3040607537d5070a4d114ae386ece35a
BLAKE2b-256 17c9a8850621d02dea3aee9cf6ce8b428523a8ec0befa13853c06e5ca00272ca

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.3-cp35-cp35m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.3-cp35-cp35m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 2.0 MB
  • Tags: CPython 3.5m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.0 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.9.0

File hashes

Hashes for tokenizers-0.9.3-cp35-cp35m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 ea1db6a8ebdf29f0487212c35357574c70f72c7c311406603fabeaa44cc94a12
MD5 dacc1eafe2faa28d95ec278bce2480db
BLAKE2b-256 3094bea65474214c054e5d24b49244b4e9f6018066fa081b8643a38c34ae2da7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page