Skip to main content

Fast and Customizable Tokenizers

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install setuptools_rust
python setup.py install

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can check how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")

Project details


Release history Release notifications | RSS feed

This version

0.9.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.9.0.tar.gz (170.4 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.9.0-cp38-cp38-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.9.0-cp38-cp38-win32.whl (1.7 MB view details)

Uploaded CPython 3.8 Windows x86

tokenizers-0.9.0-cp38-cp38-manylinux1_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.8

tokenizers-0.9.0-cp38-cp38-macosx_10_11_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.8 macOS 10.11+ x86-64

tokenizers-0.9.0-cp37-cp37m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.9.0-cp37-cp37m-win32.whl (1.7 MB view details)

Uploaded CPython 3.7m Windows x86

tokenizers-0.9.0-cp37-cp37m-manylinux1_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.7m

tokenizers-0.9.0-cp37-cp37m-macosx_10_11_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.7m macOS 10.11+ x86-64

tokenizers-0.9.0-cp36-cp36m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.9.0-cp36-cp36m-win32.whl (1.7 MB view details)

Uploaded CPython 3.6m Windows x86

tokenizers-0.9.0-cp36-cp36m-manylinux1_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.6m

tokenizers-0.9.0-cp36-cp36m-macosx_10_11_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.6m macOS 10.11+ x86-64

tokenizers-0.9.0-cp35-cp35m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.9.0-cp35-cp35m-win32.whl (1.7 MB view details)

Uploaded CPython 3.5m Windows x86

tokenizers-0.9.0-cp35-cp35m-manylinux1_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.5m

tokenizers-0.9.0-cp35-cp35m-macosx_10_11_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.5m macOS 10.11+ x86-64

File details

Details for the file tokenizers-0.9.0.tar.gz.

File metadata

  • Download URL: tokenizers-0.9.0.tar.gz
  • Upload date:
  • Size: 170.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.0.tar.gz
Algorithm Hash digest
SHA256 0bfa960345e114efd553e265e32eca6d79861abe24925fba903c925f8760f795
MD5 270f2e6698ac342a70d41ef6f6b0aaba
BLAKE2b-256 b4b3855d7ae57fc2c24c9295c2224eefcbc13eebf8ebc8f1d3c19d601ee4d765

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.0-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 de251698949d7fc7dd5bf4909dc0ac163f063ffdd155ed967790da97833f9bf0
MD5 e0c4e59b8d535b5e110179108ceed414
BLAKE2b-256 6cfe1c05e41823afcd6c72181ff452de305d9814b53c5d2f470403c53c6683a7

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0-cp38-cp38-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.0-cp38-cp38-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.8, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.0-cp38-cp38-win32.whl
Algorithm Hash digest
SHA256 8b5ad55607e90b099be279b1a26cc0d1c310210a560210e99765bb04643f2c92
MD5 6e8803019968ac9e5e4a63b50d233192
BLAKE2b-256 4527c0b48c5463c8906431c0e13d087c9b594c5a559a4d02a0aa090243cc43ae

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.9 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.0-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 b0f9c99e708c0b08c8e6b01ff1326dc69be35be95b737e739a5cc718edb4d070
MD5 9ec6c5f5556d5e80c614e770a8202cbe
BLAKE2b-256 afa6d52bda0915cac89fbb18ef9d5a10f961cb4acb2ecc7163d09c5834018adb

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0-cp38-cp38-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0-cp38-cp38-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 2.0 MB
  • Tags: CPython 3.8, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.0-cp38-cp38-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 ac1c44f1c066330f17e9453729e29493671472e240e2d05097dc012eb6d1e879
MD5 6a18c98740a2bd2b92594ea34e9d8d3b
BLAKE2b-256 8362681473fe0445ef01aa9f99ae3607e496c1869e225c57a8bcd0b5272564f2

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.0-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 f3ddcd39fd5bb23db1bdf1517fb911701629e9dd8aa36abcc36c87d4efc84dce
MD5 b10998004b0b76d3b4a33ab34da64ef6
BLAKE2b-256 661df463bbcdfa30f5e8da44220c7d8c93f70c69d2f0ddba75117c2570f86c1d

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0-cp37-cp37m-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.0-cp37-cp37m-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.7m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.0-cp37-cp37m-win32.whl
Algorithm Hash digest
SHA256 289d8dc2abb26bc45e622ca986d565bb1e6913ede20373e3939768a67bdfe526
MD5 096a5d5531815201e081996f2d169f16
BLAKE2b-256 e0c54987dac00e1fb17ebe36a3dc3fcbd6e11f8d69a2dec03e2855a58e976885

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.9 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.0-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 48dc2126d61212af4bd676658d2be69ea9f8e65a45c3fc3073b212ab9794a86c
MD5 35bc0a194dc957a8305ed7415fe95b4e
BLAKE2b-256 15dbcd3196ddf576c87576970142c7abe2be957a807f5e3d2add5e534db33aac

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0-cp37-cp37m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0-cp37-cp37m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 2.0 MB
  • Tags: CPython 3.7m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.0-cp37-cp37m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 5d7852260313323fef98510a4f5cf076aaf74496c73bb91756dedd0401c5ef8c
MD5 55bf3bff98cab0a5dadc52880af058b7
BLAKE2b-256 e6cb40015520f31b7e50c194cf5230ee562007e5c5910ab05fda68ddf75b7cff

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.0-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.0-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 b1a19b5f2534fbb5dd30c9d13ebc94e30a52c0e43c62f56e66ffae9a26468c7c
MD5 af61059644156e1ca5d4ee9811cd4f5a
BLAKE2b-256 beb8d023d72c257a5eff049d4bc96449cb245a97cf3303b53415fac23b59a5ea

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0-cp36-cp36m-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.0-cp36-cp36m-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.6m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.0-cp36-cp36m-win32.whl
Algorithm Hash digest
SHA256 2ed8f50b711858f793740d58f7cbb4ac7508cc149a87f417bb3ac5175d63f6ee
MD5 e280f9ef249a2ec0c2d5c69031d93dab
BLAKE2b-256 d655c67d19b3a5cb4c0d7f5086c8c8973da910a385b2b33cbf95a576d54bc073

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.9 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.0-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 711ffe3989af3d70e329c208f3d7a2c044587a7646dc660f96cb12b625af6f57
MD5 bb12f92de9a6c3b1598d98020c033335
BLAKE2b-256 5943ced8a977aa6efe4a20d9c362dc75f2206f3cdf0820813d0e12a7d51cd31e

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0-cp36-cp36m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0-cp36-cp36m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 2.0 MB
  • Tags: CPython 3.6m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.0-cp36-cp36m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 18eb28d0883dd0c7738224f2e05a606b8d7a36673ca9ade6ffda5d92b4a8d38f
MD5 0f5c84f99badf8091c63693b704f11f8
BLAKE2b-256 d81cf77451f3df1c17da4c476ed9bacc62a363d11b8685bda2325aa9f0bf0589

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.0-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.0-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 1fcc63139a309acdc041936b05bb2bce1ce296827b3e2fb46749558417c9855d
MD5 26390c63c20f54f8c28c258144379b43
BLAKE2b-256 bc893a101b6ca4ae08e3953649bb219be1bc0f14936f8dbd4d278f148bb71dae

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0-cp35-cp35m-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.0-cp35-cp35m-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.5m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.0-cp35-cp35m-win32.whl
Algorithm Hash digest
SHA256 d9f3e8ce29ae9bd7fe3ee5b03427c7a461d6fd000532d172f3a201169350ba17
MD5 b22379736f3e127e25efda06a0d37db2
BLAKE2b-256 01c7c09d94fcb0f96eb9a5c8db8c18ee995d1d33a29750b75bce75cd88f05e4c

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.9 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.0-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 442a4d16af8d709e0e29ad6e7712bbfe7cb50ca198242ec46d5a1deb6f3e2cf6
MD5 94a7270a32e7dd6956d25336070b7bdd
BLAKE2b-256 d195619b69640ef89f5b2e5c590af534f942e0d3a042cfd577bb1c07a5ddb595

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0-cp35-cp35m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0-cp35-cp35m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 2.0 MB
  • Tags: CPython 3.5m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.0-cp35-cp35m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 a5dde61d98069b712b0af5177b392059edd7427edc4b80ec949bbc1f2579d5eb
MD5 82ee99c8db68e4f051fe2c75118f8f7b
BLAKE2b-256 10fd8af9d181091055377070676c462f6d84f33aab07fcf9c37229745374403e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page