Skip to main content

Fast and Customizable Tokenizers

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install setuptools_rust
python setup.py install

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.8.1rc2.tar.gz (97.4 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.8.1rc2-cp38-cp38-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.8.1rc2-cp38-cp38-win32.whl (1.7 MB view details)

Uploaded CPython 3.8 Windows x86

tokenizers-0.8.1rc2-cp38-cp38-manylinux1_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.8

tokenizers-0.8.1rc2-cp38-cp38-macosx_10_14_x86_64.whl (2.1 MB view details)

Uploaded CPython 3.8 macOS 10.14+ x86-64

tokenizers-0.8.1rc2-cp37-cp37m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.8.1rc2-cp37-cp37m-win32.whl (1.7 MB view details)

Uploaded CPython 3.7m Windows x86

tokenizers-0.8.1rc2-cp37-cp37m-manylinux1_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.7m

tokenizers-0.8.1rc2-cp37-cp37m-macosx_10_14_x86_64.whl (2.1 MB view details)

Uploaded CPython 3.7m macOS 10.14+ x86-64

tokenizers-0.8.1rc2-cp36-cp36m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.8.1rc2-cp36-cp36m-win32.whl (1.7 MB view details)

Uploaded CPython 3.6m Windows x86

tokenizers-0.8.1rc2-cp36-cp36m-manylinux1_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.6m

tokenizers-0.8.1rc2-cp36-cp36m-macosx_10_14_x86_64.whl (2.1 MB view details)

Uploaded CPython 3.6m macOS 10.14+ x86-64

tokenizers-0.8.1rc2-cp35-cp35m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.8.1rc2-cp35-cp35m-win32.whl (1.7 MB view details)

Uploaded CPython 3.5m Windows x86

tokenizers-0.8.1rc2-cp35-cp35m-manylinux1_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.5m

tokenizers-0.8.1rc2-cp35-cp35m-macosx_10_14_x86_64.whl (2.1 MB view details)

Uploaded CPython 3.5m macOS 10.14+ x86-64

File details

Details for the file tokenizers-0.8.1rc2.tar.gz.

File metadata

  • Download URL: tokenizers-0.8.1rc2.tar.gz
  • Upload date:
  • Size: 97.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1rc2.tar.gz
Algorithm Hash digest
SHA256 db9dcf9efd3cd6b2c994b1fbbdbb3961208d5b7291c1d722301005730c105eb3
MD5 09520206ef7a4b196c31f8327740d338
BLAKE2b-256 e1d4041b91b6e0bf6372aa4fb89e98fe1f22d12b862f17a38fb81fddc0aa4699

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1rc2-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.8.1rc2-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1rc2-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 a863f813ece3e95618185e68d37560fc107f89d1a91608d23285294d1e414d8c
MD5 f2adc10e3b67aacd848d73973a6317e9
BLAKE2b-256 bd95c21e37fa4c780d641dce366af7572009aa8f5e400ea1132d5dcbe9e726ed

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1rc2-cp38-cp38-win32.whl.

File metadata

  • Download URL: tokenizers-0.8.1rc2-cp38-cp38-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.8, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1rc2-cp38-cp38-win32.whl
Algorithm Hash digest
SHA256 3880731fa4f1873e779bccaacc1f8f262fa6ffc7aae43be65f2815d54763be11
MD5 cc9860df8c1bc1556b1888e087fe5db4
BLAKE2b-256 03eaf285462c938b0ec6ab64819f2745818b489dc0e6a68db903743ec83ca949

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1rc2-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.1rc2-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.0 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1rc2-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 0fe77daab9b92bad46f8b74cff0aaf6676cdec61d0e7d321ee8a3946e60aa78c
MD5 e78bd607b03409495fda640cf784fbb4
BLAKE2b-256 a244a1684d6b515f9fb186c757a279a2964800a807136027810ff1895ea9920a

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1rc2-cp38-cp38-macosx_10_14_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.1rc2-cp38-cp38-macosx_10_14_x86_64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.8, macOS 10.14+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1rc2-cp38-cp38-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 a416bd2d6954ffe0b10ff5cef8cd5b99a68a041988af644d4dd7d1451511673d
MD5 fd57419658179e335d11d5bd4543d686
BLAKE2b-256 b2522e2771789fba2eb90e175e8c0d83887c82b5b610a7f2c7d82410bfa9f514

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1rc2-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.8.1rc2-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1rc2-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 49fe908cb2baa2c8fd666a7ff32cbed04f7269dbc36da29091c1444ccf9910f5
MD5 1b611c7d15efa7d9f32ffa12fb4b0c7e
BLAKE2b-256 47f8dc910b58134b7dec16d823fefb9ad47106b480f01b60ddf676d22c9727f1

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1rc2-cp37-cp37m-win32.whl.

File metadata

  • Download URL: tokenizers-0.8.1rc2-cp37-cp37m-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.7m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1rc2-cp37-cp37m-win32.whl
Algorithm Hash digest
SHA256 308dc517d188a6dae2386d230df373834fabfec1f8c32fc77a8539364ce00d86
MD5 23f067f27511b21fc47175ae4d96cfcd
BLAKE2b-256 52dc980686e8a0eb88839d0dbc825bbc3bd062c9a1214f947665aec47dcb73a9

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1rc2-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.1rc2-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.0 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1rc2-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 f9adb89b46df83807c324e74177d776d5ebccacb9694568094f7335de9b5a759
MD5 9ece92f27c8b46c8cc368626eb2b8ebd
BLAKE2b-256 7526c02ba92ecb8b780bdae4a862d351433c2912fe49469dac7f87a5c85ccca6

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1rc2-cp37-cp37m-macosx_10_14_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.1rc2-cp37-cp37m-macosx_10_14_x86_64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.7m, macOS 10.14+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1rc2-cp37-cp37m-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 38e08cfd4ed6f84fd51af9f9a2f1899cef91d47060dd2190e7c5490cef770072
MD5 8d6fb777612623fe4300df634ea15803
BLAKE2b-256 1b873e7ea5e5be54ff4fc0422a23b78d98b7c982a741b61988463da41cd5355c

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1rc2-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.8.1rc2-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1rc2-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 f29e45b0f31ca680d64890ad292569da804f44db6d160b50e7e8a2ae639db988
MD5 958b607c018dea477d89357d811a6b5a
BLAKE2b-256 ef5c4903860fd7bfe97df703c02f15ce110194a97bd1093e9ddadab4d0a9b320

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1rc2-cp36-cp36m-win32.whl.

File metadata

  • Download URL: tokenizers-0.8.1rc2-cp36-cp36m-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.6m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1rc2-cp36-cp36m-win32.whl
Algorithm Hash digest
SHA256 fb37429f2d8c00cf6c5aec8404f355772cb5afb54adb3e5e8a2a0a2d1095433a
MD5 5ded7375840d5aed7c97c350d6fe4f95
BLAKE2b-256 79aa896aca45a2a41a8e322d664f6f567a882786a814b51767ac642815b6b16f

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1rc2-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.1rc2-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.0 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1rc2-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 22b775f07de4255e1d4aeab358064574384cacbdc1f2a67c2defc7f7bfa429ea
MD5 218f575cc62f8a6e7b24b2da1efe2c88
BLAKE2b-256 80838b9fccb9e48eeb575ee19179e2bdde0ee9a1904f97de5f02d19016b8804f

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1rc2-cp36-cp36m-macosx_10_14_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.1rc2-cp36-cp36m-macosx_10_14_x86_64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.6m, macOS 10.14+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1rc2-cp36-cp36m-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 c7982733bb821934a5e53f4b4825edda3c5435b32a1705209b36d44c32215621
MD5 93149da26169bb8df9624982c7242b18
BLAKE2b-256 f0960e44037c4fe2e60e0dc4d76df54ac86df89bc303fa2faf0ae1e55d23579d

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1rc2-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.8.1rc2-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1rc2-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 e92fb6dffb5793563266b1c151846fac60bbb2525b40a49652e0a07854596f25
MD5 34e3908abfe97acefd798e4de0e99f41
BLAKE2b-256 ba2b68c41e423f7f802a79bb65ae253e82bef9930757b4b1d64b1088253532cb

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1rc2-cp35-cp35m-win32.whl.

File metadata

  • Download URL: tokenizers-0.8.1rc2-cp35-cp35m-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.5m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1rc2-cp35-cp35m-win32.whl
Algorithm Hash digest
SHA256 f8b901bfca97cbb94cfa2881bc51388dba7385f5494c70778842e5039d5d14e5
MD5 921a3665b237088645ceb64b58d16c68
BLAKE2b-256 9c4603e602a70eb14b9e50850a3eb95b7d3b666bd19eeb7aea81aa74e3fddc19

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1rc2-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.1rc2-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.0 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1rc2-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 4855c6d1b59cb31c6182dc8ae7a66e1a088f744cd99f2437f585e36bc0399b6e
MD5 0bf04f06509aaffc6588c63038dd5668
BLAKE2b-256 473b09365a05a69a060aa4eb0be6894cde946474c9a7b3a7d2ace69c4b641a41

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1rc2-cp35-cp35m-macosx_10_14_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.1rc2-cp35-cp35m-macosx_10_14_x86_64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.5m, macOS 10.14+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1rc2-cp35-cp35m-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 43d1f68b894ebdc60480375f87864d4376ddccebe0ad6988b86c2dd525d58376
MD5 ac0cd47eb3f512e57efd65448e6d4958
BLAKE2b-256 84cc24f66797ecd3679df8fa3e6da5457f432f7cba6b2fe8b5bd9348675999e1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page