Skip to main content

Fast and Customizable Tokenizers

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install setuptools_rust
python setup.py install

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can check how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.9.2.tar.gz (170.8 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.9.2-cp38-cp38-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.9.2-cp38-cp38-win32.whl (1.7 MB view details)

Uploaded CPython 3.8 Windows x86

tokenizers-0.9.2-cp38-cp38-manylinux1_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.8

tokenizers-0.9.2-cp38-cp38-macosx_10_11_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.8 macOS 10.11+ x86-64

tokenizers-0.9.2-cp37-cp37m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.9.2-cp37-cp37m-win32.whl (1.7 MB view details)

Uploaded CPython 3.7m Windows x86

tokenizers-0.9.2-cp37-cp37m-manylinux1_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.7m

tokenizers-0.9.2-cp37-cp37m-macosx_10_11_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.7m macOS 10.11+ x86-64

tokenizers-0.9.2-cp36-cp36m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.9.2-cp36-cp36m-win32.whl (1.7 MB view details)

Uploaded CPython 3.6m Windows x86

tokenizers-0.9.2-cp36-cp36m-manylinux1_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.6m

tokenizers-0.9.2-cp36-cp36m-macosx_10_11_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.6m macOS 10.11+ x86-64

tokenizers-0.9.2-cp35-cp35m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.9.2-cp35-cp35m-win32.whl (1.7 MB view details)

Uploaded CPython 3.5m Windows x86

tokenizers-0.9.2-cp35-cp35m-manylinux1_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.5m

tokenizers-0.9.2-cp35-cp35m-macosx_10_11_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.5m macOS 10.11+ x86-64

File details

Details for the file tokenizers-0.9.2.tar.gz.

File metadata

  • Download URL: tokenizers-0.9.2.tar.gz
  • Upload date:
  • Size: 170.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.2.tar.gz
Algorithm Hash digest
SHA256 6ba1337b5decd5c49f3db97fd9b202f74c249aca6e65899062df9cb083fefa60
MD5 169c66b828fa39ebcd933e1c597a78b9
BLAKE2b-256 c0410c56a3b73b6026189ed0cb270b507a7b4ca309d170b2322f000d20b8ec46

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.2-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.2-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.2-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 9435879d4ed9fba19d4515ab4668ec9dc092e8c288c73ad3cc366f8617289fc1
MD5 8242a18e9de18375c3a846e8f903b8bf
BLAKE2b-256 213beddc1e605e8b9e6b2ddf96780ea38d204dec8ff38a1b2be13cf54a2c026f

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.2-cp38-cp38-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.2-cp38-cp38-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.8, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.2-cp38-cp38-win32.whl
Algorithm Hash digest
SHA256 50537b39f76de9a0bbf37876df2844ea1c85a67289b3914a0fb1c8be30cd5262
MD5 70a7c6fcede2d9a470ea8acfeee04ca9
BLAKE2b-256 8a0360efce8f0d3cc05a59630c8567aaf03f869e5ae085b46f1539865f8910df

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.2-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.2-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.9 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.2-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 a5fb94fc91a5588a1404f347289bbf407d9140e6d4814465a29025792bfdd1db
MD5 b965673215933622731763185c523364
BLAKE2b-256 9e06cbd0404c93d65bef77951c95ed2076965849389341ecdccd773a0f493cd7

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.2-cp38-cp38-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.2-cp38-cp38-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 2.0 MB
  • Tags: CPython 3.8, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.2-cp38-cp38-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 16ab11023a4523816ba402022a9a920577d92d62919f9a98560f4f729588eabe
MD5 80241cbc8f6d8c061d0381dd07e05e66
BLAKE2b-256 f3fecb38ae6a20143da7bf5427b1243fb6ccd238a0ca15393cf111ecc0176da8

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.2-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.2-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.2-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 a7c7379a319fa2c945bb2e77fde143f7bdf3ba5b25a5d1b72078a7d2a4381ba3
MD5 7c0d0f1825fefae5ca652aaaa59d7136
BLAKE2b-256 a5f135df94644b0b6c5e354e02671c1a57e3e28a5ac6e65f6a6bc7af7e26edd6

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.2-cp37-cp37m-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.2-cp37-cp37m-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.7m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.2-cp37-cp37m-win32.whl
Algorithm Hash digest
SHA256 e3f2cbb65cecd3580477ad7eda67781738786e0d3ecdaf184b75b20b6deae314
MD5 54d6f0271769bd6973bc6de6e8de11a0
BLAKE2b-256 630d842559b879185e7af2e5f4c8d8ab6b7951039a541b6fcb236e683e3e51fc

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.2-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.2-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.9 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.2-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 d3602256170f97d245c2223f82e47863a50e077c4e4bfe3bf37bc9d70d099277
MD5 3dc9b7e3577e608aaedb4d87d562574a
BLAKE2b-256 35e7edf655ae34925aeaefb7b7fcc3dd0887d2a1203ee6b0df4d1170d1a19d4f

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.2-cp37-cp37m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.2-cp37-cp37m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 2.0 MB
  • Tags: CPython 3.7m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.2-cp37-cp37m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 87aff23a47fcffd98a8b5777bd4bcb6c9cb7baa77996dc8b9f62379bba6912cc
MD5 64051446cd90a7e4dfd2d45a2a326681
BLAKE2b-256 886b0b1bb8a176ea05e53ecce09441780af99e919b2f83626ac8a06ce8f5a349

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.2-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.2-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.2-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 5d9c30a985dbcca184e92acd961f80dc95e62fb826bb5c6491201f1d4f355669
MD5 04f0f47610f17dfdee92a1026d95e8ec
BLAKE2b-256 5edf691804f770a0383cad5765c881bf6fe8d81d370195abebde135025a97b61

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.2-cp36-cp36m-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.2-cp36-cp36m-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.6m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.2-cp36-cp36m-win32.whl
Algorithm Hash digest
SHA256 f841e57e8c8af176f9af1e18d344283bb555354e2afed3d4000ff882147e1445
MD5 ac91b675f75d16db585b4a78cf234bb9
BLAKE2b-256 fc6a63deee98cc8e19477c49c76750348197215772b080e4c414720d381702cb

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.2-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.2-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.9 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.2-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 7bd8ca09ce0f4a21f237d866f1d0314b41d4b7cd39fcc053c20e97aecb94ff5a
MD5 e7545f2425f6b3d7e75bf9f7f31e7947
BLAKE2b-256 7ca578be1a55b2ac8d6a956f0a211d372726e2b1dd2666bb537fea9b03abd62c

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.2-cp36-cp36m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.2-cp36-cp36m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 2.0 MB
  • Tags: CPython 3.6m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.2-cp36-cp36m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 ad2789e96a647e7451fe5af004426a2c5390fc7b55d16a9c8e8b087096ba8955
MD5 48fb45647d4484b738d1b160cc2ea007
BLAKE2b-256 641464fe7458aede940bad21c35fe537e87ab8fa0dbef8b945b36ca30b5d8050

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.2-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.2-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.2-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 0d01aef20d8aae938b6e00aaaf8bc5a4c31c5166f79b83fa69209480cf3d5652
MD5 d92a82f53650ee71c17082c6bc97a808
BLAKE2b-256 d423841daa597d0c82b81b9e00c481fc720a948c1245f295ff690e69c335ad0d

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.2-cp35-cp35m-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.2-cp35-cp35m-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.5m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.2-cp35-cp35m-win32.whl
Algorithm Hash digest
SHA256 031d1533ac5bc69a6cc2f3c0b953c9a52449fa9d403dff4d19b87c2f4d1014c5
MD5 dbc9b795a6ce807256407267f7348988
BLAKE2b-256 d964551438ad280f1a5d141c43dcbf5c6b14a1c27de07f338c9f197e02d0cf40

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.2-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.2-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.9 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.2-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 2121ab6818a76fc632d27bb640da306b8d6287ede661a388df3395c5a44ea1a7
MD5 7e46a8dfd6d574f3b4ae28e18dfc7aa5
BLAKE2b-256 4640421e4c6ddc52a08fe76a0be4a99e3e0119898bc0415b84ff49e8e2e1b032

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.2-cp35-cp35m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.2-cp35-cp35m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 2.0 MB
  • Tags: CPython 3.5m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.2-cp35-cp35m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 85dd00c117a4802747249aad41b3f5162ea78bb78e2e204f168e99562bab9a6d
MD5 21a679e0418e897eadb6aaa90376afb3
BLAKE2b-256 3ff58bc43575b78723760c8eb95caa08dd2423ee63bed2963553bac554142cb1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page