Skip to main content

Fast and Customizable Tokenizers

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install setuptools_rust
python setup.py install

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.8.1rc1.tar.gz (97.4 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.8.1rc1-cp38-cp38-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.8.1rc1-cp38-cp38-win32.whl (1.7 MB view details)

Uploaded CPython 3.8 Windows x86

tokenizers-0.8.1rc1-cp38-cp38-macosx_10_10_x86_64.whl (2.1 MB view details)

Uploaded CPython 3.8 macOS 10.10+ x86-64

tokenizers-0.8.1rc1-cp37-cp37m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.8.1rc1-cp37-cp37m-win32.whl (1.7 MB view details)

Uploaded CPython 3.7m Windows x86

tokenizers-0.8.1rc1-cp37-cp37m-manylinux1_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.7m

tokenizers-0.8.1rc1-cp37-cp37m-macosx_10_10_x86_64.whl (2.1 MB view details)

Uploaded CPython 3.7m macOS 10.10+ x86-64

tokenizers-0.8.1rc1-cp36-cp36m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.8.1rc1-cp36-cp36m-win32.whl (1.7 MB view details)

Uploaded CPython 3.6m Windows x86

tokenizers-0.8.1rc1-cp36-cp36m-manylinux1_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.6m

tokenizers-0.8.1rc1-cp36-cp36m-macosx_10_10_x86_64.whl (2.1 MB view details)

Uploaded CPython 3.6m macOS 10.10+ x86-64

tokenizers-0.8.1rc1-cp35-cp35m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.8.1rc1-cp35-cp35m-win32.whl (1.7 MB view details)

Uploaded CPython 3.5m Windows x86

tokenizers-0.8.1rc1-cp35-cp35m-manylinux1_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.5m

tokenizers-0.8.1rc1-cp35-cp35m-macosx_10_10_x86_64.whl (2.1 MB view details)

Uploaded CPython 3.5m macOS 10.10+ x86-64

File details

Details for the file tokenizers-0.8.1rc1.tar.gz.

File metadata

  • Download URL: tokenizers-0.8.1rc1.tar.gz
  • Upload date:
  • Size: 97.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1rc1.tar.gz
Algorithm Hash digest
SHA256 98af43fa61f1fc0ade954102d3d34e02df3c6f896315500865b93400f93dd99e
MD5 98c910f86e8b1749ecf4b37a64491b80
BLAKE2b-256 1c17893cdebdb268cac93b9f39c7e31f9a24a80c1bbd377bb1b1f7ed9dcbb435

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1rc1-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.8.1rc1-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1rc1-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 6b8fe503a38afe67085accc433ce3692c002d3c48baf20e67a3e93b558b96597
MD5 3aedf8ed085a0d4059724d19a931efa6
BLAKE2b-256 1014dfaf060ce4da2e8e91a49f70fee5cdebd713537cf215f6e27c695f4d7601

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1rc1-cp38-cp38-win32.whl.

File metadata

  • Download URL: tokenizers-0.8.1rc1-cp38-cp38-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.8, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1rc1-cp38-cp38-win32.whl
Algorithm Hash digest
SHA256 6ef5fb622383b437ec71412f2457211f330286abc35953ba150a7f58ebee217b
MD5 02429daad31aa5ad5588b94d9a73f791
BLAKE2b-256 42c7022ef0510df57a76de1de80f24227ab10941c8686d32253a45cfb873588d

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1rc1-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.1rc1-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.0 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1rc1-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 da0a7c7df84805315f732e471eb7ae25be50f07f53042202aeb36ecc23244730
MD5 c36fe043c5aaab93c8e2d64af3cc9257
BLAKE2b-256 2fa47c3c9b27534c1011bf8c6c56ada60ad98c8941da662221d01d365b716c6e

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1rc1-cp38-cp38-macosx_10_10_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.1rc1-cp38-cp38-macosx_10_10_x86_64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.8, macOS 10.10+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1rc1-cp38-cp38-macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 76dd23a1eaae5f9ed0deda05b5361a979c11ac621a901e729227564d0936383f
MD5 0d16a8c3ca3f4e9d5cd6938ea0c09eeb
BLAKE2b-256 ef4b360c38a1b1a4c0a0ad94f1b971cc186ace689bf7979addfb18afc3baa85b

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1rc1-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.8.1rc1-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1rc1-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 f8239f602f636c0d3727925a9686362d05477d1dcef4bfa9d097d40dfc53ffd3
MD5 c0e67432528483458c6721486f0c5d29
BLAKE2b-256 bf9f0bc9d97fc87b91a9f9be68623652734017caac523465ff47b980dd453ae4

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1rc1-cp37-cp37m-win32.whl.

File metadata

  • Download URL: tokenizers-0.8.1rc1-cp37-cp37m-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.7m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1rc1-cp37-cp37m-win32.whl
Algorithm Hash digest
SHA256 39a989338cb77c3897340b977b0db1e75b59b98bd0c10182fa2f8901233049f0
MD5 0c0100fa96ec58f61c60a40383841788
BLAKE2b-256 40723c47f5b4f7251d032c50f90420bc602ce76b14795db33218f8c88877c173

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1rc1-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.1rc1-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.0 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1rc1-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 b108431ef2e0375c28339266886719cfc2431494047fd682dd7c72c5795e08f9
MD5 763dc82d7a92971362f6746c6e81f918
BLAKE2b-256 025968c7e3833f535615fb97d33ffcb7b30bbf62bc7477a9c59cd19ad8535d72

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1rc1-cp37-cp37m-macosx_10_10_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.1rc1-cp37-cp37m-macosx_10_10_x86_64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.7m, macOS 10.10+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1rc1-cp37-cp37m-macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 b211e4837e709fbf36612ac8821406d4b29dde311b58ffbc609e229665759605
MD5 de11f3388a45089809c71a1aa48e7688
BLAKE2b-256 a3c8b07f4346b36ca83988a4a59c081156ec2c96aad5b4c448c75deea4f53356

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1rc1-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.8.1rc1-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1rc1-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 210fc91ff4dfaf43b558f958fa0edbfe98724c2a6cbba25ffe4477ca941cfeaa
MD5 d30d23159b4da93711ad10356f30630a
BLAKE2b-256 78a8dbb57717e7ddc5dc045fe13c403fb13bf64f4ecca9ea16c739d932071983

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1rc1-cp36-cp36m-win32.whl.

File metadata

  • Download URL: tokenizers-0.8.1rc1-cp36-cp36m-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.6m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1rc1-cp36-cp36m-win32.whl
Algorithm Hash digest
SHA256 6191f3d707a865e57cd211dd91076b66039c1b0f580d5dfe71e0c84d146614a4
MD5 f0c7fc7e2df56eb7f8b7de1e6cdf948c
BLAKE2b-256 ef301a3260fd6c5794ddab45e1acacd54be58b1b9bb553d1a91f82de3749beee

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1rc1-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.1rc1-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.0 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1rc1-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 49dec8c6cb1fcbd3f0c997ec1f31a7ef83285c2e12855bd80c8009c3d9d66126
MD5 abded0519580c589b197eb590c6c4bca
BLAKE2b-256 40d030d5f8d221a0ed981a186c8eb986ce1c94e3a6e87f994eae9f4aa5250217

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1rc1-cp36-cp36m-macosx_10_10_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.1rc1-cp36-cp36m-macosx_10_10_x86_64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.6m, macOS 10.10+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1rc1-cp36-cp36m-macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 ef206a95d89133f30e918c7f7a5c5e123fb8a1f6471362df08131dded3539002
MD5 2df2e69333efb8fafd7944d0dfc45eff
BLAKE2b-256 d218255f044c78992d0a17653228ca84665afd141502f5516e34e63cf2f47c21

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1rc1-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.8.1rc1-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1rc1-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 8fb9657b43baa7230c15db7cf1d040914580431342a0e16c3cb2f8f682745654
MD5 a669525dce707e8598163d393ed6e0b6
BLAKE2b-256 8ec16ed0f11a64822abab1a2920c9c008ce7adfde954391a9657dab16fe6cee9

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1rc1-cp35-cp35m-win32.whl.

File metadata

  • Download URL: tokenizers-0.8.1rc1-cp35-cp35m-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.5m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1rc1-cp35-cp35m-win32.whl
Algorithm Hash digest
SHA256 206ee9d10d46415eca527055d7468f6ad43fe5cf5f4629dbcb0674e621bf193a
MD5 277818af6781c82424caed8c26a7516b
BLAKE2b-256 a9348407c1b7fb8ed9ec24940de001e980dfbc4a380f0ce019a37a036e4fe1d2

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1rc1-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.1rc1-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.0 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1rc1-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 9a1b06d64822d74ced767a584812034386133e1cd11a7330ac1b59ce2c807dd2
MD5 ea6f5ae7775ce5bda3330cd870e283ed
BLAKE2b-256 2f4062609aa27503eb78122ce3fc20a2cc863837f6ac7b6d1f381890d2fa0be6

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.1rc1-cp35-cp35m-macosx_10_10_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.1rc1-cp35-cp35m-macosx_10_10_x86_64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.5m, macOS 10.10+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.1rc1-cp35-cp35m-macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 49f421fe35695aebe010180923e28044035ff8099d211ce736772cb9de685d6e
MD5 f21258cd6aa475f975651a036206c103
BLAKE2b-256 0189bc36c4fde70ef1978b1570ec08aa1124f2fc3fede5825c47faac9bdf0011

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page