Skip to main content

Fast and Customizable Tokenizers

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install setuptools_rust
python setup.py install

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.9.0.dev0.tar.gz (115.0 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.9.0.dev0-cp38-cp38-win_amd64.whl (1.6 MB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.9.0.dev0-cp38-cp38-win32.whl (1.5 MB view details)

Uploaded CPython 3.8 Windows x86

tokenizers-0.9.0.dev0-cp38-cp38-macosx_10_11_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.8 macOS 10.11+ x86-64

tokenizers-0.9.0.dev0-cp37-cp37m-win_amd64.whl (1.6 MB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.9.0.dev0-cp37-cp37m-win32.whl (1.5 MB view details)

Uploaded CPython 3.7m Windows x86

tokenizers-0.9.0.dev0-cp37-cp37m-macosx_10_11_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.7m macOS 10.11+ x86-64

tokenizers-0.9.0.dev0-cp36-cp36m-win_amd64.whl (1.6 MB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.9.0.dev0-cp36-cp36m-win32.whl (1.5 MB view details)

Uploaded CPython 3.6m Windows x86

tokenizers-0.9.0.dev0-cp36-cp36m-macosx_10_11_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.6m macOS 10.11+ x86-64

tokenizers-0.9.0.dev0-cp35-cp35m-win_amd64.whl (1.6 MB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.9.0.dev0-cp35-cp35m-win32.whl (1.5 MB view details)

Uploaded CPython 3.5m Windows x86

tokenizers-0.9.0.dev0-cp35-cp35m-macosx_10_11_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.5m macOS 10.11+ x86-64

File details

Details for the file tokenizers-0.9.0.dev0.tar.gz.

File metadata

  • Download URL: tokenizers-0.9.0.dev0.tar.gz
  • Upload date:
  • Size: 115.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev0.tar.gz
Algorithm Hash digest
SHA256 4cbe9496f32299cfb03a174753d86f8f418ed4fc5e5984bfd5fcbf0e96d550cf
MD5 7c29ae8baf70e2bd6410a0a9addc5f17
BLAKE2b-256 bc88b4bbf8ff36fb157ae904193b1ad146301245059153f4d751df02aab67b0e

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev0-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev0-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 a6da60f8c8431b588a13ff969706ee5050fa821470f441aa368c5954655d2b64
MD5 475c5da2a0f7c1b3bca8da87b2ee9162
BLAKE2b-256 081ea6a547a0846159ef573a79d3264ad60c77a3a524ffdaf4b48c362d2dedad

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev0-cp38-cp38-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev0-cp38-cp38-win32.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: CPython 3.8, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev0-cp38-cp38-win32.whl
Algorithm Hash digest
SHA256 04b4a6087f939370c49629fe7cae2a6856da3515a0efc90ab619eb114e81ec7a
MD5 3b9c1e2b79f8046a7e8632a9176ddd3d
BLAKE2b-256 3d2559e05b14df415cb33af9fe5caa0be8969fcdfd1cd0f70a7ec3b76cc10af4

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev0-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev0-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.6 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev0-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 dd70b13150f701812668b679df9ff06403ac289599ce2e006d53f290e220cd5e
MD5 4086a28899935e7e61147b3451c331e5
BLAKE2b-256 3e130c25ade67942dbb42f9715d36218724c7276dd95e216c19300b7bb4ca3c4

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev0-cp38-cp38-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev0-cp38-cp38-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.8, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev0-cp38-cp38-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 e4050c7377fa8189b4a528318beab92def60932192fc1d532889720a4c743c26
MD5 b12440c1d97cbcd465564bbae8cca903
BLAKE2b-256 c9b2b51d50755c98060c213b96a887400fe75d371ad1afb108d295d60860a87b

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev0-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev0-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 e5bdd62e4f3804c37f36ad8998d645abc6a27879b0f3f1050439af98945e6b42
MD5 fb4dee6cc7878d6f66e7d29a97050908
BLAKE2b-256 e5bfe90bad1191a5bb301fe119d7f37b92de34856199f7df529f9dde8b095030

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev0-cp37-cp37m-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev0-cp37-cp37m-win32.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: CPython 3.7m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev0-cp37-cp37m-win32.whl
Algorithm Hash digest
SHA256 3c6a0dbc0c2915ebba0a75b3314b435b9415002f9df33addbfe5ea3fccff90ee
MD5 f60e02f81a79f478d1039984d7b2c0c8
BLAKE2b-256 071fd11d89c6a3243994bca418cdb6faf5f69fb6b08a397fe0fcd66c162b62b6

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev0-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev0-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.6 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev0-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 3156302f84e765032674c31bbd9b8c85f78e1560b3418d423f8e56f6434aea43
MD5 03aa65c99cdd4edb614c91058e973496
BLAKE2b-256 6f4a34a86fc8d39dfe33eedf1936f845fe0c7282ca6ed90103814fac075fe391

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev0-cp37-cp37m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev0-cp37-cp37m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.7m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev0-cp37-cp37m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 aedab3284d354ab3956852c568d9c0cfcef32abdb9432ff58e94f9f69ca5fe37
MD5 37479e24d3e2648720798dcbd6727d68
BLAKE2b-256 a529a1ed38ac4a3d4abc6172cbe6ee1524b45c24acb157fc3aa141b7484bb5e3

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev0-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev0-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev0-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 169caec7eb83ba2b1aa0563c691d0589850993d4c1dcbb545a5b56c99b7a047a
MD5 20afd5fa83daeaa2e0f6097952f968c1
BLAKE2b-256 24d4c05448afd729e55d59522c4e2e0e38e14545416ce95c13c2f5bce5efe4a3

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev0-cp36-cp36m-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev0-cp36-cp36m-win32.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: CPython 3.6m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev0-cp36-cp36m-win32.whl
Algorithm Hash digest
SHA256 c03bd02033839cc12126ef17dca1e36c7e9d1d09ef555c4f9243d1f0eda34202
MD5 b051ced7dab9902d27b2700c5dcf3e9b
BLAKE2b-256 53aa6cd7094b27d57ad6d4e051c445ee7be42051f691bf9e01e4700d8d0362cd

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev0-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev0-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.6 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev0-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 7e95f3f819bb461399cbc50aa2c394ffb75318addb829860c628a3f4705f9ba3
MD5 cd2c4b0361a0bad46779f162885c6efb
BLAKE2b-256 3d891f610d2a5a21327b64d2e77933a115b51f1f13a9b863d6910986403a38f1

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev0-cp36-cp36m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev0-cp36-cp36m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.6m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev0-cp36-cp36m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 c1c85d5c4c53d195bac068418dbb8c1206f7373cbd89f25a40522590ec9b5806
MD5 044b556027d8586d1b6c9fc202dbe732
BLAKE2b-256 b2f4fd7e5c90b202cd88f59a0ac562a48455e79036707d0c50cdfabb8889c306

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev0-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev0-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev0-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 9d99b5ef0ebbca2a5c8747d6f8b580012716439a58ff34684c6472c674c9abcb
MD5 2aeba96dcf8fef7b7b6f743f33f75131
BLAKE2b-256 63cc4a2dd0f68cd0e5ea54efcd71292f9a7254e6971c8bba2e92df9a696bd887

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev0-cp35-cp35m-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev0-cp35-cp35m-win32.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: CPython 3.5m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev0-cp35-cp35m-win32.whl
Algorithm Hash digest
SHA256 2f62ec619cba525c20f3a527f5d4fa7dcc44252f6028ee8a4c187eb14fd1f722
MD5 ec38717117b5082e978f7dff8b646dff
BLAKE2b-256 b4d7baec94be1fda2a6bafa486272d932911512704085063171399fbb7d580e8

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev0-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev0-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.6 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev0-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 a67ba768533eae3c2700eead407043e722940481c31db8786946c179a699d2c3
MD5 ed48589c59bf810982f11722978b79bd
BLAKE2b-256 77dabe6704ed5c79811825c7196d158af2c56444585b21031c5f3b91ea3252a0

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev0-cp35-cp35m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev0-cp35-cp35m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.5m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev0-cp35-cp35m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 ae92f08448dc2fe0d90999c3c835772e8c3785690a75319b9504fa2ad9985219
MD5 3cb3c512d0af319bc3ca6b72d722432f
BLAKE2b-256 3251c2c162c67543fcc8ccf3aff13710fa093a760d86e653d4128fdd0707e875

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page