Skip to main content

Fast and Customizable Tokenizers

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install setuptools_rust
python setup.py install

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.8.0.dev2.tar.gz (89.6 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.8.0.dev2-cp38-cp38-win_amd64.whl (1.8 MB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.8.0.dev2-cp38-cp38-win32.whl (1.6 MB view details)

Uploaded CPython 3.8 Windows x86

tokenizers-0.8.0.dev2-cp38-cp38-macosx_10_10_x86_64.whl (2.1 MB view details)

Uploaded CPython 3.8 macOS 10.10+ x86-64

tokenizers-0.8.0.dev2-cp37-cp37m-win_amd64.whl (1.8 MB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.8.0.dev2-cp37-cp37m-win32.whl (1.6 MB view details)

Uploaded CPython 3.7m Windows x86

tokenizers-0.8.0.dev2-cp37-cp37m-macosx_10_10_x86_64.whl (2.1 MB view details)

Uploaded CPython 3.7m macOS 10.10+ x86-64

tokenizers-0.8.0.dev2-cp36-cp36m-win_amd64.whl (1.8 MB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.8.0.dev2-cp36-cp36m-win32.whl (1.6 MB view details)

Uploaded CPython 3.6m Windows x86

tokenizers-0.8.0.dev2-cp36-cp36m-macosx_10_10_x86_64.whl (2.1 MB view details)

Uploaded CPython 3.6m macOS 10.10+ x86-64

tokenizers-0.8.0.dev2-cp35-cp35m-win_amd64.whl (1.8 MB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.8.0.dev2-cp35-cp35m-win32.whl (1.6 MB view details)

Uploaded CPython 3.5m Windows x86

tokenizers-0.8.0.dev2-cp35-cp35m-macosx_10_10_x86_64.whl (2.1 MB view details)

Uploaded CPython 3.5m macOS 10.10+ x86-64

File details

Details for the file tokenizers-0.8.0.dev2.tar.gz.

File metadata

  • Download URL: tokenizers-0.8.0.dev2.tar.gz
  • Upload date:
  • Size: 89.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0.dev2.tar.gz
Algorithm Hash digest
SHA256 fab2afcc32e60a1b0e91db8f718c8a1b937f12095f35c772be530f8106b2b131
MD5 692d948f5abf81a0c94391fa5a5041e6
BLAKE2b-256 d2b535e4253780bccb1340cd7e7cb9ceb53c9e3f03c2dc90a63c097571f543f0

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0.dev2-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.8.0.dev2-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0.dev2-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 9ea58ef7ca2f1f69dd984775e91ef88d7eec5ee167e159e93abff8d89b185ed8
MD5 ccc83e54690cdacdc03751e073cafc4e
BLAKE2b-256 893dc06bd9792d915a3b092c7c13e9612b08b85bab060ba0e87dd043b1d1da66

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0.dev2-cp38-cp38-win32.whl.

File metadata

  • Download URL: tokenizers-0.8.0.dev2-cp38-cp38-win32.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.8, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0.dev2-cp38-cp38-win32.whl
Algorithm Hash digest
SHA256 18fdd05c1f0bd6efe0c791d35ee6baaad6c270f45f28c78b816f934579045e2f
MD5 a4ec49efc1b2b44038871c9ed1fcf4dd
BLAKE2b-256 937b091353f5f532863f399811ddf144e4c08e088997e23c4675918d6e8cdf13

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0.dev2-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0.dev2-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 12.0 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0.dev2-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 c963a56218768d80771c12d0ad9358333a35e5ec19540828f1558f99c8eb5194
MD5 510bd6313778e2990914bbf90f585818
BLAKE2b-256 27870f1e1a49914e13ce83aed3169687f18df1d3ce98dd57da39b0c0328af6f2

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0.dev2-cp38-cp38-macosx_10_10_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0.dev2-cp38-cp38-macosx_10_10_x86_64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.8, macOS 10.10+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0.dev2-cp38-cp38-macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 ea0c38b98aeace3f1e304df58e19181bef7d1dd8fdd7ed3e05b9b524a23a901b
MD5 355680c037b64534ffeb1be693822e38
BLAKE2b-256 daee5ee716371c8223d3ea245eb021bf61306af41b903d86b4ed1a6832d26689

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0.dev2-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.8.0.dev2-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0.dev2-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 6799cc54f4d039909ad4663f9dc6fe3cc75dce245ebf7eeb53b72de3b42e0b2a
MD5 a0710076329ddd34165a5ee2db2a8103
BLAKE2b-256 ffb912842c66cde4aaeeae86ab51d0bfb5da0a878930a537bc2f44aeabac4b19

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0.dev2-cp37-cp37m-win32.whl.

File metadata

  • Download URL: tokenizers-0.8.0.dev2-cp37-cp37m-win32.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.7m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0.dev2-cp37-cp37m-win32.whl
Algorithm Hash digest
SHA256 c5b0e59ebe3afd52ad213ee2c0d6c1dca411190cc0d7416d7d71ae0949d60b56
MD5 5152fad8211f922f326efc849f43ce65
BLAKE2b-256 b230d727a7b1f2c16fc7c1d82b9cc1682909633ed7d612c50dfc43c63a0b1671

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0.dev2-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0.dev2-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 9.0 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0.dev2-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 73d511bb11b4cfa9e74767fcd2957f496ebb15610199ea7cf1c56602cd4c7758
MD5 316765aa8e6cc3f40a5c8222b7ade0ca
BLAKE2b-256 36fe2002ab485a2f55ffc500792480280d69879862c1adc71ceadb684cb70f7b

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0.dev2-cp37-cp37m-macosx_10_10_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0.dev2-cp37-cp37m-macosx_10_10_x86_64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.7m, macOS 10.10+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0.dev2-cp37-cp37m-macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 58a8be8f3746d149cab22a98f3dffcd1b78b8bb3a38d38a0e081e3f2f51d5c23
MD5 c666d7b2f9c901525d419c0183490813
BLAKE2b-256 7d3aa378dc37374fc9bd20920b682f3c5d54f75076c87c62e6f44e8fae4b7dcc

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0.dev2-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.8.0.dev2-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0.dev2-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 33497a6dea44a92b66bad358a30c2fca644d6c08dcaf90d4bf00b02e642eb1e8
MD5 ac2cacca48fb045369626468812254e6
BLAKE2b-256 7ae25c9de0048dd9eed60134795198bd56e9aa97cb91fced96084b257ae2b5b4

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0.dev2-cp36-cp36m-win32.whl.

File metadata

  • Download URL: tokenizers-0.8.0.dev2-cp36-cp36m-win32.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.6m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0.dev2-cp36-cp36m-win32.whl
Algorithm Hash digest
SHA256 4f8b16d1440a07e1e95a76c1c96edd872f1d1dee1c1b91ab0573efd9533ad868
MD5 dd2a0d46a50afac4f4dd01c278bb8d5c
BLAKE2b-256 4529facdc0ab38472badd007b388dbf5f797746135e7f9f65850b174993d17e9

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0.dev2-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0.dev2-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 6.0 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0.dev2-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 e84efa5c79bc6ebd0b27befe59392ae94c38d97cff3a9261bc207e092d13b1d0
MD5 c4367b4790c0e7636338c8d427f832fe
BLAKE2b-256 c849a60274ffc52ec341f908033db3c15e9750ca9e38df0bf8b9bc71d6c8f0d8

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0.dev2-cp36-cp36m-macosx_10_10_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0.dev2-cp36-cp36m-macosx_10_10_x86_64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.6m, macOS 10.10+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0.dev2-cp36-cp36m-macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 bcdb493100c1b590dbadeb3d4e053b0e23cb60dd5d49ec1ab320ceba1d54589c
MD5 1e5fdf2fe95cb7253fdc88b9e16e76f6
BLAKE2b-256 e1798f604bd2ee67a5a6e0b84d1dc89f519fed91a443a71d796a28105a756bd2

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0.dev2-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.8.0.dev2-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0.dev2-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 b85beebcbdd9c62ed8d60a3178871d113c8ad37454f2ea2d9638612a8c97cfa8
MD5 8caf1a20ad70d13c899210bcaaca7a3c
BLAKE2b-256 aadeaeeb943735153faa29e3ddd0a2460afab8450070b09708ed3b1e54596bbe

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0.dev2-cp35-cp35m-win32.whl.

File metadata

  • Download URL: tokenizers-0.8.0.dev2-cp35-cp35m-win32.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.5m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0.dev2-cp35-cp35m-win32.whl
Algorithm Hash digest
SHA256 effa6756e19208d55eb0732c0777f8bffaf5897cb59202494c62741f728a3713
MD5 fc844d90f447cb957cc2f9897f67bc91
BLAKE2b-256 f137a5475615c6206f2217afd7d9983e48accda53cd90df4f409af32c5771f57

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0.dev2-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0.dev2-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.0 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0.dev2-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 fc52a43acc20a66912d22fb49085cc4142c1be9de4b547ba64402829d1cdcc80
MD5 272d404c7e33273d465c1f5a87f96d93
BLAKE2b-256 afdfa2cc358beb24c4f433e35a5f5496bef3c6a48cf25f2e107ee7ea835c52db

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0.dev2-cp35-cp35m-macosx_10_10_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0.dev2-cp35-cp35m-macosx_10_10_x86_64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.5m, macOS 10.10+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0.dev2-cp35-cp35m-macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 ce707a6f09bf1d12ea23480a30718f574e09c442c254cd681b0f9201a139fd4f
MD5 ed41141f5de2d97c58322886d5fe59d2
BLAKE2b-256 002f6482a53525285665a075eeed14af82d630f5c5f3b2c8a61dc98afd0ab520

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page