Skip to main content

Fast and Customizable Tokenizers

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install setuptools_rust
python setup.py install

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can check how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.9.1.tar.gz (170.7 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.9.1-cp38-cp38-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.9.1-cp38-cp38-win32.whl (1.7 MB view details)

Uploaded CPython 3.8 Windows x86

tokenizers-0.9.1-cp38-cp38-manylinux1_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.8

tokenizers-0.9.1-cp38-cp38-macosx_10_11_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.8 macOS 10.11+ x86-64

tokenizers-0.9.1-cp37-cp37m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.9.1-cp37-cp37m-win32.whl (1.7 MB view details)

Uploaded CPython 3.7m Windows x86

tokenizers-0.9.1-cp37-cp37m-manylinux1_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.7m

tokenizers-0.9.1-cp37-cp37m-macosx_10_11_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.7m macOS 10.11+ x86-64

tokenizers-0.9.1-cp36-cp36m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.9.1-cp36-cp36m-win32.whl (1.7 MB view details)

Uploaded CPython 3.6m Windows x86

tokenizers-0.9.1-cp36-cp36m-manylinux1_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.6m

tokenizers-0.9.1-cp36-cp36m-macosx_10_11_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.6m macOS 10.11+ x86-64

tokenizers-0.9.1-cp35-cp35m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.9.1-cp35-cp35m-win32.whl (1.7 MB view details)

Uploaded CPython 3.5m Windows x86

tokenizers-0.9.1-cp35-cp35m-manylinux1_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.5m

tokenizers-0.9.1-cp35-cp35m-macosx_10_11_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.5m macOS 10.11+ x86-64

File details

Details for the file tokenizers-0.9.1.tar.gz.

File metadata

  • Download URL: tokenizers-0.9.1.tar.gz
  • Upload date:
  • Size: 170.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.1.tar.gz
Algorithm Hash digest
SHA256 d0bcfa45cfd66e6aa379c7362d56a77f66a195fd06bc38ec912ca3d63c03d3b8
MD5 eb411ba030b2e35e63da7081d7354598
BLAKE2b-256 5a1fcc4a1899ac55ee39b4ab6e1a20559e4ff6d6864dc37a2f94e16b8937ec15

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.1-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.1-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.1-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 fbaf7115c09aec2519c78871e63333d3cc565adace235a26451211854fe58699
MD5 f3ded45428f3f943ce5a5aab140dc188
BLAKE2b-256 b8f2afb0f32ed2d62cb2442518fb0bc521db4845886fc39c1258769880920e3b

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.1-cp38-cp38-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.1-cp38-cp38-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.8, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.1-cp38-cp38-win32.whl
Algorithm Hash digest
SHA256 f989a34aa68b34d77237e76e835d4869ff0b3d62d4a37701b2b383793e8104fc
MD5 ac04e3345cfad691b28c16c38dcb1af6
BLAKE2b-256 4932a737591d5081e00b9a57274119ea96952790e339a0d5c43661d42b4dc287

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.1-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.1-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.9 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.1-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 1fe9fc7cf70cacad10e565c4508468ae584f465d54369b7a9a4e183e5e2b5370
MD5 7dde84cba701de4588388fe3a89e71ce
BLAKE2b-256 a076a4bfff6d81b93b2c1449c24685d556a50e47ef9e18c913bf047cf9c878a8

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.1-cp38-cp38-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.1-cp38-cp38-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 2.0 MB
  • Tags: CPython 3.8, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.1-cp38-cp38-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 7b6243df3c8f74f67766c265dc160f18825dbab488b7d1f5fb90790d8140df00
MD5 498eef6e808720b3609c7cbbe52e3de7
BLAKE2b-256 6d34d33f7214757817a2667e6e66a76240dfb8ad2f77704e8eae8930256918f0

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.1-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.1-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.1-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 0edf7797b6519d8d84492ce1d4744cdcf1b63712a7bf1f983859ce65d750e621
MD5 1cb2a368eafd7be707e5f0c55c552da8
BLAKE2b-256 1176b9fc468a5a5caf3ae07e6b0e64dce25945f27ea53ac0258dfa7216f2a33c

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.1-cp37-cp37m-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.1-cp37-cp37m-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.7m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.1-cp37-cp37m-win32.whl
Algorithm Hash digest
SHA256 c03a676f35d945f9d10ae25e3106b10fb44a47a270d675c9548a0611c831b2e6
MD5 6e2d26ccfcd74d99d4ef0584fc5aec69
BLAKE2b-256 270f95d15a5a1e9c04ad47be3e461e0d9734840e719e8299a24f7fad2de7d1b0

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.1-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.1-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.9 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.1-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 6604badc87588bf615cd9c6c5ee15d74557c2eb269be31034d2dd7ac46a07c5b
MD5 ef17c8dc133af9935c3bf40885c30887
BLAKE2b-256 9b6661232783f440f1c9010481b1eccf8caddd0af4a7316153aab6e32f67d843

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.1-cp37-cp37m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.1-cp37-cp37m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 2.0 MB
  • Tags: CPython 3.7m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.1-cp37-cp37m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 3437750a6480055fa39ee6479858177a4c5e4ee29d58239b4c493ce7bf8602c6
MD5 398a0058ad03cd21b89411373d170c5b
BLAKE2b-256 b8aa466f877841a1ac9fb6f8b0de4dc7de2d867ae8ffca908cdf72f61e9ccf33

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.1-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.1-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.1-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 44688367f21925a1cfa638732a0c2d5c5476ab85329d3ba6a890617d14d103f2
MD5 0de79b9ab7dcde2be306bc930a2addd2
BLAKE2b-256 2e2604a0f8f11a1aa91fd806c36cc1c017732bb36b812b6811b749a4fa447157

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.1-cp36-cp36m-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.1-cp36-cp36m-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.6m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.1-cp36-cp36m-win32.whl
Algorithm Hash digest
SHA256 9df0b4d2459665ec33360fe816a743c78487ba71d60534881dec46396cded50d
MD5 13d32b663f47b890990fdf30d6b761a0
BLAKE2b-256 cc30b3a31c7efffbb246f7c920b34c545b89746fb1d8ae8473c0cee44bc6e1d8

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.1-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.1-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.9 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.1-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 1a4cf983fc44e4aa6f1a9aca7511de44800e0f4a068fffd909f67d59e0c12b01
MD5 3dcfad1cfe3acfdd12d9c65394c1bbf4
BLAKE2b-256 3d54ffb2a4d26762f967aff57562b8e6586a2a8e20f6c26aee47911627ca7786

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.1-cp36-cp36m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.1-cp36-cp36m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 2.0 MB
  • Tags: CPython 3.6m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.1-cp36-cp36m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 a3d53bca1a3fa9d4c5ac33803af8a5bf9e7b297f5d00b040ff2b07939c5bd9fd
MD5 fb362122399f47c25a0fadbe49100ae5
BLAKE2b-256 bcfb531f6e11a447d6537fcee6e6335974b870f6061d92fa7c6f074e37217c24

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.1-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.1-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.1-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 5c5ea496a3c540b2a7584be9a8cf2888d52f06d7dce8a121e90597f796f304c1
MD5 51cd8ca2544686f71cf2900490732ad0
BLAKE2b-256 549b392b6b102c6303533d88a811228ac389d67db904a5f824872dcea2da9b2c

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.1-cp35-cp35m-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.1-cp35-cp35m-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.5m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.1-cp35-cp35m-win32.whl
Algorithm Hash digest
SHA256 b364c8a138ba5969ee38d35539291d5ebff7a12f83b920bca72fd7c912f73a54
MD5 7672e3efb91f7367fd3406f601cd6834
BLAKE2b-256 636b39c09bc5d495f3744d7d1037e2375350b133d36bc346637e768b71a0525c

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.1-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.1-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.9 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.1-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 dcc41a913a6ff25ea55529488fcd94b4d8e727b0242f2f44d9be2eccb76799ea
MD5 3f227a0680e541541b5d0f76b2a9ed38
BLAKE2b-256 200075d75a80caa616dacd6fbd9bdc33439d245ead099791b36aef8e8854acb6

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.1-cp35-cp35m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.1-cp35-cp35m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 2.0 MB
  • Tags: CPython 3.5m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.6

File hashes

Hashes for tokenizers-0.9.1-cp35-cp35m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 0650d8c1c7964dea7532798bf7c4cb235f021184349cb6cb1a6494fef8269c7b
MD5 ef6dc26069d034fa4b06111c20d37cc0
BLAKE2b-256 e10c1c3ac96c30e2339dadf7db8bd8e3df70779081adac3c8657277cec737fae

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page