Skip to main content

Fast and Customizable Tokenizers

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install setuptools_rust
python setup.py install

Using the provided Tokenizers

Using a pre-trained tokenizer is really simple:

from tokenizers import BPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = BPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train yours just as simply:

from tokenizers import BPETokenizer

# Initialize a tokenizer
tokenizer = BPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# And you can use it
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory", "my-bpe")

Provided Tokenizers

  • BPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

You can also easily build your own tokenizers, by putting all the different parts you need together:

Use a pre-trained tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders

# Load a BPE Model
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
bpe = models.BPE.from_files(vocab, merges)

# Initialize a tokenizer
tokenizer = Tokenizer(bpe)

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel.new(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel.new()

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

# Or tokenize multiple sentences at once:
encoded = tokenizer.encode_batch([
	"I can feel the magic, can you?",
	"The quick brown fox jumps over the lazy dog"
])
print(encoded)

Train a new tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE.empty())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel.new(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel.new())

# And then train
trainer = trainers.BpeTrainer.new(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# Now we can encode
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.1.0.tar.gz (58.3 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.1.0-cp38-cp38-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.1.0-cp38-cp38-manylinux1_x86_64.whl (7.3 MB view details)

Uploaded CPython 3.8

tokenizers-0.1.0-cp38-cp38-macosx_10_13_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.8 macOS 10.13+ x86-64

tokenizers-0.1.0-cp37-cp37m-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.1.0-cp37-cp37m-manylinux1_x86_64.whl (5.4 MB view details)

Uploaded CPython 3.7m

tokenizers-0.1.0-cp37-cp37m-macosx_10_13_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.7m macOS 10.13+ x86-64

tokenizers-0.1.0-cp36-cp36m-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.1.0-cp36-cp36m-manylinux1_x86_64.whl (3.6 MB view details)

Uploaded CPython 3.6m

tokenizers-0.1.0-cp36-cp36m-macosx_10_13_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.6m macOS 10.13+ x86-64

tokenizers-0.1.0-cp35-cp35m-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.1.0-cp35-cp35m-manylinux1_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.5m

tokenizers-0.1.0-cp35-cp35m-macosx_10_13_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.5m macOS 10.13+ x86-64

File details

Details for the file tokenizers-0.1.0.tar.gz.

File metadata

  • Download URL: tokenizers-0.1.0.tar.gz
  • Upload date:
  • Size: 58.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.1.0.tar.gz
Algorithm Hash digest
SHA256 06bfb361303be52f17806921dce4c3afc9a88ab16f59cf6338a9318c31fe0b9c
MD5 e8b1691621da17b5964c573673cb00ba
BLAKE2b-256 2e0a67ddf86dccb82fd7e7a770be64aace897fa803eeba1a6e7cc49483f0c32c

See more details on using hashes here.

File details

Details for the file tokenizers-0.1.0-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.1.0-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.1.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 2682ef07db963ceeea50a4d47d19b340ddf7c438a389938efe3d9e50ad853fd8
MD5 a006597e7da433db910c9440533b81c8
BLAKE2b-256 22e91112dd0896634e8f79083ddf13702b81c3d7bb87c630399c51e34c7e7d23

See more details on using hashes here.

File details

Details for the file tokenizers-0.1.0-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.1.0-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 7.3 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.1.0-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 591959fa49e0dfc981ceab21d088cb08d8c4ca850f254d9492ac59cab3cbc65e
MD5 3e9cbcec0922a83e72c0cc9e35b53394
BLAKE2b-256 9655a1b88016ec53eab41e12082b4f2f1f2c640a9b7082e14310bdc73a5676a4

See more details on using hashes here.

File details

Details for the file tokenizers-0.1.0-cp38-cp38-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.1.0-cp38-cp38-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.8, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.1.0-cp38-cp38-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 52847810291b75b00d985f2ab9fa19e7800c2955a3cdfd2192395a8772e6a148
MD5 b4e5765fe9fc8011067476e43fc68fc4
BLAKE2b-256 4ce7afd55ff9523a57616513bcd6fa2c93c8bb513bed2e827a57f0b90614c698

See more details on using hashes here.

File details

Details for the file tokenizers-0.1.0-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.1.0-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.1.0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 0a7285a82428390df6d147d6112caa44d212b91a8fc1b1ee57573f256a5a5276
MD5 37b6576933c4685db6e073cfc4ebfdc6
BLAKE2b-256 ff18ad239570852265751f5232e24ce5f807198f5c71d39f3bc35b1d72c426c9

See more details on using hashes here.

File details

Details for the file tokenizers-0.1.0-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.1.0-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 5.4 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.1.0-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 b2cfe8732fc8b2e387bf6f2fda1e12427e9c144ef88008c15ad679dae0c6a642
MD5 2ee358b1381ddc5589707db8109cd0b1
BLAKE2b-256 f5b57cbff6028c0858214c5cd285088e5bbc920d47adcb48cd65e0e22ce95b2f

See more details on using hashes here.

File details

Details for the file tokenizers-0.1.0-cp37-cp37m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.1.0-cp37-cp37m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.7m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.1.0-cp37-cp37m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 933e722a7607019f1d62731b466cbbfd14b3227a0d111cbfd110baac1ba694ae
MD5 607905d28fe70463d0af63eee9cc7a9b
BLAKE2b-256 03a43aea304eb322fb9e142b7471fa4928a3ef17b1e1b0ebb21c668225c9f708

See more details on using hashes here.

File details

Details for the file tokenizers-0.1.0-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.1.0-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.1.0-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 81fd6dd570985f7fc8dfeac40e3b81dea325f10142dc13df21099e52ca1bd934
MD5 aa22f44b644ddcd4ebf867e40d77d965
BLAKE2b-256 d91f9c1ebf423f50dfd70a57fd9b31d168d241788da27d105579459ca4f84a39

See more details on using hashes here.

File details

Details for the file tokenizers-0.1.0-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.1.0-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.6 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.1.0-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 822b5d78746aa238e264c632b36f650e8566eb92aad003e29a17109f3f1a1c80
MD5 9427fc0904a67927b9533bab3443d6ce
BLAKE2b-256 563f448125cca8f32a8d57efc00fb1e26326b1a22a442be2665d80acdd09978c

See more details on using hashes here.

File details

Details for the file tokenizers-0.1.0-cp36-cp36m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.1.0-cp36-cp36m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.6m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.1.0-cp36-cp36m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 034afbe99028f7a2dc31ac43ffe88d668b179eaf8c28c13298f8ac14a5a97111
MD5 d0c54695abf072ff706fd5b64180bb85
BLAKE2b-256 63e76f4e9eb9bdfcf509d42624f822a2dfea40b9b3d4905ecb79415e607b849c

See more details on using hashes here.

File details

Details for the file tokenizers-0.1.0-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.1.0-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.1.0-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 57de491ae26e6228a07049c827744e586c779a9974eba8473e14b0603658e898
MD5 0c30a20481f77c402643f99db09461f1
BLAKE2b-256 41165cb84d031760105491811202aa0e18cfc195317375046a4d730513561c1d

See more details on using hashes here.

File details

Details for the file tokenizers-0.1.0-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.1.0-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.1.0-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 d5af1978103fe71f0ecb1dd0498dfb55abee39588d1f204abc8f3a761b8921c8
MD5 ad2943746086475d2486ac9fcfc6a799
BLAKE2b-256 fa71207684d2468cff1627168fe6f02e5b0013ee2bdab71b654e927b6c62f1f6

See more details on using hashes here.

File details

Details for the file tokenizers-0.1.0-cp35-cp35m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.1.0-cp35-cp35m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.5m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.1.0-cp35-cp35m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 b464d12620606dce07a4a4a46728c5993b4159a555fe2d9265ce5f2913efd5ab
MD5 d01160d114b2b4a8b10706863f09e179
BLAKE2b-256 a6bc29a9167554cd85f8be0344d18c488da460aec46113320166ddd65f427b0e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page