Skip to main content

Fast and Customizable Tokenizers

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install setuptools_rust
python setup.py install

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.8.0.dev1.tar.gz (89.2 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.8.0.dev1-cp38-cp38-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.8.0.dev1-cp38-cp38-win32.whl (1.6 MB view details)

Uploaded CPython 3.8 Windows x86

tokenizers-0.8.0.dev1-cp38-cp38-manylinux1_x86_64.whl (12.0 MB view details)

Uploaded CPython 3.8

tokenizers-0.8.0.dev1-cp38-cp38-macosx_10_10_x86_64.whl (2.1 MB view details)

Uploaded CPython 3.8 macOS 10.10+ x86-64

tokenizers-0.8.0.dev1-cp37-cp37m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.8.0.dev1-cp37-cp37m-win32.whl (1.6 MB view details)

Uploaded CPython 3.7m Windows x86

tokenizers-0.8.0.dev1-cp37-cp37m-macosx_10_10_x86_64.whl (2.1 MB view details)

Uploaded CPython 3.7m macOS 10.10+ x86-64

tokenizers-0.8.0.dev1-cp36-cp36m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.8.0.dev1-cp36-cp36m-win32.whl (1.6 MB view details)

Uploaded CPython 3.6m Windows x86

tokenizers-0.8.0.dev1-cp36-cp36m-macosx_10_10_x86_64.whl (2.1 MB view details)

Uploaded CPython 3.6m macOS 10.10+ x86-64

tokenizers-0.8.0.dev1-cp35-cp35m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.8.0.dev1-cp35-cp35m-win32.whl (1.6 MB view details)

Uploaded CPython 3.5m Windows x86

tokenizers-0.8.0.dev1-cp35-cp35m-macosx_10_10_x86_64.whl (2.1 MB view details)

Uploaded CPython 3.5m macOS 10.10+ x86-64

File details

Details for the file tokenizers-0.8.0.dev1.tar.gz.

File metadata

  • Download URL: tokenizers-0.8.0.dev1.tar.gz
  • Upload date:
  • Size: 89.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0.dev1.tar.gz
Algorithm Hash digest
SHA256 042a58d6b393449aa42b84d0163295d22aacd842d1ac760150865d7a4dde24b6
MD5 b8861398410abf354d3377f60cbf8347
BLAKE2b-256 c2adb8dbbc6df1320aaf3a7e0552a2f2c3d5a94a08e6081e73ecf27ba3daef46

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0.dev1-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.8.0.dev1-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0.dev1-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 7d733b576a36de7c53f67d5e16106ceac44277f91b05cc930336dace7c939b27
MD5 76d862a66f3d2c4fd276046bdbe34152
BLAKE2b-256 eea3fe08f70e0c50fc9d377cda7d1d8a38382b165df36ef582c77529df2bf4d1

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0.dev1-cp38-cp38-win32.whl.

File metadata

  • Download URL: tokenizers-0.8.0.dev1-cp38-cp38-win32.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.8, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0.dev1-cp38-cp38-win32.whl
Algorithm Hash digest
SHA256 d4cd7c1812685091c50ee42c483bbf762a2e4552a05f14b75569eefefba616be
MD5 5223df994755526064155d11034fc6ee
BLAKE2b-256 2619bbb74dbf31166a509e0dffcccfe59a4ed9f34e962c6956888541ebd431b9

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0.dev1-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0.dev1-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 12.0 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0.dev1-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 32eb6dc1bdd160de87f8737019008a010e705f86edf515fd867344e14a20bd6c
MD5 7ef9d878e794a6a9210247ad8f2bc512
BLAKE2b-256 429be0cd911716498aec28ee41152da16c5e27e22df6d3b9e8a2d615264405bd

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0.dev1-cp38-cp38-macosx_10_10_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0.dev1-cp38-cp38-macosx_10_10_x86_64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.8, macOS 10.10+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0.dev1-cp38-cp38-macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 1c5955094b0d5102c0d33a2c5ce09d6713fb8eb53593f0a3f0fe7284b38bb582
MD5 a205bc006b71d6a4cec1c247be48d33f
BLAKE2b-256 c752a9b655113940322e77c28f3af6d96ffb9066d8306fb366cf0d719f29abcf

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0.dev1-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.8.0.dev1-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0.dev1-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 74c307d3539a73c9b3ae81e4cca8c9eb82df658e326906d1caa7682493bf6c5a
MD5 2b44f17f233cc3ae0335f336c55c559a
BLAKE2b-256 3591ad83b04bf6d608e107e93d52117b7cda62ae6bb9480e441e7a07ece2d937

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0.dev1-cp37-cp37m-win32.whl.

File metadata

  • Download URL: tokenizers-0.8.0.dev1-cp37-cp37m-win32.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.7m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0.dev1-cp37-cp37m-win32.whl
Algorithm Hash digest
SHA256 675364c0782bc1edf697bb509b352ccd8a3240b40e8a8469ca1ffadf512ba92b
MD5 e2eb698cd3ee4e1ec046a750c05c9818
BLAKE2b-256 5744258c19929d8d2e616d92cce08b114670dcdb59f397eb170f28d343c49cb2

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0.dev1-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0.dev1-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 9.0 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0.dev1-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 5022861475781533872ecc19627c01fbc9d8b2db991f6c0b045acfc8f20c58d4
MD5 de2ae7a92e8fa9e498f0243aa8274d36
BLAKE2b-256 8339a27ebad82980289fd630c27c31ab0ab6166fae896383f3d830e62d520231

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0.dev1-cp37-cp37m-macosx_10_10_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0.dev1-cp37-cp37m-macosx_10_10_x86_64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.7m, macOS 10.10+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0.dev1-cp37-cp37m-macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 18a8d93f755414373e1cc791b858d392c8e4bfdddb9e946ac1d05adbda150d03
MD5 171db146af12496c5221e1ec325aec91
BLAKE2b-256 4b9f5fe0a5164d25c18474e7d32e5cad24a2f71dddd118f80e9d88edabcebdef

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0.dev1-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.8.0.dev1-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0.dev1-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 ad419f229e3251d7e8bfaad3cde32664ef52b1e7cc4088c0b253563cac1e49cb
MD5 75b7367bb6755af28949f9f4bfc3116e
BLAKE2b-256 3db2085d7e8a0a324ea01f8c7e3cec046f9e45328521be57b50485d88e240a42

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0.dev1-cp36-cp36m-win32.whl.

File metadata

  • Download URL: tokenizers-0.8.0.dev1-cp36-cp36m-win32.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.6m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0.dev1-cp36-cp36m-win32.whl
Algorithm Hash digest
SHA256 f095649732f62692469845731e76105913e06cc5a9f04c712f95b8012e5f6c74
MD5 6899e57a0dc462ea2ee1a5731fe2ced6
BLAKE2b-256 abcb9aa323c799f19f3d812d99456d9d3976500c2e2338d51a393c0ad80e4113

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0.dev1-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0.dev1-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 6.0 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0.dev1-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 a8106d3714a60e9e9f460b2d7f4c136b7b90dfec8ee22a68b714c0cbec4a84d3
MD5 defb9467f8fecdb23e33fe296d637188
BLAKE2b-256 c21f7a1891719299f89881b551ae4b24fed6e31b4f85d28cef9c98ed8f2918e0

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0.dev1-cp36-cp36m-macosx_10_10_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0.dev1-cp36-cp36m-macosx_10_10_x86_64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.6m, macOS 10.10+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0.dev1-cp36-cp36m-macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 68fd80cf321ca7e207f2b579a230c743e690ae0031d93217f9ce2074437e8d91
MD5 5631f180d6b50ee1e99c158058e5cc88
BLAKE2b-256 5f8fac082c53b7009d64b209921d52badd9d965b7563351a8ce4c3c44f65731a

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0.dev1-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.8.0.dev1-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0.dev1-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 62ae40e5ef3ceb925fe10dfbd0d0dd07032875f961ec6709caba2dafc7c97d7a
MD5 555edb64116ea9f5be600914d60a07a1
BLAKE2b-256 41805275f103d0341ce8f7843a063017dba9c33ce75cc85f7aca9243dc3f703a

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0.dev1-cp35-cp35m-win32.whl.

File metadata

  • Download URL: tokenizers-0.8.0.dev1-cp35-cp35m-win32.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.5m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0.dev1-cp35-cp35m-win32.whl
Algorithm Hash digest
SHA256 7933099a5dbaf0d8ce3608f91e6d4ff4cd37ec7edd622dfdf34a77ae55769925
MD5 d370297fd3b6241fd6778b322cd88ef8
BLAKE2b-256 b24a7156e0fd36da08817e8244f60c75915c4fc14ce9a5d1212637972136b849

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0.dev1-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0.dev1-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.0 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0.dev1-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 3548c2e208e8dda08f1de6d885d1baedcc863f0dddca0b2bc3b3c23ea97eeacf
MD5 4a68ecb715db5aff3be7c252b3fe5052
BLAKE2b-256 0f3b813d06ba1951aee5475c568d5addfa19faffd82c0ba26c167f5e1a008773

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0.dev1-cp35-cp35m-macosx_10_10_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0.dev1-cp35-cp35m-macosx_10_10_x86_64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.5m, macOS 10.10+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0.dev1-cp35-cp35m-macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 760bb22692da8b41bc9244352b0c2d56091d1609e6648eeaf1126dcc17635ceb
MD5 924d233f39cd781c93db6b3637f96d5a
BLAKE2b-256 b84a3edd35383d23b6600a51d92e58a79f0c2a0de9585e099e89dcb871826dde

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page