Skip to main content

Fast and Customizable Tokenizers

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install setuptools_rust
python setup.py install

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.9.0.dev4.tar.gz (168.8 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.9.0.dev4-cp38-cp38-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.9.0.dev4-cp38-cp38-win32.whl (1.7 MB view details)

Uploaded CPython 3.8 Windows x86

tokenizers-0.9.0.dev4-cp38-cp38-macosx_10_11_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.8 macOS 10.11+ x86-64

tokenizers-0.9.0.dev4-cp37-cp37m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.9.0.dev4-cp37-cp37m-win32.whl (1.7 MB view details)

Uploaded CPython 3.7m Windows x86

tokenizers-0.9.0.dev4-cp37-cp37m-macosx_10_11_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.7m macOS 10.11+ x86-64

tokenizers-0.9.0.dev4-cp36-cp36m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.9.0.dev4-cp36-cp36m-win32.whl (1.7 MB view details)

Uploaded CPython 3.6m Windows x86

tokenizers-0.9.0.dev4-cp36-cp36m-macosx_10_11_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.6m macOS 10.11+ x86-64

tokenizers-0.9.0.dev4-cp35-cp35m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.9.0.dev4-cp35-cp35m-win32.whl (1.7 MB view details)

Uploaded CPython 3.5m Windows x86

tokenizers-0.9.0.dev4-cp35-cp35m-macosx_10_11_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.5m macOS 10.11+ x86-64

File details

Details for the file tokenizers-0.9.0.dev4.tar.gz.

File metadata

  • Download URL: tokenizers-0.9.0.dev4.tar.gz
  • Upload date:
  • Size: 168.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev4.tar.gz
Algorithm Hash digest
SHA256 3ef2ca2f95b2e9394d2024ec9f955fb5659721cec1448b3e273f2c36e523bafd
MD5 28f8fb939e05a7f8528ca0fac1078a13
BLAKE2b-256 b6ba4e8a08a004b1da69f1f3117913e5933bd9232bcdbc2642c30f897a8b8213

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev4-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev4-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev4-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 377f66354be67ecefa64724ebbe588a45b428268ff8b4718414001cf777a0beb
MD5 7cdc0402da0c34f55118a63c44a26b0a
BLAKE2b-256 d7d3271822be1ffb5f7eb39b85a13aa958bd9f8f38541600d529f56cc6e70e68

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev4-cp38-cp38-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev4-cp38-cp38-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.8, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev4-cp38-cp38-win32.whl
Algorithm Hash digest
SHA256 22cc5cc35a9f51a225972b1cda2b86edacf11fa8b1b0c0035a20087d055f6e3c
MD5 5ee748c63f8c78ee61a44d9f45ce8613
BLAKE2b-256 a45b674add9542b0e17eedf00a59bfa88ff2b1b7394d8b2ba34d6ed2b4e33da6

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev4-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev4-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.9 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev4-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 fc35a6d4e9a0f54f0095808eec8f758adc7461c87dc49224c23be21987d9c814
MD5 f24474470e3e80f6309c057fc1b8b5f4
BLAKE2b-256 0db0a8191ed953126c7306761a98e1fa577e6ecd770e1255e94b50782330d4fd

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev4-cp38-cp38-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev4-cp38-cp38-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 2.0 MB
  • Tags: CPython 3.8, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev4-cp38-cp38-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 26eac0c9135b7a5f15d83c229b153e13aa1c07b30e65e721ef168abd5e872759
MD5 921a98383710e2c63d8da5525cd2a509
BLAKE2b-256 82b08f0af62befbd0fde60dbbdd60562cc8ba424137f263235c2834fdd40cd21

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev4-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev4-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev4-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 795bec45e2437cd672f7f9a90aa2c5c7bc698ab67dc81183cef6f902e0a7cad9
MD5 b88c18fbb467ab590e0a625fe4268b7c
BLAKE2b-256 7b0cc5bb2ea4100afe7fe94e2b375dd8795785bf12db0d5badb0e597a04ee85a

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev4-cp37-cp37m-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev4-cp37-cp37m-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.7m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev4-cp37-cp37m-win32.whl
Algorithm Hash digest
SHA256 109c1c9736c8abc271228eebef86685a848bcb6a186e90a6ffeaeedd3aec1fed
MD5 a4c82d0d8d70b2e55fefb3ffcbe0a32b
BLAKE2b-256 82e9c12804ae3ca3bbeff36e8e1b164657d92f3cf92335692063a3d867e45ae6

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev4-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev4-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.9 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev4-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 abea50c9c2b8db060aa91cfa1f10716c7bb1913520d8c91eff3f6f44611086e9
MD5 6f05bb420d73005d7f18b1c6d7ff9bfc
BLAKE2b-256 75d10ea13256535d4617d2eda334f5a1c34557a8879ee58dfe720d507c351b62

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev4-cp37-cp37m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev4-cp37-cp37m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 2.0 MB
  • Tags: CPython 3.7m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev4-cp37-cp37m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 9fbcd668d948851849b3f3ff8fc4915d567477496b40ede0c5acf59ed659ed46
MD5 0a30092a2d3d81acb9950a5b05cd070e
BLAKE2b-256 cb37eeb8e9bf8a9b0861fce820d4a36fb19433da46fc8fd21203fc3de882c6b2

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev4-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev4-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev4-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 29e45d3a737ee54964b9022cce42b662fc5b4c02c1cbb0efefa5e43f1a9f1b97
MD5 4cdd92b8469da806d4386a1d9fe4f362
BLAKE2b-256 77e9bb4aec8a71eb85ba7c063aa85432c5501751d8f6be56ffa2fed78f2f1184

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev4-cp36-cp36m-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev4-cp36-cp36m-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.6m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev4-cp36-cp36m-win32.whl
Algorithm Hash digest
SHA256 a1f8721dfd75b76ecfdd33a0096175cfe0418b6d5abbdaa53fa4aa8ead777c8a
MD5 4ac9b55e3c5968fa6826f4ca3019ddd6
BLAKE2b-256 dfffc4518b62a7e90a49763f2227a2da28f8f15e3c812ab979dbfe4ca85f0753

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev4-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev4-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.9 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev4-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 b454f9b6f3550e51cb045f3d6f9f050af3ddecf126f7af64f2125a83bc62bf0c
MD5 dd45e30d0b815f5335f595029db047cc
BLAKE2b-256 7d59d2de1f9945c151f0d74a48a62c010dd7b6f1bbf3c7618426fec682846624

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev4-cp36-cp36m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev4-cp36-cp36m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 2.0 MB
  • Tags: CPython 3.6m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev4-cp36-cp36m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 545d9c8d8593006e293dcd0464b4d84e6c500c172d0744fed6ddd93f09eb840f
MD5 d443b1b25f50ca5b9046445e7468bead
BLAKE2b-256 16c026ddeb8cdba9868ec7c3a7430ce934c26dd14b7a8d10faaa3e82c74b9742

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev4-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev4-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev4-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 999627cec65f78294e4cdd0a9d761d717d297cc06022a0ef7db530b5502063bb
MD5 b873d3e3dbc9390eae486818f1a1dadd
BLAKE2b-256 72fd9377095712192a8384b5d55b804b41dd4f97c4c2009be36421b740e0511e

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev4-cp35-cp35m-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev4-cp35-cp35m-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.5m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev4-cp35-cp35m-win32.whl
Algorithm Hash digest
SHA256 5f37ddf96e2f72411b69db3cf2f3617f2bef0c7430e70feadd066b7e56c825e6
MD5 0c0a4519369fad5f20e32729667fbc49
BLAKE2b-256 fa261202e03940d36428415f29aba92f82a2df27f9de1883cdf4bdbfbde842e6

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev4-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev4-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.9 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev4-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 81c7c8ae2c5737fadc8cc5715e178a22899a58c3420341efcd50503a9b2d0423
MD5 9a48815ae025bf8ef73cbe109b9accf6
BLAKE2b-256 6f0eca326514717ebe6110176a1065db604f01c499c72ac00bd993776cc67911

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev4-cp35-cp35m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev4-cp35-cp35m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 2.0 MB
  • Tags: CPython 3.5m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev4-cp35-cp35m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 c7ebeef41909824779ffd145f3d047bc9f3bb0ebda65b8b059aeaf8f03eccdc1
MD5 43f18a24e06d831f906a5e7531246fee
BLAKE2b-256 8b328251846791eae1c60b637909a70344441e9d1bad61330f7c8e2bd81b1601

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page