Skip to main content

Fast and Customizable Tokenizers

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install setuptools_rust
python setup.py install

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.8.0.tar.gz (97.0 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.8.0-cp38-cp38-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.8.0-cp38-cp38-win32.whl (1.7 MB view details)

Uploaded CPython 3.8 Windows x86

tokenizers-0.8.0-cp38-cp38-manylinux1_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.8

tokenizers-0.8.0-cp38-cp38-macosx_10_10_x86_64.whl (2.1 MB view details)

Uploaded CPython 3.8 macOS 10.10+ x86-64

tokenizers-0.8.0-cp37-cp37m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.8.0-cp37-cp37m-win32.whl (1.7 MB view details)

Uploaded CPython 3.7m Windows x86

tokenizers-0.8.0-cp37-cp37m-manylinux1_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.7m

tokenizers-0.8.0-cp37-cp37m-macosx_10_10_x86_64.whl (2.1 MB view details)

Uploaded CPython 3.7m macOS 10.10+ x86-64

tokenizers-0.8.0-cp36-cp36m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.8.0-cp36-cp36m-win32.whl (1.7 MB view details)

Uploaded CPython 3.6m Windows x86

tokenizers-0.8.0-cp36-cp36m-manylinux1_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.6m

tokenizers-0.8.0-cp36-cp36m-macosx_10_10_x86_64.whl (2.1 MB view details)

Uploaded CPython 3.6m macOS 10.10+ x86-64

tokenizers-0.8.0-cp35-cp35m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.8.0-cp35-cp35m-win32.whl (1.7 MB view details)

Uploaded CPython 3.5m Windows x86

tokenizers-0.8.0-cp35-cp35m-manylinux1_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.5m

tokenizers-0.8.0-cp35-cp35m-macosx_10_10_x86_64.whl (2.1 MB view details)

Uploaded CPython 3.5m macOS 10.10+ x86-64

File details

Details for the file tokenizers-0.8.0.tar.gz.

File metadata

  • Download URL: tokenizers-0.8.0.tar.gz
  • Upload date:
  • Size: 97.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0.tar.gz
Algorithm Hash digest
SHA256 703101ffc1cce87e39a8fa9754126a5c29590b03817a73727e3268474dc716e6
MD5 92ac54fb5af3d015c9f29d41d0a09380
BLAKE2b-256 cbf41b23a4408ef7ec8ff39697420aa611566ab0d7b69b67a1c2b1a10118ea35

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.8.0-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 99c484aa34a065fce0b01c55b28cfddca20f7a954873be1c01f19adccd418588
MD5 7d0cd05a07f925114465ac26d1ebfff8
BLAKE2b-256 e0e4c896e7a531220d1b3f31dfb094f400df4bad048e620ea8bec67e835e6471

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0-cp38-cp38-win32.whl.

File metadata

  • Download URL: tokenizers-0.8.0-cp38-cp38-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.8, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0-cp38-cp38-win32.whl
Algorithm Hash digest
SHA256 aa1f341f9121133a59938b92a1c7487127e5fc4310c377839c4628f5f09fb301
MD5 22de2cf8284d21f20beb3e7f8b107097
BLAKE2b-256 6f547b803f18c9fe69886c8748c8e5c80f0acecdf1aa20db557cc8c5ca323da8

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.0 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 037a3b119447cefb67da0aa31dbaee0312fc8014e94a4d1468d3b601448c02f8
MD5 d6eb7f71882062bdb4082b524280ecb6
BLAKE2b-256 3b0f0045413f564383a13df3f745e7c9d7a277c8a5cd5dbe3a103e6a6db0ecb6

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0-cp38-cp38-macosx_10_10_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0-cp38-cp38-macosx_10_10_x86_64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.8, macOS 10.10+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0-cp38-cp38-macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 a083c67c6f3188d6aa180d056a7936dd1cb35666d785f28abffe1afd8246d324
MD5 50d73eaff216e8d9d0deb783cc0b13b6
BLAKE2b-256 69f141ce7578afca7eb771c62872755e5e51a7d58c91d3ff6abe8a191f5fc44c

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.8.0-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 687ad2dcf9c2aec22580df0360ebefddc6fc984e2b03df1d886e1e5f1faa31b0
MD5 cab720f07556d769462e6bd0cae0226e
BLAKE2b-256 889a140ee62a19b92992264a27608601e4d73e84cd4b16f4134b741680e4e289

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0-cp37-cp37m-win32.whl.

File metadata

  • Download URL: tokenizers-0.8.0-cp37-cp37m-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.7m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0-cp37-cp37m-win32.whl
Algorithm Hash digest
SHA256 cece3b6c02a48cc4943c3de5b9e9f5b3c837928a5e13a9480d44067bcc8b914a
MD5 93040444eba192a17b5bfcdee311269b
BLAKE2b-256 73cb05abe6ef3665623e20cd8dc33f73556ea38870abc2f746a2a56d9611bc17

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.0 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 33cb883f2c08a35e38eda1adcc1d6fa35474cc60bf5330832c909e5dab506ef5
MD5 4d42403e97d0ef9f9bb6da100e9242cb
BLAKE2b-256 00936dd269303b5060f2afc6238e14efe378523b0ead75db7455a313d7e38e49

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0-cp37-cp37m-macosx_10_10_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0-cp37-cp37m-macosx_10_10_x86_64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.7m, macOS 10.10+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0-cp37-cp37m-macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 1883a8411771aa6ef3bccc2b8426ca0677b542f90ad69c09a2877834d5f67dfb
MD5 62a99ec38feab618660e08c015b91294
BLAKE2b-256 5beaeaf99dd7fcab5bdc4c595194e15ffa6c015f40b8ae7e28fbbf2708dd251d

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.8.0-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 d2fbc8e5fcc23fe65eec894dbf44e35677761f4fd37a3b9d681944b9907b2d1c
MD5 8b40c5569fc45cc31a295748c51b7730
BLAKE2b-256 1994925defb07bf7471f969ec1a9b6d52e5ce93b4fae7b1cbe2af411c721b811

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0-cp36-cp36m-win32.whl.

File metadata

  • Download URL: tokenizers-0.8.0-cp36-cp36m-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.6m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0-cp36-cp36m-win32.whl
Algorithm Hash digest
SHA256 bd9b2e123435e3f343578b75a4618b2e4ad9282993e6b302092d5c824ddb1602
MD5 893f9d355a28227f6cca9ba3e88e8f04
BLAKE2b-256 85b56e0369897adfbde0bb9c0661a544f12f72ae7b3f0c3578ca1f6df1de5d90

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.0 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 4edc9f76ef2d419a5989bf0b524004e1fb8b4386c72c58249b2197479e83136e
MD5 917fc82e24946f323b36c29b4b9212e4
BLAKE2b-256 6b151c026f3aeafd26db30cb633d9915aae666a415179afa5943263e5dbd55a6

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0-cp36-cp36m-macosx_10_10_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0-cp36-cp36m-macosx_10_10_x86_64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.6m, macOS 10.10+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0-cp36-cp36m-macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 741fbf31d5fec7ff560c69ff20ecddbef779a890d1fac08e4bb0e0a71a30563d
MD5 6bad5756001e33e279be7483b5151a25
BLAKE2b-256 f197bcb9e63893aac19ed8be994af6a6e53fcd3709191d8bbbcbcc7e93679a5b

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.8.0-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 e962a98287c09fba1900a7661daa0819ce03025fce4aa2644a7dc633ebf34a5b
MD5 d0b3e35da93d6256b29c18291aa64fe9
BLAKE2b-256 926f891bad90a515a353e7e6e097d5ff5876001ea53e13dcd44063d909291499

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0-cp35-cp35m-win32.whl.

File metadata

  • Download URL: tokenizers-0.8.0-cp35-cp35m-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.5m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0-cp35-cp35m-win32.whl
Algorithm Hash digest
SHA256 1e3d96d33e1a8d0d6520744af4d20a8c0d99fbd02994ef8d973b60346a270509
MD5 62950d0057b37644e9748eefd45e15fb
BLAKE2b-256 ef10a80b0a613f26a548465addaf1d2a55811b500b2f5d466e4a60f00f9c14be

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.0 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 5877c7b8f9802454cc8226e906214c2da9dbc3ba7a170fd968c6633d191fa625
MD5 082d4544d86d248c0c7545f715b349c7
BLAKE2b-256 23b57c100108a0e8cf108edd6d866ead74fc2652ab885262e8d41ce485ad1f04

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0-cp35-cp35m-macosx_10_10_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0-cp35-cp35m-macosx_10_10_x86_64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.5m, macOS 10.10+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0-cp35-cp35m-macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 1accbd656487656b17bb1fa2fbd01a70fee9d1ac36c34424cc6a014c7e3d6d08
MD5 2787e8f0909336ef7867645431f7cf8a
BLAKE2b-256 10890a092963047cfe0a4f449537b6c585a89a6e8ae5b44fa227139f59dcf0e1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page