Skip to main content

Fast and Customizable Tokenizers

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install setuptools_rust
python setup.py install

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can check how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(vocab_size=20000, min_frequency=2)
tokenizer.train([
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
], trainer=trainer)

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.10.3.tar.gz (212.7 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.10.3-cp39-cp39-win_amd64.whl (2.0 MB view details)

Uploaded CPython 3.9 Windows x86-64

tokenizers-0.10.3-cp39-cp39-win32.whl (1.8 MB view details)

Uploaded CPython 3.9 Windows x86

tokenizers-0.10.3-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.12+ x86-64 manylinux: glibc 2.5+ x86-64

tokenizers-0.10.3-cp39-cp39-macosx_10_11_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.9 macOS 10.11+ x86-64

tokenizers-0.10.3-cp38-cp38-win_amd64.whl (2.0 MB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.10.3-cp38-cp38-win32.whl (1.8 MB view details)

Uploaded CPython 3.8 Windows x86

tokenizers-0.10.3-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64 manylinux: glibc 2.5+ x86-64

tokenizers-0.10.3-cp38-cp38-macosx_10_11_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.8 macOS 10.11+ x86-64

tokenizers-0.10.3-cp37-cp37m-win_amd64.whl (2.0 MB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.10.3-cp37-cp37m-win32.whl (1.8 MB view details)

Uploaded CPython 3.7m Windows x86

tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64 manylinux: glibc 2.5+ x86-64

tokenizers-0.10.3-cp37-cp37m-macosx_10_11_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.7m macOS 10.11+ x86-64

tokenizers-0.10.3-cp36-cp36m-win_amd64.whl (2.0 MB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.10.3-cp36-cp36m-win32.whl (1.8 MB view details)

Uploaded CPython 3.6m Windows x86

tokenizers-0.10.3-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64 manylinux: glibc 2.5+ x86-64

tokenizers-0.10.3-cp36-cp36m-macosx_10_11_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.6m macOS 10.11+ x86-64

File details

Details for the file tokenizers-0.10.3.tar.gz.

File metadata

  • Download URL: tokenizers-0.10.3.tar.gz
  • Upload date:
  • Size: 212.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for tokenizers-0.10.3.tar.gz
Algorithm Hash digest
SHA256 1a5d3b596c6d3a237e1ad7f46c472d467b0246be7fd1a364f12576eb8db8f7e6
MD5 944d79415988f5609fbac26000294c6b
BLAKE2b-256 482bb99184cacb1a743edc18290e9127d1b0e658c0c46f2ab5290b27fe865ff4

See more details on using hashes here.

File details

Details for the file tokenizers-0.10.3-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.10.3-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 2.0 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for tokenizers-0.10.3-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 e9d147e545cdfeca560646c7a703bf287afe45645da426506ccd5eb78aab5ef5
MD5 60b1cd3c7a13260229bec7620d3e54f0
BLAKE2b-256 64c5ae6008631f67085c7189d1407abea468c80000657778af4d4039de0d893b

See more details on using hashes here.

File details

Details for the file tokenizers-0.10.3-cp39-cp39-win32.whl.

File metadata

  • Download URL: tokenizers-0.10.3-cp39-cp39-win32.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.9, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for tokenizers-0.10.3-cp39-cp39-win32.whl
Algorithm Hash digest
SHA256 ad700fd9da518884fd58bf89f0b6dfeecef9b4e2d2db8765ef259f66d6c14980
MD5 c60d11c9f1eeb508c3691210a11187b4
BLAKE2b-256 6f5c931177b1e715c4bf4cd632d94aa391315ac7d51b905e6eccfcc3830e2954

See more details on using hashes here.

File details

Details for the file tokenizers-0.10.3-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.10.3-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 18c495e700f4588b9a00e58b4c41dc459c36daaa7c39a27faf880eb8f5533ce1
MD5 893d91ed410f14a4a397e2bbb172395c
BLAKE2b-256 a84fca8bc50358c3aaf50f298860a5ce1822e0c0ff97543e32265d1353760555

See more details on using hashes here.

File details

Details for the file tokenizers-0.10.3-cp39-cp39-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.10.3-cp39-cp39-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 2.2 MB
  • Tags: CPython 3.9, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for tokenizers-0.10.3-cp39-cp39-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 1d8867db210d75d97312360ae23b92aeb6a6b5bc65e15c1cd9d204b3fa3fc262
MD5 c12b74a6de1e4cba231ac82530585961
BLAKE2b-256 64f6f71ddb3124e2912127f26ef49898158184306357f9273f87db07a88fa8e8

See more details on using hashes here.

File details

Details for the file tokenizers-0.10.3-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.10.3-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 2.0 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for tokenizers-0.10.3-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 91a8c045980594c7c437a52c3da5276eb3c530a662b4ef628ff32d81fb22b543
MD5 2f1fc14807415b94d1f469a13c5c572e
BLAKE2b-256 39a05fd360d623f1907d4ccd40312d3de2139143e99c5ce18b25629e72dcb4ec

See more details on using hashes here.

File details

Details for the file tokenizers-0.10.3-cp38-cp38-win32.whl.

File metadata

  • Download URL: tokenizers-0.10.3-cp38-cp38-win32.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.8, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for tokenizers-0.10.3-cp38-cp38-win32.whl
Algorithm Hash digest
SHA256 a7ce051aafc53c564c9edbc09df300c2bd4f6ce87460fc22a276fed405d1892a
MD5 e6346129e9d4c494fd2a8763c0df65b4
BLAKE2b-256 8dd78549600b32ac1e88b9e48233c801d165691eac5d42443b94ea0105018ac6

See more details on using hashes here.

File details

Details for the file tokenizers-0.10.3-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.10.3-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 ae7e40d9c8a77c5a4109731ac3e21633b0c609c56a8b58be6b863da61fa54636
MD5 75694b6caa36c0630804e9b7a8fc6f25
BLAKE2b-256 e4bd10c052faa46f4effb18651b66f01010872f8eddb5f4034d72c08818129bd

See more details on using hashes here.

File details

Details for the file tokenizers-0.10.3-cp38-cp38-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.10.3-cp38-cp38-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 2.2 MB
  • Tags: CPython 3.8, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for tokenizers-0.10.3-cp38-cp38-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 a7ce0c2f27f7c92aa3f895231de90319acdf960ce2e42ba591edc651fda7d3c9
MD5 3bc49ca65f8d10f376ef6d44db80a875
BLAKE2b-256 165f2cf6a503af5dacb74f4a6b700500e003729ef75c7236a7952ad4207ee5c1

See more details on using hashes here.

File details

Details for the file tokenizers-0.10.3-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.10.3-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 2.0 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for tokenizers-0.10.3-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 7b11b373705d082d43657c08883b79b5330f1952f0668d17488b6b889c4d7feb
MD5 a0c0048b1986444a0736b83c0955c968
BLAKE2b-256 9858b092e16beb8cc360025f8cd26e2f4deb1492e43a22de0cb499793d71ea30

See more details on using hashes here.

File details

Details for the file tokenizers-0.10.3-cp37-cp37m-win32.whl.

File metadata

  • Download URL: tokenizers-0.10.3-cp37-cp37m-win32.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.7m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for tokenizers-0.10.3-cp37-cp37m-win32.whl
Algorithm Hash digest
SHA256 edd8cb85c16b4b65e87ea5ef9d400be9fdd53c4152adbaca8817e16dd3aa480b
MD5 fa5ed779759a147a4b534da71560da67
BLAKE2b-256 04be01fa4f7146d5ccb5c76eb5723d3aa7176cea3730444e825ecac7e033ef17

See more details on using hashes here.

File details

Details for the file tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 4cc194104c8e427ffc4f54c7866488b42f2b1f6351a6cad0d045ca5ab8108e42
MD5 f4688023a2b0ef661f4ad4ccf9793f65
BLAKE2b-256 d4e2df3543e8ffdab68f5acc73f613de9c2b155ac47f162e725dcac87c521c11

See more details on using hashes here.

File details

Details for the file tokenizers-0.10.3-cp37-cp37m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.10.3-cp37-cp37m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 2.2 MB
  • Tags: CPython 3.7m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for tokenizers-0.10.3-cp37-cp37m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 2f8c5fefef0d0a03be613547e613fbda06b9e6ee0891236649524964c3e54d80
MD5 836dafd6185a1991088f855a8da99dd3
BLAKE2b-256 6bf5e77aa6e8a95ce260d83f99fc88f57844e24d52dc35e7db287debe08b4149

See more details on using hashes here.

File details

Details for the file tokenizers-0.10.3-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.10.3-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 2.0 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for tokenizers-0.10.3-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 2a9ee3ee574d4aa740e099b0ad6ef8e63f52f48cde359bb31801146a5aa614dc
MD5 e9929d0a6490a27d6b784c1ef1f5c838
BLAKE2b-256 8832ace6b9ebea234ca645812b84118b0756bbc5a7b4efd4a476f7e974a7d914

See more details on using hashes here.

File details

Details for the file tokenizers-0.10.3-cp36-cp36m-win32.whl.

File metadata

  • Download URL: tokenizers-0.10.3-cp36-cp36m-win32.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.6m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for tokenizers-0.10.3-cp36-cp36m-win32.whl
Algorithm Hash digest
SHA256 6b84673997990b3c260ae2f7c57fdf1f835e316820eff14aca46dc68be3c0c74
MD5 607859f396c7d8ff58159fab17f40b2b
BLAKE2b-256 59632c194dfbb29d0aba2b82a26c127fe10a61a113f0eaa39da02e053beac9b1

See more details on using hashes here.

File details

Details for the file tokenizers-0.10.3-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.10.3-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 c26dbc3b2a3d71d3d40c50975ec62145932f05aea73f03ea35c48ebd3a717611
MD5 8e009d63f219bef4fe05441de007d6ed
BLAKE2b-256 bf203605db440db4f96d5ffd66b231a043ae451ec7e5e4d1a2fb6f20608006c4

See more details on using hashes here.

File details

Details for the file tokenizers-0.10.3-cp36-cp36m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.10.3-cp36-cp36m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 2.2 MB
  • Tags: CPython 3.6m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for tokenizers-0.10.3-cp36-cp36m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 4ab688daf4692a6c31dfe42f1f3a4a8c22050705eb69d58d3efde9d55f434586
MD5 7c0b2068acf82b39e494ec28ed5af767
BLAKE2b-256 b35c365657a61efab00c60c2f28a217ebce844900b1eb71800467d14bf68ff3b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page