Skip to main content

Fast and Customizable Tokenizers

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install setuptools_rust
python setup.py install

Using the provided Tokenizers

Using a pre-trained tokenizer is really simple:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train yours just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# And you can use it
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory", "my-bpe")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

You can also easily build your own tokenizers, by putting all the different parts you need together:

Use a pre-trained tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders

# Load a BPE Model
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
bpe = models.BPE.from_files(vocab, merges)

# Initialize a tokenizer
tokenizer = Tokenizer(bpe)

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

# Or tokenize multiple sentences at once:
encoded = tokenizer.encode_batch([
	"I can feel the magic, can you?",
	"The quick brown fox jumps over the lazy dog"
])
print(encoded)

Train a new tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE.empty())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()

# And then train
trainer = trainers.BpeTrainer(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# Now we can encode
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.5.2.tar.gz (64.6 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.5.2-cp38-cp38-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.5.2-cp38-cp38-manylinux1_x86_64.whl (7.4 MB view details)

Uploaded CPython 3.8

tokenizers-0.5.2-cp38-cp38-macosx_10_15_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.8 macOS 10.15+ x86-64

tokenizers-0.5.2-cp37-cp37m-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.5.2-cp37-cp37m-manylinux1_x86_64.whl (5.6 MB view details)

Uploaded CPython 3.7m

tokenizers-0.5.2-cp37-cp37m-macosx_10_15_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.7m macOS 10.15+ x86-64

tokenizers-0.5.2-cp36-cp36m-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.5.2-cp36-cp36m-manylinux1_x86_64.whl (3.7 MB view details)

Uploaded CPython 3.6m

tokenizers-0.5.2-cp36-cp36m-macosx_10_15_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.6m macOS 10.15+ x86-64

tokenizers-0.5.2-cp35-cp35m-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.5.2-cp35-cp35m-manylinux1_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.5m

tokenizers-0.5.2-cp35-cp35m-macosx_10_15_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.5m macOS 10.15+ x86-64

File details

Details for the file tokenizers-0.5.2.tar.gz.

File metadata

  • Download URL: tokenizers-0.5.2.tar.gz
  • Upload date:
  • Size: 64.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.5.2.tar.gz
Algorithm Hash digest
SHA256 b5a235f9c71d04d4925df6c4fa13b13f1d03f9b7ac302b89f8120790c4f742bc
MD5 c5703452c16ab711edad6c8a23f81a3c
BLAKE2b-256 f5d7a3882b2d36991f613b749fc5e305cccc345ec9d6ab0621ad7e7bf1be8691

See more details on using hashes here.

File details

Details for the file tokenizers-0.5.2-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.5.2-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.5.2-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 e9fa4c6002478131b58255e5337512f8a5630c0f52b14f6d6905c8ac37b47a72
MD5 8f81b1ad6d3810e4b7429fac1ee13a93
BLAKE2b-256 2f1915446ddd352d706fe3ce71bad2a7e24a3421acf58e6a5642bbd5ec3d113e

See more details on using hashes here.

File details

Details for the file tokenizers-0.5.2-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.5.2-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 7.4 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.5.2-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 423cfe1ee552a75e3a52584c935494e1895e5d2632e07a87d795869162327255
MD5 9106a3c1a3fbddf026acca4dcee7c326
BLAKE2b-256 addf878a7fe274b814c26c05144a85961bf0582c59c0b53f0c1ed3e029b1e9e7

See more details on using hashes here.

File details

Details for the file tokenizers-0.5.2-cp38-cp38-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.5.2-cp38-cp38-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.8, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.5.2-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 f51f774cf5f1c991490cd7580b411f67e62586e91ee0bfe51a8338850511cced
MD5 89aaac8409bbc2b3f86830cd769480fc
BLAKE2b-256 4d45ec1cbb7a676f90aa3e4a1dce3a51850c62226b606caa061a02ebd557cf7c

See more details on using hashes here.

File details

Details for the file tokenizers-0.5.2-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.5.2-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.5.2-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 f131788f76d4ceef2abc25c245b42807efc059753f7f76b32cbc6ead8a4c80f0
MD5 33024ee0e48aeb2e45753ea6cc11e297
BLAKE2b-256 83202113a861753c06e76d042c908cf5a7e514c7b7f75335e04bfb980da0c5a8

See more details on using hashes here.

File details

Details for the file tokenizers-0.5.2-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.5.2-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 5.6 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.5.2-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 3c6a3123253d4150c69a3a5288f994b9384b4df2c8b0ae2be3adb47a2570f390
MD5 99df01470ddc317d6e343320a3fc07a9
BLAKE2b-256 d6e35e49e9a83fb605aaa34a1c1173e607302fecae529428c28696fb18f1c2c9

See more details on using hashes here.

File details

Details for the file tokenizers-0.5.2-cp37-cp37m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.5.2-cp37-cp37m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.7m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.5.2-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 7d9752cbc4d8276758b25f52a3ac25df7bbd3b8948df3f753470c9de2aa31ff0
MD5 d531d369d0bbe6a14200bc1f65f3f694
BLAKE2b-256 750c3f061e757484e8b714c23bf154c751a969aa6fd5495ad7536a78c7f36388

See more details on using hashes here.

File details

Details for the file tokenizers-0.5.2-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.5.2-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.5.2-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 3e157fc74fd280183922684d7fc8fb8a22f7a09958132feefbdd4cc5f3d4cded
MD5 3b549774f1a56f33134b4bb209ec0a68
BLAKE2b-256 554679f883bdbe19516c2d443b55faa6ad45aa44b282bee107208b2927191f8b

See more details on using hashes here.

File details

Details for the file tokenizers-0.5.2-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.5.2-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.7 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.5.2-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 eea450fa68c4e7309f6cfd105b77d3f2f8c8c7c90e121abc7399cfbce5c650da
MD5 93dbb00f54185d196f2c1b5dc9b3b244
BLAKE2b-256 d13f73c881ea4723e43c1e9acf317cf407fab3a278daab3a69c98dcac511c04f

See more details on using hashes here.

File details

Details for the file tokenizers-0.5.2-cp36-cp36m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.5.2-cp36-cp36m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.6m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.5.2-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 28551ed0b445e307c30f83ebaa181016f915ee12d2c037df00303128364bbc2b
MD5 d9971679761d151ab730264fef4585e8
BLAKE2b-256 c81e55efba699917001184081ac7fd3fb18df0ad513d522abeb4d86f8a44e15c

See more details on using hashes here.

File details

Details for the file tokenizers-0.5.2-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.5.2-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.5.2-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 6b27efdc51a225967e5237ca5c38c1bb7730e9904f666d0346df3a580f62fb54
MD5 402741c71b75edda7d30276c32917d17
BLAKE2b-256 2c9950bcde4581d6fdf09fb8bf9c215147cd908749766fdec5fc0bc6c5c0547d

See more details on using hashes here.

File details

Details for the file tokenizers-0.5.2-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.5.2-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.5.2-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 3090b03f192b4b8e18d43cae8d3ea8612f38b4d3d314553e0153f0662cac47c5
MD5 89707bc5ca8feedf44b8fb556589ac3b
BLAKE2b-256 6954f11147254facf4b6f1130391af018190289071700128c4640e34459d6b10

See more details on using hashes here.

File details

Details for the file tokenizers-0.5.2-cp35-cp35m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.5.2-cp35-cp35m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.5m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.5.2-cp35-cp35m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 986d3502c794ffdee17acd18e6b002c9d8c7636a00c81277309202a3ad3fd778
MD5 689dfea192b84f66c3be32729c5facb3
BLAKE2b-256 4e727c2bbc50332469b6f1bafc4e1fbbd12a49aae151f5b7ca68720c86911568

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page