Skip to main content

Fast and Customizable Tokenizers

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install setuptools_rust
python setup.py install

Using the provided Tokenizers

Using a pre-trained tokenizer is really simple:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train yours just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# And you can use it
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory", "my-bpe")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

You can also easily build your own tokenizers, by putting all the different parts you need together:

Use a pre-trained tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders

# Load a BPE Model
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
bpe = models.BPE.from_files(vocab, merges)

# Initialize a tokenizer
tokenizer = Tokenizer(bpe)

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

# Or tokenize multiple sentences at once:
encoded = tokenizer.encode_batch([
	"I can feel the magic, can you?",
	"The quick brown fox jumps over the lazy dog"
])
print(encoded)

Train a new tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE.empty())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()

# And then train
trainer = trainers.BpeTrainer(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# Now we can encode
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)

Project details


Release history Release notifications | RSS feed

This version

0.6.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.6.0.tar.gz (67.6 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.6.0-cp38-cp38-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.6.0-cp38-cp38-win32.whl (930.9 kB view details)

Uploaded CPython 3.8 Windows x86

tokenizers-0.6.0-cp38-cp38-manylinux1_x86_64.whl (7.6 MB view details)

Uploaded CPython 3.8

tokenizers-0.6.0-cp38-cp38-macosx_10_10_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.8 macOS 10.10+ x86-64

tokenizers-0.6.0-cp37-cp37m-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.6.0-cp37-cp37m-win32.whl (929.2 kB view details)

Uploaded CPython 3.7m Windows x86

tokenizers-0.6.0-cp37-cp37m-manylinux1_x86_64.whl (5.7 MB view details)

Uploaded CPython 3.7m

tokenizers-0.6.0-cp37-cp37m-macosx_10_10_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.7m macOS 10.10+ x86-64

tokenizers-0.6.0-cp36-cp36m-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.6.0-cp36-cp36m-win32.whl (929.6 kB view details)

Uploaded CPython 3.6m Windows x86

tokenizers-0.6.0-cp36-cp36m-manylinux1_x86_64.whl (3.8 MB view details)

Uploaded CPython 3.6m

tokenizers-0.6.0-cp36-cp36m-macosx_10_10_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.6m macOS 10.10+ x86-64

tokenizers-0.6.0-cp35-cp35m-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.6.0-cp35-cp35m-win32.whl (929.6 kB view details)

Uploaded CPython 3.5m Windows x86

tokenizers-0.6.0-cp35-cp35m-manylinux1_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.5m

tokenizers-0.6.0-cp35-cp35m-macosx_10_10_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.5m macOS 10.10+ x86-64

File details

Details for the file tokenizers-0.6.0.tar.gz.

File metadata

  • Download URL: tokenizers-0.6.0.tar.gz
  • Upload date:
  • Size: 67.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.6.0.tar.gz
Algorithm Hash digest
SHA256 1da11fbfb4f73be695bed0d655576097d09a137a16dceab2f66399716afaffac
MD5 58f2bdb6880313bba212bcb4954ccab2
BLAKE2b-256 428271bbc4eff999a3e397373b9ccb43f82dad7d6d0865f2ce858d09add2dca6

See more details on using hashes here.

File details

Details for the file tokenizers-0.6.0-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.6.0-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.6.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 535b074704eb63294f7bad124c3f6a65cff21c003ed5b1dc098cd5fd29bd503c
MD5 d1a141d0fbc670d054d0bc2f2396c335
BLAKE2b-256 31349ce0fb07ea7b687c58edfff4e90eadda5c39aec2b8818db419e9978de271

See more details on using hashes here.

File details

Details for the file tokenizers-0.6.0-cp38-cp38-win32.whl.

File metadata

  • Download URL: tokenizers-0.6.0-cp38-cp38-win32.whl
  • Upload date:
  • Size: 930.9 kB
  • Tags: CPython 3.8, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.6.0-cp38-cp38-win32.whl
Algorithm Hash digest
SHA256 9f3feb9e12408af66d8cb1f6d6598e96f91b43116f2f9c4ac8412246c3f062d5
MD5 9b4d291c535020ac31b65e70f5112f71
BLAKE2b-256 d808372c9b30725dfb76b289a830b063b234194d91cdad81a89cef8bfd1ae4bd

See more details on using hashes here.

File details

Details for the file tokenizers-0.6.0-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.6.0-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 7.6 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.6.0-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 35fa117e38c55b9209f05e2a9de6b772a34a9681748ea22b5767ce5663f2d8b9
MD5 9f22c0d6ccbe7876b05b9437c905c4e5
BLAKE2b-256 2ac8c5e38543ab26a44a4782d9c0593a1f6d385aa27e176a4b37b72228eecb0f

See more details on using hashes here.

File details

Details for the file tokenizers-0.6.0-cp38-cp38-macosx_10_10_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.6.0-cp38-cp38-macosx_10_10_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.8, macOS 10.10+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.6.0-cp38-cp38-macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 4f2844f535f7397779ba1278af895a98dcfb85beb4faf45d792eec364f8bda56
MD5 0065dedd28d4633fb20e9009100b6bbc
BLAKE2b-256 7401699e1a2315eb8b83f5a473e5cdc69d48649aa0192c9f91ccaeee6a60167e

See more details on using hashes here.

File details

Details for the file tokenizers-0.6.0-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.6.0-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.6.0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 de0cd4a87edaf4464de29a65834d8a1f25818d41578cb7fb15851bfca1e362bf
MD5 ef0d246673836ef3541d930bd6c6dbba
BLAKE2b-256 32efdee152f1f3876a9abe3dd3284e6de39a1ffe4aee8fcc1312bfad95f22b6c

See more details on using hashes here.

File details

Details for the file tokenizers-0.6.0-cp37-cp37m-win32.whl.

File metadata

  • Download URL: tokenizers-0.6.0-cp37-cp37m-win32.whl
  • Upload date:
  • Size: 929.2 kB
  • Tags: CPython 3.7m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.6.0-cp37-cp37m-win32.whl
Algorithm Hash digest
SHA256 385e6e6e0f7cd6773a8ada2fbf75396389da85a28897d0427f3b30590f9413ea
MD5 b9c4e84eadd4cb235b3216d3b528423d
BLAKE2b-256 257371574061608856c653faf96539c689b68c190c7f0083890b9785078d89e6

See more details on using hashes here.

File details

Details for the file tokenizers-0.6.0-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.6.0-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 5.7 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.6.0-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 f32a9972796671c42659365060409aaaf9b24ee573dda89d06e37f2ca1ac1f18
MD5 0beb38698a30e8c84d1f840da942467f
BLAKE2b-256 074125c8548112e0a7da7c8caca4c7c50fdba9928e70f7ad1a1c8fad006073d6

See more details on using hashes here.

File details

Details for the file tokenizers-0.6.0-cp37-cp37m-macosx_10_10_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.6.0-cp37-cp37m-macosx_10_10_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.7m, macOS 10.10+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.6.0-cp37-cp37m-macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 0ab6028c2e06e0389cf3cbb0070182dd1575cb7529fefea030fff47e4ec66d5a
MD5 90a8581cd4b5ed97eea8367eda84f44f
BLAKE2b-256 8eaa91f601b9f0f766c28b56368fff16495941170dd38e6f8e889b99fe28e72c

See more details on using hashes here.

File details

Details for the file tokenizers-0.6.0-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.6.0-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.6.0-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 9085213e2220f312ef59bacf7e404a65382ee1dbdafe51e5a00ee9a4e6452b41
MD5 cc61a7dceb420011a0148f726cef1822
BLAKE2b-256 b657171a6f35f6d76212d7d1c293024db41b3e9e5eaed8113526f13dab245c70

See more details on using hashes here.

File details

Details for the file tokenizers-0.6.0-cp36-cp36m-win32.whl.

File metadata

  • Download URL: tokenizers-0.6.0-cp36-cp36m-win32.whl
  • Upload date:
  • Size: 929.6 kB
  • Tags: CPython 3.6m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.6.0-cp36-cp36m-win32.whl
Algorithm Hash digest
SHA256 2de36ca2c1b959bae08cb9d622564b4a4ed0e30f41c805e161f8a731e167388a
MD5 fa81bce31cf00b30366cd8b32363c93f
BLAKE2b-256 3b58fcb5a0440b4d3ab4ac1745592a3e45ea706f31806baebd6ddafadfe17ca9

See more details on using hashes here.

File details

Details for the file tokenizers-0.6.0-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.6.0-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.8 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.6.0-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 ca06f5a8811f99afc0d61984019e20f382deb5a2fc7d6b620733e60e2c66d83e
MD5 c626f9d842ee6d1922988668df495cad
BLAKE2b-256 73deec55e2d5a8720557b25100dd7dd4a63108a44b6b303978ce2587666931cf

See more details on using hashes here.

File details

Details for the file tokenizers-0.6.0-cp36-cp36m-macosx_10_10_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.6.0-cp36-cp36m-macosx_10_10_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.6m, macOS 10.10+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.6.0-cp36-cp36m-macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 5c6b510e567f81231cf57f0c9740cf936abaa83821fe95dbca685bbdf557da2c
MD5 0511f44be67112f2f6bff954763c2f41
BLAKE2b-256 788209e9ee8105f85c1f53888c0a2beca5c8b06a80837e3716aa795735db0643

See more details on using hashes here.

File details

Details for the file tokenizers-0.6.0-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.6.0-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.6.0-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 e49412deacb7323b683d5ea1e55684778a6603d40ce854965c4d8f47112283f3
MD5 290781e20fde587d1434acafcd516a73
BLAKE2b-256 76d70bf27daaeaf436d05a2b10365b083ac93785bb72ed1eb55d8ff73c1735ce

See more details on using hashes here.

File details

Details for the file tokenizers-0.6.0-cp35-cp35m-win32.whl.

File metadata

  • Download URL: tokenizers-0.6.0-cp35-cp35m-win32.whl
  • Upload date:
  • Size: 929.6 kB
  • Tags: CPython 3.5m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.6.0-cp35-cp35m-win32.whl
Algorithm Hash digest
SHA256 f0a9f60ebb82cdccaefd991b5c5a619886c6ff45862caf16018465240e1768c2
MD5 38dc5fd4d8998492d14f2205ce6a27a5
BLAKE2b-256 d9deda8055611e19a9fccf06f3971ca7394d281e62e7e02b27b74b541df66449

See more details on using hashes here.

File details

Details for the file tokenizers-0.6.0-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.6.0-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.6.0-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 37bc9a03b44c00281a4d8e380660154b98164c7e9c8f420d81f3ffa937ed8359
MD5 3db483fdee1eb568b95f0890e1de60cf
BLAKE2b-256 1b70080f36ae65acae70b44ed9cb6c19658bc294fd7adac5605cfa1a6b883ac8

See more details on using hashes here.

File details

Details for the file tokenizers-0.6.0-cp35-cp35m-macosx_10_10_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.6.0-cp35-cp35m-macosx_10_10_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.5m, macOS 10.10+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for tokenizers-0.6.0-cp35-cp35m-macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 db72c2c48bc70437ddf0529b6ea27149072ece81cdfe3b358c73746a931b5ab7
MD5 abedb622ae89a55064a7cdbf16712cd0
BLAKE2b-256 b6362a2696f1eb76eb624a4236e0670ac43c3edf71b31090edffb0c8e85923b6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page