Skip to main content

Fast and Customizable Tokenizers

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install setuptools_rust
python setup.py install

Using the provided Tokenizers

Using a pre-trained tokenizer is really simple:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train yours just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# And you can use it
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory", "my-bpe")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

You can also easily build your own tokenizers, by putting all the different parts you need together:

Use a pre-trained tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, processors

# Load a BPE Model
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
bpe = models.BPE(vocab, merges)

# Initialize a tokenizer
tokenizer = Tokenizer(bpe)

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

# Or tokenize multiple sentences at once:
encoded = tokenizer.encode_batch([
	"I can feel the magic, can you?",
	"The quick brown fox jumps over the lazy dog"
])
print(encoded)

Train a new tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# Now we can encode
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.7.0.tar.gz (81.1 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.7.0-cp38-cp38-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.7.0-cp38-cp38-win32.whl (981.2 kB view details)

Uploaded CPython 3.8 Windows x86

tokenizers-0.7.0-cp38-cp38-manylinux1_x86_64.whl (7.5 MB view details)

Uploaded CPython 3.8

tokenizers-0.7.0-cp38-cp38-macosx_10_10_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.8 macOS 10.10+ x86-64

tokenizers-0.7.0-cp37-cp37m-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.7.0-cp37-cp37m-win32.whl (981.5 kB view details)

Uploaded CPython 3.7m Windows x86

tokenizers-0.7.0-cp37-cp37m-manylinux1_x86_64.whl (5.6 MB view details)

Uploaded CPython 3.7m

tokenizers-0.7.0-cp37-cp37m-macosx_10_10_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.7m macOS 10.10+ x86-64

tokenizers-0.7.0-cp36-cp36m-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.7.0-cp36-cp36m-win32.whl (981.6 kB view details)

Uploaded CPython 3.6m Windows x86

tokenizers-0.7.0-cp36-cp36m-manylinux1_x86_64.whl (3.8 MB view details)

Uploaded CPython 3.6m

tokenizers-0.7.0-cp36-cp36m-macosx_10_10_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.6m macOS 10.10+ x86-64

tokenizers-0.7.0-cp35-cp35m-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.7.0-cp35-cp35m-win32.whl (981.5 kB view details)

Uploaded CPython 3.5m Windows x86

tokenizers-0.7.0-cp35-cp35m-manylinux1_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.5m

tokenizers-0.7.0-cp35-cp35m-macosx_10_10_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.5m macOS 10.10+ x86-64

File details

Details for the file tokenizers-0.7.0.tar.gz.

File metadata

  • Download URL: tokenizers-0.7.0.tar.gz
  • Upload date:
  • Size: 81.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2

File hashes

Hashes for tokenizers-0.7.0.tar.gz
Algorithm Hash digest
SHA256 a3cb9be31e3be381ab3f9e9ea7f96d4ba83588c40c44fe63b535b7341cdf74fe
MD5 344efee94a20d717f8d9e3461d5c88a9
BLAKE2b-256 839e7163dc8ea080e901d8b6c0c64eb3c0b3ffe27a820145448b64d24ca78b97

See more details on using hashes here.

File details

Details for the file tokenizers-0.7.0-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.7.0-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2

File hashes

Hashes for tokenizers-0.7.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 40520c333c1d602d0f99602bfeecd8f734188fc4360268ec7eb4d8b8570c6e95
MD5 1c25f1800f27546414b116d4147d0bb8
BLAKE2b-256 014a9176917ed69a8e816ffb9dc2cf25a956ffdd4c8cd45f3fca8d72fbf8b9b7

See more details on using hashes here.

File details

Details for the file tokenizers-0.7.0-cp38-cp38-win32.whl.

File metadata

  • Download URL: tokenizers-0.7.0-cp38-cp38-win32.whl
  • Upload date:
  • Size: 981.2 kB
  • Tags: CPython 3.8, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2

File hashes

Hashes for tokenizers-0.7.0-cp38-cp38-win32.whl
Algorithm Hash digest
SHA256 03ad125d12e69a343763dbb160f43d953513cb32c5e11674c09431133ebcfd8b
MD5 e40ec21ec8043843d68011da9b4ce89a
BLAKE2b-256 e06fd4c0ed18804a966213f5855431812efaedeecf9f90e0e4142eb30eca1bb4

See more details on using hashes here.

File details

Details for the file tokenizers-0.7.0-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.7.0-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 7.5 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2

File hashes

Hashes for tokenizers-0.7.0-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 2b101a752ee6147c4a5e08daa9c7d617259483fd4b0c70e7dfddfcadc8a73d2f
MD5 cdbbe50d30735238046ded97c3d38288
BLAKE2b-256 04441da09bb16357fb7502b20631170e7b278df4d8567931bd104180dd62d2d8

See more details on using hashes here.

File details

Details for the file tokenizers-0.7.0-cp38-cp38-macosx_10_10_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.7.0-cp38-cp38-macosx_10_10_x86_64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.8, macOS 10.10+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2

File hashes

Hashes for tokenizers-0.7.0-cp38-cp38-macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 f22ea3a79daf3705d9a8446821b3e202e8cc79467df7db75875d1fbb85d7c852
MD5 dc0d6f5751eaa7f047a153985e979642
BLAKE2b-256 17b4fe6cbe59b3d8107c4b7df82993d30f72eb7d9e588d25efaf243b164f5dff

See more details on using hashes here.

File details

Details for the file tokenizers-0.7.0-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.7.0-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2

File hashes

Hashes for tokenizers-0.7.0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 b319d70f50c851ec4ae9a3d5c4eae1e3f74f8d720d61bc3d430915868a06a4a8
MD5 8945be6693a67504cacd1a66d6669d7e
BLAKE2b-256 fcf6553d20014b62001d23dd5727dcc443ed8d72c317e23004faf64b9548ea95

See more details on using hashes here.

File details

Details for the file tokenizers-0.7.0-cp37-cp37m-win32.whl.

File metadata

  • Download URL: tokenizers-0.7.0-cp37-cp37m-win32.whl
  • Upload date:
  • Size: 981.5 kB
  • Tags: CPython 3.7m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2

File hashes

Hashes for tokenizers-0.7.0-cp37-cp37m-win32.whl
Algorithm Hash digest
SHA256 a0abe20c50ca0760a895da33f1b55d452f21e55bddc418007d92d8665e86feb7
MD5 593d2983e08c52758eaf910ff01880c5
BLAKE2b-256 ffa83c81ce8670f83d4d9965e29953e5b2733695d12b2ad2720eb5cd701bc5c9

See more details on using hashes here.

File details

Details for the file tokenizers-0.7.0-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.7.0-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 5.6 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2

File hashes

Hashes for tokenizers-0.7.0-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 695657cddabb9bb08444ba1bed822302039983c63d046e93760eb993739c3c10
MD5 4a14abe19bcf60c6f7800d4947bc19c3
BLAKE2b-256 ea59bb06dd5ca53547d523422d32735585493e0103c992a52a97ba3aa3be33bf

See more details on using hashes here.

File details

Details for the file tokenizers-0.7.0-cp37-cp37m-macosx_10_10_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.7.0-cp37-cp37m-macosx_10_10_x86_64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.7m, macOS 10.10+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2

File hashes

Hashes for tokenizers-0.7.0-cp37-cp37m-macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 1b28e8ec30eea03b0d9bf7fe80c6fd240b7e5b76e7ec9542af0a48ffc1853a16
MD5 6a001524410b84f3fc5b6789fe3a0418
BLAKE2b-256 98a211e6465beaecbf92a3f203e44447a43110e3e0ee2cfdc9cfe03c7e2c1051

See more details on using hashes here.

File details

Details for the file tokenizers-0.7.0-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.7.0-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2

File hashes

Hashes for tokenizers-0.7.0-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 83da606afe2a5e7941a25490d841924750d55d7667284d2d2ded2de520181790
MD5 181821a8f286120650f09e23ccf2ebf6
BLAKE2b-256 e290080433297ee3775d8c5995b7bc3742e2468f8a91532da32ce88fa1a6700b

See more details on using hashes here.

File details

Details for the file tokenizers-0.7.0-cp36-cp36m-win32.whl.

File metadata

  • Download URL: tokenizers-0.7.0-cp36-cp36m-win32.whl
  • Upload date:
  • Size: 981.6 kB
  • Tags: CPython 3.6m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2

File hashes

Hashes for tokenizers-0.7.0-cp36-cp36m-win32.whl
Algorithm Hash digest
SHA256 fe3c994d2a993d32effcaf8600bf6ac29ef7de84519669f0efadb54f94d411a3
MD5 95f8d286f640e417aea04965ec6472ce
BLAKE2b-256 6275561b79aa02f142dd8d76763910490730bc5e4cb857cd4236a0ca2f287b48

See more details on using hashes here.

File details

Details for the file tokenizers-0.7.0-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.7.0-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.8 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2

File hashes

Hashes for tokenizers-0.7.0-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 17793599e4a0bb71730e366ecef47e4c0df2a79b4418d7557bf3af6cb995f8ba
MD5 5a7ed85b66ec2bfc00ef7a4ead975bee
BLAKE2b-256 14e5a26eb4716523808bb0a799fcfdceb6ebf77a18169d9591b2f46a9adb87d9

See more details on using hashes here.

File details

Details for the file tokenizers-0.7.0-cp36-cp36m-macosx_10_10_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.7.0-cp36-cp36m-macosx_10_10_x86_64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.6m, macOS 10.10+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2

File hashes

Hashes for tokenizers-0.7.0-cp36-cp36m-macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 e0faee1f08daaec0f9220967c8209b19e147e6eda55a22bea8fcc6f06aee95c7
MD5 5230fb32582dfbe4f5653f84d7178dcd
BLAKE2b-256 e99bfefc49f80e3b5cc48f0b1c8aa2c25f673735b70b0984810f5cc3c8438175

See more details on using hashes here.

File details

Details for the file tokenizers-0.7.0-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.7.0-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2

File hashes

Hashes for tokenizers-0.7.0-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 8f4203683b66369defa6fdd91ba07828715537ff31258dab171e4029bf54f7c9
MD5 82f9c9655be96a40d1d3d5985060d750
BLAKE2b-256 59d2d1798e7bcc2a20658237721527d3bce1e925976ede0d8e7a9413a62de714

See more details on using hashes here.

File details

Details for the file tokenizers-0.7.0-cp35-cp35m-win32.whl.

File metadata

  • Download URL: tokenizers-0.7.0-cp35-cp35m-win32.whl
  • Upload date:
  • Size: 981.5 kB
  • Tags: CPython 3.5m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2

File hashes

Hashes for tokenizers-0.7.0-cp35-cp35m-win32.whl
Algorithm Hash digest
SHA256 aa7d429b4c2978e1b2265a9fdbf27fe723f3acb9d58cebd6756ef20584d2d5e5
MD5 bcff1f9599fa5343a8b472537aaf48df
BLAKE2b-256 660a0e7132d573dba6814a6e0cf05fcc29fe7e648990dd00f4f3b3e972e237fb

See more details on using hashes here.

File details

Details for the file tokenizers-0.7.0-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.7.0-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2

File hashes

Hashes for tokenizers-0.7.0-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 892dac477347c65d65eef5092e9aa0c02df17f1a6d2113380277505bc6ae1db4
MD5 f06ac7afcae7abffbf9daeaa21c30195
BLAKE2b-256 f931be2570a1627c72e9925193774f61c87b5a51376864e79ee84b6533c1e345

See more details on using hashes here.

File details

Details for the file tokenizers-0.7.0-cp35-cp35m-macosx_10_10_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.7.0-cp35-cp35m-macosx_10_10_x86_64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.5m, macOS 10.10+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2

File hashes

Hashes for tokenizers-0.7.0-cp35-cp35m-macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 c9edc043bc14462faf8b261b528661718e9c4f0b8424fb25be71cae26187432a
MD5 49c38fe6e16fee6a3660173c8bc5dfd1
BLAKE2b-256 45dfb393363d9355c2b76a8e88d69e41ace61f13b40a29b49d12eb7282ec0752

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page