Skip to main content

Fast and Customizable Tokenizers

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install setuptools_rust
python setup.py install

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.8.0rc1.tar.gz (90.1 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.8.0rc1-cp38-cp38-win_amd64.whl (1.8 MB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.8.0rc1-cp38-cp38-win32.whl (1.6 MB view details)

Uploaded CPython 3.8 Windows x86

tokenizers-0.8.0rc1-cp38-cp38-manylinux1_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.8

tokenizers-0.8.0rc1-cp38-cp38-macosx_10_10_x86_64.whl (2.1 MB view details)

Uploaded CPython 3.8 macOS 10.10+ x86-64

tokenizers-0.8.0rc1-cp37-cp37m-win_amd64.whl (1.8 MB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.8.0rc1-cp37-cp37m-win32.whl (1.6 MB view details)

Uploaded CPython 3.7m Windows x86

tokenizers-0.8.0rc1-cp37-cp37m-manylinux1_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.7m

tokenizers-0.8.0rc1-cp37-cp37m-macosx_10_10_x86_64.whl (2.1 MB view details)

Uploaded CPython 3.7m macOS 10.10+ x86-64

tokenizers-0.8.0rc1-cp36-cp36m-win_amd64.whl (1.8 MB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.8.0rc1-cp36-cp36m-win32.whl (1.6 MB view details)

Uploaded CPython 3.6m Windows x86

tokenizers-0.8.0rc1-cp36-cp36m-manylinux1_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.6m

tokenizers-0.8.0rc1-cp36-cp36m-macosx_10_10_x86_64.whl (2.1 MB view details)

Uploaded CPython 3.6m macOS 10.10+ x86-64

tokenizers-0.8.0rc1-cp35-cp35m-win_amd64.whl (1.8 MB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.8.0rc1-cp35-cp35m-win32.whl (1.6 MB view details)

Uploaded CPython 3.5m Windows x86

tokenizers-0.8.0rc1-cp35-cp35m-manylinux1_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.5m

tokenizers-0.8.0rc1-cp35-cp35m-macosx_10_10_x86_64.whl (2.1 MB view details)

Uploaded CPython 3.5m macOS 10.10+ x86-64

File details

Details for the file tokenizers-0.8.0rc1.tar.gz.

File metadata

  • Download URL: tokenizers-0.8.0rc1.tar.gz
  • Upload date:
  • Size: 90.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0rc1.tar.gz
Algorithm Hash digest
SHA256 3b2135c78826edeb386fd6674201a61a373065c64e8214bb620de8fad9a6f446
MD5 0b2abd2bbd86e4735bbeee2ef10aa82e
BLAKE2b-256 49f2cd33b5c30e7fff15aeac03806eaabed2690e34a716b8c8186d6f801bd14c

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0rc1-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.8.0rc1-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0rc1-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 774400217e411a492f910e92acd3eb39a9d29460374d579e38287da934c69024
MD5 fa8be2dd1aa39d91061cda08ae00259a
BLAKE2b-256 a2d56f00ad6f1d879d20febc5e484a9f6b39ddd3a1a1c557968013522d02010e

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0rc1-cp38-cp38-win32.whl.

File metadata

  • Download URL: tokenizers-0.8.0rc1-cp38-cp38-win32.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.8, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0rc1-cp38-cp38-win32.whl
Algorithm Hash digest
SHA256 6c27081251b1d5b6c1ddc3054b24ba5559a4377d930100d5c12cce95e771fb34
MD5 93782d8e79d4441fcce73db26ed5bc69
BLAKE2b-256 db4e96e4d044753204fc7c5249d76b42b1159c3a7d4365486e0d64d2820b7602

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0rc1-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0rc1-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.0 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0rc1-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 109e3f97fb7d49de436c087b9f5d8add5eff217ed95f82f07120c5796b4cb43d
MD5 c133fe00cf46e536aa9d4decc752cc5f
BLAKE2b-256 40b99228edab34deec041bb34d8deba25c3136166ee9fff31a7c22ba66e4b027

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0rc1-cp38-cp38-macosx_10_10_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0rc1-cp38-cp38-macosx_10_10_x86_64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.8, macOS 10.10+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0rc1-cp38-cp38-macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 dcee0a7bdd360762648adff4863eaec3bbfa4020ab9600d6832ffe55c16585be
MD5 d604a333aa3d33ab453158150504ce1a
BLAKE2b-256 07a991d09159bda9e05a911d188fb200f4ccbcb1c53fec8be578dac23e734965

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0rc1-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.8.0rc1-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0rc1-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 d52108eabf698c5cd576c1f1e8508da92d402adc13cfbc9e0694c476bb21307e
MD5 e528add4a093dcab7981946318d8a855
BLAKE2b-256 1468a47d63c3ea5e546e5b9aa06b87eecad7527f9fac5682a5db99b98e6d3293

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0rc1-cp37-cp37m-win32.whl.

File metadata

  • Download URL: tokenizers-0.8.0rc1-cp37-cp37m-win32.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.7m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0rc1-cp37-cp37m-win32.whl
Algorithm Hash digest
SHA256 9f2fae60feef3c1612d56fab91f1b47b93d39c11b8f882273dcab13481639367
MD5 9deefeaa698c50eb1119d93d3a0e2e41
BLAKE2b-256 024e96356de5275d31e629354de10dcaadd24500d3f7e7ce122d9fcc9c04e1eb

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0rc1-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0rc1-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.0 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0rc1-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 460148ce3800922371b5d0d177d50a02b42d8e4b30e953b05f48010f0950b8de
MD5 9ea56e50ce96d1ce0b1395ba73991f7e
BLAKE2b-256 73ebcd1e6de28b238e1b2a1a5cf04fe4d6ec40786023b77190d2bb50bc7e023c

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0rc1-cp37-cp37m-macosx_10_10_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0rc1-cp37-cp37m-macosx_10_10_x86_64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.7m, macOS 10.10+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0rc1-cp37-cp37m-macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 f996c840c7dd03ce083882aaef1867219acbfcface004ddf2ff77907bd5b0e74
MD5 85897085eb765fb63c3da2cd4b5e57a2
BLAKE2b-256 30e7f4ee03b6e8c7c18c9953b8ae5c6548f44b1fc281980ff41fbffc7f00bad6

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0rc1-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.8.0rc1-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0rc1-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 a9628cb5e8ae67d69794ee8587c517472e698a5b48ed75752b1fac8dc4725261
MD5 2f5b9990ea4dcd85b162bb8b9f8efe2a
BLAKE2b-256 ff53d648958727822a976e419652ee10a79274a7e689ec539c51c037143e5ec4

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0rc1-cp36-cp36m-win32.whl.

File metadata

  • Download URL: tokenizers-0.8.0rc1-cp36-cp36m-win32.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.6m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0rc1-cp36-cp36m-win32.whl
Algorithm Hash digest
SHA256 e75bfc34f9945c4ea2fe514bd13c52f83e09e718bb244162fb46ccec795904c6
MD5 bc9c7425a88cc32e6d181997f00fd18b
BLAKE2b-256 3dc3d318c41c7f57efbfcfa99fec87b7b28ac4c8eb0b6ce9abeaccd0f4faab7b

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0rc1-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0rc1-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.0 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0rc1-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 162b0bcf199a2b0d0fdac77a85ff1e32a3944520a55e4de7b6f95612197e058d
MD5 8772427cba29025338b7f4ce67d9b485
BLAKE2b-256 9865e4bf0ddb9b4d0865f1ae579132c243d2f8dd6555d9402e424434c728c8a8

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0rc1-cp36-cp36m-macosx_10_10_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0rc1-cp36-cp36m-macosx_10_10_x86_64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.6m, macOS 10.10+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0rc1-cp36-cp36m-macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 6208b111adf3b144c572d8082d627ee680d4e886503d35acdc0a9f284534d5aa
MD5 7f1f691202396b807e8b4fd023037410
BLAKE2b-256 bc48ff45fa3682ab69279ea275a2b78d928b035d9f88c457be530a3a25f19406

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0rc1-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.8.0rc1-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0rc1-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 b8fce13167b5f6b13266a253d74e376844cb1a5d1b563533cb32c8db6965cc8b
MD5 99b6bdfb0b533b32e23b29f24219cc21
BLAKE2b-256 6565472f3ade32826bc3279e445a1fd2d91fdc01537c6199826bd9d5dc3a6d50

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0rc1-cp35-cp35m-win32.whl.

File metadata

  • Download URL: tokenizers-0.8.0rc1-cp35-cp35m-win32.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.5m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0rc1-cp35-cp35m-win32.whl
Algorithm Hash digest
SHA256 1e1c8de8e69d04782123982dd90c7a8ec111d75cfc17c0ef5faa48e8ac0c8a9f
MD5 7c9fc543fed8ab0852bdb36148c39395
BLAKE2b-256 fc00211e73dcbe576ff3a85fa6ba864ee445ac28d94c7e9346e60b1293dafee8

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0rc1-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0rc1-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.0 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0rc1-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 b0d4f86ac92301c6dcd67f7ebc24e876189c160130a37de36c91bfdc3a50410f
MD5 7a556392793a52e2536bc9120972502c
BLAKE2b-256 3a6fd68d8dc9699867a6bb9b255a29820908141aaec1debbdff8e2602912c552

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0rc1-cp35-cp35m-macosx_10_10_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0rc1-cp35-cp35m-macosx_10_10_x86_64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.5m, macOS 10.10+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0rc1-cp35-cp35m-macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 96346a3b8020343a9e99f49710fdebae356368b04f28e11190766cda19578bbe
MD5 130451d156b6e153d94743f8246c2744
BLAKE2b-256 33e17c30d60f7efcd568ce430879d56a89439dff0c9f98daf96b0138d3092a62

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page