Skip to main content

Fast and Customizable Tokenizers

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install setuptools_rust
python setup.py install

Using the provided Tokenizers

Using a pre-trained tokenizer is really simple:

from tokenizers import BPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = BPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train yours just as simply:

from tokenizers import BPETokenizer

# Initialize a tokenizer
tokenizer = BPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# And you can use it
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory", "my-bpe")

Provided Tokenizers

  • BPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

You can also easily build your own tokenizers, by putting all the different parts you need together:

Use a pre-trained tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders

# Load a BPE Model
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
bpe = models.BPE.from_files(vocab, merges)

# Initialize a tokenizer
tokenizer = Tokenizer(bpe)

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

# Or tokenize multiple sentences at once:
encoded = tokenizer.encode_batch([
	"I can feel the magic, can you?",
	"The quick brown fox jumps over the lazy dog"
])
print(encoded)

Train a new tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE.empty())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()

# And then train
trainer = trainers.BpeTrainer(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# Now we can encode
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.4.2.tar.gz (62.7 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.4.2-cp38-cp38-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.4.2-cp38-cp38-manylinux1_x86_64.whl (7.5 MB view details)

Uploaded CPython 3.8

tokenizers-0.4.2-cp38-cp38-macosx_10_15_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.8 macOS 10.15+ x86-64

tokenizers-0.4.2-cp37-cp37m-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.4.2-cp37-cp37m-manylinux1_x86_64.whl (5.6 MB view details)

Uploaded CPython 3.7m

tokenizers-0.4.2-cp37-cp37m-macosx_10_15_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.7m macOS 10.15+ x86-64

tokenizers-0.4.2-cp36-cp36m-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.4.2-cp36-cp36m-manylinux1_x86_64.whl (3.7 MB view details)

Uploaded CPython 3.6m

tokenizers-0.4.2-cp36-cp36m-macosx_10_15_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.6m macOS 10.15+ x86-64

tokenizers-0.4.2-cp35-cp35m-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.4.2-cp35-cp35m-manylinux1_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.5m

tokenizers-0.4.2-cp35-cp35m-macosx_10_15_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.5m macOS 10.15+ x86-64

File details

Details for the file tokenizers-0.4.2.tar.gz.

File metadata

  • Download URL: tokenizers-0.4.2.tar.gz
  • Upload date:
  • Size: 62.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.4.2.tar.gz
Algorithm Hash digest
SHA256 e72068044dc2406f8d5cc995ff35d3ff8eaff7e79e9ff5544bf3e29d37b301d5
MD5 d53ce34579ddda53e467820d65ff9c2b
BLAKE2b-256 9b61f6fad680f4356d63a463a9a00f581ed8e2ee523539bfe8880d09096c9ba0

See more details on using hashes here.

File details

Details for the file tokenizers-0.4.2-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.4.2-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.4.2-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 17960445b4713c726b8d684d7870b300225e4859f8eea75be1239f1aefb0189c
MD5 9468cc3c87bf413a1b0bdcd16b7a1b21
BLAKE2b-256 bebb144bda0aa5620769d861669f392f8c0645f1326367323dc1a8dc5234dc16

See more details on using hashes here.

File details

Details for the file tokenizers-0.4.2-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.4.2-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 7.5 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.4.2-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 a36e73ea4675b8edf508fbbb94ff8892999aec599e05d4c4a75222c6017c7fbb
MD5 d72cf1c7ffc0d233dba42ec4070b109a
BLAKE2b-256 01fe37103568026a06e4887a75df5bc7debc603f855acf0817c56d1f737dc724

See more details on using hashes here.

File details

Details for the file tokenizers-0.4.2-cp38-cp38-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.4.2-cp38-cp38-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.8, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.4.2-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 e9295d6b7a3dae03ac056791c984274a1cf68d24be6954dddb4aa6c7185cad15
MD5 4f6123a2c851c0e79876cfe0e0bdbf95
BLAKE2b-256 2339325756728f8da0921f08a1ace1060b5b16568aaecd48a0a5d234b2788893

See more details on using hashes here.

File details

Details for the file tokenizers-0.4.2-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.4.2-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.4.2-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 489090af2a7929deac6e111900adc1f551601e53f71e797ced2a4f15af551938
MD5 9b376bf988a0f0d54819cda3d6131ccd
BLAKE2b-256 c9081c6101eab55243bf9197360f74ab7ad7a971ff8005e31934ac348b2f4f15

See more details on using hashes here.

File details

Details for the file tokenizers-0.4.2-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.4.2-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 5.6 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.4.2-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 7e7a4dc2030a0fbd16cdd6015a2ebbb78be07c5802c1dd8f16844514e65ea9b2
MD5 afccc7a3f639ed8471678a8b5303be1d
BLAKE2b-256 f003cd98cd047712203475c00903ce575ca8a01a7577175c9521198a0dee36c2

See more details on using hashes here.

File details

Details for the file tokenizers-0.4.2-cp37-cp37m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.4.2-cp37-cp37m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.7m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.4.2-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 f4ec78c7684173d1c68bf72e9b2f6618843988e99963f76a82f300917e9d16e5
MD5 c00743fde20e2c5668bb0c340d0f88a2
BLAKE2b-256 3f57929bad888af7549d34ea992634f8f70dd98d7d42005c2eeb21619dd97193

See more details on using hashes here.

File details

Details for the file tokenizers-0.4.2-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.4.2-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.4.2-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 e8857f4ff2963ea8d188d6053173ffb5ae58f8dfcf88904b78ec2385c2fa6658
MD5 332f58ad97f4b81c0cc4f4f2bebbd4de
BLAKE2b-256 d939b2cc130f7bc71ae76d2deaf15aa3ded5478f49bbdcd1de7f636888087ef2

See more details on using hashes here.

File details

Details for the file tokenizers-0.4.2-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.4.2-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.7 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.4.2-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 da8e1593ef2a73cb0d2ebaf06183e8acd304ef1a272c526884a686ded9cf4d27
MD5 33c23692a029aa73cbc64feaf966ead0
BLAKE2b-256 2e0a096a279ee20a206a58ba9af3251c7271fb9493bed3bf29bfe4c58fcbfe4c

See more details on using hashes here.

File details

Details for the file tokenizers-0.4.2-cp36-cp36m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.4.2-cp36-cp36m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.6m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.4.2-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 1689d2a967442bc84d01e51dc35678b1cdad62401bafbcd92bfd6acf44e6dbef
MD5 25042dbf8f28e752f723458b4e673aab
BLAKE2b-256 f6714c54e79d2e9f785f3c162bd9065057bfba80912307f2a536a727f099f1a1

See more details on using hashes here.

File details

Details for the file tokenizers-0.4.2-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.4.2-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.4.2-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 0a3496baca054d91b10bf68701c680b55892afd24b4ed82d7940b61d5e648279
MD5 893b99fdc88f0b2062d19ce2cafe6cc8
BLAKE2b-256 3233299b361caaea7879ddb01ad6a7546577c169f6dc2182fc05a4c51f8c44cf

See more details on using hashes here.

File details

Details for the file tokenizers-0.4.2-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.4.2-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.4.2-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 41bed20539c1be2101ec1e656c121ee1c5458e396f10cd10e439d25978afc356
MD5 24abe2e1edf207026fb4589db39d2b79
BLAKE2b-256 cf0822c4e9e262e1a2c59efad83e4704972d378928e2d9e687445379504234a3

See more details on using hashes here.

File details

Details for the file tokenizers-0.4.2-cp35-cp35m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.4.2-cp35-cp35m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.5m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.4.2-cp35-cp35m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 aa1071b1ce179da97c074e645fd28537aa5efddebb24b33882aaf68f3f8724a0
MD5 3307ec2e494fa6e1f0e0ad610d862e72
BLAKE2b-256 a3b6f647e832bc47f4ce7a89aea6c0670bddc9ecc172c46475736d7bbcbd5d81

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page