Skip to main content

Fast and Customizable Tokenizers

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install setuptools_rust
python setup.py install

Using the provided Tokenizers

Using a pre-trained tokenizer is really simple:

from tokenizers import BPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = BPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train yours just as simply:

from tokenizers import BPETokenizer

# Initialize a tokenizer
tokenizer = BPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# And you can use it
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory", "my-bpe")

Provided Tokenizers

  • BPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

You can also easily build your own tokenizers, by putting all the different parts you need together:

Use a pre-trained tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders

# Load a BPE Model
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
bpe = models.BPE.from_files(vocab, merges)

# Initialize a tokenizer
tokenizer = Tokenizer(bpe)

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel.new(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel.new()

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

# Or tokenize multiple sentences at once:
encoded = tokenizer.encode_batch([
	"I can feel the magic, can you?",
	"The quick brown fox jumps over the lazy dog"
])
print(encoded)

Train a new tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE.empty())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel.new(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel.new())

# And then train
trainer = trainers.BpeTrainer.new(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# Now we can encode
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.1.1.tar.gz (58.4 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.1.1-cp38-cp38-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.1.1-cp38-cp38-manylinux1_x86_64.whl (7.3 MB view details)

Uploaded CPython 3.8

tokenizers-0.1.1-cp38-cp38-macosx_10_13_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.8 macOS 10.13+ x86-64

tokenizers-0.1.1-cp37-cp37m-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.1.1-cp37-cp37m-manylinux1_x86_64.whl (5.4 MB view details)

Uploaded CPython 3.7m

tokenizers-0.1.1-cp37-cp37m-macosx_10_13_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.7m macOS 10.13+ x86-64

tokenizers-0.1.1-cp36-cp36m-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.1.1-cp36-cp36m-manylinux1_x86_64.whl (3.6 MB view details)

Uploaded CPython 3.6m

tokenizers-0.1.1-cp36-cp36m-macosx_10_13_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.6m macOS 10.13+ x86-64

tokenizers-0.1.1-cp35-cp35m-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.1.1-cp35-cp35m-manylinux1_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.5m

tokenizers-0.1.1-cp35-cp35m-macosx_10_13_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.5m macOS 10.13+ x86-64

File details

Details for the file tokenizers-0.1.1.tar.gz.

File metadata

  • Download URL: tokenizers-0.1.1.tar.gz
  • Upload date:
  • Size: 58.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.1.1.tar.gz
Algorithm Hash digest
SHA256 167b8e80f04e1bc83c1edaf9de6bb79f885bdf5825d5c494c90bb468e94f3e39
MD5 074750029929f2c8e7d7f909a9b48210
BLAKE2b-256 500dc3b7e9f13797546ea03e1caea79ee8384cdb20e2afdabeecce3eff64cb32

See more details on using hashes here.

File details

Details for the file tokenizers-0.1.1-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.1.1-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.1.1-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 7f50f0601efdad62f06072528ee8db142df31c20bace082afe35e47a98c4e1af
MD5 f7f1ecdea66df3387fdd660143a79bf9
BLAKE2b-256 407f93ef1a3e5d078c4db71201f9593c8bf7a1e837aa4f58926cb3dd066e2ddb

See more details on using hashes here.

File details

Details for the file tokenizers-0.1.1-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.1.1-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 7.3 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.1.1-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 7caf3a97780d8fdce486a046d40d7caafd4983416619b196c50fcdc09e74bc7f
MD5 6e356ee42d8b097b998b818a5caf6197
BLAKE2b-256 b4a102e5ccfc296651ed20b1e9f911374759fe720a4882524dab502e54d0b098

See more details on using hashes here.

File details

Details for the file tokenizers-0.1.1-cp38-cp38-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.1.1-cp38-cp38-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.8, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.1.1-cp38-cp38-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 35ab1fd5cb4ae4a55ae3d16359c6faa9e43934235d8cc6ee3acfc73b9fb53b85
MD5 e482f932176c7b131f274a6c4f486c2f
BLAKE2b-256 731c984558d77cc66305db6564aef214543eaae29eb839ca4e4988fa394f04f5

See more details on using hashes here.

File details

Details for the file tokenizers-0.1.1-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.1.1-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.1.1-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 331c172f33f78ef1b11a0bb3e904b8b33df0aec3a51f77e8f63683a0a2b1282b
MD5 8193965b4ea9c1436e72d5bc22588832
BLAKE2b-256 81c037bbeb35cf46869d62637b8c99afce3bddd6b2196c883f428a2ffd3befb6

See more details on using hashes here.

File details

Details for the file tokenizers-0.1.1-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.1.1-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 5.4 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.1.1-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 fd6464157dbda00267330f3c1c8221b73cc5505b4d1b904abd7d0e804f767ea0
MD5 0767cbd0805a2fc323de60ba51bab796
BLAKE2b-256 5345b711e47671d207b9189878cda23487cf6f3f4965f142ee4a0b393aa0b871

See more details on using hashes here.

File details

Details for the file tokenizers-0.1.1-cp37-cp37m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.1.1-cp37-cp37m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.7m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.1.1-cp37-cp37m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 84127c30bb2387131437fe0553ec743859b1584cf77393fb11f76e894052905a
MD5 2ca8edffd50436160181d4e4ce4b8ecf
BLAKE2b-256 6034949ba3decf0217ad7e58c005952c63d4ddc1d6c934ab2aa53594d1870cad

See more details on using hashes here.

File details

Details for the file tokenizers-0.1.1-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.1.1-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.1.1-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 1bb63a46ae3057c3c3e937c3d993202e5187ebc24f7399410c52d2a4de1b91f3
MD5 43a2551539258826a526d8b989889bc3
BLAKE2b-256 4ebee034913f837f97b4c28e022cfef81e9d1312805104dd8bcc3a17ff42f1d7

See more details on using hashes here.

File details

Details for the file tokenizers-0.1.1-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.1.1-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.6 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.1.1-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 42a66d140093da258d639a606dbf7c0af5600b84f6780872577897dd03aaeae2
MD5 68a2851cf91934f54b298b12de350a96
BLAKE2b-256 2cabd1c750d565359dd0b5e95ae73544ce22bf1a07f62ae043a7435baa5bf7ea

See more details on using hashes here.

File details

Details for the file tokenizers-0.1.1-cp36-cp36m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.1.1-cp36-cp36m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.6m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.1.1-cp36-cp36m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 5f694fac618f9f022b35141f3231c38382f6b2e5ac4544f7e14dd7bebd419b66
MD5 f73131580d7061630fdffb8fb4c42faa
BLAKE2b-256 fd06b8a9eefe56a95478a0ed4d05b50db957ed31e277e0ff78f869fd9ab56ccf

See more details on using hashes here.

File details

Details for the file tokenizers-0.1.1-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.1.1-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.1.1-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 0b71efe6724e8ef85d2f49c2eacfdae1f822f54e066e22a0a7ee674b51aaceab
MD5 197e71d6242327a1bc863bd38678e702
BLAKE2b-256 5500f263e7f5960151793b11858e1c658ea776ce0c3762b949d9edecc99db61b

See more details on using hashes here.

File details

Details for the file tokenizers-0.1.1-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.1.1-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.1.1-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 c4a153572503dcf1efe9be727cd2132e4e31785f11684937706c94b70e572fd2
MD5 e521c6399ad0acde6a18cbe283f5db09
BLAKE2b-256 659cadb6658b791eb320f32cb9e0f7ede54e0946af6b6bdd4eae99cb7a8ca3fc

See more details on using hashes here.

File details

Details for the file tokenizers-0.1.1-cp35-cp35m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.1.1-cp35-cp35m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.5m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.1.1-cp35-cp35m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 833b6dc732d61e8c634a267a66b7b62a475e4824cc9971eafbe6e4fbf5f3c28c
MD5 0cc3a61642b145d42619f5a51d664370
BLAKE2b-256 56da93b2c29dd05965f8002882aa6bf757f2328e68af0c7da2d373f1ec451f88

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page