Skip to main content

Fast and Customizable Tokenizers

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install setuptools_rust
python setup.py install

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.9.0.dev1.tar.gz (147.7 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.9.0.dev1-cp38-cp38-win_amd64.whl (1.7 MB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.9.0.dev1-cp38-cp38-win32.whl (1.6 MB view details)

Uploaded CPython 3.8 Windows x86

tokenizers-0.9.0.dev1-cp38-cp38-macosx_10_11_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.8 macOS 10.11+ x86-64

tokenizers-0.9.0.dev1-cp37-cp37m-win_amd64.whl (1.7 MB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.9.0.dev1-cp37-cp37m-win32.whl (1.6 MB view details)

Uploaded CPython 3.7m Windows x86

tokenizers-0.9.0.dev1-cp37-cp37m-macosx_10_11_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.7m macOS 10.11+ x86-64

tokenizers-0.9.0.dev1-cp36-cp36m-win_amd64.whl (1.7 MB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.9.0.dev1-cp36-cp36m-win32.whl (1.6 MB view details)

Uploaded CPython 3.6m Windows x86

tokenizers-0.9.0.dev1-cp36-cp36m-macosx_10_11_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.6m macOS 10.11+ x86-64

tokenizers-0.9.0.dev1-cp35-cp35m-win_amd64.whl (1.7 MB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.9.0.dev1-cp35-cp35m-win32.whl (1.6 MB view details)

Uploaded CPython 3.5m Windows x86

tokenizers-0.9.0.dev1-cp35-cp35m-macosx_10_11_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.5m macOS 10.11+ x86-64

File details

Details for the file tokenizers-0.9.0.dev1.tar.gz.

File metadata

  • Download URL: tokenizers-0.9.0.dev1.tar.gz
  • Upload date:
  • Size: 147.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev1.tar.gz
Algorithm Hash digest
SHA256 b3e3392835944e59fc9ce46585b028cffb117ec854b8ef92f46cc6b6c1fe7c4c
MD5 b37b832170e89558f1a2d69d4be348fe
BLAKE2b-256 776588b29e1e64b4e2866221a21abdefb9190339198b7330fa156f196708467b

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev1-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev1-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev1-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 519f246a5407b944ed4f4ed28528df381b3a604a16acfded19b7c264f6e71c3d
MD5 61b1ba84ffad1e2a4030bf484ac68b88
BLAKE2b-256 75db629f13ba65025a522e127aa8cbcec9cc6356ae55fc4e87404181531d0fd4

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev1-cp38-cp38-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev1-cp38-cp38-win32.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.8, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev1-cp38-cp38-win32.whl
Algorithm Hash digest
SHA256 e38b4561c51a5f584339f0b39e952e26736d15c3455eaf5a2e6e592fcddf7742
MD5 0722bf6f0e84aac701029bfd34b27149
BLAKE2b-256 ba55d9a62c16a0dd5fb5cb5802626db8c681207cf8391c3e99d08e19ba65224a

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev1-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev1-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.7 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev1-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 e48169da1143fdc10520d7d6d3d5e5946b8d7b941062b950612ad93323ce957f
MD5 9faa19ce73ffe7a749bd9722d2a3e57a
BLAKE2b-256 bfef2d4beaef62654fdec1f7087c10c58903b32bb4f6438195a83f4d0cbf5d2d

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev1-cp38-cp38-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev1-cp38-cp38-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.8, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev1-cp38-cp38-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 7849b2b595431de31a6f57fa4efe54fbfd19c20c113d1f5fe60e8302df7c82d5
MD5 cf7ec80ed4cee10e69dadfcbd6495b8e
BLAKE2b-256 34ac8d02d75bb9f96dfd7563caaa45ecf446f55cd05766ae96d7730da65e7397

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev1-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev1-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev1-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 8ef056a0fbb27e80a0cf76b4f34f1a6ae3e02261f99d25c4e7e09a8a40156950
MD5 28010b1d9ce45c6bb42fb76cad7cc76e
BLAKE2b-256 e22df060d3130709edc57180fd9b0faf58d1911cdd3bee96206d3cecced8cced

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev1-cp37-cp37m-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev1-cp37-cp37m-win32.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.7m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev1-cp37-cp37m-win32.whl
Algorithm Hash digest
SHA256 5fa2efa85316ccc965caa1bedb8060ddb71d58100dc9db1ecec7ff710ef8a083
MD5 cf74fbfae88f845db11071e51c8add33
BLAKE2b-256 02f11fbee8df0fc077c54f4507f54f58fbb4c24e89be5a45332b8c8d54776cdf

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev1-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev1-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.7 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev1-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 13150fc0c384edc6374fc4473021c926f2f89f2e93e26b388be188ff7065a9c2
MD5 ef8b090b754069444986de326dfecf61
BLAKE2b-256 75c68033193a4d820d9920c68fecb83aa44539a896b5ec3d8692aed740ded5b4

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev1-cp37-cp37m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev1-cp37-cp37m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.7m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev1-cp37-cp37m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 906fa333995be4b5313094d8c70052a1857596ec4754ef7742ff51474c61b8fb
MD5 8add49ccf5ee56b5619f109d577f310a
BLAKE2b-256 e6a1e3acd8843643fa29523c07f3f351ed5e5cd95a430ff13070d93a67cf7480

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev1-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev1-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev1-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 0ea204913a2417adf98d8d6c23496cd938caecaccf302d8319e32c46c552ab1f
MD5 5ab2e8c0fa3e687347995053f0bd3f7c
BLAKE2b-256 402cb55a2d6747481e81f3d6a1e55b6ddc564d925b036b6156d3f27c2fc41135

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev1-cp36-cp36m-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev1-cp36-cp36m-win32.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.6m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev1-cp36-cp36m-win32.whl
Algorithm Hash digest
SHA256 f9ab5e26657795443f3e90bfb9082b464f76a5ed5270d506484372f8ceab5048
MD5 b2e6508709f86cafadcfd6b4bbe3b25d
BLAKE2b-256 fa0cfbf47fdf86b6ee7b334007c2dc862f51cbd6a8175c839e6712af743475f5

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev1-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev1-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.7 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev1-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 5ed739737cb0d3d61ac9cfecfc913fee9294305871dbb122e665b7695b945f85
MD5 80892629b319def777701e815ff07083
BLAKE2b-256 095db967facdfa26bc60a83fbb0528c20d591bd85b4ae99ef0829f4d5da18a81

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev1-cp36-cp36m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev1-cp36-cp36m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.6m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev1-cp36-cp36m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 37eb4fec41d932d8311447ac867c9bd767a80cdeb1a27afc5f364ec6c4b3beb8
MD5 9d766ba5c65e4a702e735f3c49127d6a
BLAKE2b-256 82d04be72837705480d1825b084dd3f9abe29325bb8bda7b98b3958480b87280

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev1-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev1-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev1-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 a8f3275446fee1a2c23d2d27bb6023d19008bae418ce5e86d612228bd42db828
MD5 730a3de4dd9f4a9e4dd94ca17bd5ae25
BLAKE2b-256 59a876e766c58de0fa28f9b4f7c7685966c80d2d75260021e4d8253409c083b9

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev1-cp35-cp35m-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev1-cp35-cp35m-win32.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.5m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev1-cp35-cp35m-win32.whl
Algorithm Hash digest
SHA256 49b23631a0a26e13d69a670550f2444b85dd83251ad2d3f6cfc5302684327084
MD5 8db29e24f1bc01f220d58515a18593cb
BLAKE2b-256 4f2d61d699c5023a9740b6014555160f5a553550ac87f80c8dcfd5671b7e1684

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev1-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev1-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.7 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev1-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 a3384a6ec904d975c3a7b516cf6e1c12335723d32370fb3cb0a35716e306d43b
MD5 e37c7e4c4d654832814de670aaa37955
BLAKE2b-256 606f42afef884b82c5f11ed0f280eba1decd8e0f8284aac22d71260176017aa4

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev1-cp35-cp35m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev1-cp35-cp35m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.5m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev1-cp35-cp35m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 2314d12e2646335c046d621c6335a4fe0b8d7b413dfe84814bb9511633921837
MD5 b76fcfdf7671b4d21dfbc5aa7e16e675
BLAKE2b-256 52855beb133d46ec86878e8593351661c21f3c55d84be860ccf794466243edeb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page