Skip to main content

Fast and Customizable Tokenizers

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install setuptools_rust
python setup.py install

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.9.0.dev3.tar.gz (158.5 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.9.0.dev3-cp38-cp38-win_amd64.whl (1.8 MB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.9.0.dev3-cp38-cp38-win32.whl (1.6 MB view details)

Uploaded CPython 3.8 Windows x86

tokenizers-0.9.0.dev3-cp38-cp38-macosx_10_11_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.8 macOS 10.11+ x86-64

tokenizers-0.9.0.dev3-cp37-cp37m-win_amd64.whl (1.8 MB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.9.0.dev3-cp37-cp37m-win32.whl (1.6 MB view details)

Uploaded CPython 3.7m Windows x86

tokenizers-0.9.0.dev3-cp37-cp37m-macosx_10_11_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.7m macOS 10.11+ x86-64

tokenizers-0.9.0.dev3-cp36-cp36m-win_amd64.whl (1.8 MB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.9.0.dev3-cp36-cp36m-win32.whl (1.6 MB view details)

Uploaded CPython 3.6m Windows x86

tokenizers-0.9.0.dev3-cp36-cp36m-macosx_10_11_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.6m macOS 10.11+ x86-64

tokenizers-0.9.0.dev3-cp35-cp35m-win_amd64.whl (1.8 MB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.9.0.dev3-cp35-cp35m-win32.whl (1.6 MB view details)

Uploaded CPython 3.5m Windows x86

tokenizers-0.9.0.dev3-cp35-cp35m-macosx_10_11_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.5m macOS 10.11+ x86-64

File details

Details for the file tokenizers-0.9.0.dev3.tar.gz.

File metadata

  • Download URL: tokenizers-0.9.0.dev3.tar.gz
  • Upload date:
  • Size: 158.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev3.tar.gz
Algorithm Hash digest
SHA256 8f5b7089a6bb386191b974b71ecfb23eeeeb702bb1cedb228e25ee54faeb9bd2
MD5 7652df41adf765ffd8fe826af85b7633
BLAKE2b-256 552d329c5faa74bc94be4cc6f7d9cfb9c95eba492a11e86c449e56a1c3b5bbae

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev3-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev3-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev3-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 e5ac825be4f2a7a9f2f6b54f9521717ecdb5c360c8645f16c33a3efceff613fb
MD5 6fddb358bcf3a649c101e5ee6f4cdff2
BLAKE2b-256 1786bcd9185200755f66a7eb52a99b721ce869971eab88a181b756fca06c86e5

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev3-cp38-cp38-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev3-cp38-cp38-win32.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.8, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev3-cp38-cp38-win32.whl
Algorithm Hash digest
SHA256 4cb444fd9709fe8c4fc692e84cd2f323d40c927dd0f58cc2ed7c8b9b9cccca30
MD5 949aac830d9fc3623c1573363a87668a
BLAKE2b-256 5b84408293c5256454817905c9fa46734aa3cb920e1b0149653c9b3c9c84c972

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev3-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev3-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.8 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev3-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 bee99acf6565ae7a1fac27807f928f2ca6cda020c98fac7fb9e427eca3af70c4
MD5 bc8606dc6ff7617b42ecb53691170817
BLAKE2b-256 fb2c354b2e0605d11727ce7b12cb084b0a27e3bbfe8c67ec425d957de949f437

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev3-cp38-cp38-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev3-cp38-cp38-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.8, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev3-cp38-cp38-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 eff69572d0096786c1b4f96046846a845065440b0fba5d79b706622ae32cab24
MD5 bcfb887ae5d45e50089ec9b83b23d027
BLAKE2b-256 b5fea43a0eb3944830d61d206871a90370c72105e5e6b6e3357646077507d3da

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev3-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev3-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev3-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 10c9e1535c99d07cb00830f780ab3d16c8a015972364aa93c7468226d20f6a5d
MD5 2866e1a16ca83f139cda3295c204eec1
BLAKE2b-256 cfb85a317a888e6d9107648d25f0ec252e01d0de5391cef42e25aa683d0ec6d3

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev3-cp37-cp37m-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev3-cp37-cp37m-win32.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.7m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev3-cp37-cp37m-win32.whl
Algorithm Hash digest
SHA256 54235d98b92e818a16aafd01893b2241cb72805179f4f14312d2c5b135a764c3
MD5 087151d03cf6a4c1f215cc82afff2a64
BLAKE2b-256 7a443c7ea9291a10af8842c1e3620da7d0973e4a208257b6c498597e639cce86

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev3-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev3-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.8 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev3-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 ecc8536dc2519f1306737f16383acb916c87484e822ceb29372036b75a2a2f1f
MD5 cb4af6a1e36d115fab7178d6723d536a
BLAKE2b-256 5a1d96907e48c6105bc2f8d0944b67f8d110420121fe2a9a64e99ff7fe36e6dc

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev3-cp37-cp37m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev3-cp37-cp37m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.7m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev3-cp37-cp37m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 ef778e458bc54622386291310e897263bd8f4fd03110544432def65b0c7e1f55
MD5 5eff1317f2bc2d0e9d828d399963e0db
BLAKE2b-256 fd30004c31001003611dd801565c80d525885d0b523d087b6b4a7f7fe357aafe

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev3-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev3-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev3-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 7fdc8f68aea9f445b56a4589ffa2cec09e876ad4dcfbf8a93b986a6596c99dae
MD5 04ef292f8039b9f84a88f7d13c9ba019
BLAKE2b-256 61e7174ccd79b2e1712b94b7611112e2cac4fc282e5cb5613460bddd81ba904b

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev3-cp36-cp36m-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev3-cp36-cp36m-win32.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.6m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev3-cp36-cp36m-win32.whl
Algorithm Hash digest
SHA256 834c5a8fa337d5daa8e1ba107819c95ccb577646ec0eb661691f60650c408d5d
MD5 1814bc97d059065ed0193880a892bc8c
BLAKE2b-256 06cfdef8b2e1f106c1855891067b71d8a9b5d4c0a82427c00e6ecd0785667973

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev3-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev3-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.8 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev3-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 fd680545ec2edb84e396aec78a9daf927a451a36ce0b93db5bc58d5afb2e6363
MD5 56795d2ed76e7bfa826e3ad87b489eb7
BLAKE2b-256 8996e57886c453942882fe27a11d66964d144a6e22d3d1ee724d96547ee58634

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev3-cp36-cp36m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev3-cp36-cp36m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.6m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev3-cp36-cp36m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 76b5159bf22df7240c9435a0b8a3fc8596aa89c17e37a16f6e5f4a7e8dbbd50a
MD5 cce566870e35d4df89b791d099788d74
BLAKE2b-256 f4f6cac6e4f4e41b6673d1e4c5bc964e28494387355a659385e6d80b13ecc974

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev3-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev3-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev3-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 72578c83a1939d3e6c89b83692dc6996bc4bdd2d36fb8917d2c83fb13976eb9c
MD5 4d158be5c4954ec5bd72ad9a4189085e
BLAKE2b-256 791b790b472a4b76961a2c7d66c56f08b5cab420a1447ba1055d24b4bd96d7e2

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev3-cp35-cp35m-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev3-cp35-cp35m-win32.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.5m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev3-cp35-cp35m-win32.whl
Algorithm Hash digest
SHA256 f7a50540c5532e442b6b2d8163bd9e5b207fb298e0fe9d26b3b8b2d188fbe267
MD5 f3d8b324db6d255412a79558f819cdb9
BLAKE2b-256 2231f883814d2a4153e7f500c8daf0cf5fa00cc488b2df1169072dbbeeb097e3

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev3-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev3-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.8 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev3-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 4f084162619feee523ceea5367cf16affdfb698a26764bd448072d3b4f24bb63
MD5 70193fb03bc2bf6b975cae21ee98992a
BLAKE2b-256 4e7b13db7b029a05c5a09c1a6d938e9e8d3aaef8736a3f638c9c6bd66f29801f

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev3-cp35-cp35m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev3-cp35-cp35m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.5m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev3-cp35-cp35m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 62eb162b05639081ad2c03dc3c8a9f6699481de516d3e8bbf3aaed024533182d
MD5 d8abf3644ae78bad6a670aadd60ea49f
BLAKE2b-256 f3644d29c44d0e40a70951deb83144691fbf1d9c6647b9b02756c5e877fb5e6d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page