Skip to main content

Fast and Customizable Tokenizers

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install setuptools_rust
python setup.py install

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.8.0rc3.tar.gz (95.1 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.8.0rc3-cp38-cp38-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.8.0rc3-cp38-cp38-win32.whl (1.6 MB view details)

Uploaded CPython 3.8 Windows x86

tokenizers-0.8.0rc3-cp38-cp38-manylinux1_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.8

tokenizers-0.8.0rc3-cp38-cp38-macosx_10_10_x86_64.whl (2.1 MB view details)

Uploaded CPython 3.8 macOS 10.10+ x86-64

tokenizers-0.8.0rc3-cp37-cp37m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.8.0rc3-cp37-cp37m-win32.whl (1.6 MB view details)

Uploaded CPython 3.7m Windows x86

tokenizers-0.8.0rc3-cp37-cp37m-manylinux1_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.7m

tokenizers-0.8.0rc3-cp37-cp37m-macosx_10_10_x86_64.whl (2.1 MB view details)

Uploaded CPython 3.7m macOS 10.10+ x86-64

tokenizers-0.8.0rc3-cp36-cp36m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.8.0rc3-cp36-cp36m-win32.whl (1.6 MB view details)

Uploaded CPython 3.6m Windows x86

tokenizers-0.8.0rc3-cp36-cp36m-manylinux1_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.6m

tokenizers-0.8.0rc3-cp36-cp36m-macosx_10_10_x86_64.whl (2.1 MB view details)

Uploaded CPython 3.6m macOS 10.10+ x86-64

tokenizers-0.8.0rc3-cp35-cp35m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.8.0rc3-cp35-cp35m-win32.whl (1.6 MB view details)

Uploaded CPython 3.5m Windows x86

tokenizers-0.8.0rc3-cp35-cp35m-manylinux1_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.5m

tokenizers-0.8.0rc3-cp35-cp35m-macosx_10_10_x86_64.whl (2.1 MB view details)

Uploaded CPython 3.5m macOS 10.10+ x86-64

File details

Details for the file tokenizers-0.8.0rc3.tar.gz.

File metadata

  • Download URL: tokenizers-0.8.0rc3.tar.gz
  • Upload date:
  • Size: 95.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0rc3.tar.gz
Algorithm Hash digest
SHA256 d34720e3d1f099ad7c03acc5fdac1b8daa0bbbf5622376e5ee7c6c8717c37b4a
MD5 55c4a9736c8fb6e9e409218a30caf41c
BLAKE2b-256 d3cebcbf6f7b63abda8853c6f79eacd06de574cff02ebd348a2bbd54c20f3029

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0rc3-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.8.0rc3-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0rc3-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 601e9022a2f193f6ef4d4735b4cd33220ae8721b87583bdcea32c1a9af529799
MD5 eea244f814bff37c209b2c0b0a78f809
BLAKE2b-256 bc389c4fa44750adb36b88b44f19fa1e89499f2a1572152caab806593f9972bd

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0rc3-cp38-cp38-win32.whl.

File metadata

  • Download URL: tokenizers-0.8.0rc3-cp38-cp38-win32.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.8, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0rc3-cp38-cp38-win32.whl
Algorithm Hash digest
SHA256 0f68ef16139595654f45195680735f0bba9ae569a68798f3fe514978c6f21ec3
MD5 af3ebf4f41c9c91e029ec17c9ec2fcbf
BLAKE2b-256 530f1b48f1ce889fab8839c37b4784d6bc67156dec5e269cdb4dd90e6dc7c0f3

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0rc3-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0rc3-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.0 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0rc3-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 56ef89a3aa0d50c19e3e653686afded31da8607e1884cb21fb7acd981afb74aa
MD5 86db150e11957f1ca10105175dd683bd
BLAKE2b-256 06929c283746407f759091b2848ca453343f07449036fce0f0973a11f144eded

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0rc3-cp38-cp38-macosx_10_10_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0rc3-cp38-cp38-macosx_10_10_x86_64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.8, macOS 10.10+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0rc3-cp38-cp38-macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 f91da4bfee1fdacc43bec9afd756135d9c3938745874e4cc5bddbb4407e15368
MD5 d1147d200194e65d31c44777264c29bb
BLAKE2b-256 02186716a83bc550f47f5d89a8ce60179ccfe9109a272748f19f09e67d3edde9

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0rc3-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.8.0rc3-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0rc3-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 5befaa17ad4d40e977c10f959a28837954d48eda39781ceee31b997efc6cfafd
MD5 5037e489ccfb2a44ac9649a7aaf26a79
BLAKE2b-256 c9180f654a88b494288668a7c6063d4462092ed0ef1c827d8623a3202433a646

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0rc3-cp37-cp37m-win32.whl.

File metadata

  • Download URL: tokenizers-0.8.0rc3-cp37-cp37m-win32.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.7m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0rc3-cp37-cp37m-win32.whl
Algorithm Hash digest
SHA256 bf7b952afe74b14549274f4fb62027eb391809aa394379ca2b13f502581b14d9
MD5 af2569800e8acae603188fa6d19558e7
BLAKE2b-256 43d0ababbaec33ac72d05f5360112322af595d9067160cf5b00576313c298b2c

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0rc3-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0rc3-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.0 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0rc3-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 8c9a4b2fd6768d46ea34b45d3429d29e28d50df441e2383b1ce4d233f7f8936c
MD5 48bd9528e8cbbc065ed84197ddb9b928
BLAKE2b-256 e633bd7eaeea1b84ca219a7140544e37f75efe6de0c883a6f155fb3b23053a76

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0rc3-cp37-cp37m-macosx_10_10_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0rc3-cp37-cp37m-macosx_10_10_x86_64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.7m, macOS 10.10+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0rc3-cp37-cp37m-macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 349f753cbca7b7937ec1a09c59ebf3de468858ec981e785425a2037c36eb8259
MD5 130cc6c343285d28d57003f595669565
BLAKE2b-256 9cf293bbdc0a37d509b54d2d54665f76ee0832fc0e191ebfbe7b38dedbc6ebf6

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0rc3-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.8.0rc3-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0rc3-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 611cf501252737cdd5033094b9f4b418b6918a68d0cd0999cdc99a7984119615
MD5 20c5017e627465c7a75f17abee3196b1
BLAKE2b-256 9d0229ef71fb0cea5fd98e28bd6544ba0e10ae162c5663a3e099799a83cbd9d9

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0rc3-cp36-cp36m-win32.whl.

File metadata

  • Download URL: tokenizers-0.8.0rc3-cp36-cp36m-win32.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.6m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0rc3-cp36-cp36m-win32.whl
Algorithm Hash digest
SHA256 b346bd28c99c2334b056b9d16f6215cc9707be67502d2554552e91320e1158b2
MD5 52a702034ae173152fe36b013456fe2f
BLAKE2b-256 081908961c9322081b0b8d8bac49f48bfdc0ab5b9fdbf5e1f5980fbefe22cb13

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0rc3-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0rc3-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.0 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0rc3-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 4586bf1663c0f7968d056aa43efab978d2333be5cc13755761750c6ad1f29f1a
MD5 914e1d4ffea80860e9ce2d4c0684c974
BLAKE2b-256 c324fc3b869878ad596d9b3acfea7f1f163958893579f7b0c278cc45eb885dfd

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0rc3-cp36-cp36m-macosx_10_10_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0rc3-cp36-cp36m-macosx_10_10_x86_64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.6m, macOS 10.10+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0rc3-cp36-cp36m-macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 d100f215537ae8329e4b7cefd00f07c7b1481786fcbf9bd8a313c8913eb4649b
MD5 72482e8040fc76e3675571e89f14e306
BLAKE2b-256 07c9a559efb1ab6e9e74d92e4ae82899d75414ad60ddd0dbff5be76d0a738370

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0rc3-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.8.0rc3-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0rc3-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 d364b08417238feb168a8cdb360921b69a1ecf90bff47225455bfd9020d42367
MD5 290bd586e3fd6c165ec6e1fe99a55cc1
BLAKE2b-256 550e44b3700cb2cf1ae607af7ad7b404aba62d4c32924f4251ca74e797adb668

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0rc3-cp35-cp35m-win32.whl.

File metadata

  • Download URL: tokenizers-0.8.0rc3-cp35-cp35m-win32.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.5m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0rc3-cp35-cp35m-win32.whl
Algorithm Hash digest
SHA256 9b31b2c7ed6c7700491cfcce209a09c38f8e4710d09d3a30f4b02dfab3d9e291
MD5 5bfa97fdfa3955c39f7c9bea60b03eea
BLAKE2b-256 472338ab1547fd90a776e10d34aab994510053fcc3cda252cf518c875d415abb

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0rc3-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0rc3-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.0 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0rc3-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 953be87d0fcb635894b37ae4ca63450eeca2269010df727465e8821bcf9ddd19
MD5 5c73abea6454287c817006af5bb9c6c8
BLAKE2b-256 d0eb1fe7ad583ea19412dcb66abc5a08a199d42de26ded487990b3153f0f1ac1

See more details on using hashes here.

File details

Details for the file tokenizers-0.8.0rc3-cp35-cp35m-macosx_10_10_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.8.0rc3-cp35-cp35m-macosx_10_10_x86_64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.5m, macOS 10.10+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tokenizers-0.8.0rc3-cp35-cp35m-macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 0fe573ed196ee93d9b900b07a371356ae76c48c509ce412c40cf2eeec0751ae6
MD5 f297b6739576c22d6da4a5266fa43a57
BLAKE2b-256 abab7687eada43db4aca9f98584eca038e79eab432e406fac73e66fb8517d82b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page