Skip to main content

Fast and Customizable Tokenizers

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install setuptools_rust
python setup.py install

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.9.0.dev2.tar.gz (378.2 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.9.0.dev2-cp38-cp38-win_amd64.whl (1.8 MB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.9.0.dev2-cp38-cp38-win32.whl (1.6 MB view details)

Uploaded CPython 3.8 Windows x86

tokenizers-0.9.0.dev2-cp38-cp38-macosx_10_11_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.8 macOS 10.11+ x86-64

tokenizers-0.9.0.dev2-cp37-cp37m-win_amd64.whl (1.8 MB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.9.0.dev2-cp37-cp37m-win32.whl (1.6 MB view details)

Uploaded CPython 3.7m Windows x86

tokenizers-0.9.0.dev2-cp37-cp37m-macosx_10_11_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.7m macOS 10.11+ x86-64

tokenizers-0.9.0.dev2-cp36-cp36m-win_amd64.whl (1.8 MB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.9.0.dev2-cp36-cp36m-win32.whl (1.6 MB view details)

Uploaded CPython 3.6m Windows x86

tokenizers-0.9.0.dev2-cp36-cp36m-macosx_10_11_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.6m macOS 10.11+ x86-64

tokenizers-0.9.0.dev2-cp35-cp35m-win_amd64.whl (1.8 MB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.9.0.dev2-cp35-cp35m-win32.whl (1.6 MB view details)

Uploaded CPython 3.5m Windows x86

tokenizers-0.9.0.dev2-cp35-cp35m-macosx_10_11_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.5m macOS 10.11+ x86-64

File details

Details for the file tokenizers-0.9.0.dev2.tar.gz.

File metadata

  • Download URL: tokenizers-0.9.0.dev2.tar.gz
  • Upload date:
  • Size: 378.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev2.tar.gz
Algorithm Hash digest
SHA256 e835fcf95bd1e0cd5440655ede6f54ed03e4507d1934d7ef1ee46b9adac30dc1
MD5 cb8af12897b462a0241c2141da1612e4
BLAKE2b-256 67628b139c368e65601166b7562981e44b1cdcb1d06b5471f73156d15645239e

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev2-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev2-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev2-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 9f0b25612828df63bb1fac5c800641cb1c3e8061767630941cd7695494a74f35
MD5 d99c9cb8a8c34599337264409832ffa9
BLAKE2b-256 9a8d4826a25c226d16cfd254f4e14608e2698faaa34c79319d6212906b793464

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev2-cp38-cp38-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev2-cp38-cp38-win32.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.8, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev2-cp38-cp38-win32.whl
Algorithm Hash digest
SHA256 7d9b60fd6bc09a22b77026f82d6de2b39c9f6eaee219f15801f5d52335d2743e
MD5 76fd46175942e78268b79b12832aa480
BLAKE2b-256 4603871998cbbdc9a05d69252a5274f642591c330eb6e5e6156c7158a3cf7e8e

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev2-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev2-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.8 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev2-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 c8c52f7b0701d5f090d2e7e54a2acbace3f593e651d436cb1f7848cede130271
MD5 71e01a37dc1a9b0226aa02ff2ecbc20f
BLAKE2b-256 be1ded46a226d3116825bab07d9ab8f1a7603e94e8473980da50cbca9d79400c

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev2-cp38-cp38-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev2-cp38-cp38-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.8, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev2-cp38-cp38-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 086a510a2bcfd67701ede4ecb2850be9da08f54576a0c168abf3ad9969c26aaf
MD5 21ee386e926c574e75e76a3e28f0dc1d
BLAKE2b-256 c29463eebc44c07ffe2871266bc7522248d9a1e2946c12bcfc068d43b5463ece

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev2-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev2-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev2-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 a3ac0accb7af5be1da8ac5bb48d6a916d8a678377be6e733d2bbf6aed9579d51
MD5 70d6b9e18748ce1b6d058d42d91f2d27
BLAKE2b-256 1e9d99a67d7b8b1120bdbcc1a38419383c0446bef9afa27807f4d528548ed628

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev2-cp37-cp37m-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev2-cp37-cp37m-win32.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.7m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev2-cp37-cp37m-win32.whl
Algorithm Hash digest
SHA256 deaecda2e28e14b5499380d5e4c17634b6cb459a42c32778980b4c73151bc9fc
MD5 a2517958551003cfe28f7669ee2083f6
BLAKE2b-256 8bc99fdb5db1f727f60a5d4e3b37e3b76c775067462cda54f1cdda184b0ab187

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev2-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev2-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.8 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev2-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 6dc47ed53aff85e2c3e172bcfd2d7cbc4081807fa0946610e7d22349950846c4
MD5 2cc15c195229494a664022ffa50bdf1f
BLAKE2b-256 6a15be963959182b5dc35690912fc3a78245a404a75dd6de3e569906a04f4e9b

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev2-cp37-cp37m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev2-cp37-cp37m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.7m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev2-cp37-cp37m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 ccec826b260dfbdfc8fc13db47c6f9c405fcc22aa8138b6a3ad3d68619d49168
MD5 8e894b14f2a700739f1a09be30015edc
BLAKE2b-256 7d6681e9e9f33167c332362c280a9f7a894cfbc487e347e30e0efa50a551afa5

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev2-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev2-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev2-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 e5718c6be4f56b6e8d391ef80b82c788ce8aadde2de9e38e6f01d4adb95ee858
MD5 6bde922cc8e647e8d34f5685eca0bb80
BLAKE2b-256 816928b9992c018c364c7ecb3b3230f699c98dc712623af3a45f3c7925344a5e

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev2-cp36-cp36m-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev2-cp36-cp36m-win32.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.6m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev2-cp36-cp36m-win32.whl
Algorithm Hash digest
SHA256 3c24b97d51a50f3d530e49ee0de50960eb9a377a9adcea4cf58e456d64c2a97d
MD5 717714d95316594a81f67706cf6abe46
BLAKE2b-256 8655c7e41701890ab0a25b4eb758c17e7332d8454c5a8a8a480a7e626b073f24

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev2-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev2-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.8 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev2-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 10c8b9b1cb12da92b00e36673de82d4aef1e9210fd00054b3d2a1b917c526ad5
MD5 e6908cb1cd21742613e5e725426009ca
BLAKE2b-256 2bc1945fefe8ac98c3c15d820c1454a7b76e304247775aed06530afd1baed2d5

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev2-cp36-cp36m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev2-cp36-cp36m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.6m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev2-cp36-cp36m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 a5d6bda3e17a472505893ae5a5bf0f350fef8f84115603d994fe8c05fe87e8f9
MD5 98105f18d5c6d49aed069e183b46552a
BLAKE2b-256 2f664b6b027c2235dd98b6fbac63a90dee2540f38d98790d9ed3437b5848f138

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev2-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev2-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev2-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 2cf853304a7277d6b75382be7103d023827cce1884f2f8d30911f15a98237a8b
MD5 8599ce8b10b92d4c695b4d99169c883d
BLAKE2b-256 dfc3dcfc9e3b9f24385c326a9fb34ed3166bbc2f2b32233e1e95bd931cc9b3a9

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev2-cp35-cp35m-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev2-cp35-cp35m-win32.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.5m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev2-cp35-cp35m-win32.whl
Algorithm Hash digest
SHA256 089d57908e0e25ec775670167c613e2c35f09424141fcd923f16ad27797147c1
MD5 358e55ba6df40b7a196c2c4958368a74
BLAKE2b-256 b7a7a1bd56d19a54c80415cb225c13789af15cde297831613abe6f31012ec378

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev2-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev2-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.8 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev2-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 97cb0905ef8b77c13dfa6120ea81b5d2758247915e8f328c9050829c2f17e479
MD5 88094590173b6643a38130f74c8e50f1
BLAKE2b-256 462033903a8ac71335e999821a268566c9e000c20b86916c0da9ff03703a27a7

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0.dev2-cp35-cp35m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0.dev2-cp35-cp35m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.5m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0.dev2-cp35-cp35m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 bbbe7620e855bc2a48854095a016a33210cfddf395b65df40eb0d70e65df3198
MD5 68acae232289dcf1d90b22cc2aa29a20
BLAKE2b-256 70279c19e014d4aa8b2b8a0b58e46f64aa20e87372cc14923a4fa59c4733b0e0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page