Skip to main content

Fast and Customizable Tokenizers

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install setuptools_rust
python setup.py install

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can check how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.9.0rc2.tar.gz (170.1 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.9.0rc2-cp38-cp38-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.9.0rc2-cp38-cp38-win32.whl (1.7 MB view details)

Uploaded CPython 3.8 Windows x86

tokenizers-0.9.0rc2-cp38-cp38-macosx_10_11_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.8 macOS 10.11+ x86-64

tokenizers-0.9.0rc2-cp37-cp37m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.9.0rc2-cp37-cp37m-win32.whl (1.7 MB view details)

Uploaded CPython 3.7m Windows x86

tokenizers-0.9.0rc2-cp37-cp37m-manylinux1_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.7m

tokenizers-0.9.0rc2-cp37-cp37m-macosx_10_11_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.7m macOS 10.11+ x86-64

tokenizers-0.9.0rc2-cp36-cp36m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.9.0rc2-cp36-cp36m-win32.whl (1.7 MB view details)

Uploaded CPython 3.6m Windows x86

tokenizers-0.9.0rc2-cp36-cp36m-manylinux1_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.6m

tokenizers-0.9.0rc2-cp36-cp36m-macosx_10_11_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.6m macOS 10.11+ x86-64

tokenizers-0.9.0rc2-cp35-cp35m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.9.0rc2-cp35-cp35m-win32.whl (1.7 MB view details)

Uploaded CPython 3.5m Windows x86

tokenizers-0.9.0rc2-cp35-cp35m-manylinux1_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.5m

tokenizers-0.9.0rc2-cp35-cp35m-macosx_10_11_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.5m macOS 10.11+ x86-64

File details

Details for the file tokenizers-0.9.0rc2.tar.gz.

File metadata

  • Download URL: tokenizers-0.9.0rc2.tar.gz
  • Upload date:
  • Size: 170.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0rc2.tar.gz
Algorithm Hash digest
SHA256 5f10158b53b60775a2dfa397ef3266ab7906a0c38b1290898dc1174dee4b78e8
MD5 dfc5c56dbb33d2d063ee967c348be59b
BLAKE2b-256 eae31b29764baca35c7bd56c7859f27c75dad2474f72c3ad08b73e68a736a23b

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0rc2-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.0rc2-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0rc2-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 2e5fc1847f043555a05ac981ae7d08a7270c6c2b75e0ef2e39b10f84d73340bd
MD5 cd55ab99f074d684102ed1f9b89cc4df
BLAKE2b-256 c940874241ec6d4e68c12ae8d03ae5062c3ef2fbf40f911ea9888b006e7f35d5

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0rc2-cp38-cp38-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.0rc2-cp38-cp38-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.8, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0rc2-cp38-cp38-win32.whl
Algorithm Hash digest
SHA256 0d8c69bc6368402d69e3967fcccd90b54b3d09da62d940c8a9cd870067035659
MD5 77991813aa42244f8c812447db91f288
BLAKE2b-256 238f29a1ac6feb9daaeb1127b363d9102efecf77cb818913f5c71487f7cdfc99

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0rc2-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0rc2-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.9 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0rc2-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 5d7f6214549fc07f15d68af9b0bdc53c72ddb7d3fde96da43891a306591d5250
MD5 d3d2d2d0178e27a2c7d58610b9faec8b
BLAKE2b-256 64f1367d730b0ede6538588fad6eb826d92919cde047a26b9fb873464bbf1f3a

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0rc2-cp38-cp38-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0rc2-cp38-cp38-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 2.0 MB
  • Tags: CPython 3.8, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0rc2-cp38-cp38-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 3ad3fb10fea0ce78a541b6eecd8970f68117a38f2065eaf534869a5fc543fe4d
MD5 623f62340f8cf2b0d5d7f4e36ccbfc79
BLAKE2b-256 a5c2221ec9de31574236341353a455e1873588f442212575963ebaa54dcb8ad9

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0rc2-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.0rc2-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0rc2-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 e1439fd3ca67f6f9d80ed36a33f366de85f43dc3fdf68aa2f0284684758c0c76
MD5 b719b0d0631c85a8f0bc6e807ec1d394
BLAKE2b-256 4eb2c551060023397d6a64f1d5f81e1f4750520fb3ffb116338f39a416631437

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0rc2-cp37-cp37m-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.0rc2-cp37-cp37m-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.7m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0rc2-cp37-cp37m-win32.whl
Algorithm Hash digest
SHA256 ee0477d5dc13567c34fdebb7400fa5883a5798bc394aa2281835154bb0ed464a
MD5 fc5b0451c48b971a0674497d9ab3a845
BLAKE2b-256 9f11ae481e3ea2f31eb0c1ceed913ab23969f10980e1e174bf3603d003854866

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0rc2-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0rc2-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.9 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0rc2-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 ca4379403f3f3ded4544500663149bea18a7dbd11643f8802debd4c101368371
MD5 75d984e84714aec26c64e809b870e84d
BLAKE2b-256 3a67f7647e5e458e8d4b495bdd61ff7dfd47b77b3caa756d5365231fbdebe943

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0rc2-cp37-cp37m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0rc2-cp37-cp37m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 2.0 MB
  • Tags: CPython 3.7m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0rc2-cp37-cp37m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 3e93f673175b569fcc0b59af96aa3d44cd6e8ec630088b50f162b97965fc3edd
MD5 3596bd48a4a22552c948467604dd223a
BLAKE2b-256 707b7410fca63d9d95f5a495acd99dd19a3ef8343ce837622dedcba44295cd40

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0rc2-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.0rc2-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0rc2-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 002d4a3174c8392dfb4dfc3f1821a52d4fc5646c11a8787ebc9c8b60297c2bf2
MD5 d2baca5cbf4bd91be402c3d2d10601d2
BLAKE2b-256 aa7693ec14721342387a6ac297fa3b47ea2d4e39ad00e6db16cb6c4e6674bc05

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0rc2-cp36-cp36m-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.0rc2-cp36-cp36m-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.6m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0rc2-cp36-cp36m-win32.whl
Algorithm Hash digest
SHA256 6937defdab97a3a7a7fcaa043bf39eaa356a55972b2cb9c2c0b5e342a775ed20
MD5 7c2337d17970262a796a1980b798902b
BLAKE2b-256 305f138c142b067deba247204443dbcc268e2b0f74f92fb94606b2c43edb5d7c

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0rc2-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0rc2-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.9 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0rc2-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 d78c6159b3c570680eda51b645f8c8fe7f86d67df268c183c85481159160f379
MD5 d88cc4de169596fdbf2aa9f5906c074f
BLAKE2b-256 52adef09d13f98e42727c840863c4d7ba380eb041a6ca2f86413f6191c78162c

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0rc2-cp36-cp36m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0rc2-cp36-cp36m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 2.0 MB
  • Tags: CPython 3.6m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0rc2-cp36-cp36m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 46ece1c3f3638c0ba8695d9efa7dfa07e1ebd25a252cc28fcd7cd785abfb5e98
MD5 c8a36624c1196848515638b26a9c4d8d
BLAKE2b-256 4576f75a0eee154def5d885b5bfb894f71d7b42a39fd32cd8c1934cd71080e92

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0rc2-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.0rc2-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0rc2-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 0ec85ba4962bdf83213b674f6bafffd4ea5902e2f247a7f3e04b82ee1155fec3
MD5 7e1ae876d2ca23ee2e5d0210d76b501c
BLAKE2b-256 84b7490a73b0c70398dd2180b9eea0c952920e5b0caeaebd3dbf5a0161b91d5d

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0rc2-cp35-cp35m-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.0rc2-cp35-cp35m-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.5m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0rc2-cp35-cp35m-win32.whl
Algorithm Hash digest
SHA256 32eac58680df0a50ad110bdeb25c14ee91294f50edee1a6589a68cdc8f44fe29
MD5 149bbf74fd407902f16abf63bc8c0dec
BLAKE2b-256 9a425428857b6b491fa04b9153943cef9f2245d7ed989c2979ab0f45f45d5d28

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0rc2-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0rc2-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.9 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0rc2-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 0a4001fde14614d088632c679f369f9a7632980db80a1faf979c228fdb79b816
MD5 e241e0f24315bf6478dff3e8d672e501
BLAKE2b-256 195cf66fc384356254c7315f8c3ed6fcce7996f610501c49f8c8a85a16fa6e93

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0rc2-cp35-cp35m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0rc2-cp35-cp35m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 2.0 MB
  • Tags: CPython 3.5m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0rc2-cp35-cp35m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 fae2ae14e09ab88eff8d6fe35cea6499b41c6ded3f75764a48d71a3fb6e5aa16
MD5 721decc87a858e1715c47867354da007
BLAKE2b-256 9b3de8ea7f76a7f58fb9b91985f7808e3649783e034c94b7b445eff189ab25c6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page