Skip to main content

Fast and Customizable Tokenizers

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install setuptools_rust
python setup.py install

Using the provided Tokenizers

Using a pre-trained tokenizer is really simple:

from tokenizers import BPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = BPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train yours just as simply:

from tokenizers import BPETokenizer

# Initialize a tokenizer
tokenizer = BPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# And you can use it
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory", "my-bpe")

Provided Tokenizers

  • BPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

You can also easily build your own tokenizers, by putting all the different parts you need together:

Use a pre-trained tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders

# Load a BPE Model
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
bpe = models.BPE.from_files(vocab, merges)

# Initialize a tokenizer
tokenizer = Tokenizer(bpe)

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel.new(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel.new()

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

# Or tokenize multiple sentences at once:
encoded = tokenizer.encode_batch([
	"I can feel the magic, can you?",
	"The quick brown fox jumps over the lazy dog"
])
print(encoded)

Train a new tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE.empty())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel.new(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel.new()

# And then train
trainer = trainers.BpeTrainer.new(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# Now we can encode
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.2.1.tar.gz (59.2 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.2.1-cp38-cp38-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.2.1-cp38-cp38-manylinux1_x86_64.whl (7.3 MB view details)

Uploaded CPython 3.8

tokenizers-0.2.1-cp38-cp38-macosx_10_13_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.8 macOS 10.13+ x86-64

tokenizers-0.2.1-cp37-cp37m-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.2.1-cp37-cp37m-manylinux1_x86_64.whl (5.5 MB view details)

Uploaded CPython 3.7m

tokenizers-0.2.1-cp37-cp37m-macosx_10_13_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.7m macOS 10.13+ x86-64

tokenizers-0.2.1-cp36-cp36m-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.2.1-cp36-cp36m-manylinux1_x86_64.whl (3.6 MB view details)

Uploaded CPython 3.6m

tokenizers-0.2.1-cp36-cp36m-macosx_10_13_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.6m macOS 10.13+ x86-64

tokenizers-0.2.1-cp35-cp35m-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.2.1-cp35-cp35m-manylinux1_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.5m

tokenizers-0.2.1-cp35-cp35m-macosx_10_13_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.5m macOS 10.13+ x86-64

File details

Details for the file tokenizers-0.2.1.tar.gz.

File metadata

  • Download URL: tokenizers-0.2.1.tar.gz
  • Upload date:
  • Size: 59.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.2.1.tar.gz
Algorithm Hash digest
SHA256 4c584e3b1e88751cd2facd994b051d24683d77bc0073559d81f0d24122433987
MD5 5bcc6b9ee0b63e0e3eae45d0bc8ca79b
BLAKE2b-256 192e252e324e1fb506a44cf637b66dc6ae4e21beb0da90976639994e49a2b03a

See more details on using hashes here.

File details

Details for the file tokenizers-0.2.1-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.2.1-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.2.1-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 72b135e06493d83eb53c8866dbdc6c3f374cc71a4fb48e3a47ab077ec458ef28
MD5 ac5c953d90a1baca4c336fc0547b7095
BLAKE2b-256 861c7e60f8ba98bedc302c7e6d1b342175209cba6321098ea7bab3fe12197f17

See more details on using hashes here.

File details

Details for the file tokenizers-0.2.1-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.2.1-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 7.3 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.2.1-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 49f9e8e238bbd1f02b0c904e8a8274b7f8b600ed8c3283c93d6a08189fb4a517
MD5 9a80f0ae22a23e8cfa3519f7bd345a39
BLAKE2b-256 00650308ba12ddb1173ab57c94655243faeed53eee3edaeab40ce8609a69f920

See more details on using hashes here.

File details

Details for the file tokenizers-0.2.1-cp38-cp38-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.2.1-cp38-cp38-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.8, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.2.1-cp38-cp38-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 32d7e3e8ec81bafd750a04a798477aca7defc4b923006f21e552f721ca9339cf
MD5 da30a3f4b2f77b1293d88d4c07477c01
BLAKE2b-256 70ebfb67346a217edd20771a2b0da7ed2083a33b24dd6924b5936101598432a8

See more details on using hashes here.

File details

Details for the file tokenizers-0.2.1-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.2.1-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.2.1-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 4ea4471eccc4cf6e74ecfd828bf2bc5fabb12232425230b12160fb98059505d9
MD5 94e7cbeebc55ba07bcd01235292b3e9d
BLAKE2b-256 8c7e30019002ed45cdfa77fd5eade1c2e0018d4e7964d5b9e1bf85520b2511ac

See more details on using hashes here.

File details

Details for the file tokenizers-0.2.1-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.2.1-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 5.5 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.2.1-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 d8107903bb43adae9301ecc6ba838d9df80663cd6c7a60393103c101d6dd6b99
MD5 77cd507bd4359fa2d2060027c5071843
BLAKE2b-256 ced7fbd3835c550070477707e24a38bed985fc458ce5e4664f234ce14f1a9d91

See more details on using hashes here.

File details

Details for the file tokenizers-0.2.1-cp37-cp37m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.2.1-cp37-cp37m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.7m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.2.1-cp37-cp37m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 6d873707b33b94e027883d395ec5774da4f3371ec38a6b0439a1594ab4558ac6
MD5 7daeffc897fe51b5afd795579cecbba4
BLAKE2b-256 462a3164883430db2193d5eea62ac33b6f6e8cf994e2cca76cd6dbbca34ec85e

See more details on using hashes here.

File details

Details for the file tokenizers-0.2.1-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.2.1-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.2.1-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 cae637fa8c58b0b1f35a6d072d549d3cd65c5882818f8225385af1f1e671a8c3
MD5 578681664f78e7de677976cf1ce055a4
BLAKE2b-256 66f062c2c1dcaac37cb6d07c4da0dbe2fec9b54edcf6880ae1b1176619e07f85

See more details on using hashes here.

File details

Details for the file tokenizers-0.2.1-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.2.1-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.6 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.2.1-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 f3f5bee5eacb41dd2b19043c2637ae18e4bb1cbf302a08fae5d3d3cbda5bd9d7
MD5 24db30c54f3e6b350aa479aae320edb1
BLAKE2b-256 3ed85af89a486513ef9c1d1f7a8ad6231f3d20e2015fc84533ff7d25005c8717

See more details on using hashes here.

File details

Details for the file tokenizers-0.2.1-cp36-cp36m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.2.1-cp36-cp36m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.6m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.2.1-cp36-cp36m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 498d537f8bda1f4c4d0b4a7b102ce16e241bfefd1dbcebb07615358997585a76
MD5 5bae4ea717317b6c844c442f6fd64b72
BLAKE2b-256 dd999c097638488f1da35c90ce483eec7c1c8beb264b42be2981c56d09b274d3

See more details on using hashes here.

File details

Details for the file tokenizers-0.2.1-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.2.1-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.2.1-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 09dee208a2878b37e175197bc8882b82e9419d70d32da0d60373744a23ee2124
MD5 969502f579d862b604f40ef5b111c93a
BLAKE2b-256 3cb7577f380b833b7e00754b815679cf5845a24c062bf176bf5dba2819be8070

See more details on using hashes here.

File details

Details for the file tokenizers-0.2.1-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.2.1-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.2.1-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 a01958e712f47318f7e8f1856d956cdbd7ec82a61b34aca7bec2ecdd5a19ae1b
MD5 6047f9a51cdca2b5cca360284e5b94ed
BLAKE2b-256 37fe8cdcdbeec49bd8686dcacceffbd0df0891ccb73b9d7b4deb04dbe8b27024

See more details on using hashes here.

File details

Details for the file tokenizers-0.2.1-cp35-cp35m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.2.1-cp35-cp35m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.5m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.2.1-cp35-cp35m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 4106e9f1c287812a1ab532e07d81c98c0d73552fa4e8cf3362005a85301f6f31
MD5 02aa0fd9fcfa430eb84048674310fd57
BLAKE2b-256 4d56f7175734cc97ab17e2a702f06329594668bc91d64d9e80d6c7278ed59c1b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page