Skip to main content

No project description provided

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install -e .

Load a pretrained tokenizer from the Hub

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-cased")

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can check how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(
    vocab_size=20000,
    min_frequency=2,
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet()
)
tokenizer.train([
    "./path/to/dataset/1.txt",
    "./path/to/dataset/2.txt",
    "./path/to/dataset/3.txt"
], trainer=trainer)

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.20.4rc0.tar.gz (344.8 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.20.4rc0-cp39-abi3-win_amd64.whl (2.4 MB view details)

Uploaded CPython 3.9+ Windows x86-64

tokenizers-0.20.4rc0-cp39-abi3-win32.whl (2.2 MB view details)

Uploaded CPython 3.9+ Windows x86

tokenizers-0.20.4rc0-cp39-abi3-musllinux_1_2_x86_64.whl (9.4 MB view details)

Uploaded CPython 3.9+ musllinux: musl 1.2+ x86-64

tokenizers-0.20.4rc0-cp39-abi3-musllinux_1_2_i686.whl (9.2 MB view details)

Uploaded CPython 3.9+ musllinux: musl 1.2+ i686

tokenizers-0.20.4rc0-cp39-abi3-musllinux_1_2_armv7l.whl (8.9 MB view details)

Uploaded CPython 3.9+ musllinux: musl 1.2+ ARMv7l

tokenizers-0.20.4rc0-cp39-abi3-musllinux_1_2_aarch64.whl (9.0 MB view details)

Uploaded CPython 3.9+ musllinux: musl 1.2+ ARM64

tokenizers-0.20.4rc0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.17+ x86-64

tokenizers-0.20.4rc0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (3.4 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.17+ s390x

tokenizers-0.20.4rc0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (3.1 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.17+ ppc64le

tokenizers-0.20.4rc0-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl (3.1 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.17+ i686

tokenizers-0.20.4rc0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (2.8 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.17+ ARMv7l

tokenizers-0.20.4rc0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (2.9 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.17+ ARM64

tokenizers-0.20.4rc0-cp39-abi3-macosx_11_0_arm64.whl (2.6 MB view details)

Uploaded CPython 3.9+ macOS 11.0+ ARM64

tokenizers-0.20.4rc0-cp39-abi3-macosx_10_12_x86_64.whl (2.7 MB view details)

Uploaded CPython 3.9+ macOS 10.12+ x86-64

File details

Details for the file tokenizers-0.20.4rc0.tar.gz.

File metadata

  • Download URL: tokenizers-0.20.4rc0.tar.gz
  • Upload date:
  • Size: 344.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.7.5

File hashes

Hashes for tokenizers-0.20.4rc0.tar.gz
Algorithm Hash digest
SHA256 d836aa6087064d622cc53548d378a9c85c741e61ab652f6804421395cb490c74
MD5 8f791b9687c49fc251970679ce26623d
BLAKE2b-256 2a35da6cf9aa8b41db5b46da2f09ed2fb67082e5c004250c8ba062135c5294f7

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.4rc0-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.4rc0-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 b322b7124a6a5059d729a5e70049308cfce71b2503d808af8436fd2b4758f48e
MD5 f04a53ff3693e48dac4c76bfe8ac3363
BLAKE2b-256 00e66b078d5d346c727e9d364992270b4f84c981e2ae36d15026ff08ce586660

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.4rc0-cp39-abi3-win32.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.4rc0-cp39-abi3-win32.whl
Algorithm Hash digest
SHA256 5b211d340d810e761217eb5faec01eb856f5faa7bed6d996e16107af31e65dd3
MD5 2d2aca8f0d5ce0e856e6b3ee616cd6ee
BLAKE2b-256 83af5847c6800fedab04c283fed0234fe8f24f4925fa3f6751263cff86d8543d

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.4rc0-cp39-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.4rc0-cp39-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 1934779df8e73e067182ad57e114ae3be31605e9a27730110e504f71c402eb0f
MD5 bacfb21d63aa3f778328142c32017490
BLAKE2b-256 e20061ee625b45b088adb6e9fc7d836711de9b76056f1a10e85af837a6c20db6

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.4rc0-cp39-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.4rc0-cp39-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 fe123aa04844adf7204c2569d1a4f531223a004da857316e39e506c1181568ce
MD5 8355dac38301fb34b32029a66466da08
BLAKE2b-256 b7555249e6100791d637537d253a79ef12ca8f928b1ba7dac9a4b3e66633d25e

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.4rc0-cp39-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.4rc0-cp39-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 9d5dcca3c7997e04709b5a230de4a905b4648134c00205171b6b381f3052d9e0
MD5 9290dbce46db181590875f692536ecd1
BLAKE2b-256 1f81a195a9e9acc10bde3011f5d733bc7c4474a46ad01406955626ec2f324c2f

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.4rc0-cp39-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.4rc0-cp39-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 a9e37aa9cfa67a82cb8c2ddb07131f157325c77d6db7f2ac62b45a5ddf970516
MD5 2e875ec4f16511f61881d68da3906693
BLAKE2b-256 02d396b6139fcf5befa96c714ea7fa122ffe6099617a85687f39584f7a7518d2

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.4rc0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.4rc0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a02993fe0ec555cf11ca4561066a2c9f6cd199576b757488de3163b69a108745
MD5 cd86388ecc34b3a7dd5c8dc42271e958
BLAKE2b-256 7105e8289411bf6e5ce9a0614452c4270f43787264aec3f879a96c7116bd83d0

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.4rc0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.4rc0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 33297d5fcc1e57c7048945a4e4ac9074963076f4ba93f9d0f3ccc440ed819859
MD5 d24638ecdd72cf692192238c8635bfc3
BLAKE2b-256 fca01ad1d7ae5c9850c9673f2bc2927d689459d9d52c010577d395a6d323feca

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.4rc0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.4rc0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 c6cd443ec4ada587894bc879a25df83acf976453d4ef90b63eea70ef59074e0c
MD5 aadb059e44f98321ea8db048253b1ec4
BLAKE2b-256 3cb71ba6b989786cf15928da1abbaa27b1ee898f9dc6b4570dec119b947dcb32

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.4rc0-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.4rc0-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 49caa15567424d46ed687f0d1378c31f9cad6cea1b1954fd7ee1ad2ef7844539
MD5 22c96ad8d5acba1fc159174a903cd12f
BLAKE2b-256 c62eeb894001ea20edb70d8a6198535d6209ac1130c549825170dfa06f111140

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.4rc0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.4rc0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 bc617a01f98a81a1a3cf7a67fdecece0511ee6e32731a798592a7944ade50d15
MD5 d2d3997647a61beaa2fdb209eb885305
BLAKE2b-256 e3b58cfe49a357dea9a545f7a724eb38fbff74b9ede798efb889def7b6c04d35

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.4rc0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.4rc0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 6f6697c0d5548509f875a94eb66d93323905301346828ca06b17d7b7bec6db95
MD5 9a1243ead07b2fe07be0661315ab5ca9
BLAKE2b-256 f209c9fd71f76417281bae4e0f9014e56f2d223c4ec8a82667386e2cf65148dd

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.4rc0-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.4rc0-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7f9683e499517277ccfcb9adca840b708a6e4e7cc51fd5dc36bed1c3ae58efc7
MD5 7e982ed67000d47e681f5383df1dbb71
BLAKE2b-256 c6fa2836dfb26c77c9f65e1b4406feecfc48621b88e3ef93a9a77c07b0ca3179

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.4rc0-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.4rc0-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 c69faa8e0465d7aaa98868c6577ec1a9df34cc31f83e7847d11efae5326fca3b
MD5 6dcd9e51cceca240c19dc2004ad32544
BLAKE2b-256 2e663db124d78d9e40fe5266355b65ab07723b719b2e392de9d4a4575f020063

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page