Skip to main content

No project description provided

Reason this release was yanked:

Removing support for python3.7 and 3.8 is breaking and is move to v0.21.0

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install -e .

Load a pretrained tokenizer from the Hub

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-cased")

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can check how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(
    vocab_size=20000,
    min_frequency=2,
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet()
)
tokenizer.train([
    "./path/to/dataset/1.txt",
    "./path/to/dataset/2.txt",
    "./path/to/dataset/3.txt"
], trainer=trainer)

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.20.4.tar.gz (343.0 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.20.4-cp39-abi3-win_amd64.whl (2.4 MB view details)

Uploaded CPython 3.9+ Windows x86-64

tokenizers-0.20.4-cp39-abi3-win32.whl (2.2 MB view details)

Uploaded CPython 3.9+ Windows x86

tokenizers-0.20.4-cp39-abi3-musllinux_1_2_x86_64.whl (9.4 MB view details)

Uploaded CPython 3.9+ musllinux: musl 1.2+ x86-64

tokenizers-0.20.4-cp39-abi3-musllinux_1_2_i686.whl (9.2 MB view details)

Uploaded CPython 3.9+ musllinux: musl 1.2+ i686

tokenizers-0.20.4-cp39-abi3-musllinux_1_2_armv7l.whl (8.9 MB view details)

Uploaded CPython 3.9+ musllinux: musl 1.2+ ARMv7l

tokenizers-0.20.4-cp39-abi3-musllinux_1_2_aarch64.whl (9.0 MB view details)

Uploaded CPython 3.9+ musllinux: musl 1.2+ ARM64

tokenizers-0.20.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.17+ x86-64

tokenizers-0.20.4-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (3.4 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.17+ s390x

tokenizers-0.20.4-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (3.1 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.17+ ppc64le

tokenizers-0.20.4-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl (3.1 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.17+ i686

tokenizers-0.20.4-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (2.8 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.17+ ARMv7l

tokenizers-0.20.4-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (2.9 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.17+ ARM64

tokenizers-0.20.4-cp39-abi3-macosx_11_0_arm64.whl (2.6 MB view details)

Uploaded CPython 3.9+ macOS 11.0+ ARM64

tokenizers-0.20.4-cp39-abi3-macosx_10_12_x86_64.whl (2.6 MB view details)

Uploaded CPython 3.9+ macOS 10.12+ x86-64

File details

Details for the file tokenizers-0.20.4.tar.gz.

File metadata

  • Download URL: tokenizers-0.20.4.tar.gz
  • Upload date:
  • Size: 343.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.7.5

File hashes

Hashes for tokenizers-0.20.4.tar.gz
Algorithm Hash digest
SHA256 db50ac15e92981227f499268541306824f49e97dbeec05d118ebdc7c2d22322c
MD5 8b1d31c2d90e962e5bcd216532ff9a4b
BLAKE2b-256 1a980df883ea6201e35e286a97f5fb2a601bfb5b52e4165f7688a76e4553eeec

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.4-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.4-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 6cba92b87969ddf5a7e2f2293577c30129d8c22c6f68e8c626d3e76b8d52412c
MD5 b5436ecc16a1fcf7ca12ce5eac9b58bd
BLAKE2b-256 e0e3c7e4adf727ddcdefc1831945bf95fe07dd6b4b879613a62b5719be539ce4

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.4-cp39-abi3-win32.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.4-cp39-abi3-win32.whl
Algorithm Hash digest
SHA256 60ea37c885a9bb8efa53b7542ea83561cd00eb3ffb47a77f5ae622d9f7f66ffe
MD5 a42d0ae88f60c0c003663864e159229e
BLAKE2b-256 a11c0c9ebd40b4e6bb51b178cff6b148e985a728fc899d12f4a6ebe49044d4e2

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.4-cp39-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.4-cp39-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 a6d392a20ca70692aaba8a636677b57f6c67655879773ba2b6be8cb4a19ce6b8
MD5 4cf4d109526dfa26ff7ec08132e78b5d
BLAKE2b-256 9e6241dc66bda0b1d45f36f324d01a3c4389caac0526d4b4ccc2d2006c7722b1

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.4-cp39-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.4-cp39-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 84bf8b4a7bbf1c6bb78775ae309a5c69d08dadf7b88125d6d19ccb4738a87350
MD5 3dcacb8f3ed9cd0ff479150aee692640
BLAKE2b-256 94082d541e0e437e87e5cae903c55fc832fd032812c85e095bb7833d5fe65dfd

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.4-cp39-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.4-cp39-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 e59a405459ed31b73426b364752c2e7c73f4a94210a63fd7acd161a774af7bd2
MD5 340b4aafe900ef5810bde1373a367a34
BLAKE2b-256 245661b8c0cd217a5be05c4206263afdfa937848796696fee7cc14fab6c94a49

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.4-cp39-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.4-cp39-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 3e960ad5c467a95e5665e518151ed9024e7aa111d2c54ff1938162cc7c2b8959
MD5 f85ab29a15443688a40f7aedac8d07fd
BLAKE2b-256 efbf135de1cf1c10f53a10c8db5f067e7cf71adebd79ed2b5cf5729eded209af

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 05c2bab579c1f31292b48bb79b6334b5346c1ec87dac81089e6098b8a20b2fd4
MD5 3bf8386c6572c5e8b6bd684171660744
BLAKE2b-256 b91ccc1ca0c4e4c5763b2d872935de7149633765d158f0d1a46b48d73f3a9c58

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.4-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.4-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 735ffc9bba65d20f8ab5f82dfbab262bb066afc7dee3684c5e5435e7a5da445d
MD5 033aafdfaa639c7b80afcf9293283ba4
BLAKE2b-256 f6c934c6a304529a0b3375533e0ccc4ffd4bcfa401cb57cdab5997efaf9836fc

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.4-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.4-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 eee647ccba9cbd36b5ec4e8e73d25dbd586ec06de7a43ff83a3dad9fec466a29
MD5 b07f0e8d997774bda07d769692ddbfba
BLAKE2b-256 6549e4647d5e496e64f2f639aeba9df134f59b648822a90d63ceb7b4b309ba4a

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.4-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.4-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 aa392bae7f0a36e4c97ad43100390ad84f2a1bfff6742604774210f7d7a4fa13
MD5 04e25d398d821e8b523f07ddb24b13e8
BLAKE2b-256 d0298dff1d57d1d1bcae5f0d701759ca3f503ca31ecd443f75571d6ef3083043

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.4-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.4-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 075635cd7e6936cc4b3a13901c1a05690d5b533ce3d0f035dee21117dd4f04ae
MD5 26c7be01f2c1527bac1a7fff2c445e94
BLAKE2b-256 b7d1e70b4497da0e97111ef75eba32526a98bf04268ce2a7388d9759ec89acdb

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.4-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.4-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 7786004e180fab72e6e873e982ccd18b3cfa31521d397b6c024cc19175abf91b
MD5 787ca7e29a3cb69b98913ae04bb7fc36
BLAKE2b-256 66aa70de5a7c96621c1962e8399245efac6d6ab3bbe873457b9beb26c406c4f0

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.4-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.4-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f41df992797ad0ff9472e8a2c7a3ef7178667935d984639b73da7d19b33ea4e2
MD5 2347d99ef0ce3d53ac53be79efd3d412
BLAKE2b-256 bb26a4b8dfe37c92205a4ba267d35ed7a6ca0ae1cbb2a1ed8acf84813cf48704

See more details on using hashes here.

File details

Details for the file tokenizers-0.20.4-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.20.4-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 25f59ebc5b79e7bbafe86bfec62696468016627157d8a9ceba5092486796a156
MD5 ed0cebeae485c25ec830c98fc0163538
BLAKE2b-256 b7a1c8285e835dd185307f40bc4a308224dc357ee4d25f3fd83ef013101a5060

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page