Skip to main content

No project description provided

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install -e .

Load a pretrained tokenizer from the Hub

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-cased")

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can check how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(
    vocab_size=20000,
    min_frequency=2,
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet()
)
tokenizer.train([
    "./path/to/dataset/1.txt",
    "./path/to/dataset/2.txt",
    "./path/to/dataset/3.txt"
], trainer=trainer)

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.21.0.tar.gz (343.0 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.21.0-cp39-abi3-win_amd64.whl (2.4 MB view details)

Uploaded CPython 3.9+ Windows x86-64

tokenizers-0.21.0-cp39-abi3-win32.whl (2.2 MB view details)

Uploaded CPython 3.9+ Windows x86

tokenizers-0.21.0-cp39-abi3-musllinux_1_2_x86_64.whl (9.4 MB view details)

Uploaded CPython 3.9+ musllinux: musl 1.2+ x86-64

tokenizers-0.21.0-cp39-abi3-musllinux_1_2_i686.whl (9.2 MB view details)

Uploaded CPython 3.9+ musllinux: musl 1.2+ i686

tokenizers-0.21.0-cp39-abi3-musllinux_1_2_armv7l.whl (8.9 MB view details)

Uploaded CPython 3.9+ musllinux: musl 1.2+ ARMv7l

tokenizers-0.21.0-cp39-abi3-musllinux_1_2_aarch64.whl (9.0 MB view details)

Uploaded CPython 3.9+ musllinux: musl 1.2+ ARM64

tokenizers-0.21.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.17+ x86-64

tokenizers-0.21.0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (3.4 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.17+ s390x

tokenizers-0.21.0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (3.1 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.17+ ppc64le

tokenizers-0.21.0-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl (3.1 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.17+ i686

tokenizers-0.21.0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (2.8 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.17+ ARMv7l

tokenizers-0.21.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (2.9 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.17+ ARM64

tokenizers-0.21.0-cp39-abi3-macosx_11_0_arm64.whl (2.6 MB view details)

Uploaded CPython 3.9+ macOS 11.0+ ARM64

tokenizers-0.21.0-cp39-abi3-macosx_10_12_x86_64.whl (2.6 MB view details)

Uploaded CPython 3.9+ macOS 10.12+ x86-64

File details

Details for the file tokenizers-0.21.0.tar.gz.

File metadata

  • Download URL: tokenizers-0.21.0.tar.gz
  • Upload date:
  • Size: 343.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.7.5

File hashes

Hashes for tokenizers-0.21.0.tar.gz
Algorithm Hash digest
SHA256 ee0894bf311b75b0c03079f33859ae4b2334d675d4e93f5a4132e1eae2834fe4
MD5 d03aa5c857cb696ab19545505b9f92dc
BLAKE2b-256 2041c2be10975ca37f6ec40d7abd7e98a5213bb04f284b869c1a24e6504fd94d

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.0-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.0-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 87841da5a25a3a5f70c102de371db120f41873b854ba65e52bccd57df5a3780c
MD5 3624df9173b0fa2ad62013a39f72c73a
BLAKE2b-256 4469d21eb253fa91622da25585d362a874fa4710be600f0ea9446d8d0217cec1

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.0-cp39-abi3-win32.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.0-cp39-abi3-win32.whl
Algorithm Hash digest
SHA256 eb1702c2f27d25d9dd5b389cc1f2f51813e99f8ca30d9e25348db6585a97e24a
MD5 9188b8fdcbb843e36220f99c4e8b81b0
BLAKE2b-256 15b0dc4572ca61555fc482ebc933f26cb407c6aceb3dc19c301c68184f8cad03

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.0-cp39-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.0-cp39-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 4145505a973116f91bc3ac45988a92e618a6f83eb458f49ea0790df94ee243ff
MD5 dfbbe50dec074a64ef32f1d644aa843d
BLAKE2b-256 18073e88e65c0ed28fa93aa0c4d264988428eef3df2764c3126dc83e243cb36f

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.0-cp39-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.0-cp39-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 c87ca3dc48b9b1222d984b6b7490355a6fdb411a2d810f6f05977258400ddb74
MD5 ea41a5ca79575874d5e6ab799429544c
BLAKE2b-256 d8eece83d5ec8b6844ad4c3ecfe3333d58ecc1adc61f0878b323a15355bcab24

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.0-cp39-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.0-cp39-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 089d56db6782a73a27fd8abf3ba21779f5b85d4a9f35e3b493c7bbcbbf0d539b
MD5 287254038b0ced1830e8f95d72d0e7cc
BLAKE2b-256 f7f3b776061e4f3ebf2905ba1a25d90380aafd10c02d406437a8ba22d1724d76

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.0-cp39-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.0-cp39-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 eb7202d231b273c34ec67767378cd04c767e967fda12d4a9e36208a34e2f137e
MD5 35862cf2955d2487be9b81966de2d273
BLAKE2b-256 c86954a0aee4d576045b49a0eb8bffdc495634309c823bf886042e6f46b80058

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e84ca973b3a96894d1707e189c14a774b701596d579ffc7e69debfc036a61a04
MD5 805817b4adbc8e61b9484362ac488ab5
BLAKE2b-256 220669d7ce374747edaf1695a4f61b83570d91cc8bbfc51ccfecf76f56ab4aac

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 400832c0904f77ce87c40f1a8a27493071282f785724ae62144324f171377273
MD5 180b35cd8597b69c8676038401b67add
BLAKE2b-256 814207600892d48950c5e80505b81411044a2d969368cdc0d929b1c847bf6697

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 d8b09dbeb7a8d73ee204a70f94fc06ea0f17dcf0844f16102b9f414f0b7463ba
MD5 c550654c9b548a47bf1b8a29a1b7076c
BLAKE2b-256 4df65ed6711093dc2c04a4e03f6461798b12669bc5a17c8be7cce1240e0b5ce8

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.0-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.0-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 9aeb255802be90acfd363626753fda0064a8df06031012fe7d52fd9a905eb00e
MD5 491d4fc72dc54271d07054d3f9dd3d9e
BLAKE2b-256 578b7da5e6f89736c2ade02816b4733983fca1c226b0c42980b1ae9dc8fcf5cc

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 6b43779a269f4629bebb114e19c3fca0223296ae9fea8bb9a7a6c6fb0657ff8e
MD5 97855b6b0e6e1656845e3b5ce395c0ad
BLAKE2b-256 7edb3433eab42347e0dc5452d8fcc8da03f638c9accffefe5a7c78146666964a

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 6b177fb54c4702ef611de0c069d9169f0004233890e0c4c5bd5508ae05abf193
MD5 3d93aac1a8f5a5f8b470c05273fc258b
BLAKE2b-256 f71483429177c19364df27d22bc096d4c2e431e0ba43e56c525434f1f9b0fd00

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.0-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.0-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f53ea537c925422a2e0e92a24cce96f6bc5046bbef24a1652a5edc8ba975f62e
MD5 05aca62284c1e246bebd64105bc3c4ce
BLAKE2b-256 227a88e58bb297c22633ed1c9d16029316e5b5ac5ee44012164c2edede599a5e

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.0-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.0-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 3c4c93eae637e7d2aaae3d376f06085164e1660f89304c0ab2b1d08a406636b2
MD5 1a9c7b080e453bb6e3e0fe0820d4fc61
BLAKE2b-256 b05c8b09607b37e996dc47e70d6a7b6f4bdd4e4d5ab22fe49d7374565c7fefaf

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page