Skip to main content

No project description provided

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install -e .

Load a pretrained tokenizer from the Hub

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-cased")

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can check how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(
    vocab_size=20000,
    min_frequency=2,
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet()
)
tokenizer.train([
    "./path/to/dataset/1.txt",
    "./path/to/dataset/2.txt",
    "./path/to/dataset/3.txt"
], trainer=trainer)

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.21.0rc0.tar.gz (343.1 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.21.0rc0-cp39-abi3-win_amd64.whl (2.4 MB view details)

Uploaded CPython 3.9+ Windows x86-64

tokenizers-0.21.0rc0-cp39-abi3-win32.whl (2.2 MB view details)

Uploaded CPython 3.9+ Windows x86

tokenizers-0.21.0rc0-cp39-abi3-musllinux_1_2_x86_64.whl (9.4 MB view details)

Uploaded CPython 3.9+ musllinux: musl 1.2+ x86-64

tokenizers-0.21.0rc0-cp39-abi3-musllinux_1_2_i686.whl (9.2 MB view details)

Uploaded CPython 3.9+ musllinux: musl 1.2+ i686

tokenizers-0.21.0rc0-cp39-abi3-musllinux_1_2_armv7l.whl (8.9 MB view details)

Uploaded CPython 3.9+ musllinux: musl 1.2+ ARMv7l

tokenizers-0.21.0rc0-cp39-abi3-musllinux_1_2_aarch64.whl (9.0 MB view details)

Uploaded CPython 3.9+ musllinux: musl 1.2+ ARM64

tokenizers-0.21.0rc0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.17+ x86-64

tokenizers-0.21.0rc0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (3.4 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.17+ s390x

tokenizers-0.21.0rc0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (3.1 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.17+ ppc64le

tokenizers-0.21.0rc0-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl (3.1 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.17+ i686

tokenizers-0.21.0rc0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (2.8 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.17+ ARMv7l

tokenizers-0.21.0rc0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (2.9 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.17+ ARM64

tokenizers-0.21.0rc0-cp39-abi3-macosx_11_0_arm64.whl (2.6 MB view details)

Uploaded CPython 3.9+ macOS 11.0+ ARM64

tokenizers-0.21.0rc0-cp39-abi3-macosx_10_12_x86_64.whl (2.6 MB view details)

Uploaded CPython 3.9+ macOS 10.12+ x86-64

File details

Details for the file tokenizers-0.21.0rc0.tar.gz.

File metadata

  • Download URL: tokenizers-0.21.0rc0.tar.gz
  • Upload date:
  • Size: 343.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.7.5

File hashes

Hashes for tokenizers-0.21.0rc0.tar.gz
Algorithm Hash digest
SHA256 8c696e87870035ea60209d348892c17b31359321ffa636033792cf31a154e274
MD5 6e4d5ed307a75df6c7c565c6ab7a5c24
BLAKE2b-256 1f4f46c4299ad19d83e8638843e60af535d575b2c1f57204e3a4c474df7d5e4c

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.0rc0-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.0rc0-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 6a3c1a395740dadf48fced64ff115eb25dc09fa1688cd812d7f06e1e97612757
MD5 1e5cb9294f049f3c0c9c25e3b3c8945f
BLAKE2b-256 a778b1733f1f14ffa2adb42ba02ee1c177ef06280d1c26171b0110141729a2ac

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.0rc0-cp39-abi3-win32.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.0rc0-cp39-abi3-win32.whl
Algorithm Hash digest
SHA256 2596f179b4568e026b00b230c3e6fa15393ad5cb9351bd48c4db89f323dd04e7
MD5 0e8cfffdf604ee2c1010664c1ed1da7b
BLAKE2b-256 e0d2c30551fe2c98e93188027b332d127cdc675703bfd002c373fe3ad8c2f3ca

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.0rc0-cp39-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.0rc0-cp39-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 8b194cf7123eda46f30d1e1ba72357c5cee61f6183abaeea0aa490af92b0ee6b
MD5 ad5bf59271e65642e4a75aded2703036
BLAKE2b-256 65a5e1bdd590b9890012c08cf824a11631ed7719fa9fbb239958fc5deba0d4e8

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.0rc0-cp39-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.0rc0-cp39-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 4ea67ce1b93cd4443790e7984f6f020e6a98cd9656a512532fca2b57c505efc6
MD5 0b3290ca253f81de2d78af257185b1aa
BLAKE2b-256 ead33d643892e817c53652ec63228a72f555c8712e34f873e547b3cf9b5ab237

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.0rc0-cp39-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.0rc0-cp39-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 f1e849a154ecb71f9c79309dd625c440d6aeb5c061d9c4fae8e4879b2d6f1c0d
MD5 bae5b2210b6eef58d4ff222a4d605f51
BLAKE2b-256 a8ec1028747e4f1f1d208b45a177cd797fa2716ec619dc8ab37878a5d74ddd9c

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.0rc0-cp39-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.0rc0-cp39-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 b381d1b259a21d37670d7375fd0d8aa354ccbefc67db8898852548b1a69446c5
MD5 cac85ab0aec07177854551c1ca9e92fc
BLAKE2b-256 6df73d19db0f8793c6d049e96ed470e8fafeba52ee2d5fd2e42fb3935605db00

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.0rc0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.0rc0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 12cba85ea7bef58f1f77d69387dc3a55a0f38229511c080b43c52a7f8f2a7ae8
MD5 9f139a4c1fda5af9823fdde4d9293ab6
BLAKE2b-256 f146ba94ea2634ff41d77ee9cd52e47233d58a24bd28aa6d12b147bc0d3259c8

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.0rc0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.0rc0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 4674668e269ea02b9afe06c3867dc568ce5e40f50046adc0878edf212b46c26a
MD5 a0d5911678295b5b14d352fd3dc09587
BLAKE2b-256 83ed10f0d258450c37e5a83db1d266aa2535da8bf6ead56118cce55d95894da0

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.0rc0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.0rc0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 22f61f9d615d822aa21919430c9cd949b4fbf15d10d59629e72d290ac032bde0
MD5 d364ac5cc444962777c33e36c7b5ecb1
BLAKE2b-256 13ca1e7f12d99cb7d36cae6dd22e22e1e362be9f49e9615fe16614996290af4d

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.0rc0-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.0rc0-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 17161d996ed86740a9fa488c7314b077b19336bc63abb4ba4bfdeb29cf3492f8
MD5 d5d1ea8e5ad0ccd47a167a1c00b77bfd
BLAKE2b-256 dae78272f9f506c2e9ef3ea9e4bee3dde35827331910920b99515bf5a5aaa863

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.0rc0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.0rc0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 48b4c0579bb32a1a3091392b493d9daa090301c937e99bdda3808068cc5a07c8
MD5 acdc44d7a8c18fd83b1de1e61eed78bc
BLAKE2b-256 77f775ed0053b64023099582a975af27db173421979569d68dd72dfcbe684f0d

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.0rc0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.0rc0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 28ce6cf26f3a264f5264d9663205f9a5430d0de3cc7131dde0d12b26f42546bc
MD5 0cf694a765070fd276fb6dbfbb3466f8
BLAKE2b-256 498b8b5a865513a3c824c56222b64d789673fc4ce8a8477d0425aca446488a65

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.0rc0-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.0rc0-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f918620e59ff6a9617d69de5fbe33aee614269e6cc5e6fe406e5d510e41deed3
MD5 c64b46816078007be247aeb5739e6077
BLAKE2b-256 01b2bc102d3ac17e796ba128db6834fd6088e34ce6aaf894a98d49f92547dbfd

See more details on using hashes here.

File details

Details for the file tokenizers-0.21.0rc0-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tokenizers-0.21.0rc0-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 45c19bae62ed4e7517aac0778e19b95b98e8ba3e712ca16ee8bd2c0132468b41
MD5 f43044157fb1472aa6f5883d1ba7f06b
BLAKE2b-256 44cef7feec2e5ce6b86efcc99a868b86410b7d9f79bd5c5057b0ae0b48104255

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page