Skip to main content

Fast and Customizable Tokenizers

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install setuptools_rust
python setup.py install

Using the provided Tokenizers

Using a pre-trained tokenizer is really simple:

from tokenizers import BPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = BPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train yours just as simply:

from tokenizers import BPETokenizer

# Initialize a tokenizer
tokenizer = BPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# And you can use it
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory", "my-bpe")

Provided Tokenizers

  • BPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

You can also easily build your own tokenizers, by putting all the different parts you need together:

Use a pre-trained tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders

# Load a BPE Model
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
bpe = models.BPE.from_files(vocab, merges)

# Initialize a tokenizer
tokenizer = Tokenizer(bpe)

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel.new(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel.new()

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

# Or tokenize multiple sentences at once:
encoded = tokenizer.encode_batch([
	"I can feel the magic, can you?",
	"The quick brown fox jumps over the lazy dog"
])
print(encoded)

Train a new tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE.empty())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel.new(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel.new()

# And then train
trainer = trainers.BpeTrainer.new(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# Now we can encode
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)

Project details


Release history Release notifications | RSS feed

This version

0.3.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.3.0.tar.gz (62.3 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.3.0-cp38-cp38-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.3.0-cp38-cp38-manylinux1_x86_64.whl (7.4 MB view details)

Uploaded CPython 3.8

tokenizers-0.3.0-cp38-cp38-macosx_10_15_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.8 macOS 10.15+ x86-64

tokenizers-0.3.0-cp37-cp37m-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.3.0-cp37-cp37m-manylinux1_x86_64.whl (5.6 MB view details)

Uploaded CPython 3.7m

tokenizers-0.3.0-cp37-cp37m-macosx_10_15_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.7m macOS 10.15+ x86-64

tokenizers-0.3.0-cp36-cp36m-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.3.0-cp36-cp36m-manylinux1_x86_64.whl (3.7 MB view details)

Uploaded CPython 3.6m

tokenizers-0.3.0-cp36-cp36m-macosx_10_15_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.6m macOS 10.15+ x86-64

tokenizers-0.3.0-cp35-cp35m-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.3.0-cp35-cp35m-manylinux1_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.5m

tokenizers-0.3.0-cp35-cp35m-macosx_10_15_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.5m macOS 10.15+ x86-64

File details

Details for the file tokenizers-0.3.0.tar.gz.

File metadata

  • Download URL: tokenizers-0.3.0.tar.gz
  • Upload date:
  • Size: 62.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.3.0.tar.gz
Algorithm Hash digest
SHA256 d8d9077bc4863252c3bc61b508dd7ed1ad82b5e9d0974416d0d8be8e7c339e6c
MD5 4808864ff90ede2769cd0b6323677cea
BLAKE2b-256 9e0bb01fa1dac8037ac18246c991f262763665de35bcb9e75d5426578ba2d0c2

See more details on using hashes here.

File details

Details for the file tokenizers-0.3.0-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.3.0-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.3.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 2a9cdb2680640937c749ef21ec7286833d2f8d8661eed113f96078df6e56a1d2
MD5 1139b8a9f18c67c16e2f3e227fd3d68d
BLAKE2b-256 d8821f048a36afbe32210fb0c66e7d5ec6af8ce73299e0e8a5641a6b90dab561

See more details on using hashes here.

File details

Details for the file tokenizers-0.3.0-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.3.0-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 7.4 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.3.0-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 ec5227948d4f903a8173af13956aed909748b35928e2ed306b59da3ab3d0ef67
MD5 1376281f3f8ef50788d68c910818d9ac
BLAKE2b-256 eb2e0365a12dee40eb2aaa4b2acbe1e5ec5481deddb8d5f9d9cc8fa81a9a8cec

See more details on using hashes here.

File details

Details for the file tokenizers-0.3.0-cp38-cp38-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.3.0-cp38-cp38-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.8, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.3.0-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 386e68dde18b438947344f7f6911fc1d7f9d19b0fd1033cd54c02d7d353745d4
MD5 2c7707c2e239f0ab6b2f47418d483de1
BLAKE2b-256 6b7f478fafd319bd68d110743a6e96e492d5e862026b58935d78d19a45812e0a

See more details on using hashes here.

File details

Details for the file tokenizers-0.3.0-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.3.0-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.3.0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 b40e1c8f29419062f39696ce6742e81824b6f3ddbab14b4305810fb32e910075
MD5 35140417191abd4ed6696dcd45758c13
BLAKE2b-256 c4904dadb624963367459088941c17ee8d7dc2c8cbc8502e01628b8013372759

See more details on using hashes here.

File details

Details for the file tokenizers-0.3.0-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.3.0-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 5.6 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.3.0-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 b12d63b6f941bf0d5091ce1e19789765d3e9052401dd2f01ddcb918f86627883
MD5 8f47dd94ef8379ab1af83be3238712c4
BLAKE2b-256 777de736fd090c676c237b0c2544f797ffa92f2e99951cc6d8dd084595ec0717

See more details on using hashes here.

File details

Details for the file tokenizers-0.3.0-cp37-cp37m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.3.0-cp37-cp37m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.7m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.3.0-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 c50e718386dd6c9bc8f08315a69eec0f29cca870e9717bf89cda9b9e4f0b4dca
MD5 09b37ca1fe0a30d307bbf73d36c102fc
BLAKE2b-256 6af845c26b94fa1e0697b8c346b294533c1be27b2d08e71f2e929c82e2d83060

See more details on using hashes here.

File details

Details for the file tokenizers-0.3.0-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.3.0-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.3.0-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 c97b8f71a2a13e62d38f8bd63cb246a2078c54f57b2ec38fe6c587272ba579db
MD5 3efef2968fa19ad47799f49a1cbf9de1
BLAKE2b-256 5eaa96de292e2f9025c666397cc331569a13f8b3064dd24cc6b581a13a2c7082

See more details on using hashes here.

File details

Details for the file tokenizers-0.3.0-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.3.0-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.7 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.3.0-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 d672e0661a9f85d577d4c046902c6e8e7e6930e1ad4e7baf5b5e47b4c33f9e3b
MD5 f1a229b430bce396bad23e56dc86675e
BLAKE2b-256 765ffe07dde4b4523fb361ccb7ff00fccb051321dd127dcd355ed9aff9e3641c

See more details on using hashes here.

File details

Details for the file tokenizers-0.3.0-cp36-cp36m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.3.0-cp36-cp36m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.6m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.3.0-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 9279bcd7650e2ff91ac9554b013d8db73585815d2604c67e0f59265005f1516c
MD5 c7d7ac0c53ce45dffd0a01e40a32fd14
BLAKE2b-256 95cf2e30124c9202aad6f7500ed5ff59e4a11659291721d6cd9dd0330b318da2

See more details on using hashes here.

File details

Details for the file tokenizers-0.3.0-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.3.0-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.3.0-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 6453b8ad81fa961231ed5ad05ceba35a2c10facb0f0edd3615f0c5da8e2c7dc9
MD5 323f4b3f43406439a04a98275a700e1e
BLAKE2b-256 861f477e67f48dd32b85d1cdd1e24ab45edce458f79b14c425d789d3009c77f2

See more details on using hashes here.

File details

Details for the file tokenizers-0.3.0-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.3.0-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.3.0-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 5bc9f70baeae7137bbaf7d7e3cc15583709b3937997023b09477be52be72bc2c
MD5 097273932b82946aef39f86f316d20fa
BLAKE2b-256 ffc0acb84c8957c7c00c109039d349806e62b43cb27bd3ecac69947225bc6d0b

See more details on using hashes here.

File details

Details for the file tokenizers-0.3.0-cp35-cp35m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.3.0-cp35-cp35m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.5m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.3.0-cp35-cp35m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 cba08973c95961376c4b48063b6b349cf51e38514ac96525bbda7ce7d1d76f06
MD5 c04030aa7387d041d5a8ab2e371a23e1
BLAKE2b-256 b051de97cafe3c7f0e47096bf3bb4f54d6a99fc65b1b4b3ef1db8393a41c0a3e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page