Skip to main content

Fast and Customizable Tokenizers

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install setuptools_rust
python setup.py install

Using the provided Tokenizers

Using a pre-trained tokenizer is really simple:

from tokenizers import BPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = BPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train yours just as simply:

from tokenizers import BPETokenizer

# Initialize a tokenizer
tokenizer = BPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# And you can use it
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory", "my-bpe")

Provided Tokenizers

  • BPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

You can also easily build your own tokenizers, by putting all the different parts you need together:

Use a pre-trained tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders

# Load a BPE Model
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
bpe = models.BPE.from_files(vocab, merges)

# Initialize a tokenizer
tokenizer = Tokenizer(bpe)

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel.new(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel.new()

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

# Or tokenize multiple sentences at once:
encoded = tokenizer.encode_batch([
	"I can feel the magic, can you?",
	"The quick brown fox jumps over the lazy dog"
])
print(encoded)

Train a new tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE.empty())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel.new(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel.new()

# And then train
trainer = trainers.BpeTrainer.new(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# Now we can encode
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.2.0.tar.gz (58.9 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.2.0-cp38-cp38-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.2.0-cp38-cp38-manylinux1_x86_64.whl (7.3 MB view details)

Uploaded CPython 3.8

tokenizers-0.2.0-cp38-cp38-macosx_10_13_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.8 macOS 10.13+ x86-64

tokenizers-0.2.0-cp37-cp37m-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.2.0-cp37-cp37m-manylinux1_x86_64.whl (5.5 MB view details)

Uploaded CPython 3.7m

tokenizers-0.2.0-cp37-cp37m-macosx_10_13_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.7m macOS 10.13+ x86-64

tokenizers-0.2.0-cp36-cp36m-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.2.0-cp36-cp36m-manylinux1_x86_64.whl (3.7 MB view details)

Uploaded CPython 3.6m

tokenizers-0.2.0-cp36-cp36m-macosx_10_13_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.6m macOS 10.13+ x86-64

tokenizers-0.2.0-cp35-cp35m-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.2.0-cp35-cp35m-manylinux1_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.5m

tokenizers-0.2.0-cp35-cp35m-macosx_10_13_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.5m macOS 10.13+ x86-64

File details

Details for the file tokenizers-0.2.0.tar.gz.

File metadata

  • Download URL: tokenizers-0.2.0.tar.gz
  • Upload date:
  • Size: 58.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.2.0.tar.gz
Algorithm Hash digest
SHA256 9ee022c0d51a80b62eba75f4ab55f8c274d0c63b38d59a032b528d292415f176
MD5 50a17d103cbc6c016e29681feaef2cf9
BLAKE2b-256 8f9022fdf09676c828f7b869f5544c91e5322c2549731e0dbc9690a0161a5795

See more details on using hashes here.

File details

Details for the file tokenizers-0.2.0-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.2.0-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.2.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 bb8b7d6d0cf9a326402c89519bbe5d6bdeebec5f5d77987855d9ea82800a3ca3
MD5 a152452c0dc227f379852183d649094f
BLAKE2b-256 9adc79a31afd5c8d104dd3a50aaf00fb9b6d884681fe5b37f798b7afc0b8b9bd

See more details on using hashes here.

File details

Details for the file tokenizers-0.2.0-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.2.0-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 7.3 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.2.0-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 1b5a54f3ce8eae18dfb8292bf625cc0c764242c6b78c62e19e86a8cf07022590
MD5 be34f3978dc542e3e6dfb9587f22024d
BLAKE2b-256 94e2a562fdc52d90dd8c9ef133a9d50c0d96d045ae6460df65acdbe7e2f8adeb

See more details on using hashes here.

File details

Details for the file tokenizers-0.2.0-cp38-cp38-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.2.0-cp38-cp38-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.8, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.2.0-cp38-cp38-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 2d711390157f370ed1c44919c14631ca2b8ce9ae9283fa03d961b475113ddd20
MD5 88b116b641522cb779045aa66d59f079
BLAKE2b-256 9d45799a6997ba6ee0cafe960a312d1ab6e14fa4e6a3e43497e496f188dbaa51

See more details on using hashes here.

File details

Details for the file tokenizers-0.2.0-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.2.0-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.2.0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 20bda3fd07651b3911a8f078c8e1d10ee64ad68a1bdb105741e2f46b6d67c981
MD5 e736f0c1ee5514af8193f5a30ad7b319
BLAKE2b-256 5a506c1697f8da36a0e948ee07692419f038cea9da331d61d10f1705bfa08b65

See more details on using hashes here.

File details

Details for the file tokenizers-0.2.0-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.2.0-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 5.5 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.2.0-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 d70c68101ea245f4d197eb91df4124ffe5df8ab2e87c7d33c46ad77fd333e3d3
MD5 621d94dad8a1720b23429e262750ef13
BLAKE2b-256 9e69e13d1cd4bd22ea1a2304c81105b811f23d1ba4eb2da9da4398e33c9e0879

See more details on using hashes here.

File details

Details for the file tokenizers-0.2.0-cp37-cp37m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.2.0-cp37-cp37m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.7m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.2.0-cp37-cp37m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 75dfd683cb8f5829e19fc08050e41ff728b610e3feb0f253fb65efc445f63117
MD5 e4a2a65829f4ddac8a61113e5641300b
BLAKE2b-256 5e7952c7ac2021bc4ad7f521e3c9f04842f02a6041bba0cd7beda03fe72040b5

See more details on using hashes here.

File details

Details for the file tokenizers-0.2.0-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.2.0-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.2.0-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 5ba707296e2d0844d01e3f28b99e5bd5d5cef3808ea7014b59e869f9785d198f
MD5 552b85e1a36322ca2a59c6b00cdb1192
BLAKE2b-256 1256d1bfea36e2aca649ad036f8b98c2922fe461128aa50ba874b16f8d0d6cfb

See more details on using hashes here.

File details

Details for the file tokenizers-0.2.0-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.2.0-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.7 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.2.0-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 46fede12a19c06fe8db7461b347b6530dac49b263fed2b24274ae844559617be
MD5 b610c9adebd0a40d6e5594c24dc5191b
BLAKE2b-256 f5fba8a296f305ed635533d8a980842670f0c45d5c0a85598737a7467ccfd126

See more details on using hashes here.

File details

Details for the file tokenizers-0.2.0-cp36-cp36m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.2.0-cp36-cp36m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.6m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.2.0-cp36-cp36m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 c7f534d4d52fb8b79cb26f65e007a56a8ea56f23d2ef504cf6e025c57908b041
MD5 d7ebed1330b190d06555504ee16c0987
BLAKE2b-256 23cbc60fd244b85effd6d541ff4123cf3f12e9988e2164edac7daabdc0cc6e68

See more details on using hashes here.

File details

Details for the file tokenizers-0.2.0-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.2.0-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.2.0-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 be7d12397391a53024a1a75ede801053fbf76a4ece4732832aa12e4dd4536512
MD5 a40682c5d6528d3903163f2ccb509c7b
BLAKE2b-256 9b4ba7ba77e4513c318c39ee22cca1fbb9ebad2bbcbbf118e7efef4cc2340cc6

See more details on using hashes here.

File details

Details for the file tokenizers-0.2.0-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.2.0-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.2.0-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 045e7834513afd46a0298c67bfae08808ad10bd0d46a2185c8120127efb3c8ea
MD5 bd6def56db812fbb08469a7b3ef4ff32
BLAKE2b-256 478397fa869c827546957277d4cfc4148e0eea5ef8f60c5124464ebe02899ee1

See more details on using hashes here.

File details

Details for the file tokenizers-0.2.0-cp35-cp35m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.2.0-cp35-cp35m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.5m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.2.0-cp35-cp35m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 a753a92d311e3c448206d0b28ec39be92e1fc8157ee48e4ee7aeb48d1eeb1b04
MD5 27f45d3ee99cfbcd1c109aa567789711
BLAKE2b-256 93243ae589df3a188def7f7380451e239ec4af64ca8ea78763af36791c3e6084

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page