Skip to main content

Fast and Customizable Tokenizers

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install setuptools_rust
python setup.py install

Using the provided Tokenizers

Using a pre-trained tokenizer is really simple:

from tokenizers import BPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = BPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train yours just as simply:

from tokenizers import BPETokenizer

# Initialize a tokenizer
tokenizer = BPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# And you can use it
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory", "my-bpe")

Provided Tokenizers

  • BPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

You can also easily build your own tokenizers, by putting all the different parts you need together:

Use a pre-trained tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders

# Load a BPE Model
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
bpe = models.BPE.from_files(vocab, merges)

# Initialize a tokenizer
tokenizer = Tokenizer(bpe)

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

# Or tokenize multiple sentences at once:
encoded = tokenizer.encode_batch([
	"I can feel the magic, can you?",
	"The quick brown fox jumps over the lazy dog"
])
print(encoded)

Train a new tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE.empty())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()

# And then train
trainer = trainers.BpeTrainer(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# Now we can encode
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.4.0.tar.gz (62.6 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.4.0-cp38-cp38-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.4.0-cp38-cp38-manylinux1_x86_64.whl (7.4 MB view details)

Uploaded CPython 3.8

tokenizers-0.4.0-cp38-cp38-macosx_10_15_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.8 macOS 10.15+ x86-64

tokenizers-0.4.0-cp37-cp37m-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.4.0-cp37-cp37m-manylinux1_x86_64.whl (5.6 MB view details)

Uploaded CPython 3.7m

tokenizers-0.4.0-cp37-cp37m-macosx_10_15_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.7m macOS 10.15+ x86-64

tokenizers-0.4.0-cp36-cp36m-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.4.0-cp36-cp36m-manylinux1_x86_64.whl (3.7 MB view details)

Uploaded CPython 3.6m

tokenizers-0.4.0-cp36-cp36m-macosx_10_15_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.6m macOS 10.15+ x86-64

tokenizers-0.4.0-cp35-cp35m-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.4.0-cp35-cp35m-manylinux1_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.5m

tokenizers-0.4.0-cp35-cp35m-macosx_10_15_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.5m macOS 10.15+ x86-64

File details

Details for the file tokenizers-0.4.0.tar.gz.

File metadata

  • Download URL: tokenizers-0.4.0.tar.gz
  • Upload date:
  • Size: 62.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.4.0.tar.gz
Algorithm Hash digest
SHA256 4a69ff14ca5ce3c2af9e40f45b6ee2ba22261d1ee0796690aeab41b99c3436aa
MD5 9317927c3d5100946162a7ebbeb73fff
BLAKE2b-256 aba3b15073e6e8bc093010999268e03f17cd35a5351036a2fedfa2d60fe72867

See more details on using hashes here.

File details

Details for the file tokenizers-0.4.0-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.4.0-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.4.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 0653580c7e4e0fd957ed0c715ce7429131642c6b69b7efb6c4cce31ccffcdfcf
MD5 af8cb867e40849f2751aac4918e5b509
BLAKE2b-256 bb8fccfe60ebae4cbc4610e32fd43b2f21ecc257f775e86ac36f634cf4a4c325

See more details on using hashes here.

File details

Details for the file tokenizers-0.4.0-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.4.0-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 7.4 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.4.0-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 73fe03aa02f96f83c28d1039d1e04f684a3bdb399f7b57671ec24bde1cd64094
MD5 3e8ecdcf2a2830281f8e3a90205b14c4
BLAKE2b-256 53851164b135dde87818200e26810235d94d08cf9b524cc79c57c057bf33d378

See more details on using hashes here.

File details

Details for the file tokenizers-0.4.0-cp38-cp38-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.4.0-cp38-cp38-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.8, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.4.0-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 78f72276ad2e505c9ba9ef32e7b6d8092c2d5152d1e7f921533379e051794ba6
MD5 6b96343a61ca07dc8a0f0a8442b638a6
BLAKE2b-256 9ba1815d5b4a53c4540fda20727d389bb8cecaf15fb62722f39a06eee45c6dac

See more details on using hashes here.

File details

Details for the file tokenizers-0.4.0-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.4.0-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.4.0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 6502f211bb9f4e2e7215b05b593c548c16d8f0e03abc2b7624a935f055eb95d6
MD5 9eb3515ead2a1589126661c6c36e86e3
BLAKE2b-256 b30a8b46cb27d0f899044439f857dec315229d03e244f8ba3a07d7e5ec0405a4

See more details on using hashes here.

File details

Details for the file tokenizers-0.4.0-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.4.0-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 5.6 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.4.0-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 9af74a309ca70b82884692327fddc8a846106928b7ec64e594ce0051ec12264e
MD5 af7974f83d47d97aa0918fccb8256241
BLAKE2b-256 4b186425265def99dbb1729c64325657c72a1e08993401481c0177319b229cbc

See more details on using hashes here.

File details

Details for the file tokenizers-0.4.0-cp37-cp37m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.4.0-cp37-cp37m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.7m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.4.0-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 a56f6abeceebab536c075df7d4f4b29ad1ecab86305c0cf17a84cf457934e0ef
MD5 13103221b013caf5b5965a345d6ee3ac
BLAKE2b-256 0d2bea39a9f2653fc2e0e8e27b7f35a89dadfefddd1c297212b57c5958c1bba5

See more details on using hashes here.

File details

Details for the file tokenizers-0.4.0-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.4.0-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.4.0-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 98c2b1298a647e2db6732fc6f8a941b824984e32809b50f757ca1200f2ab08a2
MD5 e5ccf23eb2cf6b558a302e833aa3faff
BLAKE2b-256 6716c3a9223a931f1abb73f55ac399daee17293f5039b664b84efcff90fd6ab4

See more details on using hashes here.

File details

Details for the file tokenizers-0.4.0-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.4.0-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.7 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.4.0-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 d109f4be047be1b6c9d395cb03b6d4427370c9d92dcddafe3c1b9696c7c508d8
MD5 5d131028e86b1668d9145732240600cd
BLAKE2b-256 c275564a669ba519df6dd54c272adca94c9b4e9e8db86d56426bfac4d5a56b8b

See more details on using hashes here.

File details

Details for the file tokenizers-0.4.0-cp36-cp36m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.4.0-cp36-cp36m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.6m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.4.0-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 79265917fa0d9c1e7c30fa630d61340de60a98afd2d1cf9f8ee2de6921e73b0f
MD5 3e1f0272ed6e86e9d98e412239840cd7
BLAKE2b-256 cc5f7297d7a8e6f770a4beeee518b58ac2592aab63d9088c0ad2309f2a940512

See more details on using hashes here.

File details

Details for the file tokenizers-0.4.0-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.4.0-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.4.0-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 4e1e9e7ab0d16f048a1ed31f870b206d76017ac2d0470da804b837ca856dbb66
MD5 4473431f090b28f73f45f564c7a6e0f5
BLAKE2b-256 72d96672263f543fb163990174ed23333dd2ee31b1865be940868756ecef51b1

See more details on using hashes here.

File details

Details for the file tokenizers-0.4.0-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.4.0-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.4.0-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 146c96c3167027ca19cdea1c9ce056ae8a493b2a9111ccaadd9c1787338880a5
MD5 8004aad32d6c001d0014778f6a7fa668
BLAKE2b-256 978811ce1939ef58f2cc274cef7bfb64a6df73a31ef09edf1b43b2dc6fdfe9ec

See more details on using hashes here.

File details

Details for the file tokenizers-0.4.0-cp35-cp35m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.4.0-cp35-cp35m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.5m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.4.0-cp35-cp35m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 b8b69f524514b20c18a6864c2238bbb568e7657f6eda4b705a5cad619d9b6a0b
MD5 3a87e519e9e1b7e5f0f72e7f4ddda1ab
BLAKE2b-256 0840964edc0afa08b4956e3eba15c6c19b76a3155630330f934345ef31c16512

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page