Skip to main content

Fast and Customizable Tokenizers

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install setuptools_rust
python setup.py install

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can check how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.9.0rc1.tar.gz (169.8 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.9.0rc1-cp38-cp38-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.9.0rc1-cp38-cp38-win32.whl (1.7 MB view details)

Uploaded CPython 3.8 Windows x86

tokenizers-0.9.0rc1-cp38-cp38-manylinux1_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.8

tokenizers-0.9.0rc1-cp38-cp38-macosx_10_11_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.8 macOS 10.11+ x86-64

tokenizers-0.9.0rc1-cp37-cp37m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.9.0rc1-cp37-cp37m-win32.whl (1.7 MB view details)

Uploaded CPython 3.7m Windows x86

tokenizers-0.9.0rc1-cp37-cp37m-manylinux1_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.7m

tokenizers-0.9.0rc1-cp37-cp37m-macosx_10_11_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.7m macOS 10.11+ x86-64

tokenizers-0.9.0rc1-cp36-cp36m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.9.0rc1-cp36-cp36m-win32.whl (1.7 MB view details)

Uploaded CPython 3.6m Windows x86

tokenizers-0.9.0rc1-cp36-cp36m-manylinux1_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.6m

tokenizers-0.9.0rc1-cp36-cp36m-macosx_10_11_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.6m macOS 10.11+ x86-64

tokenizers-0.9.0rc1-cp35-cp35m-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.9.0rc1-cp35-cp35m-win32.whl (1.7 MB view details)

Uploaded CPython 3.5m Windows x86

tokenizers-0.9.0rc1-cp35-cp35m-manylinux1_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.5m

tokenizers-0.9.0rc1-cp35-cp35m-macosx_10_11_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.5m macOS 10.11+ x86-64

File details

Details for the file tokenizers-0.9.0rc1.tar.gz.

File metadata

  • Download URL: tokenizers-0.9.0rc1.tar.gz
  • Upload date:
  • Size: 169.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0rc1.tar.gz
Algorithm Hash digest
SHA256 b1cb9a4f8a378f3be32031a50f5fbfd8214fdeaa2b8fb40581b5c9cf2dcb04a8
MD5 184735031e754cbd13eacf7a57ac487c
BLAKE2b-256 36a7a94e64c95d871c1db4286e196aedc896aa91239dc0a5f9fabeac1a63567c

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0rc1-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.0rc1-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0rc1-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 9a758aea83ad2d8a271fe09ca8abc711d61a34e078467554a43869a4ada3824a
MD5 b8e6c01c9c059d7d77f84dbec263d922
BLAKE2b-256 a014ab5426c0726b76e0cda971c0c231cecdfd6c8a98e63f0f1c5a212c2c0158

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0rc1-cp38-cp38-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.0rc1-cp38-cp38-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.8, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0rc1-cp38-cp38-win32.whl
Algorithm Hash digest
SHA256 57c39bc2af381ef9599e8606a024e49ed299b062235c3d3ec7e75da8a7c841ee
MD5 211bce7d9163e4a9531e8f550958bdbb
BLAKE2b-256 7397e8757f130968ed08c553290da3cfd9a2ea1c5ad7990c027e7613393fff94

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0rc1-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0rc1-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.9 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0rc1-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 e7f3826ddafc7c67ae17200e5f126a67b41b8ab80532727cff189549b794e557
MD5 eb1198b1b9fa800f2288ab623e25c990
BLAKE2b-256 164ac188de189511f0e74c23b8e6b1f941c75ebe62983c49e485cf814cf3a12e

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0rc1-cp38-cp38-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0rc1-cp38-cp38-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 2.0 MB
  • Tags: CPython 3.8, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0rc1-cp38-cp38-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 48c49a34e7e85ee3a516829151764a8df6d52ecb43341815fa8a86757ddefb4a
MD5 e0067fde5d5b83e9d97239b09cbb31de
BLAKE2b-256 5e776e20d8454ff52c0e205a337816b9bdd55dab80eff8c1645e01fa0a1d0e8f

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0rc1-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.0rc1-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0rc1-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 581a943f6ddecf52bc2ceb4b5364e6fb4e52e5f955da3ec45f340061bbf467a1
MD5 85ba6116d6526498007e92986ae20a4f
BLAKE2b-256 6d7fb7eede39d423715d592724ca4fb1e880bee94b75bfdc73d9ae006683c956

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0rc1-cp37-cp37m-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.0rc1-cp37-cp37m-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.7m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0rc1-cp37-cp37m-win32.whl
Algorithm Hash digest
SHA256 3864a16213001c709151e33deb1d76fff87e191b746343ae74ac75af18c3fb60
MD5 d6902e9c3045bc80cdf7e5742a3362cd
BLAKE2b-256 1b9b54482092c819e33b26d76f60e0273e9024e14e71ac81c711886eb80a1904

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0rc1-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0rc1-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.9 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0rc1-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 99f99fd350cac7063bd096fb1ac7cf3e4f11f70a942578f6c23ba31174cf0a06
MD5 9d0159bed0099310b1fbeeb7fbedff2b
BLAKE2b-256 df584b5a011aaa0ce99e122badddf150afecaf128bd48491a0762929718515f1

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0rc1-cp37-cp37m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0rc1-cp37-cp37m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 2.0 MB
  • Tags: CPython 3.7m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0rc1-cp37-cp37m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 d99a0d4b47e94735c4d19d837fa9117f98b5e3f5072f34eb16de48c139cf6bb8
MD5 4c5b716e4369ddddd41b945d15e2677b
BLAKE2b-256 c5f24e4a0efb2916833564cf741281399005d8a7e77cf46607f1e386764ad4b2

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0rc1-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.0rc1-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0rc1-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 dc54eda341991b484597a7ecce2db7280be34c973717b2744bb5a5f008acdcba
MD5 34a56e359f82be8896985332b30839ba
BLAKE2b-256 d9c8196a502bdbdcfcd2969fab8299477e4be7497c3012f86537e635fe1fc6dc

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0rc1-cp36-cp36m-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.0rc1-cp36-cp36m-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.6m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0rc1-cp36-cp36m-win32.whl
Algorithm Hash digest
SHA256 10b788713e0d77dc185b5adc172f7099fc3cc380bd8d019ee663e2ee8e543d12
MD5 f425fef3a6d8f7f20375d7f9ef34048d
BLAKE2b-256 0ebc5a5e16508fa16314b5c946e7a3274667fe66728a6a611312cf2e7dcc15e0

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0rc1-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0rc1-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.9 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0rc1-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 7979fd453c1c82fcbc549f73ffd692f4c11a783176d4c30a92e76658b42f042b
MD5 52a6d81b56b53a374141c26b04879832
BLAKE2b-256 06221e37938bbedc60e7c93ad71538f551dabfacee9ee170a934ba5ded578aa6

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0rc1-cp36-cp36m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0rc1-cp36-cp36m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 2.0 MB
  • Tags: CPython 3.6m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0rc1-cp36-cp36m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 26a4f98d29dbd6c07a6e54edd917166afbf0d49d43f4996e99dece477fc50936
MD5 ae2d5a458c4375c58d9e7c3bc5884542
BLAKE2b-256 061bc60544eabb2cad35880c1efbbabede2414c4d8357bac1b3ed44240030525

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0rc1-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.9.0rc1-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0rc1-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 798a4b4f7bd7c56b4b2c05c04c511baad49cfc28ee6c4cf26e8d59f1c04f415d
MD5 0109688b4171352247bd48fca99c8e83
BLAKE2b-256 f010c5f6322c4d32470dd8460f63f98ef02292a948b938ac972ad61b9c24fcac

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0rc1-cp35-cp35m-win32.whl.

File metadata

  • Download URL: tokenizers-0.9.0rc1-cp35-cp35m-win32.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.5m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0rc1-cp35-cp35m-win32.whl
Algorithm Hash digest
SHA256 bf9681affd294c45675285eae245c2c4213fea2a6eb47f45db6c3319998e9875
MD5 d74dddf448d0b7e3e371f1b72d5cdc69
BLAKE2b-256 a7be094b46fc7514f916406fe5213b6e76ff0ca4fabbd851dd0df3571498ced2

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0rc1-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0rc1-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.9 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0rc1-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 f423f5dd4c59e3e1cd9588ca2103bb12e10d60cf62910e6dd9bd23c5f5129371
MD5 aa2b3f902aa1d93712b6e423bc496ffa
BLAKE2b-256 86c0b3196271b43b210f559ed8b89abe18719c3e52977bb78e7db345ac0a95f8

See more details on using hashes here.

File details

Details for the file tokenizers-0.9.0rc1-cp35-cp35m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.9.0rc1-cp35-cp35m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 2.0 MB
  • Tags: CPython 3.5m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for tokenizers-0.9.0rc1-cp35-cp35m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 4ffd69810c5b2420261d0175c3a505b70ebe31b173c2ae2323bfdc10bbeb8e28
MD5 518ce9ee47f31f70243e7abcf83141ce
BLAKE2b-256 b2cb9848c9e1b2c4984508fc6e4098df59dea93cb3fabb3b3266aab8097b1bc8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page