Skip to main content

Fast and Customizable Tokenizers

Project description

PyPI version

Tokenizers

A fast and easy to use implementation of today's most used tokenizers.

This API is currently in the process of being stabilized. We might introduce breaking changes really often in the coming days/weeks, so use at your own risks.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust nightly toolchain installed.

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -default-toolchain nightly-2019-11-01 -y
export PATH="$HOME/.cargo/bin:$PATH"

# Or select the right toolchain:
rustup default nightly-2019-11-01

Once Rust is installed and using the right toolchain you can do the following.

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install maturin
maturin develop --release

Usage

Use a pre-trained tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders

# Load a BPE Model
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
bpe = models.BPE.from_files(vocab, merges)

# Initialize a tokenizer
tokenizer = Tokenizer(bpe)

# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new(True))
tokenizer.with_decoder(decoders.ByteLevel.new())

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)

# Or tokenize multiple sentences at once:
encoded = tokenizer.encode_batch([
	"I can feel the magic, can you?",
	"The quick brown fox jumps over the lazy dog"
])
print(encoded)

Train a new tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE.empty())

# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new(True))
tokenizer.with_decoder(decoders.ByteLevel.new())

# And then train
trainer = trainers.BpeTrainer.new(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# Now we can encode
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.0.8.tar.gz (32.9 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.0.8-cp38-cp38-win_amd64.whl (777.5 kB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.0.8-cp38-cp38-manylinux1_x86_64.whl (6.2 MB view details)

Uploaded CPython 3.8

tokenizers-0.0.8-cp38-cp38-macosx_10_13_x86_64.whl (853.6 kB view details)

Uploaded CPython 3.8 macOS 10.13+ x86-64

tokenizers-0.0.8-cp37-cp37m-win_amd64.whl (777.3 kB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.0.8-cp37-cp37m-manylinux1_x86_64.whl (4.6 MB view details)

Uploaded CPython 3.7m

tokenizers-0.0.8-cp37-cp37m-macosx_10_13_x86_64.whl (853.3 kB view details)

Uploaded CPython 3.7m macOS 10.13+ x86-64

tokenizers-0.0.8-cp36-cp36m-win_amd64.whl (778.5 kB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.0.8-cp36-cp36m-manylinux1_x86_64.whl (3.1 MB view details)

Uploaded CPython 3.6m

tokenizers-0.0.8-cp36-cp36m-macosx_10_13_x86_64.whl (853.7 kB view details)

Uploaded CPython 3.6m macOS 10.13+ x86-64

tokenizers-0.0.8-cp35-cp35m-win_amd64.whl (778.7 kB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.0.8-cp35-cp35m-manylinux1_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.5m

tokenizers-0.0.8-cp35-cp35m-macosx_10_13_x86_64.whl (853.7 kB view details)

Uploaded CPython 3.5m macOS 10.13+ x86-64

File details

Details for the file tokenizers-0.0.8.tar.gz.

File metadata

  • Download URL: tokenizers-0.0.8.tar.gz
  • Upload date:
  • Size: 32.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.8.tar.gz
Algorithm Hash digest
SHA256 9012e6a352703026c63b67f78b0f8b62821793a9c27d0b86761e6e61df2f263d
MD5 cadff12e9c4c36690e49749e04559167
BLAKE2b-256 7b29be54554a9a5948d4d43a137bf72dbf78e22dc1071b05841a24f210aa74da

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.8-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.8-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 777.5 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.8-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 ebfa81384b525415bbed3a119cc502ee73a7d65662853dfcdfe557a85d4e9615
MD5 aac4c3edf97d42a9e7fbe47d0610aa3a
BLAKE2b-256 472022b26d47a33fbeb892a4945036391ae6e93a7e221d63a10ce21c8d84c978

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.8-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.8-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 6.2 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.8-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 b800444868ff318f741507aaeff416abb8495e29609778ad2aedeff39e88ae1b
MD5 a57a8fd97f0928c24a78382635c8947a
BLAKE2b-256 6c134b8f1dba204cd820b894037425efe7b4e8dcda384769e8820daecb1dec0b

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.8-cp38-cp38-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.8-cp38-cp38-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 853.6 kB
  • Tags: CPython 3.8, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.8-cp38-cp38-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 edbe284f8908beb07f28e7900be228b9762e88bcee5dc7b273be2fa2da757bd0
MD5 7914c7b46287ddf338ceb51af98d78fa
BLAKE2b-256 89d8ec3625737c382677bbd56fe503709651d98f5ce8fc145bf4e29b3c48d5ec

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.8-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.8-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 777.3 kB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.8-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 7289f3abb939de67e70d7c8f96e175b6a780ddfbc06077957f060cfbc85aa735
MD5 0f327ed9abbf83f386421e6bc6dd2606
BLAKE2b-256 2726c488bf0e9a9aa3aca3e56a850c0284a6f54f1503a7ff0feb826b4d569796

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.8-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.8-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 4.6 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.8-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 3130479965533d1e3f45dfa75455eeff7b8fc31cf66bceed05ffbfa79b0f0000
MD5 b6fcd06e75b7e353ad803d06991d4734
BLAKE2b-256 79f3e6d275ad5b7ef5065b5bfd56cd82f0ffe2a7d55f77afb34bcb2c3cd7245a

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.8-cp37-cp37m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.8-cp37-cp37m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 853.3 kB
  • Tags: CPython 3.7m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.8-cp37-cp37m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 8ed9fbbe0d78c28b1ac0cac7f11f82c745586582154bba0cc97aeaab0bff54cf
MD5 d26ca9679d39365ac337390082e4b5e2
BLAKE2b-256 81c668dbb5a9ca8232a43e402ede5b5e692b6d8055a062373b1c6d6504cf1070

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.8-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.8-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 778.5 kB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.8-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 7aff1e8cbb8b12f838a3b64daa0c9d3b9f2349f865c0a85532a9eb94c7247e2a
MD5 17bc817686b39ef7f65194fcb9aee6d4
BLAKE2b-256 c6d4f32e33727bdc699a960fec596c9ddb38deb4ef92e3c5a0c67b231bbd7b3f

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.8-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.8-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.1 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.8-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 bc929724d940ca6e5e7e003f13c02154dc604ab6285a6bd08cc02f76e09e1684
MD5 d0c82fafb1c90fd66408ec37c9ca623d
BLAKE2b-256 a74497380ce7b1f328aa6c4bc945cc16a8abf528784eb2bb15024880f4aa8749

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.8-cp36-cp36m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.8-cp36-cp36m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 853.7 kB
  • Tags: CPython 3.6m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.8-cp36-cp36m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 aa875b3a66ea28c5ea3695fcaf89bd9ec075768b3fefce9e612292a91ec18f26
MD5 3e7ae6d9bf2226f918beb5192e16d77a
BLAKE2b-256 df28f38a9bc5441b96edb05c88f99a5e233e19497ddc5c0cbbf0c23a62cc68fb

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.8-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.8-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 778.7 kB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.8-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 24b3ced23cbf63a74423312260114c71f690040815c8a93750c670245f6c68b7
MD5 d7786e98b5faccbad712e8a733ea28b5
BLAKE2b-256 01895e0d029315a530cbeea52229a3a63e3d9f6459723dfb27ab5e00d771b268

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.8-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.8-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.8-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 cefbe50c0219a9829bf245cd3b53488a10004d044863c84f099e6760e64da515
MD5 7c89200065c4b4fa91b275e00a7bd1d9
BLAKE2b-256 17f1aca99f99370894cf21c16346571c9f2b5fc31e8a0cfc547a1d3616ad5fea

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.8-cp35-cp35m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.8-cp35-cp35m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 853.7 kB
  • Tags: CPython 3.5m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.8-cp35-cp35m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 b49731bd169e893912bdbd5e887eeb5dc660487edb4940f7b1635d6149875611
MD5 706fb6b89e8226ca8028c087dd1e16e5
BLAKE2b-256 bcc24f5fc753cd789425ac2e5abdd947c6e0676231b9c50660d6b4470ae164ed

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page