Skip to main content

Fast and Customizable Tokenizers

Project description

PyPI version

Tokenizers

A fast and easy to use implementation of today's most used tokenizers.

This API is currently in the process of being stabilized. We might introduce breaking changes really often in the coming days/weeks, so use at your own risks.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust nightly toolchain installed.

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -default-toolchain nightly-2019-11-01 -y
export PATH="$HOME/.cargo/bin:$PATH"

# Or select the right toolchain:
rustup default nightly-2019-11-01

Once Rust is installed and using the right toolchain you can do the following.

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install maturin
maturin develop --release

Usage

Use a pre-trained tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders

# Load a BPE Model
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
bpe = models.BPE.from_files(vocab, merges)

# Initialize a tokenizer
tokenizer = Tokenizer(bpe)

# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new(add_prefix_space=True))
tokenizer.with_decoder(decoders.ByteLevel.new())

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)

# Or tokenize multiple sentences at once:
encoded = tokenizer.encode_batch([
	"I can feel the magic, can you?",
	"The quick brown fox jumps over the lazy dog"
])
print(encoded)

Train a new tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE.empty())

# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new(add_prefix_space=True))
tokenizer.with_decoder(decoders.ByteLevel.new())

# And then train
trainer = trainers.BpeTrainer.new(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# Now we can encode
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.0.10.tar.gz (30.7 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.0.10-cp38-cp38-win_amd64.whl (779.2 kB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.0.10-cp38-cp38-manylinux1_x86_64.whl (6.2 MB view details)

Uploaded CPython 3.8

tokenizers-0.0.10-cp38-cp38-macosx_10_13_x86_64.whl (854.8 kB view details)

Uploaded CPython 3.8 macOS 10.13+ x86-64

tokenizers-0.0.10-cp37-cp37m-win_amd64.whl (779.0 kB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.0.10-cp37-cp37m-manylinux1_x86_64.whl (4.6 MB view details)

Uploaded CPython 3.7m

tokenizers-0.0.10-cp37-cp37m-macosx_10_13_x86_64.whl (854.7 kB view details)

Uploaded CPython 3.7m macOS 10.13+ x86-64

tokenizers-0.0.10-cp36-cp36m-win_amd64.whl (779.4 kB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.0.10-cp36-cp36m-manylinux1_x86_64.whl (3.1 MB view details)

Uploaded CPython 3.6m

tokenizers-0.0.10-cp36-cp36m-macosx_10_13_x86_64.whl (854.9 kB view details)

Uploaded CPython 3.6m macOS 10.13+ x86-64

tokenizers-0.0.10-cp35-cp35m-win_amd64.whl (779.4 kB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.0.10-cp35-cp35m-manylinux1_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.5m

tokenizers-0.0.10-cp35-cp35m-macosx_10_13_x86_64.whl (855.0 kB view details)

Uploaded CPython 3.5m macOS 10.13+ x86-64

File details

Details for the file tokenizers-0.0.10.tar.gz.

File metadata

  • Download URL: tokenizers-0.0.10.tar.gz
  • Upload date:
  • Size: 30.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.10.tar.gz
Algorithm Hash digest
SHA256 fe5e6ba77d93b50c5b5bd4c655ee57f37d89f70ba868aaa547a7f54019234279
MD5 ce123882aba47d414f580c05eb65a71e
BLAKE2b-256 4fdbe2184844bee5bb0e9a250334824975d6a28f8b88aa76997b3d02346592c6

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.10-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.10-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 779.2 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.10-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 64e3a5a335c46aca22ea68e99ba8d006dd537ac1d9d5f17c5f84e88436a2a673
MD5 c5e4034229762be850d0610de4655764
BLAKE2b-256 dcdd7577be71c73d1e78e50cdd37726b3344ba2bc762692f3d8608d357c9565c

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.10-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.10-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 6.2 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.10-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 648d6ba6e54973a21be5faa992ef54b615e0cb93d902dbbce36c1061a525a369
MD5 34c67509f6d44c60ab4e9b6f6b9be2fe
BLAKE2b-256 d3bbbcfca7a3e4b6e8440aebef2f21f0ffb9e25782b57e418ff24f6fffb1646a

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.10-cp38-cp38-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.10-cp38-cp38-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 854.8 kB
  • Tags: CPython 3.8, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.10-cp38-cp38-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 ba92c859fd4191e7827fa4c1e745681e943f5be7ea21dd2308932da860b2016f
MD5 0c87305da872848746cb339f53269106
BLAKE2b-256 c342914d6f7ec7bdeff91211bd14423bac16956a08b78e4a4b869283a19b61cb

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.10-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.10-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 779.0 kB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.10-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 9088655cee1b04d0ffb069f32b39f02d1ca4182f078cdeef8b73361334ad0775
MD5 ff72f6ccb9c76eeec3faec1d7eddf41e
BLAKE2b-256 bdc3827e25d5dd5dc2849dfcd0619bea3c8e07100436ce5c503cdc0a88c433aa

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.10-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.10-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 4.6 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.10-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 26193a5a648236c5f6b5dafee43c132cf81ca0cd12ddff8676be44116e9cfbc3
MD5 fa2cf080c4271f9d95f8b9cd4cb6d93c
BLAKE2b-256 3dd89bbdc3cbcd6c41b397aa3847b466f0547127c5ba31de6d6ed42e194ffd74

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.10-cp37-cp37m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.10-cp37-cp37m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 854.7 kB
  • Tags: CPython 3.7m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.10-cp37-cp37m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 44e2a1eb48889615f17541deef013e7b881eb2c1e3d8c14889b792903bcdef76
MD5 fdc129c352d649ba1bb58a8bfea84e53
BLAKE2b-256 2ab4a5295582ff3749315fbab604d340614ff170405d698f62e60ae9520ade4b

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.10-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.10-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 779.4 kB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.10-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 bf855f993f71e35e5c37d4ef87715a1a5505d8bf1fb9e91db21e10dd2dc9693f
MD5 6e7d6fd436444343aeb8505886cbeee0
BLAKE2b-256 86bfdfc90f077b74c97565a5e41b46209042de2118b5724ceaacca75f63645ba

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.10-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.10-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.1 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.10-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 84555c0af0105a59b6c9d92e53a38e5221651d7a7cb167b34c72d56dbd5d820c
MD5 5b1b66e45b10d13457fc92a778c26f7f
BLAKE2b-256 34a690ee8652a9976ae1f3fea081390312dcbe6d629c3ecaa2bf014b0e7ce95b

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.10-cp36-cp36m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.10-cp36-cp36m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 854.9 kB
  • Tags: CPython 3.6m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.10-cp36-cp36m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 d5df32b11d0b0a19ee1ebd4f8b32c2f056eaf7c63540908615ae4573c76c375a
MD5 7f39366b74eb5df83cf7dd5e0e4bdb9f
BLAKE2b-256 4ec26052a4f293a44b22f357011d1f5917076ec0f2881d5fa57657ba0b8167ea

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.10-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.10-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 779.4 kB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.10-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 55395ba33d6d00935db13b4e27bcb5946040a206809f6efebe010c7d98861013
MD5 d877d4623065b17b09c34c607ca1e945
BLAKE2b-256 8fda41fce30ba3756b7f035c44998419fa3abc9fe65d131d355c6061e5e1eb6a

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.10-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.10-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.10-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 54bd2ffda27853c99cd41f8df8e585361481be043b0d6245d003d6f71c205e46
MD5 6db22572d6901172abeb0521003e61db
BLAKE2b-256 4e2e9a33edc51253b183852dee5de7f89566f0e53a5cf425ad1575964d83379d

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.10-cp35-cp35m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.10-cp35-cp35m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 855.0 kB
  • Tags: CPython 3.5m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.10-cp35-cp35m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 892713d929593a077c00167f9fe8a108fb2b5ce83b7a6d1324e417caf7629da6
MD5 54b104039424a3620b8af27c7c256a04
BLAKE2b-256 df0c3aa923245aeea88cd0d8ddf6dec2bc82344c798cefee877dd3ee51354c5b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page