Skip to main content

Fast and Customizable Tokenizers

Project description

PyPI version

Tokenizers

A fast and easy to use implementation of today's most used tokenizers.

This API is currently in the process of being stabilized. We might introduce breaking changes really often in the coming days/weeks, so use at your own risks.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust nightly toolchain installed.

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -default-toolchain nightly-2019-11-01 -y
export PATH="$HOME/.cargo/bin:$PATH"

# Or select the right toolchain:
rustup default nightly-2019-11-01

Once Rust is installed and using the right toolchain you can do the following.

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install maturin
maturin develop --release

Usage

Use a pre-trained tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders

# Load a BPE Model
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
bpe = models.BPE.from_files(vocab, merges)

# Initialize a tokenizer
tokenizer = Tokenizer(bpe)

# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new())
tokenizer.with_decoder(decoders.ByteLevel.new())

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)

# Or tokenize multiple sentences at once:
encoded = tokenizer.encode_batch([
	"I can feel the magic, can you?",
	"The quick brown fox jumps over the lazy dog"
])
print(encoded)

Train a new tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE.empty())

# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new())
tokenizer.with_decoder(decoders.ByteLevel.new())

# And then train
trainer = trainers.BpeTrainer.new(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# Now we can encode
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.0.5.tar.gz (28.8 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.0.5-cp38-cp38-win_amd64.whl (732.5 kB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.0.5-cp38-cp38-manylinux1_x86_64.whl (5.9 MB view details)

Uploaded CPython 3.8

tokenizers-0.0.5-cp38-cp38-macosx_10_13_x86_64.whl (802.3 kB view details)

Uploaded CPython 3.8 macOS 10.13+ x86-64

tokenizers-0.0.5-cp37-cp37m-win_amd64.whl (732.4 kB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.0.5-cp37-cp37m-manylinux1_x86_64.whl (4.5 MB view details)

Uploaded CPython 3.7m

tokenizers-0.0.5-cp37-cp37m-macosx_10_13_x86_64.whl (802.8 kB view details)

Uploaded CPython 3.7m macOS 10.13+ x86-64

tokenizers-0.0.5-cp36-cp36m-win_amd64.whl (732.8 kB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.0.5-cp36-cp36m-manylinux1_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.6m

tokenizers-0.0.5-cp36-cp36m-macosx_10_13_x86_64.whl (802.9 kB view details)

Uploaded CPython 3.6m macOS 10.13+ x86-64

tokenizers-0.0.5-cp35-cp35m-win_amd64.whl (732.8 kB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.0.5-cp35-cp35m-manylinux1_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.5m

tokenizers-0.0.5-cp35-cp35m-macosx_10_13_x86_64.whl (802.9 kB view details)

Uploaded CPython 3.5m macOS 10.13+ x86-64

File details

Details for the file tokenizers-0.0.5.tar.gz.

File metadata

  • Download URL: tokenizers-0.0.5.tar.gz
  • Upload date:
  • Size: 28.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.5.tar.gz
Algorithm Hash digest
SHA256 db8eaa7a72c65b9d38a78a2a3366b06de180efd8c128b36b7bb510f612cbf8a7
MD5 3932aae297df6a79d0453880ac9cd9de
BLAKE2b-256 367eef043eeb68b885cce52686bf96f8ec2439112c658f2ea5f1213f3b8fabf1

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.5-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.5-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 732.5 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.5-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 92e573ac02df00237eb269338158db46b2548d70dab11be57393807763fb6b34
MD5 4002190e1810c6b6d7abc7ec996d1ebc
BLAKE2b-256 9322df5900a481ec54dddee81f858efb04a2225119e62bb811e02267be9afa50

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.5-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.5-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 5.9 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.5-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 bcc0c17b480863c2734389c287da5cd25368ef985db019677374f2021c2a772f
MD5 7d886759c57f8e723f620c4601a26a2d
BLAKE2b-256 db50e4c27d20553538452b431a595a7a113e2e552eaf352dfddf00e8b372841d

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.5-cp38-cp38-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.5-cp38-cp38-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 802.3 kB
  • Tags: CPython 3.8, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.5-cp38-cp38-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 1e6c28d7ab133b0c1ad59f4d41991fd90109f486f96a703be5abcba6f700be68
MD5 82115505ad81e62312ced4c0d5e6de47
BLAKE2b-256 20110278c108c1053c90539075a72433bdf73cc59a0b2498460b25f2261439f1

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.5-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.5-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 732.4 kB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.5-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 4bc35c7596f2aa435864eedb338ffe750128a4fa2b3885e6e54d2c2012488516
MD5 3fbe4af80120e976868cc5afc1a51768
BLAKE2b-256 ac178ea9baf991c9724ae4d02856fcc3d360b0fb6e861f509d1a454a6fd5aca0

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.5-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.5-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 4.5 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.5-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 0f5932f7d1afe977e976276c151cc6e447a0240f32db184712c6c49028b89374
MD5 ad90d96c0853230d982a3187c9be30bf
BLAKE2b-256 139acd1afd4b9da0095a09ee45c06287d2172d4463fc6a8579722499eee46f4c

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.5-cp37-cp37m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.5-cp37-cp37m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 802.8 kB
  • Tags: CPython 3.7m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.5-cp37-cp37m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 ab948374106284500b7991b498ead3d55e49af54f23389080ec5ed5e5ee1d666
MD5 cc7e4941c24cfc5e1a784dbf9f5a3a3f
BLAKE2b-256 bb899bec8b246dc9eb9d0a35a4551cab40d3bc69d3af9252b8bd393059eae601

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.5-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.5-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 732.8 kB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.5-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 b47ab70875795944a54adbf8e5c0594b23c650b218133af3faa6afa8a89b35e5
MD5 79a54295b88e76016ae6d5103f385767
BLAKE2b-256 435e58c3612467145d409dc80ed581d8e4327998bb1d1b65d77ffa474a5fd887

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.5-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.5-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.0 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.5-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 6afc29e591ba6f879c4d3bce9c40e1769b6cd59b90c53a63b36f4ecc8c6ef305
MD5 49164276c3501c71fdf9da6afc39e4bc
BLAKE2b-256 a376e9cf639200332268ae351e746790076c1ca6497a175acef0fff6247ceebe

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.5-cp36-cp36m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.5-cp36-cp36m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 802.9 kB
  • Tags: CPython 3.6m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.5-cp36-cp36m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 b2fa0ee1df3563f991064e1f10e919ef0d66f93452cb9e1f5e627bea704931ba
MD5 85ca9e847f8c81f119563f0851cb03fb
BLAKE2b-256 7afc479f0c9698507d985e292a4d28f42a3eabd611db9c397fc44e26f2bcef22

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.5-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.5-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 732.8 kB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.5-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 e2c9779774cd7892436df6795e865fa1a7a2479ce843fd89efe9d2b997d9ab8d
MD5 2dc1a077b59993abbd4cfa6e81df7b24
BLAKE2b-256 38b28272414625004a2391b65fa2d1a76b453408e2d10be84b2d9e5e93948845

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.5-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.5-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.5-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 93dc0a8837b87691cdda03a3795dacf1ebd58590cf5eae0c4d1cd8038b2114cd
MD5 02c270cc5086f60fe9a39c049be22776
BLAKE2b-256 a084aa8cd48cdbcb7ff471c4979dba890e9f8c19534106e7cc42638fde51c173

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.5-cp35-cp35m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.5-cp35-cp35m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 802.9 kB
  • Tags: CPython 3.5m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.5-cp35-cp35m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 f0686aa2ee7fb21698747ceb18d446e35320e2a41576aa319df0282308690ec6
MD5 5e24461d5f89a8a3f2f524e5a2b1fab2
BLAKE2b-256 be3edcf4eb23916c7943ebf6bd6783097885bb3d8b901ec7efa4d438646dffd9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page