Skip to main content

Fast and Customizable Tokenizers

Project description

PyPI version

Tokenizers

A fast and easy to use implementation of today's most used tokenizers.

This API is currently in the process of being stabilized. We might introduce breaking changes really often in the coming days/weeks, so use at your own risks.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust nightly toolchain installed.

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -default-toolchain nightly-2019-11-01 -y
export PATH="$HOME/.cargo/bin:$PATH"

# Or select the right toolchain:
rustup default nightly-2019-11-01

Once Rust is installed and using the right toolchain you can do the following.

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install maturin
maturin develop --release

Usage

Use a pre-trained tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders

# Load a BPE Model
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
bpe = models.BPE.from_files(vocab, merges)

# Initialize a tokenizer
tokenizer = Tokenizer(bpe)

# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new())
tokenizer.with_decoder(decoders.ByteLevel.new())

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)

# Or tokenize multiple sentences at once:
encoded = tokenizer.encode_batch([
	"I can feel the magic, can you?",
	"The quick brown fox jumps over the lazy dog"
])
print(encoded)

Train a new tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE.empty())

# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new())
tokenizer.with_decoder(decoders.ByteLevel.new())

# And then train
trainer = trainers.BpeTrainer.new(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# Now we can encode
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)

Project details


Release history Release notifications | RSS feed

This version

0.0.3

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.0.3.tar.gz (21.5 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.0.3-cp38-cp38-win_amd64.whl (615.8 kB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.0.3-cp38-cp38-manylinux1_x86_64.whl (5.3 MB view details)

Uploaded CPython 3.8

tokenizers-0.0.3-cp38-cp38-macosx_10_13_x86_64.whl (683.3 kB view details)

Uploaded CPython 3.8 macOS 10.13+ x86-64

tokenizers-0.0.3-cp37-cp37m-win_amd64.whl (615.5 kB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.0.3-cp37-cp37m-manylinux1_x86_64.whl (4.0 MB view details)

Uploaded CPython 3.7m

tokenizers-0.0.3-cp37-cp37m-macosx_10_13_x86_64.whl (683.4 kB view details)

Uploaded CPython 3.7m macOS 10.13+ x86-64

tokenizers-0.0.3-cp36-cp36m-win_amd64.whl (617.1 kB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.0.3-cp36-cp36m-manylinux1_x86_64.whl (2.7 MB view details)

Uploaded CPython 3.6m

tokenizers-0.0.3-cp36-cp36m-macosx_10_13_x86_64.whl (683.6 kB view details)

Uploaded CPython 3.6m macOS 10.13+ x86-64

tokenizers-0.0.3-cp35-cp35m-win_amd64.whl (617.1 kB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.0.3-cp35-cp35m-manylinux1_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.5m

tokenizers-0.0.3-cp35-cp35m-macosx_10_13_x86_64.whl (683.5 kB view details)

Uploaded CPython 3.5m macOS 10.13+ x86-64

File details

Details for the file tokenizers-0.0.3.tar.gz.

File metadata

  • Download URL: tokenizers-0.0.3.tar.gz
  • Upload date:
  • Size: 21.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.3.tar.gz
Algorithm Hash digest
SHA256 1734f3b0499b0a0b39b07825be779107402049b144a9dcc2f25cb5b3adb9c937
MD5 4f600066b5ad05e8d763648049923e49
BLAKE2b-256 a9bf0624bc7b26691012e8b0013f58289f353afb0838dab4829a6b283433955b

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.3-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.3-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 615.8 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.3-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 cd0ec2faf6f847d1c2dddc3110a587812e4bc4dd37fb0ad71749b4223b68886b
MD5 9e52bbac5d486e099989bffa92c68764
BLAKE2b-256 549b1a5a7ff200440bd5334734b9c0daf746ff54bce3b5adabe52b867f0ad218

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.3-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.3-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 5.3 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.3-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 e74d504603f76fb8a4e4866b7a93ecdc6fe6242fa78fb819d9f02a6f3ba72c8f
MD5 8584a2bc3c399adffb438c75b8719fad
BLAKE2b-256 facb6dab2a81b129beac78f51bdc9dbf3917a027065bc7aa8ef5070b26d7dd43

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.3-cp38-cp38-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.3-cp38-cp38-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 683.3 kB
  • Tags: CPython 3.8, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.3-cp38-cp38-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 ea392806ce1c4358b9e2daa6bda1632c997583d014c4b9ddaf2cd7e2de3191fc
MD5 03389707f3c85c5d10d630d487b01890
BLAKE2b-256 349acc9ec164571a5b67965df23c079cbbcd51971be0791a94c0f259907bb7c7

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.3-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.3-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 615.5 kB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.3-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 7fd9e4cf35bc131106c1435ec466a7a0e03dbc8a2e6f8307c39c2eac35ceec81
MD5 78ca6a453d65a5d20c2edbfbbd2cbbee
BLAKE2b-256 7854d5ffca295ae65da78e304fbf7f74890c66f381d7997e2ea4879afb89a812

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.3-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.3-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 4.0 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.3-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 7dc257e0369440510954983d72fc161395da62601522ae03fb4a5d915f68a6b6
MD5 38a25841d2f776c9a91c91212dac8093
BLAKE2b-256 4d0a6c0710260e8cc09892eac161bbeeed53b2e64b9fa18d9e4492ecaa318977

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.3-cp37-cp37m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.3-cp37-cp37m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 683.4 kB
  • Tags: CPython 3.7m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.3-cp37-cp37m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 93dc62cc1794051363798ae999b76f47510a3cb21acaac05d7a5a9da0b1e2152
MD5 bfdf6c9965c54b07b151cfc7cc2b3d30
BLAKE2b-256 6555c6440067d81b01faf8526eb54672b091ad60493045104724d8cf9a2631dd

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.3-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.3-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 617.1 kB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.3-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 29d935e32c38c0a56e906e5f58458ac1b72e63e6276d6a5741a189409e5f3f0d
MD5 575c5b8f2bb29d02871926bd057e313c
BLAKE2b-256 abe5e3aeb19175290cd23c79db1585d70153d0b83fd18ee4b581dfc83451af59

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.3-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.3-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.7 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.3-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 be850b7bc57fb3276cfd65467f1de024edfee789be0c29ab5e08a8cb323bae41
MD5 43098ecbbbf3929c940459cd4d12786a
BLAKE2b-256 e97df9ba934026b0764400bf79103addd02d05aab1dc9adfcc6bd7be8a575f38

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.3-cp36-cp36m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.3-cp36-cp36m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 683.6 kB
  • Tags: CPython 3.6m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.3-cp36-cp36m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 a0bf5a21a21cdb8d0c1386716ad872cc1445987f6abcd1de8e53427c9b25bd59
MD5 134bb1134c2ad7672c762956d6bf37ff
BLAKE2b-256 7b9238790ae7ca2d17d22cc14efb647c304a08a4e3fd278e8535440471c8c5f9

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.3-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.3-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 617.1 kB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.3-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 ed6fbcd386ee38fa77f6ebc2751bbc78846936b98b876f1343a921863c8d48dd
MD5 7b16828311826669ca55a15ddb51b6d8
BLAKE2b-256 593eb5739837d7437bc3f8d0b751fff20d072630db03ea7a195fa5e321193ecb

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.3-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.3-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.3-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 1ea810773ee1d8ef968b5b3cb0fd7e4cd5f441fb7dd343f26e6366c0f2062d03
MD5 1d216bfbf127961179e616b3de9652ba
BLAKE2b-256 b647f2eb2a19c737346bec86890ed860d91f43becf3c19d5073c48e671ae47cd

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.3-cp35-cp35m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.3-cp35-cp35m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 683.5 kB
  • Tags: CPython 3.5m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.3-cp35-cp35m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 35c6b5e942f67f18cacc5c802c32a853957ea67143b925066b9b618cff413625
MD5 77ffff298d4f8995813e29aca08046bc
BLAKE2b-256 b731994e0351f14594638eb7e6ba100f72c34bef22d998affae61023c97dbbcf

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page