Skip to main content

Fast and Customizable Tokenizers

Project description

PyPI version

Tokenizers

A fast and easy to use implementation of today's most used tokenizers.

This API is currently in the process of being stabilized. We might introduce breaking changes really often in the coming days/weeks, so use at your own risks.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust nightly toolchain installed.

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -default-toolchain nightly-2019-11-01 -y
export PATH="$HOME/.cargo/bin:$PATH"

# Or select the right toolchain:
rustup default nightly-2019-11-01

Once Rust is installed and using the right toolchain you can do the following.

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install maturin
maturin develop --release

Usage

Use a pre-trained tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders

# Load a BPE Model
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
bpe = models.BPE.from_files(vocab, merges)

# Initialize a tokenizer
tokenizer = Tokenizer(bpe)

# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new())
tokenizer.with_decoder(decoders.ByteLevel.new())

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)

# Or tokenize multiple sentences at once:
encoded = tokenizer.encode_batch([
	"I can feel the magic, can you?",
	"The quick brown fox jumps over the lazy dog"
])
print(encoded)

Train a new tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE.empty())

# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new())
tokenizer.with_decoder(decoders.ByteLevel.new())

# And then train
trainer = trainers.BpeTrainer.new(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# Now we can encode
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.0.4.tar.gz (25.6 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.0.4-cp38-cp38-win_amd64.whl (713.4 kB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.0.4-cp38-cp38-manylinux1_x86_64.whl (5.9 MB view details)

Uploaded CPython 3.8

tokenizers-0.0.4-cp38-cp38-macosx_10_13_x86_64.whl (783.2 kB view details)

Uploaded CPython 3.8 macOS 10.13+ x86-64

tokenizers-0.0.4-cp37-cp37m-win_amd64.whl (713.5 kB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.0.4-cp37-cp37m-manylinux1_x86_64.whl (4.4 MB view details)

Uploaded CPython 3.7m

tokenizers-0.0.4-cp37-cp37m-macosx_10_13_x86_64.whl (782.9 kB view details)

Uploaded CPython 3.7m macOS 10.13+ x86-64

tokenizers-0.0.4-cp36-cp36m-win_amd64.whl (713.7 kB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.0.4-cp36-cp36m-manylinux1_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.6m

tokenizers-0.0.4-cp36-cp36m-macosx_10_13_x86_64.whl (783.0 kB view details)

Uploaded CPython 3.6m macOS 10.13+ x86-64

tokenizers-0.0.4-cp35-cp35m-win_amd64.whl (713.7 kB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.0.4-cp35-cp35m-manylinux1_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.5m

tokenizers-0.0.4-cp35-cp35m-macosx_10_13_x86_64.whl (783.0 kB view details)

Uploaded CPython 3.5m macOS 10.13+ x86-64

File details

Details for the file tokenizers-0.0.4.tar.gz.

File metadata

  • Download URL: tokenizers-0.0.4.tar.gz
  • Upload date:
  • Size: 25.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.4.tar.gz
Algorithm Hash digest
SHA256 25ed9ab6e98d349cb6d401bfb37b01a6e0947c9d30b3fbddd42f43ee4405735a
MD5 9746062d4169444eb67175ad01623c37
BLAKE2b-256 77ea94628994275fa2789ae43e888ae2c404904e1cb49c38a59d8ae553643e6a

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.4-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.4-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 713.4 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.4-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 3b4b28e2c1a53e11b794cc9122c28b7d512238efdec551f3179b242d888eaa19
MD5 5bffbda703ec029b5efa691f71ddf782
BLAKE2b-256 a7170d66e66b8e79c1fc0e6c58a5a62d1f153d4a65a85e7212a22c9b92195edc

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.4-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.4-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 5.9 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.4-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 e2d29d7194332c3b5cb4b0091227f03c9dfcbfe8c442a78dd7d8796ecad07631
MD5 81f3811f88d8ae942e9e4c92696b63c7
BLAKE2b-256 f3e141ac9752c4cc2c32dfa476117f7f137186cb019fef9f48391574dd095ad0

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.4-cp38-cp38-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.4-cp38-cp38-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 783.2 kB
  • Tags: CPython 3.8, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.4-cp38-cp38-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 3adda4fab8be71e2457edf82ed5c78d2e225c8685d59169c45e3e0ad2b928e82
MD5 16cc2725bb306a13aa9fb01fd277ddf2
BLAKE2b-256 26eb3be95aeda1490200e30f25fcdc20d1da675603d724c953b3134596ad318c

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.4-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.4-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 713.5 kB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.4-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 40631328824910d2c3ee192c4736cd95e863e9a1b44eb565071266a63e04f9e9
MD5 fb40f5f6163153a7212dd02c082827b9
BLAKE2b-256 db35241f0338a0b8530074fcfebe7a13b45bb512c98fb5a988be941b27f75d55

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.4-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.4-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 4.4 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.4-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 ed0b5537980efb3a30f70ce9aebed8261fe78adf8c7ed596da00ad450bf98f45
MD5 0aaa72ce2a55f5f9accd0281c4c9265c
BLAKE2b-256 2f6851e710a6c4b74e8f135ed0d8b2ad083fe88e716c83ee6f3c02bde074dc20

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.4-cp37-cp37m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.4-cp37-cp37m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 782.9 kB
  • Tags: CPython 3.7m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.4-cp37-cp37m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 58df90911f832085140a273725ecc0c7119c0554b61aad4bdce0ed77b5e9e45c
MD5 64ef07075d0838f695d3ca029d7a100b
BLAKE2b-256 ad5d87b98e4cdc465090bf87d9745a61631f183bd8e436332cf61cfdbdd16668

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.4-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.4-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 713.7 kB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.4-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 abc2b3c9513d1542d07c5203a0b9582263bd43562260187ced855078a12aaa21
MD5 016aec00626399e614b73e6b73da79d9
BLAKE2b-256 aea0f848687338a2fd7d1243d5a7755eb7eecf29da1fab3ed13cb35c346f3a80

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.4-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.4-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.9 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.4-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 4b5a28bf52c6dc4f83d17b0846d8171b45153ad48ced8b0d5820085258d7b7b2
MD5 c9e5c42415b349e680d6893ff2a59b58
BLAKE2b-256 75417f4bd50024fa46e500842e6c8cf95768c0866cb6c944a904782bec71bb17

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.4-cp36-cp36m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.4-cp36-cp36m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 783.0 kB
  • Tags: CPython 3.6m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.4-cp36-cp36m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 8e0be35d450bfe6cb3e6a8435173ed9afd8f5597218e7b0da387f6765a459b4a
MD5 99f7f291cbefaf343bc11c03b6de2984
BLAKE2b-256 995d85f9da8dc855da740e9ff5f88ba9ee0ff9c0bb4de178f1b34ace197567a9

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.4-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.4-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 713.7 kB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.4-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 c1fffd674fa695785900485f74d28d5627a2d9e611a8d2d91e4480a786a8ed0d
MD5 9f94d676e71f28eb3ec5858e2609ff25
BLAKE2b-256 b245b5b1a5144be71b97ef5657dd9d1d25f85d546537ea26534e7d879db028ed

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.4-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.4-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.4-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 39cab7d8dff13b7f5f3059f46e2bd54f6971bd230fc9aa57a6f396b1beaa8a47
MD5 49146710a45c6cba5c9f538fa4d7a14a
BLAKE2b-256 af9db43316f18b270a4a0a52fc17394fae84e4782ef434c1593707e10f475db9

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.4-cp35-cp35m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.4-cp35-cp35m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 783.0 kB
  • Tags: CPython 3.5m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.4-cp35-cp35m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 9580c158e6cee58a901653da37903f1848719a91c6e70942ae28cfc80af068e9
MD5 3cc0280492bad62738524cc5aff9bee0
BLAKE2b-256 6f779b2e5c25774a1d81bc843c80c432a745ffa45796fbad0b08ff225962a6c1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page