Skip to main content

Fast and Customizable Tokenizers

Project description

PyPI version

Tokenizers

A fast and easy to use implementation of today's most used tokenizers.

This API is currently in the process of being stabilized. We might introduce breaking changes really often in the coming days/weeks, so use at your own risks.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust nightly toolchain installed.

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -default-toolchain nightly-2019-11-01 -y
export PATH="$HOME/.cargo/bin:$PATH"

# Or select the right toolchain:
rustup default nightly-2019-11-01

Once Rust is installed and using the right toolchain you can do the following.

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install maturin
maturin develop --release

Usage

Use a pre-trained tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders

# Load a BPE Model
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
bpe = models.BPE.from_files(vocab, merges)

# Initialize a tokenizer
tokenizer = Tokenizer(bpe)

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel.new(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel.new()

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)

# Or tokenize multiple sentences at once:
encoded = tokenizer.encode_batch([
	"I can feel the magic, can you?",
	"The quick brown fox jumps over the lazy dog"
])
print(encoded)

Train a new tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE.empty())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel.new(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel.new())

# And then train
trainer = trainers.BpeTrainer.new(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# Now we can encode
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.0.13.tar.gz (55.6 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.0.13-cp38-cp38-win_amd64.whl (985.2 kB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.0.13-cp38-cp38-manylinux1_x86_64.whl (7.1 MB view details)

Uploaded CPython 3.8

tokenizers-0.0.13-cp38-cp38-macosx_10_13_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.8 macOS 10.13+ x86-64

tokenizers-0.0.13-cp37-cp37m-win_amd64.whl (985.0 kB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.0.13-cp37-cp37m-manylinux1_x86_64.whl (5.4 MB view details)

Uploaded CPython 3.7m

tokenizers-0.0.13-cp37-cp37m-macosx_10_13_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.7m macOS 10.13+ x86-64

tokenizers-0.0.13-cp36-cp36m-win_amd64.whl (985.4 kB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.0.13-cp36-cp36m-manylinux1_x86_64.whl (3.6 MB view details)

Uploaded CPython 3.6m

tokenizers-0.0.13-cp36-cp36m-macosx_10_13_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.6m macOS 10.13+ x86-64

tokenizers-0.0.13-cp35-cp35m-win_amd64.whl (985.5 kB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.0.13-cp35-cp35m-manylinux1_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.5m

tokenizers-0.0.13-cp35-cp35m-macosx_10_13_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.5m macOS 10.13+ x86-64

File details

Details for the file tokenizers-0.0.13.tar.gz.

File metadata

  • Download URL: tokenizers-0.0.13.tar.gz
  • Upload date:
  • Size: 55.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.13.tar.gz
Algorithm Hash digest
SHA256 7a62530f94bc8356c3fb6708e46574db12dadb9824ef4261766e0efe911b6e95
MD5 dd7dd53a5513590d60610d8dd3be8638
BLAKE2b-256 e5a0a03e0cc68843804aa487379c3476a85f25252f4846b57f39502987e034b6

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.13-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.13-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 985.2 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.13-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 3e63d26c519ca9ea34809d1264ae75b34838886d78b2e8a68aa7f205b48a4832
MD5 172264ece869b56d6c44ab66aa7ee2bd
BLAKE2b-256 075f5b2ae7f059b45bf6361eb98bee80d5ec91ef228e9085f659b8af40b0af6d

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.13-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.13-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 7.1 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.13-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 5224f7b2b6a7334ff8ffae7eb4f1e7c7e258712839e836a057e420397ccae76d
MD5 04639e814647284161789324866f4349
BLAKE2b-256 7adf3611d00388037608bdc725ca19642df4b5ceecdc98be600bfd06048645c0

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.13-cp38-cp38-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.13-cp38-cp38-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.8, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.13-cp38-cp38-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 3e81d6ab8d3c7d4b5a52a9516cc463f2a8d75d064d971d1fc7a6aae624e7417f
MD5 029c9a6b1cbec9468c0f5383980f61c8
BLAKE2b-256 c6ac2d69d67a4c1f07ff48c9f5d4f718559bea8d364d61ead627cf3591bec941

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.13-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.13-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 985.0 kB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.13-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 399bd78d99b30315e4e3b48f75bc255a3d288c5e15646a60846312a342b7a368
MD5 6cd36a1f2ee178ab9abff6c504ea60d6
BLAKE2b-256 b2b87694ab18219214c7a5cd49babfba470beb2d9f9eb158fbf682ad5c6ccf47

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.13-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.13-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 5.4 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.13-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 2fe9b59c90b8b877e73a421e17b524db8c76ef207ded34a50763c44790cf6a29
MD5 ae522127078ceb6e01cd69e46ed030bd
BLAKE2b-256 a70957508f0aca9668bf31fd78291a97a0624d59a8aa1f9d8c9f09c4f9201cd8

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.13-cp37-cp37m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.13-cp37-cp37m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.7m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.13-cp37-cp37m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 657e7528fc2f1518fa8623af312b98b5ece9f40e885066844865914624c90e99
MD5 8e19d33335bdacdf6af3188f7d33ea99
BLAKE2b-256 c1d4f1cd1f55a5c66f7c8b050d636dcb571b3bd150f97384cd82a24f245d91ad

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.13-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.13-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 985.4 kB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.13-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 7726fdcc50bacad246ee78290111aaa644aa657935861735443a19785e46467f
MD5 e8086cd1af15924954d3e5f6cd1d8bb2
BLAKE2b-256 0b99fcd7a30ceafa55f2dfd9f4c9251ce542dab97843517f52e8fdb006c858ca

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.13-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.13-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.6 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.13-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 b1e4a82c4383e86206b54feb92f8ff6fa6099b948be34193e5d41edac576131d
MD5 2aeb4fdc6d2d0d7e16bab43c5dfc4ab9
BLAKE2b-256 3ff0bfcb982b7dfa35f38c527cead4fb25c9ef608ca75cebfb7402631004bb03

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.13-cp36-cp36m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.13-cp36-cp36m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.6m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.13-cp36-cp36m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 f2770a1a7bc14bfcad637bb0b0995094b66a3e28e2c89fa0cd80b74806003729
MD5 04c869a144880b2161765d9af3ea20c6
BLAKE2b-256 750837487fd09519ea1801f7f73810186f45adb741e233b00437e52987caf8ca

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.13-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.13-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 985.5 kB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.13-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 ad7f865e3ba96278a6fea3bc0fa866986e006cea3ad72815d33db8c9ac90d808
MD5 e531bd2fdbef3b88865006f7db1afa63
BLAKE2b-256 d18e2c50c8f5c9764379a25680c529ea81fc11f74b9d5600d0f6957e23958850

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.13-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.13-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.13-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 40fcdd16bb1ff188fc4af579cac05fbdb504e807008bdc48ee2061e2b666e377
MD5 7d7967392aa0c35e48c3425acfc9e8e1
BLAKE2b-256 d42713b564de37f8f891e4821567842e1b60002cad78c3a9c252ce382525388c

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.13-cp35-cp35m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.13-cp35-cp35m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.5m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.13-cp35-cp35m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 b8bbc871f90968ef5b18f9e6cb7fd74c692c14b623d5167ac91bebe99da739dd
MD5 8ec16806aa490ef4ffcd9f1d32ff2416
BLAKE2b-256 ffa4763f5871bdd7ab404836cc341885acad814ccd750be43fb94bcec4f68348

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page