Skip to main content

Fast and Customizable Tokenizers

Project description

PyPI version

Tokenizers

A fast and easy to use implementation of today's most used tokenizers.

This API is currently in the process of being stabilized. We might introduce breaking changes really often in the coming days/weeks, so use at your own risks.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust nightly toolchain installed.

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -default-toolchain nightly-2019-11-01 -y
export PATH="$HOME/.cargo/bin:$PATH"

# Or select the right toolchain:
rustup default nightly-2019-11-01

Once Rust is installed and using the right toolchain you can do the following.

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install maturin
maturin develop --release

Usage

Use a pre-trained tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders

# Load a BPE Model
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
bpe = models.BPE.from_files(vocab, merges)

# Initialize a tokenizer
tokenizer = Tokenizer(bpe)

# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new(add_prefix_space=True))
tokenizer.with_decoder(decoders.ByteLevel.new())

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)

# Or tokenize multiple sentences at once:
encoded = tokenizer.encode_batch([
	"I can feel the magic, can you?",
	"The quick brown fox jumps over the lazy dog"
])
print(encoded)

Train a new tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE.empty())

# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new(add_prefix_space=True))
tokenizer.with_decoder(decoders.ByteLevel.new())

# And then train
trainer = trainers.BpeTrainer.new(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# Now we can encode
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.0.11.tar.gz (30.8 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.0.11-cp38-cp38-win_amd64.whl (797.0 kB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.0.11-cp38-cp38-manylinux1_x86_64.whl (6.3 MB view details)

Uploaded CPython 3.8

tokenizers-0.0.11-cp38-cp38-macosx_10_13_x86_64.whl (869.5 kB view details)

Uploaded CPython 3.8 macOS 10.13+ x86-64

tokenizers-0.0.11-cp37-cp37m-win_amd64.whl (796.5 kB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.0.11-cp37-cp37m-manylinux1_x86_64.whl (4.7 MB view details)

Uploaded CPython 3.7m

tokenizers-0.0.11-cp37-cp37m-macosx_10_13_x86_64.whl (869.5 kB view details)

Uploaded CPython 3.7m macOS 10.13+ x86-64

tokenizers-0.0.11-cp36-cp36m-win_amd64.whl (796.8 kB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.0.11-cp36-cp36m-manylinux1_x86_64.whl (3.1 MB view details)

Uploaded CPython 3.6m

tokenizers-0.0.11-cp36-cp36m-macosx_10_13_x86_64.whl (869.6 kB view details)

Uploaded CPython 3.6m macOS 10.13+ x86-64

tokenizers-0.0.11-cp35-cp35m-win_amd64.whl (796.7 kB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.0.11-cp35-cp35m-manylinux1_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.5m

tokenizers-0.0.11-cp35-cp35m-macosx_10_13_x86_64.whl (869.5 kB view details)

Uploaded CPython 3.5m macOS 10.13+ x86-64

File details

Details for the file tokenizers-0.0.11.tar.gz.

File metadata

  • Download URL: tokenizers-0.0.11.tar.gz
  • Upload date:
  • Size: 30.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.11.tar.gz
Algorithm Hash digest
SHA256 4b7c42b644a1c5705a59b14c53c84b50b8f0b9c0f5f952a8a91a350403e7615f
MD5 adbbf2f9b95714d90b883ceb819cfe95
BLAKE2b-256 6c510eb780144128a7e7e108b507077b3a8099c908a8f5c1942db07cd8c312d1

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.11-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.11-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 797.0 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.11-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 503418d5195ae1a483ced0257a0d2f4583456aa49bdfe0014c8605babf244ac5
MD5 089be5cf90db2cebf1f8452b17dac566
BLAKE2b-256 5d46b3a08e93b905bca11cb83a1e9bdc2b76c470125b168c546f753bf3603e14

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.11-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.11-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 6.3 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.11-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 5ba2c6eaac2e8e0a2d839c0420d16707496b5e93b1454029d19487c5dd8c9b62
MD5 348f59558d1622ad06ce253f51ed122b
BLAKE2b-256 f0788425a69ada57481d10e0f8ba293499b0bfa4a508d4cc29d02de9056991c1

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.11-cp38-cp38-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.11-cp38-cp38-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 869.5 kB
  • Tags: CPython 3.8, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.11-cp38-cp38-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 7de28e0bebd0904b990560a1f14c3c5600da29be287e544bdf19e6970ea11d54
MD5 91475ae989af64d61f2fb7d76b6ee281
BLAKE2b-256 a0121ab2c816115df5f19ef7cd716e39475daf1f2d8134e0f221fa2fac60903d

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.11-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.11-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 796.5 kB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.11-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 3ebe7f0bff9e30ab15dec4846c54c9085e02e47711eb7253d36a6777eadc2948
MD5 c275538d4c9583894b36000088f656af
BLAKE2b-256 6ad3af5629cf53fac268dadcc69fd4db3096eda17e617ecfa9011787820dd59f

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.11-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.11-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 4.7 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.11-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 08e08027564194e16aa647d180837d292b2c9c5ef772fed15badcc88e2474a8f
MD5 91a40a17a6303582c899acdd0bc14c6d
BLAKE2b-256 5fcb3e8902d528538972873d0e9e4e47a31d1849a98e057009e9d383637c96fb

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.11-cp37-cp37m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.11-cp37-cp37m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 869.5 kB
  • Tags: CPython 3.7m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.11-cp37-cp37m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 ce75c75430a3dfc33a10c90c1607d44b172c6d2ea19d586692b6cc9ba6ec5e14
MD5 c4e2c9c37169c64c2b17c4f9a53abe75
BLAKE2b-256 cef3cafb6b6b814d5b044c5dbb9bf3fd189367fdf0cd44c5aa49a298dfe1aaaf

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.11-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.11-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 796.8 kB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.11-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 82e8c3b13a66410358753b7e48776749935851cdb49a3d0c139a046178ec4f49
MD5 08738f7080da019f4fe487ae5f61b72c
BLAKE2b-256 24d8deab989b6ca8bc12344515e6dd14206d4e9d17d08d48399817c41e00fd16

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.11-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.11-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.1 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.11-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 bb44fa1b268d1bbdf2bb14cd82da6ffb93d19638157c77f9e17e246928f0233f
MD5 e53facf2be7629611c5808e8ae2895df
BLAKE2b-256 5e367af38d572c935f8e0462ec7b4f7a46d73a2b3b1a938f50a5e8132d5b2dc5

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.11-cp36-cp36m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.11-cp36-cp36m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 869.6 kB
  • Tags: CPython 3.6m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.11-cp36-cp36m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 a66ff87c32a221a126904d7ec972e7c8e0033486b24f8777c0f056aedbc09011
MD5 4bb41a16e332a9573ed1ce7631104417
BLAKE2b-256 bda375ee3ee28ead743d05fe854fce0e2549ccdbbf01b6453e4a1d7ef6a32aa4

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.11-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.11-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 796.7 kB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.11-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 a7f5e43674dd5b012ad29b79a32f0652ecfff3a3ed1c04f9073038c4bf63829d
MD5 a44c12cf99986a43ebbea13d91c1cf89
BLAKE2b-256 55f68354c1e3037d6a2ea6ec57a471e77a226c9b9b4a6d05373806d9079b3aa3

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.11-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.11-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.11-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 a4d1ef6ee9221e7f9c1a4c122a15e93f0961977aaae2813b7b405c778728dcee
MD5 66128f74d4dd2b5174fa64b7fabfc0d8
BLAKE2b-256 cd9c460a5476a8bbffa08a1617bc834b456a3559c0b169ae46559b6c5f0b8399

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.11-cp35-cp35m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.11-cp35-cp35m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 869.5 kB
  • Tags: CPython 3.5m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.11-cp35-cp35m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 1385deb90ec76cbee59b50298c8d2dc5909cda080a706d263e4f81c8474ba53d
MD5 18aacb7156747905c3c02fd38a443825
BLAKE2b-256 23055f11f8b4874d5649af4f740af72f29cfff4c97c3f67fecc74f96869e723c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page