Skip to main content

Fast and Customizable Tokenizers

Project description

PyPI version

Tokenizers

A fast and easy to use implementation of today's most used tokenizers.

This API is currently in the process of being stabilized. We might introduce breaking changes really often in the coming days/weeks, so use at your own risks.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust nightly toolchain installed.

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -default-toolchain nightly-2019-11-01 -y
export PATH="$HOME/.cargo/bin:$PATH"

# Or select the right toolchain:
rustup default nightly-2019-11-01

Once Rust is installed and using the right toolchain you can do the following.

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install maturin
maturin develop --release

Usage

Use a pre-trained tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders

# Load a BPE Model
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
bpe = models.BPE.from_files(vocab, merges)

# Initialize a tokenizer
tokenizer = Tokenizer(bpe)

# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new(True))
tokenizer.with_decoder(decoders.ByteLevel.new())

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)

# Or tokenize multiple sentences at once:
encoded = tokenizer.encode_batch([
	"I can feel the magic, can you?",
	"The quick brown fox jumps over the lazy dog"
])
print(encoded)

Train a new tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE.empty())

# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new(True))
tokenizer.with_decoder(decoders.ByteLevel.new())

# And then train
trainer = trainers.BpeTrainer.new(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# Now we can encode
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.0.9.tar.gz (34.9 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.0.9-cp38-cp38-win_amd64.whl (777.7 kB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.0.9-cp38-cp38-manylinux1_x86_64.whl (6.2 MB view details)

Uploaded CPython 3.8

tokenizers-0.0.9-cp38-cp38-macosx_10_13_x86_64.whl (853.2 kB view details)

Uploaded CPython 3.8 macOS 10.13+ x86-64

tokenizers-0.0.9-cp37-cp37m-win_amd64.whl (777.2 kB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.0.9-cp37-cp37m-manylinux1_x86_64.whl (4.6 MB view details)

Uploaded CPython 3.7m

tokenizers-0.0.9-cp37-cp37m-macosx_10_13_x86_64.whl (853.5 kB view details)

Uploaded CPython 3.7m macOS 10.13+ x86-64

tokenizers-0.0.9-cp36-cp36m-win_amd64.whl (777.6 kB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.0.9-cp36-cp36m-manylinux1_x86_64.whl (3.1 MB view details)

Uploaded CPython 3.6m

tokenizers-0.0.9-cp36-cp36m-macosx_10_13_x86_64.whl (853.8 kB view details)

Uploaded CPython 3.6m macOS 10.13+ x86-64

tokenizers-0.0.9-cp35-cp35m-win_amd64.whl (777.6 kB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.0.9-cp35-cp35m-manylinux1_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.5m

tokenizers-0.0.9-cp35-cp35m-macosx_10_13_x86_64.whl (853.7 kB view details)

Uploaded CPython 3.5m macOS 10.13+ x86-64

File details

Details for the file tokenizers-0.0.9.tar.gz.

File metadata

  • Download URL: tokenizers-0.0.9.tar.gz
  • Upload date:
  • Size: 34.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.9.tar.gz
Algorithm Hash digest
SHA256 aef88a366167a4b3cf11c7e0ddee98b0f6e3c70fc2612eccb3cbc67b2d645611
MD5 1afb8f087bff62e0ba7059dc7195f41c
BLAKE2b-256 bb85a039c358431f9ac268bf2ea089b338bd1760b409d8804e67888ea1153682

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.9-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.9-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 777.7 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.9-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 99f6de923253797dab84ae7794e78e8d6d141887a71f88c13daad5611efc96a6
MD5 85c8a7deacc8f54b2adbb07283e9f32e
BLAKE2b-256 7904cffa021701165521e42944cbe21a1b090ebcfafc119425cf161b35ac0236

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.9-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.9-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 6.2 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.9-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 43ec890b4a868e3fc7a1f23d05d2739423afce8ec3cac412a748d679fea0aead
MD5 32bcea8738c5b18a68893f701ae0e5fa
BLAKE2b-256 54c80f2181ac6d6f8780fd5e13dca4029a75702d1372e26957ac80be6edc86fb

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.9-cp38-cp38-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.9-cp38-cp38-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 853.2 kB
  • Tags: CPython 3.8, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.9-cp38-cp38-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 b166cbbad77ef5543203d5ff103b44d73da9c6a7f2b54ca3221053ff18e215f9
MD5 d38f45b1b61eb1c34ec432d0abb66ef0
BLAKE2b-256 7b12f4ad2c6dfb491024dc3f9faa792bcadbd6c245a157fa1f3f7cd0e7366d50

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.9-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.9-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 777.2 kB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.9-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 f0a04d5f209a8d07a3b32649011f91dc0d3fc0ca30f29428c04b89eefcc94060
MD5 80c6c0c07ee1612446c906a64e6c2e74
BLAKE2b-256 62226fa66befcccea4511dcdf09e09d82ca60796044d64a6e01967ae49a7416f

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.9-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.9-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 4.6 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.9-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 bc07255ddfdc348fd72d5f0c5642f03f6eb3266b2701eec2ad197fa929bb332e
MD5 0914e27e7fa85640c38b50fbf7c2bd20
BLAKE2b-256 c2ec297376a3697bf2f190de24c54f3493281c411c3403df2a73e54a6a6a455f

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.9-cp37-cp37m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.9-cp37-cp37m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 853.5 kB
  • Tags: CPython 3.7m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.9-cp37-cp37m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 2785443c893ed6d72129d7fec552e5243ea66b4159f2ee3b18d5c1de76cdbc6b
MD5 c01aadbb9bb2da11850e1bbc99b2dd77
BLAKE2b-256 a65e048621cd472dd6986fcd67ff490a52b6a1b69d3797a6f36a6433ea22c510

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.9-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.9-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 777.6 kB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.9-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 e401d087f31929b0e6526605b72ac09d86689ab2f333b7c852cffad1094314d2
MD5 8a36661a970c7600499f7024464beaef
BLAKE2b-256 a379c43e752c46c87d6378db85427b072a578caeda446da61533a9aa2c8d244e

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.9-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.9-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.1 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.9-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 98103a7f6ec888bafa04cf4cc8dbc6a595e7aa76f5598f9d6f5a1dd5f1490473
MD5 7f72a11fe0dc9bfa75e87a78893cc1c5
BLAKE2b-256 db6138694f1bbaa8d8528045e083c3d1a59ef062ede798ddf1fce2c5d6f13ee3

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.9-cp36-cp36m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.9-cp36-cp36m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 853.8 kB
  • Tags: CPython 3.6m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.9-cp36-cp36m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 87920436ad1bce9af064ebb15cbfdffdbe809b1ac7fd8ac13775aa0c861fefb1
MD5 0e989d78faff95c481e3188fc0fb9a3c
BLAKE2b-256 e2d553e2083058dd32c52b5bff51524dfa67b3845f35112cc9397fb87d6fd865

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.9-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.9-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 777.6 kB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.9-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 dc2778111fcf369fa5d1670b5d6f34456fdb74b513d93294b7497212edbd5915
MD5 ad3b669e41605738a6e00076f1942e74
BLAKE2b-256 7045a7c345ece2cf847b85695f76370426c0109719d54c95b9df9b740f21876d

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.9-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.9-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.9-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 d67d3dc1ca0b666e33ae9eed809de736c3a563be7daca17e435b88bd1e66d0f9
MD5 83b96540e513ae318ba47e41913e2eee
BLAKE2b-256 97ebdfdf438337ec19d2da7f0c3ee08990e78914255fa16dc9faf49f33ef6cfe

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.9-cp35-cp35m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.9-cp35-cp35m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 853.7 kB
  • Tags: CPython 3.5m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.9-cp35-cp35m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 fb3c70403bd16dc4df618a64ae3dac78feaf1cf4dfeff4852d5cb75c565f488a
MD5 20442c5fe6cfdc6e0633ee449ee06af2
BLAKE2b-256 545309cfd6fff576e63f6dc583e5a469da3d5acb1dbe07468bffb84dd9bac94f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page