Skip to main content

Fast and Customizable Tokenizers

Project description

PyPI version

Tokenizers

A fast and easy to use implementation of today's most used tokenizers.

This API is currently in the process of being stabilized. We might introduce breaking changes really often in the coming days/weeks, so use at your own risks.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust nightly toolchain installed.

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -default-toolchain nightly-2019-11-01 -y
export PATH="$HOME/.cargo/bin:$PATH"

# Or select the right toolchain:
rustup default nightly-2019-11-01

Once Rust is installed and using the right toolchain you can do the following.

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install maturin
maturin develop --release

Usage

Use a pre-trained tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders

# Load a BPE Model
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
bpe = models.BPE.from_files(vocab, merges)

# Initialize a tokenizer
tokenizer = Tokenizer(bpe)

# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new(add_prefix_space=True))
tokenizer.with_decoder(decoders.ByteLevel.new())

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)

# Or tokenize multiple sentences at once:
encoded = tokenizer.encode_batch([
	"I can feel the magic, can you?",
	"The quick brown fox jumps over the lazy dog"
])
print(encoded)

Train a new tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE.empty())

# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new(add_prefix_space=True))
tokenizer.with_decoder(decoders.ByteLevel.new())

# And then train
trainer = trainers.BpeTrainer.new(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# Now we can encode
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.0.12.tar.gz (50.4 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.0.12-cp38-cp38-win_amd64.whl (964.3 kB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.0.12-cp38-cp38-manylinux1_x86_64.whl (7.1 MB view details)

Uploaded CPython 3.8

tokenizers-0.0.12-cp38-cp38-macosx_10_13_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.8 macOS 10.13+ x86-64

tokenizers-0.0.12-cp37-cp37m-win_amd64.whl (965.0 kB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.0.12-cp37-cp37m-manylinux1_x86_64.whl (5.3 MB view details)

Uploaded CPython 3.7m

tokenizers-0.0.12-cp37-cp37m-macosx_10_13_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.7m macOS 10.13+ x86-64

tokenizers-0.0.12-cp36-cp36m-win_amd64.whl (965.0 kB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.0.12-cp36-cp36m-manylinux1_x86_64.whl (3.5 MB view details)

Uploaded CPython 3.6m

tokenizers-0.0.12-cp36-cp36m-macosx_10_13_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.6m macOS 10.13+ x86-64

tokenizers-0.0.12-cp35-cp35m-win_amd64.whl (965.0 kB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.0.12-cp35-cp35m-manylinux1_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.5m

tokenizers-0.0.12-cp35-cp35m-macosx_10_13_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.5m macOS 10.13+ x86-64

File details

Details for the file tokenizers-0.0.12.tar.gz.

File metadata

  • Download URL: tokenizers-0.0.12.tar.gz
  • Upload date:
  • Size: 50.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.12.tar.gz
Algorithm Hash digest
SHA256 cbb1e779bbf6506b63bb00b8188361c16b0058e5605554851f785416f02de271
MD5 435f0f6c3b02d89d54c6ac52339a8d57
BLAKE2b-256 39bed3c93a54a31266cf9e1a585012e17b6bc3722348cb85b35966ade82d33f6

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.12-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.12-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 964.3 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.12-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 703bb4210252e9cb148e4586fe764b50c3e059584bb72e913254b85ee5c94cb3
MD5 4a5883f2a929390b6549514a63c85891
BLAKE2b-256 6267e16c348eee103620a73b4fbc85676bee32c5b3cf4aed31c8bab1111ec366

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.12-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.12-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 7.1 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.12-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 1b5b48303ff414577682cd02a322e90b5ba6be466af7930dd1db95044bec40eb
MD5 d6f8e8f968aad6dc042d5d1e21e26707
BLAKE2b-256 539f5a89839e9e5b166ea783250a17affa5a22631348994b60547ab9d6fec82c

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.12-cp38-cp38-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.12-cp38-cp38-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.8, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.12-cp38-cp38-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 faa03757be6672861b2090ac7bee49edae17aa911fdea41e573e6da2cd870968
MD5 08455cfa3ae127cb7fea5ee8e114582d
BLAKE2b-256 f423ebd2a17becf1f1a9d2d855fe660eed2b4ef6a3d8c5d2b28a6812aa05096c

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.12-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.12-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 965.0 kB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.12-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 b468b61f922bc60e53970c09efa62452ac2e4772ca6ef1f3e55c0f93162ab61c
MD5 58ee63d150d48fc7e4e29cbcaa662a3e
BLAKE2b-256 b4a9e3bd033305c94c4540ac5dc4c50f9b23c872475568df9fd88600929aa0bd

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.12-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.12-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 5.3 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.12-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 0f973d79fef2d135b02536dca98d00005dca85fbe43ff399e15c8f893d49f557
MD5 5c9e2de48053bb10dbc942a1053edea3
BLAKE2b-256 bd6e080558cffd543c8dfd0e846562c683d03bd83e7b18736dde31a2bd0ada2d

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.12-cp37-cp37m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.12-cp37-cp37m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.7m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.12-cp37-cp37m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 245b7c5d5205925723433fb25d07f041a60ea1330b83881cbe94af59731d7bb5
MD5 d6d8e8befcc17d2abf08be3c68b8f463
BLAKE2b-256 8efbd61c21758aecf258d0f90c146f59cfe40df8f151a402c37e0b679952c544

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.12-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.12-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 965.0 kB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.12-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 5054d4c3ce6a36ec8ab9b085846893cd7329718720f7afd4ca31d52c83f46c13
MD5 c6a2545445d8f67c26a1b997ae79df09
BLAKE2b-256 e34f3fd38e985981b605a4e34629b156a0056b1340ebab67474b365445a7c9d1

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.12-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.12-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.5 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.12-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 6cd473a079775f4d6d468cab269feb1e4c980f725586ea3c15374e24255604db
MD5 121ed1a093a2d9ea4b2e51c661beae14
BLAKE2b-256 a167a07f5b0f5ec2a92062ad728826b68d57610e5602a938db85f81a8f782086

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.12-cp36-cp36m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.12-cp36-cp36m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.6m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.12-cp36-cp36m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 0f0cf9c1f8cb1b0ae91be371e2e35dbbd735a459247e967283a60c310f21a0e9
MD5 270659e7cfea98453fec3d584d0aa67e
BLAKE2b-256 556a6aaaeff6e2260b8172433b466b6c389c98ea9322ddf79db530dd56987b07

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.12-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.12-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 965.0 kB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.12-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 fb612b0e856f5dbae4e1d8d22ae4767d0ee25617f6119e993c8a40a428fe0a00
MD5 b5f618e44987b19c20d863f70cf7ffa8
BLAKE2b-256 4dd3c48fb353616f227804b34c8ee757cfdfae3fbebcd0fbd360375a264e5d47

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.12-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.12-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.12-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 f94a66c16c8eca02c1920398ef9b3ee19b0aaf93100b2422bb9d2e0800d721b2
MD5 6f36e138eb2e1e74a8758239ada27ba1
BLAKE2b-256 4019935f78893986824542a7b72294f0a3794b57b441950fda1194014f071197

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.12-cp35-cp35m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.12-cp35-cp35m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.5m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.12-cp35-cp35m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 91573509e1ae48f4a28910e2d2c675a682ac3d79b9739ef00dcb1672b46d6a6a
MD5 ae656f7c71256b202346c71c60fb2cd0
BLAKE2b-256 0f6a89dc07b67b518cc3f130048b078b4041b77805f104b183bc2b6377f089db

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page