Skip to main content

Fast and Customizable Tokenizers

Project description

PyPI version

Tokenizers

A fast and easy to use implementation of today's most used tokenizers.

This API is currently in the process of being stabilized. We might introduce breaking changes really often in the coming days/weeks, so use at your own risks.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust nightly toolchain installed.

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -default-toolchain nightly-2019-11-01 -y
export PATH="$HOME/.cargo/bin:$PATH"

# Or select the right toolchain:
rustup default nightly-2019-11-01

Once Rust is installed and using the right toolchain you can do the following.

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install maturin
maturin develop --release

Usage

Use a pre-trained tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders

# Load a BPE Model
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
bpe = models.BPE.from_files(vocab, merges)

# Initialize a tokenizer
tokenizer = Tokenizer(bpe)

# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new())
tokenizer.with_decoder(decoders.ByteLevel.new())

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)

# Or tokenize multiple sentences at once:
encoded = tokenizer.encode_batch([
	"I can feel the magic, can you?",
	"The quick brown fox jumps over the lazy dog"
])
print(encoded)

Train a new tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE.empty())

# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new())
tokenizer.with_decoder(decoders.ByteLevel.new())

# And then train
trainer = trainers.BpeTrainer.new(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# Now we can encode
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.0.7.tar.gz (32.6 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.0.7-cp38-cp38-win_amd64.whl (774.9 kB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.0.7-cp38-cp38-manylinux1_x86_64.whl (6.1 MB view details)

Uploaded CPython 3.8

tokenizers-0.0.7-cp38-cp38-macosx_10_13_x86_64.whl (849.6 kB view details)

Uploaded CPython 3.8 macOS 10.13+ x86-64

tokenizers-0.0.7-cp37-cp37m-win_amd64.whl (775.0 kB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.0.7-cp37-cp37m-manylinux1_x86_64.whl (4.6 MB view details)

Uploaded CPython 3.7m

tokenizers-0.0.7-cp37-cp37m-macosx_10_13_x86_64.whl (849.7 kB view details)

Uploaded CPython 3.7m macOS 10.13+ x86-64

tokenizers-0.0.7-cp36-cp36m-win_amd64.whl (774.3 kB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.0.7-cp36-cp36m-manylinux1_x86_64.whl (3.1 MB view details)

Uploaded CPython 3.6m

tokenizers-0.0.7-cp36-cp36m-macosx_10_13_x86_64.whl (849.7 kB view details)

Uploaded CPython 3.6m macOS 10.13+ x86-64

tokenizers-0.0.7-cp35-cp35m-win_amd64.whl (774.5 kB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.0.7-cp35-cp35m-manylinux1_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.5m

tokenizers-0.0.7-cp35-cp35m-macosx_10_13_x86_64.whl (849.7 kB view details)

Uploaded CPython 3.5m macOS 10.13+ x86-64

File details

Details for the file tokenizers-0.0.7.tar.gz.

File metadata

  • Download URL: tokenizers-0.0.7.tar.gz
  • Upload date:
  • Size: 32.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.7.tar.gz
Algorithm Hash digest
SHA256 7fdc3fafc95a64a61cff4b04539ab3bdc09c73554e9d9caac7d16949b274d08a
MD5 4412092e5e0d92bb5cd4742075b73ed2
BLAKE2b-256 8e005ac4d726c19c5e1e9a8eb029a6182f670b35ed8c28758135f207e5c5a11d

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.7-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.7-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 774.9 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.7-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 94070b2a4cdef78978d8fd077525876344e98e9bd09274e76b8297768628ddae
MD5 29b32588facd73ff061e3a49cc975179
BLAKE2b-256 275d95b1f501143bce0577017ba09cc883215fac770286209a5ced2e7f8d97b5

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.7-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.7-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 6.1 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.7-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 a7d36cd94689ac3dfcbd7d161dd2b42b2398667edd05f25c993fb4057b1b4ecd
MD5 6463c42cc778c32d39a48ed7af6171a9
BLAKE2b-256 70c016fae2d4e860397191ea63b3fee43b4878882aecbaf85812d9f1dac1b7a4

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.7-cp38-cp38-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.7-cp38-cp38-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 849.6 kB
  • Tags: CPython 3.8, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.7-cp38-cp38-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 9f253b1666a5a960eae5a13241acea5cd776f11866a7bc09d5953e1683d6bc3f
MD5 549822141905621fd439f20f8887066c
BLAKE2b-256 223e267b0d62b214330d171dc79eefd1d2c552b74f03f1d86d48c002ea8dc2b3

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.7-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.7-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 775.0 kB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.7-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 da58fd46f9f46a2812f4a4f886b3dec4baa0f64c46f2792a354f659efcc745a2
MD5 ebbb53466f03b679b481a48248d7665c
BLAKE2b-256 a69bd0622be89041fb2b1f115fd4ad69ead75280128c116b23806d860611f9c8

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.7-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.7-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 4.6 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.7-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 5832a856c3d0e3f08f526d2c825ae1d733e931d40372612d6a0b31829b272dca
MD5 ce355e64d8d1f065e9bd66fb740169e3
BLAKE2b-256 06406aa6103718f927273fe94df66d1dc4f0771c7afbac573e16c17f2814607e

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.7-cp37-cp37m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.7-cp37-cp37m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 849.7 kB
  • Tags: CPython 3.7m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.7-cp37-cp37m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 662cea8038652353960704576c565506a9a9d78d8d8ce409d235e9799fe9e6d6
MD5 4dfee56363a6c21273d2002011a22e1e
BLAKE2b-256 9ddff1e2164cd87f6fdf45d5279886c4cd0733345516ad4c15a87a33fdcaea96

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.7-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.7-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 774.3 kB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.7-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 2a67858e2d765c7a414dfcd484d699ea18a36e1432a56aedc1dc32c307480473
MD5 d66a3714c7b9203aa6b3eacd1accc6bd
BLAKE2b-256 ee7bac6ac6e101294cd862c7df3c95c2c968da2230f7c6d69a350209cd9b11f4

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.7-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.7-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.1 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.7-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 3518c00e395b436edf874a68b1f74d819283dece070c5ec57d811d9c62e13d0f
MD5 ce4425657607f533c913603647ca71d7
BLAKE2b-256 f85d35429295b4f6fa8c7a5c9c2d15a8f3107e7a5112388d9047cb9bcc1d0e92

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.7-cp36-cp36m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.7-cp36-cp36m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 849.7 kB
  • Tags: CPython 3.6m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.7-cp36-cp36m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 cd976ca6cf69cce6b15429d6188963509f2a39f609050607b170015a0b02049f
MD5 8dc5649268139e71d4f1c6be66c570ca
BLAKE2b-256 f8f19b05494ddecfc7d885806fbf021cf9546ebc7d2e2e27916aa56eb90bf8ff

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.7-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.7-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 774.5 kB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.7-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 5be48cc7d889fd7f8f66066273669f1df724410de06114c94bb457cb88cc913f
MD5 e846d4ff68e96c84446be49c69f12201
BLAKE2b-256 0827a6005cad6acbe053b836edaceee14bfbe6e49a8d48203853083b7d7cda1b

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.7-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.7-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.7-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 ae5960425b9422546bdecb755f43956cc828bbbf6dcd5bcec7a7153080e6c160
MD5 aca1e61022eea1f1006a9cb38c53b8b7
BLAKE2b-256 a9a90d9401615d29b066cab3403e7d8395e407c7c69c191bc582e848342fa7ed

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.7-cp35-cp35m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.7-cp35-cp35m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 849.7 kB
  • Tags: CPython 3.5m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.7-cp35-cp35m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 ff6f0b26e5dedfb86c732fbd6aaa8368cfc1df183ca5064b45b75d1382c4dc0d
MD5 d9ff930784618df8afdea3ecf32f9b80
BLAKE2b-256 8ec75b89b4af6f53770f2e6aa73ee79a1eccfc3cdabce7114a79d27efab6c8d3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page