Skip to main content

Fast and Customizable Tokenizers

Project description

PyPI version

Tokenizers

A fast and easy to use implementation of today's most used tokenizers.

This API is currently in the process of being stabilized. We might introduce breaking changes really often in the coming days/weeks, so use at your own risks.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust nightly toolchain installed.

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -default-toolchain nightly-2019-11-01 -y
export PATH="$HOME/.cargo/bin:$PATH"

# Or select the right toolchain:
rustup default nightly-2019-11-01

Once Rust is installed and using the right toolchain you can do the following.

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install maturin
maturin develop --release

Usage

Use a pre-trained tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders

# Load a BPE Model
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
bpe = models.BPE.from_files(vocab, merges)

# Initialize a tokenizer
tokenizer = Tokenizer(bpe)

# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new())
tokenizer.with_decoder(decoders.ByteLevel.new())

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)

# Or tokenize multiple sentences at once:
encoded = tokenizer.encode_batch([
	"I can feel the magic, can you?",
	"The quick brown fox jumps over the lazy dog"
])
print(encoded)

Train a new tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE.empty())

# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new())
tokenizer.with_decoder(decoders.ByteLevel.new())

# And then train
trainer = trainers.BpeTrainer.new(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# Now we can encode
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.0.6.tar.gz (31.2 kB view details)

Uploaded Source

Built Distributions

tokenizers-0.0.6-cp38-cp38-win_amd64.whl (771.9 kB view details)

Uploaded CPython 3.8 Windows x86-64

tokenizers-0.0.6-cp38-cp38-manylinux1_x86_64.whl (6.1 MB view details)

Uploaded CPython 3.8

tokenizers-0.0.6-cp38-cp38-macosx_10_13_x86_64.whl (845.9 kB view details)

Uploaded CPython 3.8 macOS 10.13+ x86-64

tokenizers-0.0.6-cp37-cp37m-win_amd64.whl (771.8 kB view details)

Uploaded CPython 3.7m Windows x86-64

tokenizers-0.0.6-cp37-cp37m-manylinux1_x86_64.whl (4.6 MB view details)

Uploaded CPython 3.7m

tokenizers-0.0.6-cp37-cp37m-macosx_10_13_x86_64.whl (844.8 kB view details)

Uploaded CPython 3.7m macOS 10.13+ x86-64

tokenizers-0.0.6-cp36-cp36m-win_amd64.whl (773.2 kB view details)

Uploaded CPython 3.6m Windows x86-64

tokenizers-0.0.6-cp36-cp36m-manylinux1_x86_64.whl (3.1 MB view details)

Uploaded CPython 3.6m

tokenizers-0.0.6-cp36-cp36m-macosx_10_13_x86_64.whl (845.9 kB view details)

Uploaded CPython 3.6m macOS 10.13+ x86-64

tokenizers-0.0.6-cp35-cp35m-win_amd64.whl (773.2 kB view details)

Uploaded CPython 3.5m Windows x86-64

tokenizers-0.0.6-cp35-cp35m-manylinux1_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.5m

tokenizers-0.0.6-cp35-cp35m-macosx_10_13_x86_64.whl (845.9 kB view details)

Uploaded CPython 3.5m macOS 10.13+ x86-64

File details

Details for the file tokenizers-0.0.6.tar.gz.

File metadata

  • Download URL: tokenizers-0.0.6.tar.gz
  • Upload date:
  • Size: 31.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.6.tar.gz
Algorithm Hash digest
SHA256 c06831e29af9bdd85879efe97540d97b311bfa509a610cad1327448e4cdb18e5
MD5 770b476d4cf39c7b628550a51fee0f9b
BLAKE2b-256 5f051cf6b9426e5ce22711a755e9d079d0752e8e434128d70b3a0149412d37b5

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.6-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.6-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 771.9 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.6-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 ea01fb5daf48c8fc4677e2c972f3ab9d67de6f1e50e38b0224430e0db287ebec
MD5 097fed7b49aadbc35ec87f025ae0b19b
BLAKE2b-256 c2e30fd7873e5905616628d43e101bcc26013f12f6d6ab5ab15ca6509584b77b

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.6-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.6-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 6.1 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.6-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 3a4b4e45df54676aea849c10971d000d7eb86be68eaeebfdaf74221d6d99e447
MD5 dd83dc270d03c80f927c8685daf9e146
BLAKE2b-256 03ea6165ea369dc06eaf3aa4dc15db217f8b86010b06af3e925f0ce20b8248eb

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.6-cp38-cp38-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.6-cp38-cp38-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 845.9 kB
  • Tags: CPython 3.8, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.6-cp38-cp38-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 fb241dc10061ed3f380dd5a2404e1257e291e12f91c7ee48c52cb5aff74cf516
MD5 69430dc6c1d172b6b777ec4e8fe6d574
BLAKE2b-256 ff135a13f79f6d8009017803c21115fa692a1e98ed1a34ffc4e03be0d4ed0eac

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.6-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.6-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 771.8 kB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.6-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 5233d5cb9ff35d90a831431cbe2d7a506cbec8953ffd723fa8b9607819f82377
MD5 a930d7dfbc99111f08dc9d76b3f90cc4
BLAKE2b-256 7d5c75d3232254879a460ac8bbceb5a1b7112b2b192427d574eeac8e5570185a

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.6-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.6-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 4.6 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.6-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 082be3c04618e9f2e1cccd87e76a2502249e8898690c7a2ba3c3f7b99ded6f31
MD5 3d413bc36c57df5614d2076a6409386d
BLAKE2b-256 76926ab43d5edee826c02c5279389e103814f499adbe747b4937badef694a3af

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.6-cp37-cp37m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.6-cp37-cp37m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 844.8 kB
  • Tags: CPython 3.7m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.6-cp37-cp37m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 2ef3621bbbe864a78b63cf1c2e0706dc24f32dfc413d6c4056a9d1a4e2bc655d
MD5 f0e57926dc5fc2c181a20776e747a0e6
BLAKE2b-256 a43d925e69390e1eeaf212f2ddf82e88feb000f4bb6f8245921a9e6e4921bd75

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.6-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.6-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 773.2 kB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.6-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 38631afca82d91c667a78a08ad222eaff576e0838b406e32bd82900257c158ba
MD5 47cc35e5ed684e74abe194b55e1a718a
BLAKE2b-256 a7c0034522b4c4f62cb7f580af422f5e1a01033f448c7f3eed702a6d1faacf61

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.6-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.6-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.1 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.6-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 fec9e67e9c5b32db3a3f2cb11980a494dfb5cff1d5cc842400d66bfa6fc821bf
MD5 5674e20f1e48bf29b5afd4fe48995f35
BLAKE2b-256 34a8ef2b2c87a91122d7dea9be77f28f4012f934dbeaeb0957655aa2cd8a4a4b

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.6-cp36-cp36m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.6-cp36-cp36m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 845.9 kB
  • Tags: CPython 3.6m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.6-cp36-cp36m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 c1157f8afb554b65efd02d2b201411170c055810a47225ab4ae756559d6f73a3
MD5 cc2775f576e3906470da27ed21ab1e2e
BLAKE2b-256 c039d8caabcc7ea7e3a67b76c19118fd5021a16d4b28cc3d990878a11b69da31

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.6-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.0.6-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 773.2 kB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.6-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 bfb41ff41cbae600f69d9936e144c831a1c40d8d5f616d93220f14aebb299963
MD5 1e00cec06437c0bc38aab6e87171f235
BLAKE2b-256 c55ca42054d2febc634c507a89280154c4ac3b876dba580ca405f1cdeb0f4bfb

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.6-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.6-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.6-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 5ad69b2919027c0611a8fd7047405fdf7cd13d02c3cccd393eb39605bbb1058e
MD5 0297a26a5699fec451d4fa204503ea2d
BLAKE2b-256 ebf8c204c8f70126475a4a24591f9fd79f6510054ff887efda9a7bf4a2fa1f82

See more details on using hashes here.

File details

Details for the file tokenizers-0.0.6-cp35-cp35m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.0.6-cp35-cp35m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 845.9 kB
  • Tags: CPython 3.5m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for tokenizers-0.0.6-cp35-cp35m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 f6314539f2e7ba1f1c8b00dc0fdd17b41cfab684d4b0cfd388305fd556f10101
MD5 d660f57583e274b5c62cadc5d5d7a3e7
BLAKE2b-256 469549b719e0b8a08d838304f597693ca3c613f99cf94b4d0c71661f3cf6bee1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page