Fast text tokenizer for the CLIP neural network
Project description
Instant CLIP Tokenizer: a fast tokenizer for the CLIP neural network
Instant CLIP Tokenizer is a fast pure-Rust text tokenizer for OpenAI's CLIP model. It is intended to be a replacement for the original Python-based tokenizer included in the CLIP repository, aiming for 100% compatibility with the original implementation. It can also be used with OpenCLIP and other implementations using the same tokenizer.
In addition to being usable as a Rust crate it also includes Python bindings built with PyO3 so that it can be used as a native Python module.
For the microbenchmarks included in this repository, Instant CLIP Tokenizer is ~70x faster than the Python implementation (with preprocessing and caching disabled to ensure a fair comparison).
Using the library
Rust
[dependencies]
instant-clip-tokenizer = "0.1.0"
# To enable additional functionality that depends on the `ndarray` crate:
# instant-clip-tokenizer = { version = "0.1.0", features = ["ndarray"] }
Python (>= 3.9)
pip install instant-clip-tokenizer
Using the library requires numpy >= 1.16.0
installed in your Python environment (e.g., via pip install numpy
).
Examples
use instant_clip_tokenizer::{Token, Tokenizer};
let tokenizer = Tokenizer::new();
let mut tokens = Vec::new();
tokenizer.encode("A person riding a motorcycle", &mut tokens);
let tokens = tokens.into_iter().map(Token::to_u16).collect::<Vec<_>>();
println!("{:?}", tokens);
// -> [320, 2533, 6765, 320, 10297]
import instant_clip_tokenizer
tokenizer = instant_clip_tokenizer.Tokenizer()
tokens = tokenizer.encode("A person riding a motorcycle")
print(tokens)
# -> [320, 2533, 6765, 320, 10297]
batch = tokenizer.tokenize_batch(["A person riding a motorcycle", "Hi there"], context_length=5)
print(batch)
# -> [[49406 320 2533 6765 49407]
# [49406 1883 997 49407 0]]
Testing
To run the tests run the following:
cargo test --all-features
You can also test the Python bindings with:
make test-python
Acknowledgements
The vocabulary file and original Python tokenizer code included in this repository are copyright (c) 2021 OpenAI (MIT-License).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for instant_clip_tokenizer-0.1.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 25d4006f9e82b342194954edd8b2a19e3ebf10e1ec5152a7158502341815a585 |
|
MD5 | f9b22fc4ab75b49c3a658f23c0f8d01d |
|
BLAKE2b-256 | 94ba019e8c5e0e340c19c97cf900b9887328c100d606ec34533d339a993eaae5 |
Hashes for instant_clip_tokenizer-0.1.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3a5d7b32210c30464e5932d761cd212c020c32896f23dd72197d74e708906f9e |
|
MD5 | f1b73412a5f67e28559720417eaf6c30 |
|
BLAKE2b-256 | facef42203ec30a7d2e8f305b652c0c324b3d52076425b17b115f1aec55f0cb1 |
Hashes for instant_clip_tokenizer-0.1.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 15a454e5719707ddf646529a66a03b91a4522e11395df5dad2fdfc5737ae3472 |
|
MD5 | 5497a46bf55711f59595c5211489a2ae |
|
BLAKE2b-256 | bbbc88486569d671734a94b71c5af6b44b7727f199004816f9f318b778ecc1e0 |
Hashes for instant_clip_tokenizer-0.1.0-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | eb1a941514a3aa661f2c5b6dea18ff2997e5780ef97e61446fdbee86f2fcc034 |
|
MD5 | fd3b77724e287d22420e981229c35f52 |
|
BLAKE2b-256 | dfe9c1f23d09a9f0ed708bfe39a44dba2568404523956d1cb448171723051a4c |
Hashes for instant_clip_tokenizer-0.1.0-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8654b83061f5df6d6e97efa6893d8e411a51b578dd438499bfab1f2d500337f0 |
|
MD5 | da92a2c7a37d5628b53e24f77f34668e |
|
BLAKE2b-256 | 86c404faaeb254373c90de2a6e2d199811c50464bbd371d26f16778f0b848d2d |
Hashes for instant_clip_tokenizer-0.1.0-cp312-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 40526ff9489d39085225edb9e0f80e9e20a590badefefcb6c820a23aefca90ff |
|
MD5 | 5f7c6e728b81f3f2345278dc4ef1432b |
|
BLAKE2b-256 | 0fb792c1d729f24af15f76ece627a3f7e36163b775ea9bbd5b7ec3fc375933c0 |
Hashes for instant_clip_tokenizer-0.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3c406570ffff291bf7c57b68fd3633c1ff6a7af6944ee314af0b865617aa2ef4 |
|
MD5 | e01ab8e8d680b089dff8373f1ee4a3b2 |
|
BLAKE2b-256 | 4f335729bb723b48af401697df735a0204c2c6feec8052007ec29d8fcf11dfa1 |
Hashes for instant_clip_tokenizer-0.1.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 44266f2fad54b6be46b77a3a00933ac6a914a168716fd37b189cbd3e0ff1e448 |
|
MD5 | 8e6c3e13528534dbef0ce4c2baba25b7 |
|
BLAKE2b-256 | 9e977d866a476b943efc4484ae94361738347d5bb3c3ece664a575dbd1db8102 |
Hashes for instant_clip_tokenizer-0.1.0-cp311-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a4fb2855f35550346b7d0c0078ad1a943999813077816541a8b2885ba0fc2803 |
|
MD5 | 337b258302cf03756c0f2d42bb341de2 |
|
BLAKE2b-256 | 9d6a1b99e4b6c3516590b0fc545358fdfc5130cb3f4f997dfc603a74cc72f550 |
Hashes for instant_clip_tokenizer-0.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6ceb673b8eab46380a62df0e9af49912d596f11891bb8a6c3260639f5868d91e |
|
MD5 | c4a8664ef5b9e48b27bac79adac6ed84 |
|
BLAKE2b-256 | 31acf0f7d7037ad09ecb092b951f300bc3c9c7c3eb117c49dd19a644c5aca035 |
Hashes for instant_clip_tokenizer-0.1.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8a056a7fc24fbe73378a844649e49f357a80b37276d4b05b59c4530d76bb5f9d |
|
MD5 | 98a5e39d4d876f632f226de707c50784 |
|
BLAKE2b-256 | e3169f28dcd6111219568bf123529d359be32ba3941e87ed6f10a4c8561f36d9 |
Hashes for instant_clip_tokenizer-0.1.0-cp310-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 21ffb32dee35136fec48e9127f2f78a3f6eb8c0859efc142aee0780f3ec31745 |
|
MD5 | a2586386d0f94f2c3fca80d83d9ccaa0 |
|
BLAKE2b-256 | 7300ce4e4ae0edf5d3d5d970367b36972cc6a99f935a8a51885900824126d11e |
Hashes for instant_clip_tokenizer-0.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 807a3800e667b806da89b468bf68a9643c3db58fc012dd6e167d785bd0671f6a |
|
MD5 | a95c1e9f7e64cd306e1f1312202ffb70 |
|
BLAKE2b-256 | d14bffce6e1693c61923ce42d1d8f500759cd24beb4affcae92c07d702b155be |
Hashes for instant_clip_tokenizer-0.1.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 992be290d4cd8b7983b9e0d4231d96be9d28d281b95a149f254c2f5c61ab634c |
|
MD5 | 6579a1455f419a40008a02230f4e5932 |
|
BLAKE2b-256 | f8a1eb3b44ca754ae19e6b3ac49e60b543ee8d45717c3a75cb208aa43436d4eb |
Hashes for instant_clip_tokenizer-0.1.0-cp39-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b82835bcc53c702708686bb88c76ef4a1eff3338e6ef229e921598f63a46e50e |
|
MD5 | 2e26ad299a7319464e44000e28e60985 |
|
BLAKE2b-256 | e9fd34860963d850ad74ef9a55953f707e31b6672f95de5d25824e754dfc5c78 |
Hashes for instant_clip_tokenizer-0.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 59ef5a051e21cedfe195188d5ac383a0969aba4b51f3d9b0c8b3e3e46d0f0f34 |
|
MD5 | a8d1e4dd5110a044fc4332aec686a0a4 |
|
BLAKE2b-256 | f525e46bb25370d2d9d2c0615dcd64ee4986e1d44d126db899bfa001a434b3a0 |
Hashes for instant_clip_tokenizer-0.1.0-cp39-cp39-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c5d8cc2d099f76098bd849ffb92d02606459175ac3b552231648fd9983de432a |
|
MD5 | c7225b86f83ebde271dd8e1246c92a58 |
|
BLAKE2b-256 | c477ee6473283bba5a35061f889c1e93bcfdafc0a8bf78c9876a03b300b45eb2 |
Hashes for instant_clip_tokenizer-0.1.0-cp38-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cbee78b27ef0654364b59fedc401af7c6d165d84bd2a3a512634aa70e5aaeb64 |
|
MD5 | faf6411f596ad078128861b8ee3bec58 |
|
BLAKE2b-256 | 9ef51edaa4964a5017fea3eda6a859008135c535c24c4dcb4fe6e4c07d2d028f |
Hashes for instant_clip_tokenizer-0.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5f69ef763116bb9efe411dada4436bee8560c6ed50c7136497543ab8dbf5643f |
|
MD5 | 75dfece4095e298cbc3a6a5387a7cfac |
|
BLAKE2b-256 | 281b776db947040991acc75465ae393c14cff8fee6bc95ddd2966aa3420beebd |
Hashes for instant_clip_tokenizer-0.1.0-cp37-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 47fef72736135b363637bf521d9b21deee2048f69644826fd6ab4fc7e0477d39 |
|
MD5 | 4bdc94293efe1a20b94fe788f9d12ca7 |
|
BLAKE2b-256 | b14552b6617e2181624558667218255398dca64f8958e98bf50142b7ae6b9325 |
Hashes for instant_clip_tokenizer-0.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 08d896c38b1779a24b0befb9da302c60fd7e2cefa0df6a59be48f4d62e6b8816 |
|
MD5 | 6c44d55a455937fc9a50dd18856939da |
|
BLAKE2b-256 | 44ace48b02c584ab676ca9435e3a8e90b07346bd771a5aac223a19e94bc7a4b5 |