Skip to main content

Convert tokenizers into OpenVINO models

Project description

OpenVINO Tokenizers

Downloads

OpenVINO Tokenizers adds text processing operations to OpenVINO.

Features

  • Perform tokenization and detokenization without third-party dependencies
  • Convert a HuggingFace tokenizer into OpenVINO model tokenizer and detokenizer
  • Combine OpenVINO models into a single model
  • Add greedy decoding pipeline to text generation model

Installation

(Recommended) Create and activate virtual env:

python3 -m venv venv
source venv/bin/activate
 # or
conda create --name openvino_tokenizers
conda activate openvino_tokenizers

Minimal Installation

Use minimal installation when you have a converted OpenVINO tokenizer:

pip install openvino-tokenizers
 # or
conda install -c conda-forge openvino openvino-tokenizers

Convert Tokenizers Installation

If you want to convert HuggingFace tokenizers into OpenVINO tokenizers:

pip install openvino-tokenizers[transformers]
 # or
conda install -c conda-forge openvino openvino-tokenizers && pip install transformers[sentencepiece] tiktoken

Install Pre-release Version

Use openvino-tokenizers[transformers] to install tokenizers conversion dependencies.

pip install --pre -U openvino openvino-tokenizers --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly

Build and Install from Source

Install OpenVINO archive distribution. Use --no-deps to avoid OpenVINO installation from PyPI.

source path/to/installed/openvino/setupvars.sh
git clone https://github.com/openvinotoolkit/openvino_tokenizers.git
cd openvino_tokenizers
pip install --no-deps .

This command is the equivalent of minimal installation. Install tokenizers conversion dependencies if needed:

pip install transformers[sentencepiece] tiktoken

:warning: Latest commit of OpenVINO Tokenizers might rely on features that are not present in the release OpenVINO version. Use a nightly build of OpenVINO or build OpenVINO Tokenizers from a release branch if you have issues with the build process.

Build and install for development

source path/to/installed/openvino/setupvars.sh
git clone https://github.com/openvinotoolkit/openvino_tokenizers.git
cd openvino_tokenizers
pip install -e .[all]
# verify installation by running tests
cd tests/
pytest .

C++ Installation

You can use converted tokenizers in C++ pipelines with prebuild binaries.

  1. Download OpenVINO archive distribution for your OS from here and extract the archive.
  2. Download OpenVINO Tokenizers prebuild libraries from here. To ensure compatibility first three numbers of OpenVINO Tokenizers version should match OpenVINO version and OS.
  3. Extract OpenVINO Tokenizers archive into OpenVINO installation directory:
    • Windows: <openvino_dir>\runtime\bin\intel64\Release\
    • MacOS_x86: <openvino_dir>/runtime/lib/intel64/Release
    • MacOS_arm64: <openvino_dir>/runtime/lib/arm64/Release/
    • Linux_x86: <openvino_dir>/runtime/lib/intel64/
    • Linux_arm64: <openvino_dir>/runtime/lib/aarch64/

After that you can add binary extension in the code with:

  • core.add_extension("openvino_tokenizers.dll") for Windows
  • core.add_extension("libopenvino_tokenizers.dylib") for MacOS
  • core.add_extension("libopenvino_tokenizers.so") for Linux

and read/compile converted (de)tokenizers models. If you use version 2023.3.0.0, the binary extension file is called (lib)user_ov_extension.(dll/dylib/so).

Usage

:warning: OpenVINO Tokenizers can be inferred on a CPU device only.

Convert HuggingFace tokenizer

OpenVINO Tokenizers ships with CLI tool that can convert tokenizers from Huggingface Hub or Huggingface tokenizers saved on disk:

convert_tokenizer codellama/CodeLlama-7b-hf --with-detokenizer -o output_dir

There is also convert_tokenizer function that can convert tokenizer python object.

import numpy as np
from transformers import AutoTokenizer
from openvino import compile_model, save_model
from openvino_tokenizers import convert_tokenizer

hf_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
ov_tokenizer = convert_tokenizer(hf_tokenizer)

compiled_tokenzier = compile_model(ov_tokenizer)
text_input = ["Test string"]

hf_output = hf_tokenizer(text_input, return_tensors="np")
ov_output = compiled_tokenzier(text_input)

for output_name in hf_output:
    print(f"OpenVINO {output_name} = {ov_output[output_name]}")
    print(f"HuggingFace {output_name} = {hf_output[output_name]}")
# OpenVINO input_ids = [[ 101 3231 5164  102]]
# HuggingFace input_ids = [[ 101 3231 5164  102]]
# OpenVINO token_type_ids = [[0 0 0 0]]
# HuggingFace token_type_ids = [[0 0 0 0]]
# OpenVINO attention_mask = [[1 1 1 1]]
# HuggingFace attention_mask = [[1 1 1 1]]

# save tokenizer for later use
save_model(ov_tokenizer, "openvino_tokenizer.xml")

loaded_tokenizer = compile_model("openvino_tokenizer.xml")
loaded_ov_output = loaded_tokenizer(text_input)
for output_name in hf_output:
    assert np.all(loaded_ov_output[output_name] == ov_output[output_name])

Connect Tokenizer to a Model

To infer and convert the original model, install torch or torch-cpu to the virtual environment.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from openvino import compile_model, convert_model
from openvino_tokenizers import convert_tokenizer, connect_models

checkpoint = "mrm8488/bert-tiny-finetuned-sms-spam-detection"
hf_tokenizer = AutoTokenizer.from_pretrained(checkpoint)
hf_model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

text_input = ["Free money!!!"]
hf_input = hf_tokenizer(text_input, return_tensors="pt")
hf_output = hf_model(**hf_input)

ov_tokenizer = convert_tokenizer(hf_tokenizer)
ov_model = convert_model(hf_model, example_input=hf_input.data)
combined_model = connect_models(ov_tokenizer, ov_model)
compiled_combined_model = compile_model(combined_model)

openvino_output = compiled_combined_model(text_input)

print(f"OpenVINO logits: {openvino_output['logits']}")
# OpenVINO logits: [[ 1.2007061 -1.4698029]]
print(f"HuggingFace logits {hf_output.logits}")
# HuggingFace logits tensor([[ 1.2007, -1.4698]], grad_fn=<AddmmBackward0>)

Use Extension With Converted (De)Tokenizer or Model With (De)Tokenizer

Import openvino_tokenizers will add all tokenizer-related operations to OpenVINO, after which you can work with saved tokenizers and detokenizers.

import numpy as np
import openvino_tokenizers
from openvino import Core

core = Core()

# detokenizer from codellama sentencepiece model
compiled_detokenizer = core.compile_model("detokenizer.xml")

token_ids = np.random.randint(100, 1000, size=(3, 5))
openvino_output = compiled_detokenizer(token_ids)

print(openvino_output["string_output"])
# ['sc�ouition�', 'intvenord hasient', 'g shouldwer M more']

Text generation pipeline

import numpy as np
from openvino import compile_model, convert_model
from openvino_tokenizers import add_greedy_decoding, convert_tokenizer
from transformers import AutoModelForCausalLM, AutoTokenizer


model_checkpoint = "JackFram/llama-68m"
hf_tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
hf_model = AutoModelForCausalLM.from_pretrained(model_checkpoint, use_cache=False)

# convert hf tokenizer
text_input = ["Quick brown fox jumped "]
ov_tokenizer, ov_detokenizer = convert_tokenizer(hf_tokenizer, with_detokenizer=True, skip_special_tokens=True)
compiled_tokenizer = compile_model(ov_tokenizer)

# transform input text into tokens
ov_input = compiled_tokenizer(text_input)
hf_input = hf_tokenizer(text_input, return_tensors="pt")

# convert Pytorch model to OpenVINO IR and add greedy decoding pipeline to it
ov_model = convert_model(hf_model, example_input=hf_input.data)
ov_model_with_greedy_decoding = add_greedy_decoding(ov_model)
compiled_model = compile_model(ov_model_with_greedy_decoding)

# generate new tokens
new_tokens_size = 10
prompt_size = ov_input["input_ids"].shape[-1]
input_dict = {
    output.any_name: np.hstack([tensor, np.zeros(shape=(1, new_tokens_size), dtype=np.int_)])
    for output, tensor in ov_input.items()
}
for idx in range(prompt_size, prompt_size + new_tokens_size):
    output = compiled_model(input_dict)["token_ids"]
    input_dict["input_ids"][:, idx] = output[:, idx - 1]
    input_dict["attention_mask"][:, idx] = 1
ov_token_ids = input_dict["input_ids"]

hf_token_ids = hf_model.generate(
    **hf_input,
    min_new_tokens=new_tokens_size,
    max_new_tokens=new_tokens_size,
    temperature=0,  # greedy decoding
)

# decode model output
compiled_detokenizer = compile_model(ov_detokenizer)
ov_output = compiled_detokenizer(ov_token_ids)["string_output"]
hf_output = hf_tokenizer.batch_decode(hf_token_ids, skip_special_tokens=True)
print(f"OpenVINO output string: `{ov_output}`")
# OpenVINO output string: `['<s> Quick brown fox was walking through the forest. He was looking for something']`
print(f"HuggingFace output string: `{hf_output}`")
# HuggingFace output string: `['Quick brown fox was walking through the forest. He was looking for something']`

TensorFlow Text Integration

OpenVINO Tokenizers include converters for certain TensorFlow Text operations. Currently, only the MUSE model is supported. Here is an example of model conversion and inference:

import numpy as np
import tensorflow_hub as hub
import tensorflow_text  # register tf text ops
from openvino import convert_model, compile_model
import openvino_tokenizers  # register ov tokenizer ops and translators


sentences = ["dog",  "I cuccioli sono carini.", "私は犬と一緒にビーチを散歩するのが好きです"]
tf_embed = hub.load(
    "https://www.kaggle.com/models/google/universal-sentence-encoder/frameworks/"
    "TensorFlow2/variations/multilingual/versions/2"
)
# convert model that uses Sentencepiece tokenizer op from TF Text
ov_model = convert_model(tf_embed)
ov_embed = compile_model(ov_model, "CPU")

ov_result = ov_embed(sentences)[ov_embed.output()]
tf_result = tf_embed(sentences)

assert np.all(np.isclose(ov_result, tf_result, atol=1e-4))

RWKV Tokenizer

from urllib.request import urlopen

from openvino import compile_model
from openvino_tokenizers import build_rwkv_tokenizer


rwkv_vocab_url = (
    "https://raw.githubusercontent.com/BlinkDL/ChatRWKV/main/tokenizer/rwkv_vocab_v20230424.txt"
)

with urlopen(rwkv_vocab_url) as vocab_file:
    vocab = map(bytes.decode, vocab_file)
    tokenizer, detokenizer = build_rwkv_tokenizer(vocab)

tokenizer, detokenizer = compile_model(tokenizer), compile_model(detokenizer)

print(tokenized := tokenizer(["Test string"])["input_ids"])  # [[24235 47429]]
print(detokenizer(tokenized)["string_output"])  # ['Test string']

Supported Tokenizer Types

Huggingface
Tokenizer Type
Tokenizer Model Type Tokenizer Detokenizer
Fast WordPiece
BPE
Unigram
Legacy SentencePiece .model
Custom tiktoken
RWKV Trie

Test Results

This report is autogenerated and includes tokenizers and detokenizers tests. The Output Matched, % column shows the percent of test strings for which the results of OpenVINO and Hugingface Tokenizers are the same. To update the report run pytest --update_readme tokenizers_test.py in tests directory.

Output Match by Tokenizer Type

Tokenizer Type Output Matched, % Number of Tests
BPE 96.20 4557
SentencePiece 79.06 4340
Tiktoken 97.71 218
WordPiece 94.97 1053

Output Match by Model

Tokenizer Type Model Output Matched, % Number of Tests
BPE EleutherAI/gpt-j-6b 98.16 217
BPE EleutherAI/gpt-neo-125m 98.16 217
BPE EleutherAI/gpt-neox-20b 96.31 217
BPE EleutherAI/pythia-12b-deduped 96.31 217
BPE KoboldAI/fairseq-dense-13B 98.16 217
BPE NousResearch/Meta-Llama-3-8B-Instruct 97.24 217
BPE Salesforce/codegen-16B-multi 96.31 217
BPE ai-forever/rugpt3large_based_on_gpt2 96.31 217
BPE bigscience/bloom 99.08 217
BPE databricks/dolly-v2-3b 96.31 217
BPE facebook/bart-large-mnli 98.16 217
BPE facebook/galactica-120b 97.24 217
BPE facebook/opt-66b 98.16 217
BPE gpt2 98.16 217
BPE laion/CLIP-ViT-bigG-14-laion2B-39B-b160k 70.97 217
BPE microsoft/deberta-base 98.16 217
BPE roberta-base 98.16 217
BPE sentence-transformers/all-roberta-large-v1 98.16 217
BPE stabilityai/stablecode-completion-alpha-3b-4k 97.24 217
BPE stabilityai/stablelm-2-1_6b 97.24 217
BPE stabilityai/stablelm-tuned-alpha-7b 96.31 217
SentencePiece NousResearch/Llama-2-13b-hf 100.00 217
SentencePiece NousResearch/Llama-2-13b-hf_slow 100.00 217
SentencePiece THUDM/chatglm2-6b 100.00 217
SentencePiece THUDM/chatglm2-6b_slow 100.00 217
SentencePiece THUDM/chatglm3-6b 31.80 217
SentencePiece THUDM/chatglm3-6b_slow 31.80 217
SentencePiece camembert-base 3.23 217
SentencePiece camembert-base_slow 77.42 217
SentencePiece codellama/CodeLlama-7b-hf 100.00 217
SentencePiece codellama/CodeLlama-7b-hf_slow 100.00 217
SentencePiece facebook/musicgen-small 82.49 217
SentencePiece facebook/musicgen-small_slow 77.42 217
SentencePiece microsoft/deberta-v3-base 92.63 217
SentencePiece microsoft/deberta-v3-base_slow 100.00 217
SentencePiece t5-base 84.33 217
SentencePiece t5-base_slow 79.26 217
SentencePiece xlm-roberta-base 96.31 217
SentencePiece xlm-roberta-base_slow 96.31 217
SentencePiece xlnet-base-cased 67.28 217
SentencePiece xlnet-base-cased_slow 60.83 217
Tiktoken Qwen/Qwen-14B-Chat 98.17 109
Tiktoken Salesforce/xgen-7b-8k-base 97.25 109
WordPiece ProsusAI/finbert 97.53 81
WordPiece bert-base-multilingual-cased 97.53 81
WordPiece bert-base-uncased 97.53 81
WordPiece cointegrated/rubert-tiny2 91.36 81
WordPiece distilbert-base-uncased-finetuned-sst-2-english 97.53 81
WordPiece google/electra-base-discriminator 97.53 81
WordPiece google/mobilebert-uncased 97.53 81
WordPiece jhgan/ko-sbert-sts 87.65 81
WordPiece prajjwal1/bert-mini 97.53 81
WordPiece rajiv003/ernie-finetuned-qqp 97.53 81
WordPiece rasa/LaBSE 90.12 81
WordPiece sentence-transformers/all-MiniLM-L6-v2 87.65 81
WordPiece squeezebert/squeezebert-uncased 97.53 81

Recreating Tokenizers From Tests

In some tokenizers, you need to select certain settings so that their output is closer to the Huggingface tokenizers:

  • THUDM/chatglm2-6b detokenizer always skips special tokens. Use skip_special_tokens=True during conversion
  • THUDM/chatglm3-6b detokenizer don't skips special tokens. Use skip_special_tokens=False during conversion
  • All tested tiktoken based detokenizers leave extra spaces. Use clean_up_tokenization_spaces=False during conversion

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

openvino_tokenizers-2024.1.0.2-90-py3-none-win_amd64.whl (14.0 MB view details)

Uploaded Python 3 Windows x86-64

openvino_tokenizers-2024.1.0.2-90-py3-none-manylinux_2_31_aarch64.whl (13.8 MB view details)

Uploaded Python 3 manylinux: glibc 2.31+ ARM64

openvino_tokenizers-2024.1.0.2-90-py3-none-manylinux_2_17_x86_64.whl (13.8 MB view details)

Uploaded Python 3 manylinux: glibc 2.17+ x86-64

openvino_tokenizers-2024.1.0.2-90-py3-none-macosx_11_0_arm64.whl (13.6 MB view details)

Uploaded Python 3 macOS 11.0+ ARM64

openvino_tokenizers-2024.1.0.2-90-py3-none-macosx_10_12_x86_64.whl (13.8 MB view details)

Uploaded Python 3 macOS 10.12+ x86-64

File details

Details for the file openvino_tokenizers-2024.1.0.2-90-py3-none-win_amd64.whl.

File metadata

File hashes

Hashes for openvino_tokenizers-2024.1.0.2-90-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 8e1e272732f3b483feebce935fbeccb1e0c5ef24993447dcfc0c43aa0536032a
MD5 39cc37b1056dc8cee8af8307ebd74e5c
BLAKE2b-256 cd9b934c87a8393c784ddcb5baeae36148ece3139b05c6d29ce1b7a047d10899

See more details on using hashes here.

File details

Details for the file openvino_tokenizers-2024.1.0.2-90-py3-none-manylinux_2_31_aarch64.whl.

File metadata

File hashes

Hashes for openvino_tokenizers-2024.1.0.2-90-py3-none-manylinux_2_31_aarch64.whl
Algorithm Hash digest
SHA256 783f97ccadf1838249db9121ae18f7b20b7f57b35126b0c16d7b5c54e943976b
MD5 1cf0bdde11c4787e9483443df1f6e7b1
BLAKE2b-256 5600b7780f809585f66858a7d6413756acc1fbba0ed6c7d8f5c607a885eb1652

See more details on using hashes here.

File details

Details for the file openvino_tokenizers-2024.1.0.2-90-py3-none-manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for openvino_tokenizers-2024.1.0.2-90-py3-none-manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 aea02e11111ebf5e4e5918c7a35ceebf69a7cabbb1c5216a34e160f40774752e
MD5 63ddc16df37b087dff0ae17d9d387c17
BLAKE2b-256 a45877170d31cd7b70ca4b9dba9c72389c0330fd77d20841374ead3a2001dce1

See more details on using hashes here.

File details

Details for the file openvino_tokenizers-2024.1.0.2-90-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for openvino_tokenizers-2024.1.0.2-90-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 5e535107117ecedbe0b243b707f936240779dd1129de1578e7e03895ddb74254
MD5 79b7c63bdebdfea67aa2f62f0b75af6b
BLAKE2b-256 dcaa46885e6f36e6815746e7c7048b2e37b8a78fb49a328fda1d89fd14a7e6db

See more details on using hashes here.

File details

Details for the file openvino_tokenizers-2024.1.0.2-90-py3-none-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for openvino_tokenizers-2024.1.0.2-90-py3-none-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 d508d91ca4a1b49efa964beb16ec6d1cbc67298d0b161f23d6d2827c6b67256e
MD5 c38bcb36dbf0a5a5b27227c51e12309a
BLAKE2b-256 7f9e4eff088e7df8e782533943c5f0cadc62d366e70bf52398f6316364120f3c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page