Skip to main content

tiktoken is a fast BPE tokeniser for use with OpenAI's models

Project description

⏳ tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.

import tiktoken
enc = tiktoken.get_encoding("gpt2")
assert enc.decode(enc.encode("hello world")) == "hello world"

# To get the tokeniser corresponding to a specific model in the OpenAI API:
enc = tiktoken.encoding_for_model("text-davinci-003")

The open source version of tiktoken can be installed from PyPI:

pip install tiktoken

The tokeniser API is documented in tiktoken/core.py.

Example code using tiktoken can be found in the OpenAI Cookbook.

Performance

tiktoken is between 3-6x faster than a comparable open source tokeniser:

image

Performance measured on 1GB of text using the GPT-2 tokeniser, using GPT2TokenizerFast from tokenizers==0.13.2, transformers==4.24.0 and tiktoken==0.2.0.

Getting help

Please post questions in the issue tracker.

If you work at OpenAI, make sure to check the internal documentation or feel free to contact @shantanu.

Extending tiktoken

You may wish to extend tiktoken to support new encodings. There are two ways to do this.

Create your Encoding object exactly the way you want and simply pass it around.

cl100k_base = tiktoken.get_encoding("cl100k_base")

# In production, load the arguments directly instead of accessing private attributes
# See openai_public.py for examples of arguments for specific encodings
enc = tiktoken.Encoding(
    # If you're changing the set of special tokens, make sure to use a different name
    # It should be clear from the name what behaviour to expect.
    name="cl100k_im",
    pat_str=cl100k_base._pat_str,
    mergeable_ranks=cl100k_base._mergeable_ranks,
    special_tokens={
        **cl100k_base._special_tokens,
        "<|im_start|>": 100264,
        "<|im_end|>": 100265,
    }
)

Use the tiktoken_ext plugin mechanism to register your Encoding objects with tiktoken.

This is only useful if you need tiktoken.get_encoding to find your encoding, otherwise prefer option 1.

To do this, you'll need to create a namespace package under tiktoken_ext.

Layout your project like this, making sure to omit the tiktoken_ext/__init__.py file:

my_tiktoken_extension
├── tiktoken_ext
│   └── my_encodings.py
└── setup.py

my_encodings.py should be a module that contains a variable named ENCODING_CONSTRUCTORS. This is a dictionary from an encoding name to a function that takes no arguments and returns arguments that can be passed to tiktoken.Encoding to construct that encoding. For an example, see tiktoken_ext/openai_public.py. For precise details, see tiktoken/registry.py.

Your setup.py should look something like this:

from setuptools import setup, find_namespace_packages

setup(
    name="my_tiktoken_extension",
    packages=find_namespace_packages(include=['tiktoken_ext*']),
    install_requires=["tiktoken"],
    ...
)

Then simply pip install ./my_tiktoken_extension and you should be able to use your custom encodings! Make sure not to use an editable install.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tiktoken-0.3.0.tar.gz (24.6 kB view details)

Uploaded Source

Built Distributions

tiktoken-0.3.0-cp311-cp311-win_amd64.whl (581.1 kB view details)

Uploaded CPython 3.11 Windows x86-64

tiktoken-0.3.0-cp311-cp311-musllinux_1_1_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.11 musllinux: musl 1.1+ x86-64

tiktoken-0.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

tiktoken-0.3.0-cp311-cp311-macosx_11_0_arm64.whl (702.4 kB view details)

Uploaded CPython 3.11 macOS 11.0+ ARM64

tiktoken-0.3.0-cp311-cp311-macosx_10_9_x86_64.whl (735.2 kB view details)

Uploaded CPython 3.11 macOS 10.9+ x86-64

tiktoken-0.3.0-cp310-cp310-win_amd64.whl (581.1 kB view details)

Uploaded CPython 3.10 Windows x86-64

tiktoken-0.3.0-cp310-cp310-musllinux_1_1_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.10 musllinux: musl 1.1+ x86-64

tiktoken-0.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

tiktoken-0.3.0-cp310-cp310-macosx_11_0_arm64.whl (702.4 kB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

tiktoken-0.3.0-cp310-cp310-macosx_10_9_x86_64.whl (735.2 kB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

tiktoken-0.3.0-cp39-cp39-win_amd64.whl (581.4 kB view details)

Uploaded CPython 3.9 Windows x86-64

tiktoken-0.3.0-cp39-cp39-musllinux_1_1_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.9 musllinux: musl 1.1+ x86-64

tiktoken-0.3.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

tiktoken-0.3.0-cp39-cp39-macosx_11_0_arm64.whl (702.9 kB view details)

Uploaded CPython 3.9 macOS 11.0+ ARM64

tiktoken-0.3.0-cp39-cp39-macosx_10_9_x86_64.whl (735.4 kB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

tiktoken-0.3.0-cp38-cp38-win_amd64.whl (581.4 kB view details)

Uploaded CPython 3.8 Windows x86-64

tiktoken-0.3.0-cp38-cp38-musllinux_1_1_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.8 musllinux: musl 1.1+ x86-64

tiktoken-0.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

tiktoken-0.3.0-cp38-cp38-macosx_11_0_arm64.whl (702.7 kB view details)

Uploaded CPython 3.8 macOS 11.0+ ARM64

tiktoken-0.3.0-cp38-cp38-macosx_10_9_x86_64.whl (734.9 kB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

File details

Details for the file tiktoken-0.3.0.tar.gz.

File metadata

  • Download URL: tiktoken-0.3.0.tar.gz
  • Upload date:
  • Size: 24.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for tiktoken-0.3.0.tar.gz
Algorithm Hash digest
SHA256 2476a4f4d29293762dc3320d50d866202d7e1c562ac378a785dde51057dcef5e
MD5 f818e4fadc69abd524e73c74d6f347ef
BLAKE2b-256 8d59dfafae6747926ac8200e303cd45bcf1c152ee569dfad64accb12ab7276e0

See more details on using hashes here.

File details

Details for the file tiktoken-0.3.0-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: tiktoken-0.3.0-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 581.1 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for tiktoken-0.3.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 ff26fe25480a03fdd15de2dc6c33afab632d5c4deab33b054c42fa25fea98606
MD5 66159816a09542866dac06b398fd1d51
BLAKE2b-256 c503e863c4f47fd1defdf14feaf96c0bfcd587cf073c59b177de6bd3da2f5caf

See more details on using hashes here.

File details

Details for the file tiktoken-0.3.0-cp311-cp311-musllinux_1_1_x86_64.whl.

File metadata

File hashes

Hashes for tiktoken-0.3.0-cp311-cp311-musllinux_1_1_x86_64.whl
Algorithm Hash digest
SHA256 38f7c2c790cbc8f9122c8f2bcd543d385b8e5557becade29fae6de5ddba74085
MD5 735c592b86e2e56f36155e99419cb47d
BLAKE2b-256 e43b7830a6b687df5f69106ca54eba3a64146f574e2f22946e5a2c985104a317

See more details on using hashes here.

File details

Details for the file tiktoken-0.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tiktoken-0.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1eaeeb52a79af618eac095ca91f11ba96d8b18a7e3019ecfaaaa2691838392ba
MD5 468876a92a0d6abaa80a692dce4b7d0e
BLAKE2b-256 5c996044c5197ee462ff2f698c3a9f5cc97956126aca53d15cc9b2fe565fa0c4

See more details on using hashes here.

File details

Details for the file tiktoken-0.3.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tiktoken-0.3.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 002429fcd9b004cb3b3e859c5ebe9ea8d916a51378128ebedd2bf3bf6320401a
MD5 45605831bd1303aab2831c545b7b1b6e
BLAKE2b-256 82d4be453d5110d84291b9e312264f3e5109712748a4f21ac8cea8cda79a791b

See more details on using hashes here.

File details

Details for the file tiktoken-0.3.0-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tiktoken-0.3.0-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 9e04bdabb9ff19a8a0100dae665df8f23838593d6ab6490790fcfe199ee4a8e5
MD5 5b343adaa84b01d1b13ac47f1051d475
BLAKE2b-256 96b01241f4fc2c7b9827dba9506f73751104a252189c136b0e15cf595760c470

See more details on using hashes here.

File details

Details for the file tiktoken-0.3.0-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: tiktoken-0.3.0-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 581.1 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for tiktoken-0.3.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 c57a1a61167525f4eca8c7b09ebb20e85a77f6de913a07eb6547acf60e9dbe7f
MD5 61a010dd9161259c6b5ebb396e55b78b
BLAKE2b-256 7a270826742ce2d59bfcf1e8361300f735ae0cb96611751ff9fb4ba5eb691068

See more details on using hashes here.

File details

Details for the file tiktoken-0.3.0-cp310-cp310-musllinux_1_1_x86_64.whl.

File metadata

File hashes

Hashes for tiktoken-0.3.0-cp310-cp310-musllinux_1_1_x86_64.whl
Algorithm Hash digest
SHA256 a62a2a5d29bfe93170e59533951c687f37c90a2610ba780bcca5eec8729468c9
MD5 19ae5833169b46b21477a461c8bab5be
BLAKE2b-256 58d4095bc4f3586940019c524a9fbcfdee13cecf7d31584a6e29fdf0c13e096a

See more details on using hashes here.

File details

Details for the file tiktoken-0.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tiktoken-0.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 98d91ea78a792c28664cbc5bee81440bb17393530279b71e3216de8a82253bc2
MD5 892617dc1b648ab9b40db330a4c7c357
BLAKE2b-256 353563acc50cf36ac0b77511ee8432ceadfc9e636d275cc9b2491eac9fb68a8a

See more details on using hashes here.

File details

Details for the file tiktoken-0.3.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tiktoken-0.3.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ca6e972e74903c2dc36631f0061240972cdab99bd7b559555628a34e965484f2
MD5 d4d89fc6adaabaa63fd24e82602c0117
BLAKE2b-256 2b0b06f9ef591571d0b3a2e2881ae12a4507896dd0f23c275c5e1a92460698cb

See more details on using hashes here.

File details

Details for the file tiktoken-0.3.0-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tiktoken-0.3.0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 15232605e9bc7c2dfa6f67f5608389f8caabca03ef577e5d01bc1c9c5c90e9df
MD5 848cfd607311edf97377cd98cc2039d1
BLAKE2b-256 9db0610213638cfeead02f218ebe41d2b3e4c420b8b09b2d0bed7223917d2442

See more details on using hashes here.

File details

Details for the file tiktoken-0.3.0-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: tiktoken-0.3.0-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 581.4 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for tiktoken-0.3.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 77650f9b4584fc26ba337c00e2d86f847acca1fa03ddf865fb1db935871c6f9e
MD5 9a989518e60332e8c4ca7c08f00f7199
BLAKE2b-256 3ace0db5bb6561df72c9baae6b9d540cc13be804d6c3491775029721d8ba70bc

See more details on using hashes here.

File details

Details for the file tiktoken-0.3.0-cp39-cp39-musllinux_1_1_x86_64.whl.

File metadata

File hashes

Hashes for tiktoken-0.3.0-cp39-cp39-musllinux_1_1_x86_64.whl
Algorithm Hash digest
SHA256 9a5425409d8226e017d482120b070596450d05874e0c99c3f4e788ab9d91da64
MD5 b202f28d75ee0fd80cc857fa273c0934
BLAKE2b-256 268340f8e0ee4e46be4cb146f4148691573a4f36e8e36b47c9c5d6121125ba5b

See more details on using hashes here.

File details

Details for the file tiktoken-0.3.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tiktoken-0.3.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2cdf72ffe83237a485c7e8f9a609d7d17041e1d866b5c5c424e83c82897f8ea0
MD5 1c20272c5615342d61018c6678a4f395
BLAKE2b-256 1efd614defd7524433da0e64e737c713520bb27e353ce7689e545a939045862d

See more details on using hashes here.

File details

Details for the file tiktoken-0.3.0-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tiktoken-0.3.0-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 0f9786a9f6242dd4f15be96a3b39d73428be847ee4a6ba196cae578bb9e9f76c
MD5 386f990928da807321cf11d6bc5a1b1d
BLAKE2b-256 f29eb4a3a5bcb3b0964196756f5885a5dcad3070f59d4a1c5eef756282747ac8

See more details on using hashes here.

File details

Details for the file tiktoken-0.3.0-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tiktoken-0.3.0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 38f34b23122b1a66456c8db98657a3eb5de0c2c361f35bd85dd4565c3e98edd5
MD5 085608efa0c802062714ceb1a5e08d18
BLAKE2b-256 5c7603b8286cd264f9f5550229fe21f72abc89d431a9a3c887fc365763acc5a4

See more details on using hashes here.

File details

Details for the file tiktoken-0.3.0-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tiktoken-0.3.0-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 581.4 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for tiktoken-0.3.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 22fd239a81609614cbcff331b14046c3d45ee53c38864d448d7fdbb1bb8b9754
MD5 04ded3692003efdfec34f8dbb36544a2
BLAKE2b-256 e1784fd783a87fcf51fa3f562a6f5d998dc382f774b804bbf202197553824309

See more details on using hashes here.

File details

Details for the file tiktoken-0.3.0-cp38-cp38-musllinux_1_1_x86_64.whl.

File metadata

File hashes

Hashes for tiktoken-0.3.0-cp38-cp38-musllinux_1_1_x86_64.whl
Algorithm Hash digest
SHA256 3605d349903749787bf7c50a294a80a675bd988d7f4a4d077a813c4e055f8f4a
MD5 95608c0462fde5c39bcffcc3d61f3352
BLAKE2b-256 b53092405b3bc079e8af025e0f693e36c119d3e3a1c6ec2ab610dea3fb9f3b4f

See more details on using hashes here.

File details

Details for the file tiktoken-0.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tiktoken-0.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1fd12d235e57ddf0e5298aaa650b62f8d9b6269378bfe2e3e480bfe887f1ec21
MD5 17b56931e0b24b749eccd628caf80b99
BLAKE2b-256 20c08bff69962c32342bb1360396b99ccf9c6fa743f0b599077edd36d8539b40

See more details on using hashes here.

File details

Details for the file tiktoken-0.3.0-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tiktoken-0.3.0-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 77cc1c6cf80d2838131cad91dd0a1146e769b0797726591679feae9b20e3ebd6
MD5 fb923dfdeaff3ed9dca0cab87dfa72ec
BLAKE2b-256 d5c1545e76108d1c876012ec896b70b5f59e95f927099e9b30a18dbcf263f33a

See more details on using hashes here.

File details

Details for the file tiktoken-0.3.0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tiktoken-0.3.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 abc660f7da3c8b47435009ea4c428a6dd6270727a4574ee8d31514290425689b
MD5 93543581cadc51e58d6494041d8dfd97
BLAKE2b-256 46db0dfb9b31fa82c720077b2a9af34682f09459ad7848fdd7225efb7ea148c7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page