Skip to main content

Package for applying ao techniques to GPU models

Project description

torchao: PyTorch Architecture Optimization

Introduction | Inference | Training | Composability | Custom Kernels | Alpha Features | Installation | Integrations | Videos | License

Introduction

torchao: PyTorch library for custom data types & optimizations. Quantize and sparsify weights, gradients, optimizers & activations for inference and training.

From the team that brought you the fast series

  • 8x speedups for Image segmentation models with sam-fast (9.5x with int8 dynamic quantization + 2:4 sparsity)
  • 10x speedups for Language models with gpt-fast
  • 3x speedup for Diffusion models with sd-fast

torchao just works with torch.compile() and FSDP2 over most PyTorch models on Huggingface out of the box.

Inference

Post Training Quantization

Quantizing and Sparsifying your models is a 1 liner that should work on any model with an nn.Linear including your favorite HuggingFace model. You can find a more comprehensive usage instructions here, sparsity here and a HuggingFace inference example here

For inference, we have the option of

  1. Quantize only the weights: works best for memory bound models
  2. Quantize the weights and activations: works best for compute bound models
  3. Quantize the activations and weights and sparsify the weight
from torchao.quantization.quant_api import (
    quantize_,
    int8_dynamic_activation_int4_weight,
    int8_dynamic_activation_int8_weight,
    int8_dynamic_activation_int8_semi_sparse_weight,
    int4_weight_only,
    int8_weight_only
)
quantize_(m, int4_weight_only())

For gpt-fast int4_weight_only() is the best option at bs=1 as it 2x the tok/s and reduces the VRAM requirements by about 65% over a torch.compiled baseline.

If you see slowdowns with any of these techniques or you're unsure which option to use, consider using autoquant which will automatically profile layers and pick the best way to quantize each layer.

model = torchao.autoquant(torch.compile(model, mode='max-autotune'))

We also provide a developer facing API so you can implement your own quantization algorithms so please use the excellent HQQ algorithm as a motivating example.

KV Cache Quantization

We've added kv cache quantization and other features in order to enable long context length (and necessarily memory efficient) inference.

In practice these features alongside int4 weight only quantization allow us to reduce peak memory by ~55%, meaning we can Llama3.1-8B inference with a 130k context length with only 18.9 GB of peak memory. More details can be found here

Quantization Aware Training

Post-training quantization can result in a fast and compact model, but may also lead to accuracy degradation. We recommend exploring Quantization Aware Training (QAT) to overcome this limitation. In collaboration with Torchtune, we've developed a QAT recipe that demonstrates significant accuracy improvements over traditional PTQ, recovering 96% of the accuracy degradation on hellaswag and 68% of the perplexity degradation on wikitext for Llama3 compared to post-training quantization (PTQ). And we've provided a full recipe here

from torchao.quantization.prototype.qat import Int8DynActInt4WeightQATQuantizer

qat_quantizer = Int8DynActInt4WeightQATQuantizer()

# Insert "fake quantize" operations into linear layers.
# These operations simulate quantization numerics
model = qat_quantizer.prepare(model)

# Run Training...

# Convert fake quantize to actual quantize operations
model = qat_quantizer.convert(model)

Training

Float8

torchao.float8 implements training recipes with the scaled float8 dtypes, as laid out in https://arxiv.org/abs/2209.05433.

With torch.compile on, current results show throughput speedups of up to 1.5x on 128 H100 GPU LLaMa 3 70B pretraining jobs (details)

from torchao.float8 import convert_to_float8_training
convert_to_float8_training(m, module_filter_fn=...)

And for an end-to-minimal training recipe of pretraining with float8, you can check out torchtitan

Sparse Training

We've added support for semi-structured 2:4 sparsity with 6% end-to-end speedups on ViT-L. Full blog here

The code change is a 1 liner with the full example available here

swap_linear_with_semi_sparse_linear(model, {"seq.0": SemiSparseLinear})

Memory-efficient optimizers

ADAM takes 2x as much memory as the model params so we can quantize the optimizer state to either 8 or 4 bit effectively reducing the optimizer VRAM requirements by 2x or 4x respectively over an fp16 baseline

from torchao.prototype.low_bit_optim import AdamW8bit, AdamW4bit, AdamWFp8
optim = AdamW8bit(model.parameters()) # replace with Adam4bit and AdamFp8 for the 4 / fp8 versions

In practice, we are a tiny bit slower than expertly written kernels but the implementations for these optimizers were written in a few hundred lines of PyTorch code and compiled so please use them or copy-paste them for your quantized optimizers. Benchmarks here

We also have support for single GPU CPU offloading where both the gradients (same size as weights) and the optimizers will be efficiently sent to the CPU. This alone can reduce your VRAM requirements by 60%

optim = CPUOffloadOptimizer(model.parameters(), torch.optim.AdamW, fused=True)
optim.load_state_dict(ckpt["optim"])

Composability

  1. torch.compile: A key design principle for us is composability as in any new dtype or layout we provide needs to work with our compiler. It shouldn't matter if the kernels are written in pure PyTorch, CUDA, C++, or Triton - things should just work! So we write the dtype, layout, or bit packing logic in pure PyTorch and code-generate efficient kernels.
  2. FSDP2: Historically most quantization has been done for inference, there is now a thriving area of research combining distributed algorithms and quantization.

The best example we have combining the composability of lower bit dtype with compile and fsdp is NF4 which we used to implement the QLoRA algorithm. So if you're doing research at the intersection of this area we'd love to hear from you.

Custom Kernels

We've added support for authoring and releasing custom ops that do not graph break with torch.compile() so if you love writing kernels but hate packaging them so they work all operating systems and cuda versions, we'd love to accept contributions for your custom ops. We have a few examples you can follow

  1. fp6 for 2x faster inference over fp16 with an easy to use API quantize_(model, fp6_llm_weight_only())
  2. 2:4 Sparse Marlin GEMM 2x speedups for FP16xINT4 kernels even at batch sizes up to 256
  3. int4 tinygemm unpacker which makes it easier to switch quantized backends for inference

If you believe there's other CUDA kernels we should be taking a closer look at please leave a comment on this issue

Alpha features

Things we're excited about but need more time to cook in the oven

  1. MX training and inference support with tensors using the OCP MX spec data types, which can be described as groupwise scaled float8/float6/float4/int8, with the scales being constrained to powers of two. This work is prototype as the hardware support is not available yet.
  2. Int8 Quantized Training: We're trying out full int8 training. This is easy to use with quantize_(model, int8_weight_only_quantized_training()). This work is prototype as the memory benchmarks are not compelling yet.
  3. IntX: We've managed to support all the ints by doing some clever bitpacking in pure PyTorch and then compiling it. This work is prototype as unfortunately without some more investment in either the compiler or low-bit kernels, int4 is more compelling than any smaller dtype
  4. Bitnet: Mostly this is very cool to people on the team. This is prototype because how useful these kernels are is highly dependent on better hardware and kernel support.

Installation

torchao makes liberal use of several new features in Pytorch, it's recommended to use it with the current nightly or latest stable version of PyTorch.

Stable release from Pypi which will default to CUDA 12.1

pip install torchao

Stable Release from the PyTorch index

pip install torchao --extra-index-url https://download.pytorch.org/whl/cu121 # full options are cpu/cu118/cu121/cu124

Nightly Release

pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu121 # full options are cpu/cu118/cu121/cu124

For most developers you probably want to skip building custom C++/CUDA extensions for faster iteration

USE_CPP=0 pip install -e .

Integrations

We're also fortunate to be integrated into some of the leading open-source libraries including

  1. Hugging Face transformers with a builtin inference backend and low bit optimizers
  2. Hugging Face diffusers with a minimal example thanks to Sayak Paul
  3. Mobius HQQ backend leveraged our int4 kernels to get 195 tok/s on a 4090

Videos

License

torchao is released under the BSD 3 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

torchao-0.5.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

torchao-0.5.0-cp312-cp312-macosx_11_0_arm64.whl (773.4 kB view details)

Uploaded CPython 3.12 macOS 11.0+ ARM64

torchao-0.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

torchao-0.5.0-cp311-cp311-macosx_11_0_arm64.whl (774.6 kB view details)

Uploaded CPython 3.11 macOS 11.0+ ARM64

torchao-0.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

torchao-0.5.0-cp310-cp310-macosx_11_0_arm64.whl (773.2 kB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

torchao-0.5.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

torchao-0.5.0-cp39-cp39-macosx_11_0_arm64.whl (773.4 kB view details)

Uploaded CPython 3.9 macOS 11.0+ ARM64

File details

Details for the file torchao-0.5.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for torchao-0.5.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1a14c6d5c6e8a1b03eded529bec9306f271ce59de43fa2e4699fd83f464bb5cd
MD5 d61a86268ed576e722c96374698844ed
BLAKE2b-256 8cf75e8fd7eaab81f0ceefd7caf58647d92684977abca7a38bce6e6891f86a6e

See more details on using hashes here.

File details

Details for the file torchao-0.5.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for torchao-0.5.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d1dacb8b899d76ea97f166d421c16140016cebed2090f04b17eee7e15b69969a
MD5 c05c6f93807f5d46b1446e7bca8f3b54
BLAKE2b-256 91d087470c59148ed296418a3533953803f7dec1084acb1ad43e1ac9a110a1cd

See more details on using hashes here.

File details

Details for the file torchao-0.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for torchao-0.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2aede6d89481ccda6bfd81f1666707765244de97697cb42b57c1001d9f928492
MD5 4b5633ad78a069086b585ac7b37366d7
BLAKE2b-256 13cb94636b2639d0a227d130863e87192da7e8caee5463cf7c678740f83c1d48

See more details on using hashes here.

File details

Details for the file torchao-0.5.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for torchao-0.5.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3b8be5c3dcb641688397501d55cab4804688c3d27bdb9f8e0abcba1f2810678e
MD5 cd82297e3bf64bc171615dd1231268da
BLAKE2b-256 b445e7dfdabe99427db6417ba96e0e6bd0234ed2cf713fe1b1526292c11871b3

See more details on using hashes here.

File details

Details for the file torchao-0.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for torchao-0.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 30a4c5c6ef7e3f5fa9a8dae3e2b9bb82c34d7c61a55f008e120303e22dd82cb6
MD5 44de4ba9237216635ccdc53e45c72352
BLAKE2b-256 ff0b5d0bf43aed2548a6788ac4ca434d46652e3976f582671eab28496dc77ca5

See more details on using hashes here.

File details

Details for the file torchao-0.5.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for torchao-0.5.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6daff53790532d48e6b023bdd34030e8f87075f75f4206f3dd6577e6d99d7132
MD5 7d53201f66181059f3ec03aa02e2a9de
BLAKE2b-256 5315f6ffe100392b4a1dc63fdb5787cfe4ff228dd3dddb57fec70e2f5547d384

See more details on using hashes here.

File details

Details for the file torchao-0.5.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for torchao-0.5.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 575974f5aea245cdb9ecc30b6d7385de043c3ac874f5b7701ceb5362e521b7f1
MD5 54fec02a380eeeef7b4a459e16709d39
BLAKE2b-256 c3cfe89b4c7c1627885ab64f55e0ed1097d58853a720488aa88aaa201bd32f3b

See more details on using hashes here.

File details

Details for the file torchao-0.5.0-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for torchao-0.5.0-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f3ed3c5a55609c0051d0e1b12a609f416d896afd2d6a5a1ac73ee14c0230801c
MD5 02872ee0753ff238913bafc3ce5d801a
BLAKE2b-256 e0abb7e5ea3921175d54aa840b694b06d2f5ddfc3ac617e1e0a4e7d10d1ecb3d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page