Skip to main content

Extend LLMs to infinite length without sacrificing efficiency and performance, without retraining

Project description

Attention Sinks in Transformers for Infinite-length LLMs

llama_2_7b_ppl_vram

Overview

  • Extend existing LLMs (e.g. Llama 2) to infinite length without sacrificing efficiency and performance, without any retraining.
  • The attention_sinks API allows for a drop-in replacement of the transformers API:
    from attention_sinks import AutoModel
    
    model = AutoModel.from_pretrained("meta-llama/Llama-2-7b-hf", device_map="auto")
    
  • New parameters to AutoModel....from_pretrained:
    • attention_sink_size, int, defaults to 4: The number of initial tokens to use as the attention sink. These tokens are always included in the Attention Sink KV Cache.
    • attention_sink_window_size, int, defaults to 1020: The size of the sliding window, i.e. the number of "recent tokens" to include in the Attention Sink KV Cache.

Note

I've yet to replicate all of the experiments by the original paper, although I've replicated some. I can't confirm that this indeed allows for infinite-length LLMs in theory nor in practice.

More details coming soon.

Credits

Inspired by, and adapted from StreamingLLM.

Citation

@article{xiao2023streamingllm,
    title={Efficient Streaming Language Models with Attention Sinks},
    author={Xiao, Guangxuan and Tian, Yuandong and Chen, Beidi and Han, Song and Lewis, Mike},
    journal={arXiv},
    year={2023}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

attention_sinks-0.0.1.tar.gz (11.5 kB view details)

Uploaded Source

Built Distribution

attention_sinks-0.0.1-py3-none-any.whl (12.0 kB view details)

Uploaded Python 3

File details

Details for the file attention_sinks-0.0.1.tar.gz.

File metadata

  • Download URL: attention_sinks-0.0.1.tar.gz
  • Upload date:
  • Size: 11.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.13

File hashes

Hashes for attention_sinks-0.0.1.tar.gz
Algorithm Hash digest
SHA256 d1019c0a691374538acded0ffc8914f81c05bda92c1d3f898506797ea8d2ed62
MD5 94e8bc8071bafe8e11b4f1cc258ab89a
BLAKE2b-256 0f65ce134e9bc0c6f0afa2fc924128d1ae2f6687480509f8708c7e2679e1c4dc

See more details on using hashes here.

File details

Details for the file attention_sinks-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for attention_sinks-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 59fc883ae53eb807e45114e4e5b39b84a3e23d12126f2deff7195cd8d8c7bc9d
MD5 ecd42d7318d2a7cfc5b1a2e8192800e8
BLAKE2b-256 1bc661443d3cfdc7d1509d74526a0d25f50756e5878c3b712eac40c57879e1c1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page