ICLR SparQ Attention: Bandwidth-Efficient LLM Inference

Poster
in
Workshop: 2nd Workshop on Mathematical and Empirical Understanding of Foundation Models

SparQ Attention: Bandwidth-Efficient LLM Inference

Luka Ribar · Ivan Chelombiev · Luke Hudlass-Galley · Charles Blake · Carlo Luschi · Douglas Orr

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract: Through an analysis of the statistical properties of pre-trained large language models (LLMs), we highlight two opportunities for sparse memory access: first in the components of query and keys and second in the attention scores corresponding to key, value pairs.Based on this, we introduce **SparQ Attention**, a technique for increasing the inference throughput of LLMs by utilising memory bandwidth more efficiently within attention layers, through selective fetching of the cached history. Our proposed technique can be applied directly to off-the-shelf LLMs during inference, without requiring any modification to the pre-training setup or additional fine-tuning. We show that SparQ Attention brings up to $8\times$ savings in attention data-transfers without substantial drops in accuracy, by evaluating Llama $2$, Mistral and Pythia models on a wide range of downstream tasks.

Chat is not available.

Poster in Workshop: 2nd Workshop on Mathematical and Empirical Understanding of Foundation Models

SparQ Attention: Bandwidth-Efficient LLM Inference

Luka Ribar · Ivan Chelombiev · Luke Hudlass-Galley · Charles Blake · Carlo Luschi · Douglas Orr

Poster
in
Workshop: 2nd Workshop on Mathematical and Empirical Understanding of Foundation Models