Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Workshop on Sparsity in LLMs (SLLM): Deep Dive into Mixture of Experts, Quantization, Hardware, and Inference

ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals

Utkarsh Saxena · Sayeh Sharify · Kaushik Roy · Xin Wang


Abstract:

Quantizing weights, activations, and KV cache in large language models to 4-bit without degrading generalizability is challenging due to outlier-induced activation quantization errors. We propose ResQ, a post training quantization (PTQ) method that uses principal component analysis to identify a low-rank subspace (in practice 1/8 of the hidden dimension) and keeps coefficients within this subspace in 8-bit while quantizing the rest in 4-bit. Within each subspace, invariant random rotation is applied to further suppress outliers. ResQ outperforms recent PTQ methods on Llama and Qwen2.5, achieving up to 33% lower Wikitext perplexity than SpinQuant and up to 3x speedup over 16-bit.

Chat is not available.