Skip to yearly menu bar Skip to main content


Oral
in
Workshop: Workshop on Sparsity in LLMs (SLLM): Deep Dive into Mixture of Experts, Quantization, Hardware, and Inference

Oral #3: QuEST: Training Accurate LLMs over Highly-Compressed Weights and Activation

Andrei Panferov

[ ] [ Project Page ]
Sat 26 Apr 7:55 p.m. PDT — 8:10 p.m. PDT

Abstract:

One main approach to reducing the massive costs of large language models (LLMs) is the use of quantized or sparse representations for training or deployment. While post-training compression methods are very popular, obtaining even more accurate compressed models by directly training over such representations, i.e., Quantization-Aware Training (QAT), is still largely open. In this paper, we advance this state-of-the-art for QAT via a new method called QuEST, which is Pareto-competitive with FP16, that is, it provides better accuracy at lower model size, while training models with weights and activations in 4-bits or less. Moreover, QuEST allows stable training with 1-bit weights and activations, and is compatible with weight sparsity. Experiments on Llama-type architectures show that QuEST induces new, stable scaling laws across the entire range of hardware-supported compressed representations. Moreover, we provide GPU kernel support showing that the models produced by QuEST can be efficiently executed on current hardware.

Chat is not available.