TurboBoA: Faster and Exact Attention-aware Quantization without Backpropagation
Abstract
The rapid growth of large language models (LLMs) has heightened the importance of post-training quantization (PTQ) for reducing memory and computation costs. Among PTQ methods, GPTQ has gained considerable attention for its efficiency, enabling billion-scale LLMs to be quantized within a few GPU hours. However, GPTQ assumes layer-wise independence, suffering severe accuracy drop in low-bit regimes. Recently, BoA improves upon GPTQ by incorporating inter-layer dependencies within the attention module, but it requires sequential quantization across all out-channels, making it substantially less efficient than GPTQ. In this paper, we propose TurboBoA, a new backpropagation-free PTQ algorithm that preserves the accuracy benefits of BoA while significantly accelerating the process. The proposed TurboBoA introduces three key innovations: (i) joint quantization of multiple out-channels with a closed-form error compensation rule, reducing sequential operations and yielding a 4~6 times speedup; (ii) correction for distortions propagated from preceding quantized Transformer blocks; and (iii) adaptive grid selection with attention-wise refinement to prevent misalignment during iterative updates. Extensive experiments demonstrate that TurboBoA delivers substantial acceleration over BoA while consistently improving accuracy, and when combined with outlier suppression techniques, it achieves state-of-the-art results in both weight-only and weight-activation quantization.