Uncertainty Drives Social Bias Changes in Quantized Large Language Models
Abstract
Aggregate bias metrics are fundamentally misleading for quantized large language models: quantization causes up to 21% of individual responses to flip between biased and unbiased states while aggregate scores remain unchanged, creating invisible harms that standard evaluations fail to detect. To reveal this hidden phenomenon, we introduce PostTrainingBiasBench, a unified framework for rigorous bias evaluation, and conduct the first large-scale study of 50 quantized models across 13 closed- and open-ended benchmarks. We find these flips are strongly linked to model uncertainty: uncertain responses are 3-11x more likely to change than the confident ones. Through controlled intervention via preference optimization, we establish causal evidence that uncertainty drives response flipping. Quantization strength amplifies the effect (4-bit quantized models show 4-6x more behavioral changes than 8-bit). Critically, these shifts asymmetrically impact demographic groups, with bias can worsen by up to 18.6% for some groups while improving by 14.1% for others, yielding misleadingly neutral aggregate outcomes. Larger models show no consistent robustness advantage, and group-specific shifts vary unpredictably across model families. Together, our findings demonstrate that compression fundamentally reshapes bias patterns, underscoring the need for rigorous post-quantization evaluation and interventions to ensure reliability in practice.