ERA-GAC for Stable Structured Reasoning with Attention Priors and Gain-Aware Entropy Control
Abstract
Large language models often fail on logical reasoning tasks not only by being wrong, but by being unstable, small shifts in training phase or attention dynamics can yield brittle internal inference that disproportionately harms structured question answering. We present ERA-GAC, a training-time method that stabilizes attention-based inference through (i) ERA, an additive log-prior over attention destinations that encodes length-aware structural constraints, and (ii) GAC, a gain-aware entropy/temperature controller that prevents late-phase collapse into overly sharp or overly diffuse attention regimes while keeping inference-time behavior fixed. On compute-matched training (540M parameters, 1.87B tokens), ERA-GAC yields statistically significant differences on six of nine tasks (five improvements, one regression) with paired tests and multiple-comparisons corrections. We find large gains on structured science QA benchmarks (SciQ, ARC-Easy, QASC) and smaller gains on commonsense/RC tasks, while the regression on BoolQ indicates when structural priors may interfere with passage-grounded entailment.