Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning
Abstract
The deployment of Large Language Models (LLMs) in critical domains is currently impeded by the persistent phenomenon of hallucination—the generation of plausible but factually incorrect assertions. Standard RLVR paradigms, which predominantly utilize binary reward signals, inadvertently incentivize models to function as ''good test-takers'' rather than ''honest communicators''. In this paper, we introduce an alternative reward for behavioral calibration, which trains a model via reinforcement learning to output calibrated probabilities of correctness and to abstain when these probabilities fall below a user-specified risk threshold. The model can either abstain from producing a complete response or flag individual claims for which uncertainty remains. Our approach allows a 4B-parameter model to surpass frontier models in hallucination mitigation, which we demonstrate as a transferable meta-skill that can be decoupled from raw predictive accuracy. When trained on mathematical reasoning tasks, our model achieves a log-scale gain of 0.806 in the Accuracy-to-Hallucination Ratio by rejecting uncertain responses, substantially exceeding GPT-5 (0.207) on the challenging BeyondAIME benchmark. When applied at the claim level, our approach further surpasses Gemini-2.5-pro on the same metric. Moreover, the hallucination mitigation capability generalizes to cross-domain factual QA.