FLUFFINJECTOR: DIAGNOSING LOGICAL CONSISTENCY FAILURES IN CHAIN-OF-THOUGHT REWARD MODELS
Abstract
Large Language Models (LLMs) are increasingly used as judges and reward models in alignment pipelines, where their scores shape learned behavior. Prior work shows these judges can be manipulated by superficial openers (e.g.,"Thought process:” or “Let’s solve this step by step.”), but vulnerabilities in intermediate reasoning verification remain underexplored. We identify Fluff Injection, a failure in which a logically necessary step in a chain of reasoning is replaced with plausible-sounding commentary (e.g.“Let’s slow down and check our negatives here”). To measure this failure mode, we introduce FluffInjector, a benchmark of paired minimal examples: for each problem, we generate a GOOD chain and a FLUFF chain that keeps the same step count and final answer while replacing 25-40% of steps with non-inferential filler. Evaluating frontier judges (GPT-4.1, DeepSeek-V3.1, Qwen2.5-7B-Instruct), we find they frequently validate FLUFFED chains, indicating a strong reliance on surface coherence. Using FluffInjector, we fine-tune SmartRM, a verifier trained to emphasize step-to-step logical continuity. SmartRM reduces false positives from 37.43% (GPT-4.1) to 2.68% and achieves 97.27% overall verification accuracy.