Rubric as Reward: Decomposing Verification Signals for Logical Reasoning in GRPO
Abstract
Reinforcement learning from verifiable rewards (RLVR) has improved LLM reasoning, yet reward functions remain monolithic: a model producing a correct answer via flawed reasoning receives the same signal as one reasoning validly but extracting the wrong answer. We propose rubric-grounded rewards, a framework that decomposes reward into independently weighted criteria spanning a verifiable-to-soft spectrum. Applied to logical reasoning, our five-criterion rubric separates answer correctness, Z3-checked step validity, and format compliance (all machine-verifiable) from premise utilization and reasoning completeness (requiring judgment). We train Qwen2.5- 3B-Instruct via GRPO under five reward conditions and evaluate on 166 hard FOLIO and ProntoQA examples. Three findings emerge: (1) rubricstructured verifiable rewards achieve the highest accuracy (51.8%, +6.6pp over baseline) with the most balanced True/False/Unknown performance; (2) rubric profiling reveals that conditions with near-identical accuracy exhibit substantially different quality profiles, exposing an “optimization tax” where RL training improves verifiable criteria while degrading soft ones; and (3) reward structure matters independently of reward content, as decomposing the same verification signals into explicit criteria outperforms their monolithic composite.