Logic-Verified GRPO: Graded Z3 Process Rewards for Logical Reasoning in Small LLMs
Abstract
Recent work integrates symbolic solvers into reinforcement learning for LLM reasoning, but existing approaches typically use binary chain-level verification: a full reasoning trace is either correct or not. We introduce Logic-Verified GRPO, which uses the Z3 SMT solver to provide graded, step-level process rewards within GRPO training—each step is independently verified and receives proportional credit based on its formal validity. We evaluate on FOLIO and ProntoQA using Qwen2.5-3B-Instruct. While both GRPO variants improve accuracy over the baseline (+8–10pp), our key finding is that graded step verification produces markedly better epistemic calibration: the Z3-verified model achieves +14pp improvement on Unknown (unprovable) conclusions (55.6% vs. 41.7% baseline), while outcome-only GRPO actually degrades Unknown recognition (38.9%). This suggests that graded symbolic process rewards teach models to distinguish “valid proof found” from “no valid derivation exists”—a distinction invisible to outcome-only or binary verification rewards.