SSVPO: Effective Step-Level Credit Assignment for RL Training of Language Models
Abstract
Language models have shown strong performance on mathematical reasoning tasks. Post-training with outcome-based reinforcement learning (RL) can further enhance reasoning but is inefficient because it relies solely on final rewards. Recent credit assignment–based RL methods provide intermediate feedback, yet they often struggle to fairly evaluate each step’s importance, especially in partially correct reasoning chains. We propose Sequential Shapley Value Policy Optimization (SSVPO), a step-level credit assignment framework inspired by multi-agent RL. SSVPO introduces an insertion MDP and Sequential Shapley Values (SSV), which measure each step’s marginal contribution by reordering reasoning steps into alternative chains, ensuring fair credit assignment to all possible steps. By identifying steps with zero credit, SSVPO can shorten reasoning chains to improve training efficiency. We further provide a theoretical proof that SSV fairness to allocate credits and demonstrate that SSV as the new advantage baseline is consistent with Proximal Policy Optimization (PPO). Across 7 benchmarks, SSVPO outperforms state-of-the-art RL methods, both outcome-based (RLOO, GRPO, DAPO) and credit assignment–based (VinePPO, SPO), achieving up to an 11.6\% gain in accuracy, an 18.1\% reduction in token usage, and a 1.6× improvement in reasoning efficiency over vanilla methods. Our findings highlight that SSVPO provides effective step-level credit assignment, advancing post-training LLM reasoning performance while reducing token budgets.