Poster Fri, Apr 24, 2026 • 6:30 AM – 9:00 AM PDT

Hybrid Reinforcement: when reward is sparse, better to be dense

Leitian Tao ⋅ Ilia Kulikov ⋅ Swarnadeep Saha ⋅ Tianlu Wang ⋅ Jing Xu ⋅ Sharon Li ⋅ Jason E Weston ⋅ Ping Yu

[ OpenReview]

Abstract

Post-training for reasoning in large language models has increasingly relied on verifiable rewards: deterministic checkers that provide $0$–$1$ correctness signals. While reliable, such binary feedback is brittle—many tasks admit partially correct or alternative answers that verifiers under-credit, and the resulting all-or-nothing supervision limits learning. Reward models offer richer, continuous feedback, which can serve as a complementary supervisory signal to verifiers. We introduce HERO (Hybrid Ensemble Reward Optimization), a reinforcement learning framework that integrates sparse verifier signals with dense reward model scores in a structured way. HERO employs stratified normalization to bound reward-model scores within verifier-defined groups, preserving correctness while refining quality distinctions, and variance-aware weighting to emphasize challenging prompts where dense signals matter most. Across diverse mathematical reasoning benchmarks, HERO consistently outperforms reward model-only and verifier-only baselines, with strong gains on both verifiable and hard-to-verify tasks. Our results show that hybrid reward design retains the stability of verifiers while leveraging the nuance of reward models to advance reasoning.

Video

Chat is not available.