Poster
in
Workshop: Workshop on Logical Reasoning of Large Language Models

Surrogate Signals from Format and Length: Reinforcement Learning for Solving Mathematical Problems without Ground Truth Answers

xin rihui ⋅ Han Liu ⋅ Zecheng Wang ⋅ Yupeng Zhang ⋅ Dianbo Sui ⋅ Xiaolin Hu ⋅ Bingning Wang

Project Page [ OpenReview]

Abstract

Large Language Models (LLMs) have achieved remarkable success in natural language processing tasks, with Reinforcement Learning (RL) playing a key role in adapting them to specific applications. In mathematical problem solving, however, the reliance on ground truth answers poses significant challenges due to their high collection cost and limited availability. This work explores the use of simple surrogate signals, format and length, to guide RL training. We find that early training is dominated by format learning, where structural feedback alone accounts for most performance gains. Incorporating length-based rewards further refines outputs by discouraging overly long or short responses, enabling a GRPO approach with format-length signals to approximate (>90\%), and in some cases surpass, ground-truth-based optimization. For example, our method achieves 33.3\% accuracy on AIME2024 and 57.6\% on CRUX-O with a 7B base model, and generalizes across different model sizes and series. Beyond practical efficiency, these findings provide an inspirational perspective on RL: rather than imparting new knowledge, RL primarily activates reasoning capabilities already embedded in pre-trained models. This insight suggests that lightweight, label-efficient strategies can complement pre-training to unlock LLMs’ latent potential in reasoning-intensive tasks.

Chat is not available.