Poster
in
Workshop: ICLR 2026 Workshop on AI with Recursive Self-Improvement

Beyond Solving: A Closer Look at LLMs as Solution Verifiers

Jack Lu ⋅ Ryan Teehan ⋅ Jinran Jin ⋅ Mengye Ren

Project Page [ OpenReview]

Abstract

Large language models (LLMs) can act as both problem solvers and solution verifiers, where the latter select high-quality answers from a pool of solver-generated candidates. This raises the question of under what conditions verification pays off in solver–verifier systems. In their limited study of the factors influencing verification performance, prior work focused primarily on self-verification and tended to examine neither the relationship between solver and verifier model families nor post-training. To rectify this, we present a systematic study across 37 models spanning multiple families, sizes, and base vs. post-trained variants, evaluated on 9 benchmarks covering logical reasoning, structured puzzles, symbolic computation, mathematics, commonsense, factual recall, and domain knowledge. In order to support our analysis, we introduce and empirically validate verifier gain, a metric that predicts the performance improvements from test-time verifier-based rejection sampling. Our experiments find that 1) verification across model families is more effective than either self-verification or verification within the same family, and more generally that the benefits of verification decrease as the solver and verifier become more similar, 2) post-training weakens self-improvement abilities but strengthens cross-family improvement, and 3) some tasks are inherently more amenable to improvement through verification, particularly mathematical and logical tasks.

Chat is not available.