Agentic Proving for Program Verification
Abstract
Agentic systems have recently emerged as state-of-the-art approaches for automated theorem proving in formal mathematics. To assess how far these capabilities extend to program verification, we evaluate leading agentic and non-agentic provers on two Lean 4 benchmarks for verifiable code generation, Clever and Verina, which require machine-checkable proofs of functional correctness. Our results show that agentic systems not only achieve near-saturation performance on the correct portions of these benchmarks, but also consistently identify errors in specifications and implementations within the benchmarks themselves. In several cases, the agents further propose fixes and successfully prove the corrected statements. These findings highlight a growing mismatch between the difficulty of existing program verification benchmarks and the capabilities of modern agentic provers, and point to the need for more rigorous, bug-resilient evaluation methodologies. More broadly, our results provide empirical evidence that tight compiler-in-the-loop agentic paradigms are currently the most effective approach for foundational program verification.