Benchmarking Code Verification Strategies with LLMs-as-a-judge
Abstract
Code generation has attracted attention because of its verifiable completion: code solutions can be passed through unit-tests, mimicking test-driven development to verify correctness. Since obtaining human-written unit tests is expensive, most methods rely on some way of automatic evaluation for rejection sampling or reinforcement learning. In this work, we conduct a comprehensive benchmark of LLM-as-a-judge methods, evaluating their effectiveness and limitations in verifying the correctness of generated code. We show that common approaches to LLM code judges, such as unit test generation or correctness prediction, struggle on harder coding problems. To address this, we propose an approach that combines implicit verification with test generation and consistently outperforms either approaches. Moreover, for unit tests suite based verification, we compare independent and auto-regressive generation and show that the latter method provides more accurate and diverse test suites.