The Capability Frontier: Benchmarks Miss 82% of Model Performance
Abstract
Existing benchmarks typically report accuracy for a single model on a single run. This systematically understates real-world LLM capabilities: (i) different models get different questions correct, allowing for ensembling gains, and (ii) given a budget, some models can be run multiple times to improve results. We introduce the concept of a Capability Frontier: a Pareto frontier for a set of models, characterizing the best achievable performance at each cost level. Our construction of the Capability Frontier corrects for two biases: underestimation from evaluating a single model on a single run, and overestimation from taking the maximum over several noisy models or runs. To understand the impact of these corrections, we study 21 LLMs across 16 widely used benchmarks (coding, reasoning, medicine, factuality, instruction following, and agentic tasks) and compare the performance of the Capability Frontier at matched cost to each benchmark's top-performing model. Correcting for single-model evaluation yields a 54% average accuracy improvement (reduction in error rate); additionally correcting for single runs yields an 82% improvement. Moreover, SOTA accuracy can be matched at 85% cost reduction on the Capability Frontier. These findings suggest that collective LLM capabilities are substantially underestimated, with immediate implications for both evaluation and deployment.