Why AI Evaluations Need Error Bars
Abstract
As large language models (LLMs) and agentic systems advance, the field increasingly depends on fine-grained evaluation to compare models, guide research directions, and make deployment decisions. Yet evaluation pipelines often treat LLMs as deterministic functions, even though they are fundamentally stochastic systems with variability arising from sampling methods, hardware nondeterminism, environmental randomness, and evaluation procedures. This mismatch leads to unstable benchmarks, unreliable model comparisons, inconsistent agent outcomes, and significant uncertainty when using LLMs as judges. Recent research has begun to quantify this instability and propose statistical techniques, from frequentist error bars to Bayesian latent-state models, reliability metrics, and large-scale variance audits. But adoption is uneven, and the field lacks a cohesive statistical framework for evaluating stochastic intelligence. This post synthesizes existing research into a unified perspective and outlines practical recommendations for improving evaluation practice. The goal is not to introduce new methods, but to demonstrate that the tools already exist and that incorporating statistical thinking is both feasible and urgently needed.