Evaluating LLMs Holistically in a World Where Benchmarks Leak
Abstract
Benchmark contamination is no longer a theoretical concern. As frontier models are trained on open-web data, public test sets are routinely absorbed into pre-training corpora — and beyond passive contamination, labs are known to actively optimize against known benchmarks and selectively report favorable results. When a model claims state-of-the-art on a public leaderboard, it is increasingly unclear whether that reflects genuine generalization or familiarity with the test. Private-only benchmarks — never released publicly, evaluated under controlled conditions, and continuously refreshed — offer a structural solution. If a model cannot train against a benchmark it has never seen, contamination becomes impossible by design. Built across capability families rather than isolated skills, such benchmarks can also surface cross-domain failure modes that narrow public evaluations miss entirely. This social will examine what private, holistic evaluation infrastructure could look like in practice, with short talks from practitioners followed by open discussion on what it would take for the community to coalesce around shared private evaluation standards.
Log in and register to view live content
| ICLR uses cookies for essential functions only. We do not sell your personal information. Our Privacy Policy » |