Social Fri, Apr 24, 2026 • 12:15 PM – 1:45 PM PDT 210

Evaluating LLMs Holistically in a World Where Benchmarks Leak: The Case for Private-Only Evaluation

Renaud de la Gueronniere ⋅ Alexander Borodetskiy

Abstract

Benchmark contamination is no longer a theoretical concern. As frontier models are trained on open-web data, public test sets are routinely absorbed into pre-training corpora — and beyond passive contamination, labs are known to actively optimize against known benchmarks and selectively report favorable results. When a model claims state-of-the-art on a public leaderboard, it is increasingly unclear whether that reflects genuine generalization or familiarity with the test. Private-only benchmarks — never released publicly, evaluated under controlled conditions, and continuously refreshed — offer a structural solution. If a model cannot train against a benchmark it has never seen, contamination becomes impossible by design. Built across capability families rather than isolated skills, such benchmarks can also surface cross-domain failure modes that narrow public evaluations miss entirely. This social will examine what private, holistic evaluation infrastructure could look like in practice, with short talks from practitioners followed by open discussion on what it would take for the community to coalesce around shared private evaluation standards.