Expo Talk Panel Sat, Apr 25, 2026 • 8:45 AM – 9:45 AM PDT 202 C

Prolific: Your Models Are Outgrowing Your Evaluations: Lessons from Building Evaluation Infrastructure for Frontier AI

Abstract

The capabilities of frontier AI systems are advancing faster than the methods used to evaluate them. Benchmark saturation, unrepresentative preference data, and safety evaluations that fail to reflect real deployment conditions have created a widening gap between measured performance and real-world impact. Closing this gap requires treating evaluation not as an afterthought to model development, but as a first-class infrastructure problem. One that demands scientific rigor in data collection design, representative and verified human populations, and evaluation paradigms that go beyond static benchmarks. In this talk, we present Prolific's approach to building evaluation methodology and infrastructure for frontier AI. Prolific supports over 200,000 verified participants across 45 countries and has underpinned the data and methodology of more than 30,000 publications. We describe how we build on this foundation to develop evaluations that serve the needs of leading AI labs and research institutions: from demographically stratified preference studies and adversarial red-teaming to domain expert evaluation and alignment data collection. We share what we have learned from designing evaluations that capture realistic scenarios and surface failures that matter. We ground this in two case studies from our own research presented at ICLR 2026. “Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the HUMAINE Framework” reveals how aggregate leaderboards conceal systematic preference disagreement across populations and how evaluation dimensions like trust and safety demand different methodological approaches than standard open-ended comparison. The “Missing Red Line: How Commercial Pressure Erodes AI Safety Boundaries” is an adversarial audit showing that even mild commercial objectives embedded in system prompts can override model safety training, even in scenarios with life-threatening consequences.