Optimal Aggregation Mechanisms for AI Benchmarking and Platinum Benchmarks
Abstract
AI benchmarks are frequently summarized by uniformly averaging item-level scores, implicitly treating every test item as equally valuable. This induces incentives to over-optimize for items that are trivial, socially irrelevant, or dominated by measurement noise. We model benchmarking as a multitask principal-agent game in which a benchmark designer chooses aggregation weights and a lab takes costly actions to improve their model. The optimal weights depend on normative welfare priorities, marginal costs of improvement, and measurement uncertainty. This analysis motivates \emph{platinum items}: items that (i) precisely measure (ii) welfare-aligned capabilities that are (iii) comparatively cheap to improve. We propose an operational rubric and a certification workflow, implemented via expert review and LLM-based judgments, to identify platinum items and reweight benchmark items.