Poster
in
Workshop: AI for Mechanism Design and Strategic Decision Making (AIMS)

Optimal Aggregation Mechanisms for AI Benchmarking and Platinum Benchmarks

Andreas Haupt ⋅ Anka Reuel ⋅ Mykel J Kochenderfer ⋅ Sanmi Koyejo

Project Page [ OpenReview]

Abstract

AI benchmarks are frequently summarized by uniformly averaging item-level scores, implicitly treating every test item as equally valuable. This induces incentives to over-optimize for items that are trivial, socially irrelevant, or dominated by measurement noise. We model benchmarking as a multitask principal-agent game in which a benchmark designer chooses aggregation weights and a lab takes costly actions to improve their model. The optimal weights depend on normative welfare priorities, marginal costs of improvement, and measurement uncertainty. This analysis motivates \emph{platinum items}: items that (i) precisely measure (ii) welfare-aligned capabilities that are (iii) comparatively cheap to improve. We propose an operational rubric and a certification workflow, implemented via expert review and LLM-based judgments, to identify platinum items and reweight benchmark items.

Chat is not available.