ICLR Poster Unearthing Skill-level Insights for Understanding Trade-offs of Foundation Models

Poster

Unearthing Skill-level Insights for Understanding Trade-offs of Foundation Models

Mazda Moayeri · Vidhisha Balachandran · Varun Chandrasekaran · Safoora Yousefi · Thomas Fel · Soheil Feizi · Besmira Nushi · Neel Joshi · Vibhav Vineet

Hall 3 + Hall 2B #623

[ Abstract ]

Fri 25 Apr midnight PDT — 2:30 a.m. PDT

Abstract: With models getting stronger, evaluations have grown more complex, testing multiple skills in one benchmark and even in the same instance at once. However, skill-wise performance is obscured when inspecting aggregate accuracy, under-utilizing the rich signal modern benchmarks contain. We propose an automatic approach to recover the underlying skills relevant for any evaluation instance, by way of inspecting model-generated {\em rationales}. After validating the relevance of rationale-parsed skills and inferring skills for

46

$46$ k instances over

12

$12$ benchmarks, we observe many skills to be common across benchmarks, resulting in the curation of hundreds of \emph{skill-slices} (i.e. sets of instances testing a common skill). Inspecting accuracy over these slices yields novel insights on model trade-offs: e.g., compared to GPT-4o and Claude 3.5 Sonnet, on average, Gemini 1.5 Pro is

18 %

$18\%$ more accurate in \emph{computing molar mass}, but

19

$19\\%$ less accurate in \emph{applying constitutional law}, despite the overall accuracies of the three models differing by a mere

0.4

$0.4\\%$ . Furthermore, we demonstrate the practical utility of our approach by showing that insights derived from skill slice analysis can generalize to held-out instances: when routing each instance to the model strongest on the relevant skills, we see a

3

$3\\%$ accuracy improvement over our

12

$12$ dataset corpus. Our skill-slices and framework open a new avenue in model evaluation, leveraging skill-specific analyses to unlock a more granular and actionable understanding of model capabilities.

Live content is unavailable. Log in and register to view live content