Poster
Re-evaluating Open-ended Evaluation of Large Language Models
Si-Qi Liu · Ian Gemp · Luke Marris · Georgios Piliouras · Nicolas Heess · Marc Lanctot
Hall 3 + Hall 2B #440
Evaluation has traditionally focused on ranking candidates for a specific skill. Modern generalist models, such as Large Language Models (LLMs), decidedly outpace this paradigm. Open-ended evaluation systems, where candidate models are compared on user-submitted prompts, have emerged as a popular solution. Despite their many advantages, we show that the current Elo-based rating systems can be susceptible to and even reinforce biases in data, intentional or accidental, due to their sensitivity to redundancies. To address this issue, we propose evaluation as a 3-player game, and introduce novel game-theoretic solution concepts to ensure robustness to redundancy. We show that our method leads to intuitive ratings and provide insights into the competitive landscape of LLM development.
Live content is unavailable. Log in and register to view live content