Poster
in
Workshop: Principled Design for Trustworthy AI: Interpretability, Robustness, and Safety Across Modalities

Beyond Static Truthfulness Benchmarks: Two Truths and One Lie for Multi-Agent Deception and Detection

Jason Kong ⋅ Lanxiang Hu ⋅ Flavio Ponzina ⋅ Tajana Rosing

Project Page

Abstract

Existing truthfulness benchmarks typically evaluate a single model in isolation against static datasets or human-written reference answers. As LLMs improve, these fixed test sets and reference-based evaluations risk saturating and provide limited insight into how models behave when they interact, critique, or compete. We argue for truthfulness benchmarks that can scale with increasingly capable agents by defining performance relationally, through repeated multi-agent play. We propose a tournament-style evaluation framework based on the party game Two Truths and One Lie (2T1L), in which each LLM alternates between two roles: a generator that produces two true statements and one lie, and a detector that must identify the lie in other models’ triples. From these multi-agent inter- actions we build an adjacency matrix over models and derive two interpretable metrics: a LieScore measuring how difficult a model’s lies are to detect, and a De- tectScore measuring how reliably a model identifies others’ lies. This relational, role-swapped evaluation induces a tournament ranking over models and naturally extends to settings with human players or tool-augmented LLMs, enabling the de- sign of stronger multi-agent systems for improved human alignment and truthful- ness. We present preliminary experiments with several open models, visualizing the interaction heatmap and induced rankings, and outline how rank correlation across topics can reveal domain-specific strengths and weaknesses. Our results suggest that game-based, multi-agent benchmarks can mitigate limitations of both static exams and preference-based leaderboards, and are well-suited for assessing LLMs in agentic settings