Poster
in
Workshop: I Can't Believe It's Not Better: Where Large Language Models need to improve

Style over Substance: LLM-as-a-Judge Fails to Evaluate Multi-Party Social Dialogue

Kunal Samanta ⋅ Faisal Tareque Shohan ⋅ Amine Trabelsi ⋅ Richard Khoury

Project Page [ OpenReview]

Abstract

The evaluation of multi-party social dialogue remains a significant challenge due to the complexity of turn-taking, distinct personas, and open-ended objectives. A widely adopted solution is to use instruction-tuned Large Language Models (LLMs) as automated judges, under the assumption that sufficiently capable models can approximate human preferences at scale. In this work, we present a negative result demonstrating that state-of-the-art LLM judges (including GPT-5.2 and Gemini 3.0 Flash) fail to align with human judgments in this domain, achieving near-random agreement (Cohen's $\kappa \approx 0.11-0.17$). Through controlled ablations and stress tests, we isolate the mechanism of this failure: judges act as \textit{style classifiers} rather than discourse evaluators. We show that while judges can detect extreme topic drift, they prefer ``assistant-style" utterances over natural dialogue. Our findings expose a critical limitation of LLM-as-a-Judge frameworks for social interaction and caution against optimizing dialogue systems using evaluators that are blind to interactional coherence.

Chat is not available.