Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models
Abstract
The tendency of users to anthropomorphise large language models (LLMs) is of growing societal interest. Here, we present AnthroBench: a novel empirical method and tool for evaluating anthropomorphic LLM behaviours in realistic settings. Our work introduces three key advances; first, we develop a multi-turn evaluation of 14 distinct anthropomorphic behaviours, moving beyond single-turn assessment. Second, we present a scalable, automated approach by leveraging simulations of user interactions, enabling efficient and reproducible assessment. Third, we conduct an interactive, large-scale human subject study (N=1101) to empirically validate that the model behaviours we measure predict real users’ anthropomorphic perceptions. We find that all evaluated LLMs exhibit similar behaviours, primarily characterised by relationship-building (e.g., empathy and validation) and first-person pronoun use. Crucially, we observe that the majority of these anthropomorphic behaviors only first occur after multiple turns, underscoring the necessity of multi-turn evaluations for understanding complex social phenomena in human-AI interaction. Our work provides a robust empirical foundation for investigating how design choices influence anthropomorphic model behaviours and for progressing the ethical debate on the desirability of these behaviours.