Can LLMs Move Beyond Short Exchanges to Realistic Therapy Conversations?
Abstract
Recent incidents have revealed that large language models (LLMs) deployed in mental health contexts can generate unsafe guidance, including reports of chatbots encouraging self-harm. Such risks highlight the urgent need for rigorous, clinically valid evaluation before integration into care. However, existing benchmarks remain inadequate: 1) they rely on synthetic or weakly validated data, undermining clinical reliability; 2) they reduce counseling to isolated QA or single-turn tasks, overlooking the extended, adaptive nature of real interactions; and 3) they rarely capture the formal therapeutic structure of sessions, such as rapport building, guided exploration, intervention, and closure. These gaps risk overestimating LLM competence and obscuring safety-critical failures. To address this, we present \textbf{CareBench-CBT}, the largest clinically validated benchmark for CBT-based counseling. It unifies three components: 1) we provide thousands of expert-curated and validated items to ensure data reliability; 2) we include realistic multi-turn dialogues to capture long-form therapeutic interaction; and 3) we align all sessions with CBT’s formal structure, enabling process-level evaluation of empathy, therapeutic alignment, and intervention quality. All data are anonymized, double-reviewed by 21 licensed professionals, and validated with reliability and competence metrics. Evaluating 18 state-of-the-art LLMs reveals consistent gaps: high scores on public QA degrade under expert rephrasing, vignette reasoning remains difficult, and dialogue competence falls well below human counselors. CareBench-CBT provides a rigorous foundation for advancing safe and responsible integration of LLMs into mental health care. All code and data are released in the Supplementary Materials.