Can LLMs Move Beyond Short Exchanges to Realistic Therapy Conversations?
Abstract
Recent incidents have revealed that large language models (LLMs) deployed in mental health contexts can generate unsafe guidance, including reports of chatbots encouraging self-harm. Such risks highlight the urgent need for rigorous, clinically valid evaluation before integration into care. However, existing benchmarks remain inadequate: 1) they rely on synthetic or weakly validated data, undermining clinical reliability; 2) they reduce counseling to isolated QA or single-turn tasks, overlooking the extended, adaptive nature of real interactions; and 3) they rarely capture the formal therapeutic structure of sessions. These gaps risk overestimating LLM competence and obscuring safety-critical failures. To address this, we present \textbf{CareBench-CBT}, the largest clinically validated benchmark for CBT-based counseling, unifying thousands of expert-curated items, realistic multi-turn dialogues, and formal CBT structural alignment. Evaluating 18 state-of-the-art LLMs reveals consistent gaps: high scores on public QA degrade under expert rephrasing, vignette reasoning remains difficult, and dialogue competence falls well below human counselors. Recognizing that long-horizon context management limits multi-turn performance, we further propose Hierarchical Therapy Memory (HTM), a training-free inference framework that structures dialogue history into global states and episodic summaries. HTM consistently improves session-level therapeutic coherence while reducing computational latency. Together, CareBench-CBT and HTM provide a rigorous foundation for advancing the safe and responsible integration of LLMs into mental health care. All code and data are released in the Supplementary Materials.