Gemma Needs Therapy: Investigating and Mitigating Emotional Instability in LLMs
Abstract
Large language models can exhibit responses akin to emotional distress. While this behaviour has made for entertaining viral content, it raises concerns around model reliability and safety. We systematically investigate negative emotional propensities in LLMs, and introduce controlled evaluation setups which surface emotional instability in Gemma and Gemini models, but not in other families. Comparing base and instruct models from three families (Gemma, Qwen and OLMo), we find evidence that base models show similar propensities for negative emotional expression, but only Gemma's post-training amplifies this. We demonstrate a simple mitigation: direct preference optimisation on just 280 preference pairs reduces high-frustration responses from 24.7\% to 0.6\%, generalizing across question types, user tones, and conversation lengths, without degrading capabilities. Our findings show that negative emotional propensities are a problem in some current LLMs, but we present i) evaluations to track this behaviour, and ii) mitigations without downsides.