Energy Landscapes of Truthfulness in LLM Attention
Abstract
We study whether large language model (LLM) truthfulness is reflected in an internal energy landscape derived from modern Hopfield networks and associative memory theory. Leveraging the established connection between transformer attention and continuous Hopfield retrieval (Ramsauer et al., 2021; Millidge et al., 2022), we operationalize a Hopfield-style energy proxy directly from pre-softmax attention logits under teacher-forced evaluation. We evaluate 300 TruthfulQA questions (Lin et al., 2022), pairing each prompt with a truthful reference answer (best.answer) and a false reference answer (incorrectanswers[0]). On Qwen2.5-0.5B-Instruct, truthful answers exhibit systematically lower mean Hopfield energy (paired Cohen's dz=−0.27, Wilcoxon p<0.001) and larger retrieval margins (d_z=+0.24,p<0.001). A lightweight logistic regression probe over 15 energy-derived features achieves 0.66 AUC (5-fold stratified CV with question-grouped folds), indicating that attention-logit energy carries non-trivial truthfulness signal without fine-tuning the base model (only a linear probe on frozen features). Layer-wise trajectories show that energy separation emerges primarily in early layers (0-8), consistent with early blocks implementing the critical associative pattern-matching step. Together, these findings connect associative memory energy proxies to practical truthfulness discrimination and suggest a concrete mechanistic direction for hallucination monitoring.