What Do Large Language Models Know About Opinions?
Abstract
What large language models (LLMs) know about human opinions has important implications for aligning LLMs to human values, simulating humans with LLMs, and understanding what LLMs learn during training. While prior works have tested LLMs' knowledge of opinions via their next token outputs, we present the first study to probe LLMs' internal knowledge of opinions, evaluating LLMs across 22 demographic groups on a wide range of topics. First, we show that LLMs' internal knowledge of opinions far exceeds what is revealed by their outputs, with a 50-59% improvement in alignment with the human answer distribution; this improvement is competitive to fine-tuning but 278 times less computationally expensive. Second, we find that knowledge of opinions emerges rapidly in the middle layers of the LLM and identify the final unembeddings as the source of the discrepancy between internal knowledge and outputs. Third, using sparse autoencoders, we trace the knowledge of opinions in the LLM's residual stream back to attention heads, and we identify specific attention head features responsible for different demographic groups. These findings open new avenues for building value-aligned and computationally efficient LLMs, with applications in survey research, social simulation, and more broadly, safe and trustworthy AI. We will release our code upon acceptance.