Steering LLMs for Multi-agent Decision-making using Representation Learning
Dom Huh ⋅ Prasant Mohapatra
Abstract
Activation steering offers a lightweight mechanism for controlling large language models (LLMs), but existing approaches have yet been integrated within strategic multi-agent decision-making settings. In this work, we propose a representation learning framework for activation steering tailored to multi-agent decision-making, optimizing steering representations directly from interaction trajectories by grounding latent variables in multi-agent dynamics and enforcing latent self-consistency over time. Our approach disentangles latent factors underlying strategic interaction, enabling fine-grained behavioral control without modifying model parameters or relying on task-specific supervision but on the nature of the multi-agent dynamics. We evaluate our method on $\gamma$-Bench, a diverse suite of cooperative, competitive, and mixed-motive games, and demonstrate consistent improvements in social and strategic performance across multiple open-source LLM families. These results suggest that representation learning provides a scalable and interpretable foundation for activation steering in multi-agent systems.
Chat is not available.
Successful Page Load