Multi-Agent Generative AI: Strategic Failures and Mechanisms for Cooperation (Zhijing Jin)
Abstract
As AI systems take on more autonomous roles across the economy, governance, and daily life, they will increasingly interact with each other. Will these generative AI agents coordinate for collective good, or exploit rival agents and people in ways that put humans at serious risk? In this talk, Zhijing will explain how we assess these dangers through large-scale social simulations and game-theoretic analysis of frontier generative models. Across thousands of high-stakes scenarios (from arms-race escalation to common-pool resource depletion), frontier models choose socially beneficial actions in only 62% of cases, with systematic biases in framing and ordering worsening outcomes. We characterize these as strategic failures, where models' decisions diverge from game-theoretic optimality, and show they persist even for state-of-the-art reasoning models. Surprisingly, stronger reasoning capabilities often make models more prone to selfish strategies like free-riding, and recent models consistently defect in unmodified social dilemmas regardless of scale.
However, game-theoretic interventions offer a promising path forward. Cooperation mechanisms such as mediation, enforceable contracts, and reputation systems significantly improve collective welfare and become more effective under stronger optimization pressures, pushing systems toward the Pareto frontier of multi-agent outcomes. Beyond formal mechanisms, self-organizing social structures like elected leadership further sustain cooperation in sequential dilemmas. These results suggest that safer multi-agent generative AI requires principled institutional design rather than reliance on models' inherent prosociality.