Agent Properties for Multi-Agent Safety
Abstract
Cooperation failures in multi-agent interactions could lead to catastrophic outcomes even among aligned AI agents. Classic cooperation problems such as the Prisoner's Dilemma or the Tragedy of the Commons have been useful for illustrating and exploring this challenge, but toy experiments with current language models cannot provide robust evidence for how advanced agents will behave in real-world settings. To better understand how to prevent cooperation failures among AI agents we propose a shift in focus from simulating canonical scenarios from game theory to studying specific agent properties. This should include both individual properties observable in isolation and interactive properties that only manifest in relation to other agents. If we can (1) evaluate to what extent relevant properties are present in agents and (2) understand how those properties influence outcomes in multi-agent interactions, this provides a path towards actionable results that could inform agent design and regulation.