Can we reformulate machine learning from the ground up with multiagent in mind? Modern machine learning primarily takes an optimizationfirst, singleagent approach, however, many of life’s intelligent systems are multiagent in nature across a range of scales and domains such as market economies, ant colonies, forest ecosystems, and decentralized energy grids.
Generative adversarial networks represent one of the most recent successful deviations from the dominant singleagent paradigm by formulating generative modeling as a twoplayer, zerosum game. Similarly, a few recent methods formulating root node problems of machine learning and data science as games among interacting agents have gained recognition (PCA, NMF). Multiagent designs are typically distributed and decentralized which leads to robust and parallelizable learning algorithms.
We want to bring together a community of people that wants to revisit machine learning problems and reformulate them as solutions to games. How might this algorithmic bias affect the solutions that arise and could we define a blueprint for problems that are amenable to gamification? By exploring this direction, we may gain a fresh perspective on machine learning with distinct advantages to the current dominant optimization paradigm.
Fri 5:00 a.m.  5:15 a.m.

Opening Remarks
(
Intro
)

🔗 
Fri 5:15 a.m.  5:50 a.m.

Professor Sarit Kraus: AgentHuman Complex Games for Multiagent Studies
(
Invited Talk
)
SlidesLive Video » 
Sarit Kraus 🔗 
Fri 5:50 a.m.  5:55 a.m.

Sarit Kraus Q&A
(
Q&A
)

🔗 
Fri 5:55 a.m.  6:00 a.m.

Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium Learning from Offline Datasets
(
Poster Spotlights
)
link »
SlidesLive Video » We study episodic twoplayer zerosum Markov games (MGs) in the offline setting, where the goal is to find an approximate Nash equilibrium (NE) policy pair based on a dataset collected a priori. When the dataset does not have uniform coverage over all policy pairs, finding an approximate NE involves challenges in three aspects: (i) distributional shift between the behavior policy and the optimal policy, (ii) function approximation to handle large state space, and (iii) minimax optimization for equilibrium solving. We propose a pessimismbased algorithm, dubbed as pessimistic minimax value iteration (PMVI), which overcomes the distributional shift by constructing pessimistic estimates of the value functions for both players and outputs a policy pair by solving NEs based on the two value functions. Furthermore, we establish a datadependent upper bound on the suboptimality which recovers a sublinear rate without the assumption on uniform coverage of the dataset. We also prove an informationtheoretical lower bound, which suggests that the datadependent term in the upper bound is intrinsic. Our theoretical results also highlight a notion of ``relative uncertainty'', which characterizes the necessary and sufficient condition for achieving sample efficiency in offline MGs. To the best of our knowledge, we provide the first nearly minimax optimal result for offline MGs with function approximation. 
Han Zhong · Wei Xiong · Jiyuan Tan · Liwei Wang · Tong Zhang · Zhaoran Wang · Zhuoran Yang 🔗 
Fri 6:00 a.m.  6:05 a.m.

Learning to Share in MultiAgent Reinforcement Learning
(
Poster Spotlights
)
link »
SlidesLive Video » In this paper, we study the problem of networked multiagent reinforcement learning (MARL), where a number of agents are deployed as a partially connected network and each interacts only with nearby agents. Networked MARL requires all agents make decision in a decentralized manner to optimize a global objective with restricted communication between neighbors over the network. Inspired by the fact that sharing plays a key role in human's learning of cooperation, we propose LToS, a hierarchically decentralized MARL framework that enables agents to learn to dynamically share reward with neighbors so as to encourage agents to cooperate on the global objective. For each agent, the highlevel policy learns how to share reward with neighbors to decompose the global objective, while the lowlevel policy learns to optimize local objective induced by the highlevel policies in the neighborhood. The two policies form a bilevel optimization and learn alternately. We empirically demonstrate that LToS outperforms existing methods in both social dilemma and networked MARL scenario. 
Yuxuan Yi · Ge Li · Yaowei Wang · Zongqing Lu 🔗 
Fri 6:05 a.m.  6:10 a.m.

A Regret Minimization Approach to MultiAgent Control
(
Poster Spotlights
)
link »
SlidesLive Video » We study the problem of multiagent control of a dynamical system with known dynamics and adversarial disturbances. Our study focuses on optimal control without centralized precomputed policies, but rather with adaptive control policies for the different agents that are only equipped with a stabilizing controller. We give a reduction from any (standard) regret minimizing control method to a distributed algorithm. The reduction guarantees that the resulting distributed algorithm has low regret relative to the optimal precomputed joint policy. Our methodology involves generalizing online convex optimization to a multiagent setting and applying recent tools from nonstochastic control derived for a single agent. We empirically evaluate our method on a model of an overactuated aircraft. We show that the distributed method is robust to failure and to adversarial perturbations in the dynamics. 
Udaya Ghai · Udari Madhushani · Naomi Leonard · Elad Hazan 🔗 
Fri 6:10 a.m.  6:15 a.m.

A GameTheoretic Approach for Improving Generalization Ability of TSP Solvers
(
Poster Spotlights
)
link »
SlidesLive Video » In this paper, we introduce a twoplayer zerosum framework between a trainable \emph{Solver} and a \emph{Data Generator} to improve the generalization ability of deep learningbased solvers for Traveling Salesman Problems (TSP).Grounded in \textsl{Policy Space Response Oracle} (PSRO) methods, our twoplayer framework outputs a population of bestresponding Solvers, over which we can mix and output a combined model that achieves the least exploitability against the Generator, and thereby the most generalizable performance on different TSP tasks. We conduct experiments on a variety of TSP instances with different types and sizes. Results suggest that our Solvers achieve the stateoftheart performance even on tasks the Solver never meets, whilst the performance of other deep learningbased Solvers drops sharply due to overfitting. To demonstrate the principle of our framework, we study the learning outcome of the proposed twoplayer game and demonstrate that the exploitability of the Solver population decreases during training, and it eventually approximates the Nash equilibrium along with the Generator. 
Chenguang Wang · Yaodong Yang · Oliver Slumbers · Congying Han · Tiande Guo · Haifeng Zhang · Jun Wang 🔗 
Fri 6:15 a.m.  6:50 a.m.

Professor Elad Schneidman: Efficient Collective Behavior from Maximizing Diversity
(
Invited Talk
)
SlidesLive Video » 
🔗 
Fri 6:50 a.m.  6:55 a.m.

Elad Schneidman Q&A
(
Q&A
)

🔗 
Fri 6:55 a.m.  7:45 a.m.

Poster Session 1 + Coffee Break
(
Poster Session
)

🔗 
Fri 7:45 a.m.  8:10 a.m.

NearOptimal Learning of ExtensiveForm Games with Imperfect Information
(
Contributed Talk
)
link »
SlidesLive Video »
This paper resolves the open question of designing nearoptimal algorithms for learning imperfectinformation extensiveform games from bandit feedback. We present the first line of algorithms that require only $\widetilde{\mathcal{O}}((XA+YB)/\varepsilon^2)$ episodes of play to find an $\varepsilon$approximate Nash equilibrium in twoplayer zerosum games, where $X,Y$ are the number of information sets and $A,B$ are the number of actions for the two players. This improves upon the best known sample complexity of $\widetilde{\mathcal{O}}((X^2A+Y^2B)/\varepsilon^2)$ by a factor of $\widetilde{\mathcal{O}}(\max\{X, Y\})$, and matches the informationtheoretic lower bound up to logarithmic factors. We achieve this sample complexity by two new algorithms: Balanced Online Mirror Descent, and Balanced Counterfactual Regret Minimization. Both algorithms rely on novel approaches of integrating \emph{balanced exploration policies} into their classical counterparts. We also extend our results to learning Coarse Correlated Equilibria in multiplayer generalsum games.

Yu Bai · Chi Jin · Song Mei · Tiancheng Yu 🔗 
Fri 8:10 a.m.  8:50 a.m.

“The Multiagent Hammer” w/ Sarit Kraus, Elad Schneidman, Tiancheng Yu, Goran Zuzic
(
Discussion Panel
)

🔗 
Fri 8:50 a.m.  9:30 a.m.

Professor Constantinos ("Costis") Daskalakis: Equilibrium Computation & Machine Learning
(
Invited Talk
)
SlidesLive Video » 
Constantinos C Daskalakis 🔗 
Fri 9:30 a.m.  9:35 a.m.

Constantinos Daskalakis Q&A
(
Q&A
)

🔗 
Fri 9:35 a.m.  10:45 a.m.

Lunch

🔗 
Fri 10:45 a.m.  11:10 a.m.

Learning Robust Algorithms for Online Allocation Problems Using Adversarial Training
(
Contributed Talk
)
link »
SlidesLive Video » We address the challenge of finding algorithms for online allocation (i.e. bipartite matching) using a machine learning approach. In this paper, we focus on the AdWords problem, which is a classical online budgeted matching problem of both theoretical and practical significance. In contrast to existing work, our goal is to accomplish algorithm design {\em tabula rasa}, i.e., without any humanprovided insights or experttuned training data beyond specifying the objective and constraints of the optimization problem. We construct a framework based on insights from game theory and adversarial training. Key to our approach is to generate adversarial examples that expose the weakness of any given algorithm. A unique challenge in our context is to generate complete examples from scratch rather than perturbing given examples and we demonstrate this can be accomplished for the Adwords problem. We use this framework to cotrain an algorithm network and an adversarial network against each other until they converge to an equilibrium. This approach finds algorithms and adversarial examples that are consistent with known optimal results. Secondly, we address the question of robustness of the algorithm, namely can we design algorithms that are both strong under practical distributions, as well as exhibit robust performance against adversarial instances. To accomplish this, we train algorithm networks using a mixture of adversarial and fixed practical distributions like powerlaws; the resulting networks exhibit a smooth tradeoff between the two input regimes. 
Goran Zuzic · Di Wang · Aranyak Mehta · D. Sivakumar 🔗 
Fri 11:10 a.m.  11:45 a.m.

Dr Kaiqing Zhang: Multiagent Reinforcement Learning in Stochastic Games: From AlphaGo to Robust Control
(
Invited Talk
)

Kaiqing Zhang 🔗 
Fri 11:45 a.m.  11:50 a.m.

Kaiqing Zhang Q&A
(
Q&A
)

🔗 
Fri 11:50 a.m.  11:55 a.m.

VLearning  A Simple, Efficient, Decentralized Algorithm for Multiagent RL
(
Poster Spotlights
)
link »
SlidesLive Video »
A major challenge of multiagent reinforcement learning (MARL) is \emph{the curse of multiagents}, where the size of the joint action space scales exponentially with the number of agents. This remains to be a bottleneck for designing efficient MARL algorithms even in a basic scenario with finitely many states and actions. This paper resolves this challenge for the model of episodic Markov games. We design a new class of fully decentralized algorithmsVlearning, which provably learns Nash equilibria (in the twoplayer zerosum setting), correlated equilibria and coarse correlated equilibria (in the multiplayer generalsum setting) in a number of samples that only scales with $\max_{i\in[m]} A_i$, where $A_i$ is the number of actions for the $i\th$ player. This is in sharp contrast to the size of the joint action space which is $\prod_{i=1}^m A_i$.Vlearning (in its basic form) is a new class of singleagent RL algorithms that convert any adversarial bandit algorithm with suitable regret guarantees into an RL algorithm. Similar to the classical Qlearning algorithm, it performs incremental updates to the value functions. Different from Qlearning, it only maintains the estimates of Vvalues instead of Qvalues. This key difference allows Vlearning to achieve the claimed guarantees in the MARL setting by simply letting all agents run Vlearning independently.

Chi Jin · Qinghua Liu · Yuanhao Wang · Tiancheng Yu 🔗 
Fri 11:55 a.m.  12:00 p.m.

ModelFree Opponent Shaping
(
Poster Spotlights
)
link »
SlidesLive Video » In generalsum games, the interaction of selfinterested learning agents commonly leads to collectively worstcase outcomes, such as defectdefect in the iterated prisoner’s dilemma (IPD). To overcome this, some methods, such as Learning with OpponentLearning Awareness (LOLA), shape their opponents’ learning process. However, these methods are myopic since only a small number of steps can be anticipated, are asymmetric since they treat other agents as naive learners, and require the use of higherorder derivatives, which are calculated through whitebox access to an opponent’s differentiable learning algorithm. To address these issues, we propose ModelFree Opponent Shaping (MFOS). MFOS learns in a metagame in which each metastep is an episode of the underlying (“inner”) game. The metastate consists of the inner policies, and the metapolicy produces a new inner policy to be used in the next episode. MFOS then uses generic modelfree optimisation methods to learn metapolicies that accomplish longhorizon opponent shaping. Empirically, MFOS nearoptimally exploits naive learners and other, more sophisticated algorithms from the literature. For example, to the best of our knowledge, it is the first method to learn the wellknown ZeroDeterminant (ZD) extortion strategy in the IPD. In the same settings, MFOS leads to socially optimal outcomes under metaselfplay. Finally, we show that MFOS can be scaled to highdimensional settings. 
Chris Lu · Timon Willi · Christian Schroeder de Witt · Jakob Foerster 🔗 
Fri 12:00 p.m.  12:05 p.m.

Influencing LongTerm Behavior in Multiagent Reinforcement Learning
(
Poster Spotlights
)
link »
SlidesLive Video » The main challenge of multiagent reinforcement learning is the difficulty of learning useful policies in the presence of other simultaneously learning agents whose changing behaviors jointly affect the environment's transition and reward dynamics. An effective approach that has recently emerged for addressing this nonstationarity is for each agent to anticipate the learning of other interacting agents and influence the evolution of their future policies towards desirable behavior for its own benefit. Unfortunately, all previous approaches for achieving this suffer from myopic evaluation, considering only a few or a finite number of updates to the policies of other agents. In this paper, we propose a principled framework for considering the limiting policies of other agents as the time approaches infinity. Specifically, we develop a new optimization objective that maximizes each agent's average reward by directly accounting for the impact of its behavior on the limiting set of policies that other agents will take on. Thanks to our farsighted evaluation, we demonstrate better longterm performance than stateoftheart baselines in various domains, including the full spectrum of generalsum, competitive, and cooperative settings. 
Dong Ki Kim · Matt Riemer · Miao Liu · Jakob Foerster · Michael Everett · Chuangchuang Sun · Gerald Tesauro · JONATHAN HOW 🔗 
Fri 12:05 p.m.  12:10 p.m.

ZeroSum Stochastic Stackelberg Games
(
Poster Spotlights
)
link »
SlidesLive Video » Minmax optimization problems (i.e., zerosum games) have been used to model problems in a variety of fields in recent years, from machine learning to economics. The literature to date has mostly focused on static zerosum games, assuming independent strategy sets. In this paper, we study a form of dynamic zerosum games, called stochastic games, with dependent strategy sets. Just as zerosum games with dependent strategy sets can be interpreted as zerosum Stackelberg games, stochastic zerosum games with dependent strategy sets can be interpreted as zerosum stochastic Stackelberg games. We prove the existence of an optimal solution in zerosum stochastic Stackelberg games (i.e., a recursive Stackelberg equilibrium), provide necessary and sufficient conditions for a solution to be optimal, and show that a recursive Stackelberg equilibrium can be computed in polynomial time via value iteration. Finally, we show that stochastic Stackelberg games can model the problem of pricing and allocating goods across agents and time; more specifically, we propose a stochastic Stackelberg game whose solutions correspond to a recursive competitive equilibrium in a stochastic Fisher market. We close with a series of experiments which confirm our theoretical results and show how value iteration performs in practice. 
Denizalp Goktas · Jiayi Zhao · Amy Greenwald 🔗 
Fri 12:10 p.m.  1:00 p.m.

Poster Session 2 + Coffee Break
(
Poster Session
)

🔗 
Fri 1:00 p.m.  1:40 p.m.

Professor Lillian Ratliff: Beyond Open Loop Algorithm Design: Learning from DecisionDependent Data
(
Invited Talk
)
SlidesLive Video » 
Lillian J Ratliff 🔗 
Fri 1:40 p.m.  1:45 p.m.

Lillian Ratliff Q&A
(
Q&A
)

🔗 
Fri 1:45 p.m.  2:10 p.m.

Modeling Strong and Humanlike Gameplay with KLRegularized Search
(
Contributed Talk
)
link »
SlidesLive Video » We consider the task of building strong but humanlike policies in multiagent decisionmaking problems, given examples of human behavior. Imitation learning is effective at predicting human actions but may not match the strength of expert humans, while selfplay learning and search techniques (e.g. AlphaZero) lead to strong performance but may produce policies that are difficult for humans to understand and coordinate with. We show in chess and Go that regularizing search based on the KL divergence from an imitationlearned policy results in higher human prediction accuracy and stronger performance than imitation learning alone. We then introduce a novel regret minimization algorithm that is regularized based on the KL divergence from an imitationlearned policy, and show that using this algorithm for search in nopress Diplomacy yields a policy that matches the human prediction accuracy of imitation learning while being substantially stronger. 
Athul Paul Jacob · David Wu · Gabriele Farina · Adam Lerer · Hengyuan Hu · Anton Bakhtin · Jacob Andreas · Noam Brown 🔗 
Fri 2:10 p.m.  2:50 p.m.

“The Multiagent Hammer” w/ Lillian Ratliff, Constantinos Daskalakis, Kaiqing Zhang, Noam Brown
(
Discussion Panel
)

🔗 
Fri 2:50 p.m.  3:00 p.m.

Closing Remarks
(
Close
)

🔗 


Optimal Correlated Equilibria in GeneralSum ExtensiveForm Games: FixedParameter Algorithms, Hardness, and TwoSided ColumnGeneration
(
Poster
)
link »
We study the problem of finding optimal correlated equilibria of various sorts: normalform coarse correlated equilibrium (NFCCE), extensiveform coarse correlated equilibrium (EFCCE), and extensiveform correlated equilibrium (EFCE). This is NPhard in the general case and has been studied in special cases, most notably trianglefree games (Farina & Sandholm 2020), which include all twoplayer games with public chance moves. However, the general case is not well understood, and algorithms usually scale poorly. In this paper, we make two primary contributions. First, we introduce the correlation DAG, a representation of the space of correlated strategies whose structure and size are dependent on the specific solution concept desired. It extends the team belief DAG of Zhang et al (2022) to generalsum games. For each of the three solution concepts, its size depends exponentially only on a parameter related to the information structure of the game. We also prove a fundamental complexity gap: while our size bounds for NFCCE are similar to those achieved in the case of team games by Zhang et al (2022), this is impossible to achieve for the other two concepts under standard complexity assumptions. Second, we propose a twosided column generation approach to compute optimal correlated strategies in extensiveform games. Our algorithm improves upon the onesided approach of Farina et al (2021a) by means of a new decomposition of correlated strategies which allows players to reoptimize their sequenceform strategies with respect to correlation plans which were previously added to the support. Experiments show that our techniques outperform the prior state of the art for computing optimal generalsum correlated equilibria, and that our two families of approaches have complementary strengths: the correlation DAG is fast when the parameter is small and the twosided column generation approach is superior when the parameter is large. 
Brian Zhang · Gabriele Farina · Andrea Celli · Tuomas Sandholm 🔗 


General sum stochastic games with networked information flow
(
Poster
)
link »
Inspired by naturally emerging networks of reinforcement learning problems in operation sectors such as supply chains, epidemics, and social networks, we present a stochastic game formulation with networked reward and transition dependencies for multiagent reinforcement learning (MARL). Despite the rapid development of MARL algorithms, the majority of research efforts are motivated by settings where each pair of players is either collaborating or fully competing. However, in many largescale networked systems, pairs of interacting players experience general sumtype objectives players must collaborate for total profit maximization and locally compete for individual profit maximization. In contrast to popular MARL benchmark environments, stochastic games with general sum objectives have received relatively less attention, in part due to the lack of motivating application and toy examples. In this paper, we address the lack of motivating examples by presenting a networked stochastic game framework with pairwise general sum objectives and relating the framework to operation sector systems such as supply chains. We discuss specific game features that distinguish this type of stochastic game from existing canonical stochastic game models and specifically outline the impact of these features on the adaptation of popular multiagent learning paradigms such as individual learning and centralized learning decentralized execution. We conclude with a two player supply chain to benchmark existing MARL algorithms and to contextualize the challenges at hand. 
Sarah Li · Lillian J Ratliff · Peeyush Kumar 🔗 


Finding and only finding local Nash equilibria by both pretending to be a follower
(
Poster
)
link »
Finding local Nash equilibria in twoplayer differentiable games is a classical problem in game theory with important relevance in machine learning. We propose double FollowtheRidge (doubleFTR), an algorithm whose attracting critical points are equivalent to differential Nash equilibria in generalsum twoplayer differential games. To our knowledge, doubleFTR is the first algorithm with such guarantees for generalsum games. Furthermore, we show that by varying its preconditioner, doubleFTR leads to a broader family of algorithms with the same properties. DoubleFTR avoids oscillation near equilibria due to the realeigenvalues of its Jacobian at critical points. Finally, we empirically verify the effectiveness of doubleFTR in finding local Nash equilibria in two simple examples. 
Xuchan Bao · Guodong Zhang 🔗 


Automated equilibrium analysis of 2x2x2 games
(
Poster
)
link »
The set of all Nash equilibria of a noncooperative game with more than two players is defined by equations and inequalities between nonlinear polynomials, which makes it challenging to compute. This paper presents an algorithm that computes this set for the simplest game with more than two players with arbitrary (possibly nongeneric) payoffs, which has not been done before. We give new elegant formulas for completely mixed equilibria, and compute visual representations of the bestresponse correspondences and their intersections, which define the Nash equilibrium set. These have been implemented in Python and will be part of a public webbased software for automated equilibrium analysis. For small games, which are often studied in economic models, a complete Nash equilibrium analysis is desirable and should be feasible. This project demonstrates the complexity of this task and offers pathways for extensions to larger games. 
Sahar Jahani · Bernhard Von Stengel 🔗 


Collaborative AutoCurricula MultiAgent Reinforcement Learning with Graph Neural Network Communication Layer for Openended WildfireManagement Resource Distribution
(
Poster
)
link »
Most realworld domains can be formulated as multiagent (MA) systems. Multiple intentionality sharing agents can solve more complex tasks by collaborating, possibly in less time. True cooperative actions are beneficial for egoistic and collective reasons. However, teaching individual agents to sacrifice egoistic benefits for positive collective performance seems challenging. We build on a recently proposed MultiAgent Reinforcement Learning (MARL) mechanism with a Graph Neural Network (GNN) communication layer. Rarely chosen communication actions were marginally beneficial. Here we propose a MARL system in which agents can help collaborators perform better while risking low selfperformance or send help requests for predicted increased task difficulty. We conduct our study in the context of resource distribution for wildfire management. Communicating environmental features and partially observable fire occurrence observations help the agent collective preemptively distribute resources. Furthermore, we introduce a procedural training environment accommodating autocurricula and openendedness towards better generalizability. While our MA communication proposal includes a large action and observation space, we outperform a Greedy Heuristic Baseline, a SingleAgent (SA) and MA setup. 
Philipp D Siedler 🔗 


Coalition Formation in Ridesharing with Walking Options
(
Poster
)
link »
In this work, we introduce a novel coalition formation mechanism for ridesharing services. We extend on the current literature to integrate walking options within the trip, this allows us to effectively account for a user’s value of time when walk ing, which is not negligible. Additionally, we propose a price allocation method that ensures proportionality for sharing the costs. We formally evaluate the effi cacy of our pricing method and discuss its desirable properties. Our study is a step towards developing smart mobility systems that recommend optimal ridesharing coalitions to users and suggest cost allocations that are fair and stable. 
Lucia Cipolina Kun · Sebastian Stein · Vahid Yazdanpanah · Enrico Gerding 🔗 


Stackelberg Policy Gradient: Evaluating the Performance of Leaders and Followers
(
Poster
)
link »
Hierarchical order of play is an important concept for reinforcement learning to understand better the decisions made by strategic agents in a shared environment. In this paper, we compare the learning dynamics between Stackelberg and simultaneous reinforcement learning agents. Agents are trained using their policy gradient and are tested against each other in a tournament. We compare agent performance in zerosum and nonzerosum Markov games. We show that the Stackelberg leader performs better in training under the same parameters. However, under the same parameters in the tournament setting, Stackelberg leaders and followers performed similarly to the simultaneous player. Analytically, hierarchical training can potentially provide stronger guarantees for policy gradient. 
QuocLiem Vu · Zane Alumbaugh · Ryan Ching · Quanchen Ding · Arnav Mahajan · Benjamin Chasnov · Sam Burden · Lillian J Ratliff 🔗 


Generalization Games for Reinforcement Learning
(
Poster
)
link »
In reinforcement learning (RL), the term generalization has either denoted the practice of introducing function approximation to reduce the intractability of large state and action spaces problems or designated RL agents' ability to transfer learned experiences to one or more evaluation tasks. Recently, many subfields have emerged to understand how distributions of training tasks affect an RL agent's performance in unseen environments. While the field is extensive and evergrowing, recent research has underlined that variability among the different approaches is not as significant. We leverage this intuition to demonstrate how current methods for generalization in RL are specializations of a general framework. We obtain the fundamental aspects of this formulation by rebuilding a Markov Decision Process (MDP) from the ground up by resurfacing the gametheoretic framework of games against nature. The twoplayer game that arises from considering nature as a complete player on this formulation explains how existing methods rely on learned and randomized dynamics and initial state distributions. We develop this result further by drawing inspiration from mechanism design theory to introduce the role of a principal as a third player that can modify the payoff functions of the decisionmaking agent and nature. The games induced by playing against the principal extend our framework to explain how learned and randomized reward functions induce generalization in RL agents. The main contribution of our work is the complete description of the Generalization Games for Reinforcement Learning, a multiagent, multiplayer, gametheoretic formal approach to study generalization methods in RL. We offer a preliminary ablation experiment of the different components of the framework and demonstrate that a more simplified composition of the objectives that we introduce for each player leads to comparable, and in some cases superior, zeroshot generalization performance than stateoftheart methods while requiring almost two orders of magnitude fewer samples. 
Manfred Diaz · Charlie Gauthier · Glen Berseth · Liam Paull 🔗 


ZeroSum Stochastic Stackelberg Games
(
Poster
)
link »
Minmax optimization problems (i.e., zerosum games) have been used to model problems in a variety of fields in recent years, from machine learning to economics. The literature to date has mostly focused on static zerosum games, assuming independent strategy sets. In this paper, we study a form of dynamic zerosum games, called stochastic games, with dependent strategy sets. Just as zerosum games with dependent strategy sets can be interpreted as zerosum Stackelberg games, stochastic zerosum games with dependent strategy sets can be interpreted as zerosum stochastic Stackelberg games. We prove the existence of an optimal solution in zerosum stochastic Stackelberg games (i.e., a recursive Stackelberg equilibrium), provide necessary and sufficient conditions for a solution to be optimal, and show that a recursive Stackelberg equilibrium can be computed in polynomial time via value iteration. Finally, we show that stochastic Stackelberg games can model the problem of pricing and allocating goods across agents and time; more specifically, we propose a stochastic Stackelberg game whose solutions correspond to a recursive competitive equilibrium in a stochastic Fisher market. We close with a series of experiments which confirm our theoretical results and show how value iteration performs in practice. 
Denizalp Goktas · Jiayi Zhao · Amy Greenwald 🔗 


Object Representations as Equilibria: Training Iterative Inference Algorithms with Implicit Differentiation
(
Poster
)
link »
Deep generative models, particularly those that aim to factorize the observations into discrete entities (such as objects), must often use iterative inference procedures that break symmetries among equally plausible explanations for the data. Such inference procedures include variants of the expectationmaximization algorithm and structurally resemble clustering algorithms in a latent space. However, combining such methods with deep neural networks necessitates differentiating through the inference process, which can make optimization exceptionally challenging. In this work, we observe that such iterative inference methods can be made differentiable by means of the implicit function theorem, and develop an implicit differentiation approach that improves the stability and tractability of training such models by decoupling the forward and backward passes. This connection enables us to apply recent advances in optimizing implicit layers to not only improve the stability and optimization of the slot attention module in SLATE, a stateoftheart method for learning entity representations, but do so with constant space and time complexity in backpropagation and only one additional line of code. 
Michael Chang · Thomas L. Griffiths · Sergey Levine 🔗 


Team Belief DAG Form: A Concise Representation for TeamCorrelated GameTheoretic Decision Making
(
Poster
)
link »
In this paper, we introduce a new representation for teamcoordinated gametheoretic decision making, which we coin team belief DAG form. In our representation, at every timestep, a team coordinator observes the information that is public to all its members, and then decides on a prescription for all the possible states consistent with its observations. Our representation unifies and extends recent approaches to team coordination. Similar to the approach of Carminati et al (2021), our team belief DAG form can be used to capture adversarial team games, and enables standard, outofthebox gametheoretic techniques including noregret learning (e.g., CFR and its stateoftheart modern variants such as DCFR and PCFR$^+$) and firstorder methods. However, our representation can be exponentially smaller, and can be viewed as a lossless abstraction of theirs into a directed acyclic graph. In particular, like the LPbased algorithm of Zhang & Sandholm (2022), the size of our representation scales with the amount of information uncommon to the team; in fact, using linear programming on top of our team belief DAG form to solve for a team correlated equilibrium in an adversarial team games recovers almost exactly their algorithm. Unlike that paper, however, our representation explicitly exposes the structure of the decision space, which is what enables the aforementioned gametheoretic techniques.

Brian Zhang · Gabriele Farina · Tuomas Sandholm 🔗 


Staged independent learning: Towards decentralized cooperative multiagent Reinforcement Learning
(
Poster
)
link »
We empirically show that classic ideas from twotime scale stochastic approximation \citep{borkar1997stochastic} can be extended to complex cooperative multiagent reinforcement learning (MARL) problems. We first start with giving a multiagent estimation problem as a motivating example where staged best response iteration converges while parallel best response iteration does not. Then we present a general implementation of staged multiagent RL algorithms based on multitime scale stochastic approximation, and show that our new method called Staged Independent Proximal Policy Optimization (SIPPO) outperforms stateoftheart independent learning on almost all the tasks in epymarl \citep{papoudakis2020benchmarking} benchmark. This can be seen as a first step towards more decentralized MARL methods based on multitime scale learning principle. 
Hadi Nekoei · Akilesh Badrinaaraayanan · Amit Sinha · Mohammad Amini · Janarthanan Rajendran · Aditya Mahajan · Sarath Chandar 🔗 


Competitive Physics Informed Networks
(
Poster
)
link »
Physics Informed Neural Networks (PINNs) solve partial differential equations (PDEs) by representing them as neural networks. The original PINN implementation does not provide high accuracy, typically attaining about $10^{3}$ $L_2$ relative error. We formulate and test an adversarial approach called competitive PINNs (CPINNs) to overcome this limitation. CPINNs train a discriminator that is rewarded for predicting PINN mistakes. The discriminator and PINN participate in a zerosum game with the exact PDE solution as an optimal strategy. This approach avoids the issue of squaring the large condition numbers of PDE discretizations. Numerical experiments show that a CPINN trained with competitive gradient descent can achieve errors two orders of magnitude smaller than that of a PINN trained with Adam or stochastic gradient descent.

Qi Zeng · Spencer Bryngelson · Florian Schaefer 🔗 


Modeling Strong and Humanlike Gameplay with KLRegularized Search
(
Poster
)
link »
We consider the task of building strong but humanlike policies in multiagent decisionmaking problems, given examples of human behavior. Imitation learning is effective at predicting human actions but may not match the strength of expert humans, while selfplay learning and search techniques (e.g. AlphaZero) lead to strong performance but may produce policies that are difficult for humans to understand and coordinate with. We show in chess and Go that regularizing search based on the KL divergence from an imitationlearned policy results in higher human prediction accuracy and stronger performance than imitation learning alone. We then introduce a novel regret minimization algorithm that is regularized based on the KL divergence from an imitationlearned policy, and show that using this algorithm for search in nopress Diplomacy yields a policy that matches the human prediction accuracy of imitation learning while being substantially stronger. 
Athul Paul Jacob · David Wu · Gabriele Farina · Adam Lerer · Hengyuan Hu · Anton Bakhtin · Jacob Andreas · Noam Brown 🔗 


MultiAgent Neural Rewriter for Vehicle Routing with Limited Disclosure of Costs
(
Poster
)
link »
We interpret solving the multivehicle routing problem as a team Markov game with partially observable costs. For a given set of customers to serve, the playing agents (vehicles) have the common goal to determine the teamoptimal agent routes with minimal total cost. Each agent thereby observes only its own cost. Our multiagent reinforcement learning approach, the socalled multiagent Neural Rewriter, builds on the singleagent Neural Rewriter to solve the problem by iteratively rewriting solutions. Parallel agent action execution and partial observability require new rewriting rules for the game. We propose the introduction of a socalled pool in the system which serves as a collection point for unvisited nodes. It enables agents to act simultaneously and exchange nodes in a conflictfree manner. We realize limited disclosure of agentspecific costs by only sharing them during learning. During inference, each agents acts decentrally, solely based on its own cost.First empirical results on small problem sizes demonstrate that we reach a performance close to the employed ORTools benchmark which operates in the perfect cost information setting. 
Nathalie Paul · Alexander Kiser · Tim Wirtz · Stefan Wrobel 🔗 


A NonNegative Matrix Factorization Game
(
Poster
)
link »
We present a novel gametheoretic formulation of NonNegative Matrix Factorization (NNMF), a popular dataanalysis method with many scientific and engineering applications. The gametheoretic formulation is shown to have favorable scaling and parallelization properties, while retaining reconstruction and convergence performance comparable to the traditional Multiplicative Updates (Lee & Seung,1999) algorithm. 
Satpreet H Singh 🔗 


Influencing LongTerm Behavior in Multiagent Reinforcement Learning
(
Poster
)
link »
The main challenge of multiagent reinforcement learning is the difficulty of learning useful policies in the presence of other simultaneously learning agents whose changing behaviors jointly affect the environment's transition and reward dynamics. An effective approach that has recently emerged for addressing this nonstationarity is for each agent to anticipate the learning of other interacting agents and influence the evolution of their future policies towards desirable behavior for its own benefit. Unfortunately, all previous approaches for achieving this suffer from myopic evaluation, considering only a few or a finite number of updates to the policies of other agents. In this paper, we propose a principled framework for considering the limiting policies of other agents as the time approaches infinity. Specifically, we develop a new optimization objective that maximizes each agent's average reward by directly accounting for the impact of its behavior on the limiting set of policies that other agents will take on. Thanks to our farsighted evaluation, we demonstrate better longterm performance than stateoftheart baselines in various domains, including the full spectrum of generalsum, competitive, and cooperative settings. 
Dong Ki Kim · Matt Riemer · Miao Liu · Jakob Foerster · Michael Everett · Chuangchuang Sun · Gerald Tesauro · JONATHAN HOW 🔗 


Online Learning in Iterated Prisoner's Dilemma to Mimic Human Behavior
(
Poster
)
link »
Prisoner’s Dilemma mainly treat the choice to cooperate or defect as an atomic action. We propose to study online learning algorithm behavior in the Iterated Prisoner’s Dilemma (IPD) game, where we explored the full spectrum of reinforcement learning agents: multiarmed bandits, contextual bandits and reinforcement learning. We have evaluate them based on a tournament of iterated prisoner's dilemma where multiple agents can compete in a sequential fashion. This allows us to analyze the dynamics of policies learned by multiple selfinterested independent rewarddriven agents, and also allows us study the capacity of these algorithms to fit the human behaviors. Results suggest that considering the current situation to make decision is the worst in this kind of social dilemma game. Multiples discoveries on online learning behaviors and clinical validations are stated. 
Baihan Lin · Djallel Bouneffouf · Guillermo Cecchi 🔗 


Solving Structured Hierarchical Games Using Differential Backward Induction
(
Poster
)
link »
From largescale organizations to decentralized political systems, hierarchical strategic decision making is commonplace. We introduce a novel class of structured hierarchical games (SHGs) that formally capture such hierarchical strategic interactions. In an SHG, each player is a node in a tree, and strategic choices of players are sequenced from root to leaves, with root moving first, followed by its children, then followed by their children, and so on until the leaves. A player’s utility in an SHG depends on its own decision, and on the choices of its parent and all the tree leaves. SHGs thus generalize simultaneousmove games, as well as Stackelberg games with many followers. We leverage the structure of both the sequence of player moves as well as payoff dependence to develop a novel gradientbased backpropagationstyle algorithm, which we call Differential Backward Induction (DBI), for approximating equilibria of SHGs. We provide a sufficient condition for convergence of DBI and demonstrate its efficacy in finding approximate equilibrium solutions to several SHG models of hierarchical policymaking problems. 
Zun Li · Feiran Jia · Aditya Mate · Shahin Jabbari · Mithun Chakraborty · Milind Tambe · Yevgeniy Vorobeychik 🔗 


ModelFree Opponent Shaping
(
Poster
)
link »
In generalsum games, the interaction of selfinterested learning agents commonly leads to collectively worstcase outcomes, such as defectdefect in the iterated prisoner’s dilemma (IPD). To overcome this, some methods, such as Learning with OpponentLearning Awareness (LOLA), shape their opponents’ learning process. However, these methods are myopic since only a small number of steps can be anticipated, are asymmetric since they treat other agents as naive learners, and require the use of higherorder derivatives, which are calculated through whitebox access to an opponent’s differentiable learning algorithm. To address these issues, we propose ModelFree Opponent Shaping (MFOS). MFOS learns in a metagame in which each metastep is an episode of the underlying (“inner”) game. The metastate consists of the inner policies, and the metapolicy produces a new inner policy to be used in the next episode. MFOS then uses generic modelfree optimisation methods to learn metapolicies that accomplish longhorizon opponent shaping. Empirically, MFOS nearoptimally exploits naive learners and other, more sophisticated algorithms from the literature. For example, to the best of our knowledge, it is the first method to learn the wellknown ZeroDeterminant (ZD) extortion strategy in the IPD. In the same settings, MFOS leads to socially optimal outcomes under metaselfplay. Finally, we show that MFOS can be scaled to highdimensional settings. 
Chris Lu · Timon Willi · Christian Schroeder de Witt · Jakob Foerster 🔗 


Introducing Coordination in Concurrent Reinforcement Learning
(
Poster
)
link »
Research on exploration in reinforcement learning has mostly focused on problems with a single agent interacting with an environment. However many problems are better addressed by the concurrent reinforcement learning paradigm, where multiple agents operate in a common environment. Recent work has tackled the challenge of exploration in this particular setting \citep{dimakopoulou2018coordinated,dimakopoulou2018scalable}.Nonetheless, they do not completely leverage the characteristics of this framework and agents end up behaving independently from each other. In this work we argue that coordination among concurrent agents is crucial for efficient exploration.We introduce coordination in Thompson Sampling based methods by drawing correlated samples from an agent's posterior.We apply this idea to extend existing exploration schemes such as randomized least squares value iteration (RLSVI).Empirical results emphasize the merits of our approach and call attention to coordination as a key objective for efficient exploration in concurrent reinforcement learning. 
Adrien Ali Taiga · Aaron Courville · Marc G Bellemare 🔗 


VLearning  A Simple, Efficient, Decentralized Algorithm for Multiagent RL
(
Poster
)
link »
A major challenge of multiagent reinforcement learning (MARL) is \emph{the curse of multiagents}, where the size of the joint action space scales exponentially with the number of agents. This remains to be a bottleneck for designing efficient MARL algorithms even in a basic scenario with finitely many states and actions. This paper resolves this challenge for the model of episodic Markov games. We design a new class of fully decentralized algorithmsVlearning, which provably learns Nash equilibria (in the twoplayer zerosum setting), correlated equilibria and coarse correlated equilibria (in the multiplayer generalsum setting) in a number of samples that only scales with $\max_{i\in[m]} A_i$, where $A_i$ is the number of actions for the $i\th$ player. This is in sharp contrast to the size of the joint action space which is $\prod_{i=1}^m A_i$.Vlearning (in its basic form) is a new class of singleagent RL algorithms that convert any adversarial bandit algorithm with suitable regret guarantees into an RL algorithm. Similar to the classical Qlearning algorithm, it performs incremental updates to the value functions. Different from Qlearning, it only maintains the estimates of Vvalues instead of Qvalues. This key difference allows Vlearning to achieve the claimed guarantees in the MARL setting by simply letting all agents run Vlearning independently.

Chi Jin · Qinghua Liu · Yuanhao Wang · Tiancheng Yu 🔗 


Learning Meta Representations for Agents in MultiAgent Reinforcement Learning
(
Poster
)
link »
In multiagent reinforcement learning, the behaviors that agents learn in a single Markov Game (MG) are typically confined to the given agent number. Every single MG induced by varying population sizes may possess distinct optimal joint strategies and gamespecific knowledge, which are modeled independently in modern multiagent algorithms. In this work, we focus on creating agents that generalize across populationvarying MGs. Instead of learning a unimodal policy, each agent learns a policy set that is formed by effective strategies across a variety of games. We propose Meta Representations for Agents (MRA) that explicitly models the gamecommon and gamespecific strategic knowledge. By representing the policy sets with multimodal latent policies, the common strategic knowledge and diverse strategic modes are discovered with an iterative optimization procedure. We prove that as an approximation to a constrained mutual information maximization objective, the learned policies can reach Nash Equilibrium in every evaluation MG under the assumption of Lipschitz game on a sufficiently large latent space. When deploying it at practical latent models with limited size, fast adaptation can be achieved by leveraging the firstorder gradient information. Extensive experiments show the effectiveness of MRA on both training performance and generalization ability in hard and unseen games. 
Shenao Zhang · Li Shen · Lei Han · Li Shen 🔗 


NearOptimal Learning of ExtensiveForm Games with Imperfect Information
(
Poster
)
link »
This paper resolves the open question of designing nearoptimal algorithms for learning imperfectinformation extensiveform games from bandit feedback. We present the first line of algorithms that require only $\widetilde{\mathcal{O}}((XA+YB)/\varepsilon^2)$ episodes of play to find an $\varepsilon$approximate Nash equilibrium in twoplayer zerosum games, where $X,Y$ are the number of information sets and $A,B$ are the number of actions for the two players. This improves upon the best known sample complexity of $\widetilde{\mathcal{O}}((X^2A+Y^2B)/\varepsilon^2)$ by a factor of $\widetilde{\mathcal{O}}(\max\{X, Y\})$, and matches the informationtheoretic lower bound up to logarithmic factors. We achieve this sample complexity by two new algorithms: Balanced Online Mirror Descent, and Balanced Counterfactual Regret Minimization. Both algorithms rely on novel approaches of integrating \emph{balanced exploration policies} into their classical counterparts. We also extend our results to learning Coarse Correlated Equilibria in multiplayer generalsum games.

Yu Bai · Chi Jin · Song Mei · Tiancheng Yu 🔗 


Learning Robust Algorithms for Online Allocation Problems Using Adversarial Training
(
Poster
)
link »
We address the challenge of finding algorithms for online allocation (i.e. bipartite matching) using a machine learning approach. In this paper, we focus on the AdWords problem, which is a classical online budgeted matching problem of both theoretical and practical significance. In contrast to existing work, our goal is to accomplish algorithm design {\em tabula rasa}, i.e., without any humanprovided insights or experttuned training data beyond specifying the objective and constraints of the optimization problem. We construct a framework based on insights from game theory and adversarial training. Key to our approach is to generate adversarial examples that expose the weakness of any given algorithm. A unique challenge in our context is to generate complete examples from scratch rather than perturbing given examples and we demonstrate this can be accomplished for the Adwords problem. We use this framework to cotrain an algorithm network and an adversarial network against each other until they converge to an equilibrium. This approach finds algorithms and adversarial examples that are consistent with known optimal results. Secondly, we address the question of robustness of the algorithm, namely can we design algorithms that are both strong under practical distributions, as well as exhibit robust performance against adversarial instances. To accomplish this, we train algorithm networks using a mixture of adversarial and fixed practical distributions like powerlaws; the resulting networks exhibit a smooth tradeoff between the two input regimes. 
Goran Zuzic · Di Wang · Aranyak Mehta · D. Sivakumar 🔗 


Dynamic Noises of MultiAgent Environments Can Improve Generalization: Agentbased Models meets Reinforcement Learning
(
Poster
)
link »
We study the benefits of reinforcement learning (RL) environments based on agentbased models (ABM). While ABMs are known to offer microfoundational simulations at the cost of computational complexity, we empirically show in this work that their nondeterministic dynamics can improve the generalization of RL agents. To this end, we examine the control of an epidemic SIR environments based on either differential equations or ABMs. Numerical simulations demonstrate that the intrinsic noise in the ABMbased dynamics of the SIR model not only improve the average reward but also allow the RL agent to generalize on a wider ranges of epidemic parameters. 
Mohamed Akrout · Bob McLeod 🔗 


When is Offline TwoPlayer ZeroSum Markov Game Solvable?
(
Poster
)
link »
We study what dataset assumption permits solving offline twoplayer zerosum Markov game. In stark contrast to the offline singleagent Markov decision process, we show that the single strategy concentration assumption is insufficient for learning the Nash equilibrium (NE) strategy in offline twoplayer zerosum Markov games. On the other hand, we propose a new assumption named unilateral concentration and design a pessimismtype algorithm that is provably efficient under this assumption. In addition, we show that the unilateral concentration assumption is necessary for learning an NE strategy. Furthermore, our algorithm can achieve minimax sample complexity without any modification for two widely studied settings: dataset with uniform concentration assumption and turnbased Markov game. Our work serves as an important initial step towards understanding offline multiagent reinforcement learning. 
Qiwen Cui · Simon Du 🔗 


A SelfPlay Posterior Sampling Algorithm for ZeroSum Markov Games
(
Poster
)
link »
Existing studies on provably efficient algorithms for Markov games (MGs) almost exclusively build on the ``optimism in the face of uncertainty'' (OFU) principle. This work focuses on a different approach of posterior sampling, which is celebrated in many bandits and reinforcement learning settings but remains underexplored for MGs. Specifically, for episodic twoplayer zerosum MGs, a novel posterior sampling algorithm is developed with \emph{general} function approximation. Theoretical analysis demonstrates that the posterior sampling algorithm admits a $\sqrt{T}$regret bound for problems with a low multiagent decoupling coefficient, which is a new complexity measure for MGs, where $T$ denotes the number of episodes. When specialized to linear MGs, the obtained regret bound matches the stateoftheart results. To the best of our knowledge, this is the first provably efficient posterior sampling algorithm for MGs with frequentist regret guarantees, which enriches the toolbox for MGs and promotes the broad applicability of posterior sampling.

Wei Xiong · Han Zhong · Chengshuai Shi · Cong Shen · Tong Zhang 🔗 


Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium Learning from Offline Datasets
(
Poster
)
link »
We study episodic twoplayer zerosum Markov games (MGs) in the offline setting, where the goal is to find an approximate Nash equilibrium (NE) policy pair based on a dataset collected a priori. When the dataset does not have uniform coverage over all policy pairs, finding an approximate NE involves challenges in three aspects: (i) distributional shift between the behavior policy and the optimal policy, (ii) function approximation to handle large state space, and (iii) minimax optimization for equilibrium solving. We propose a pessimismbased algorithm, dubbed as pessimistic minimax value iteration (PMVI), which overcomes the distributional shift by constructing pessimistic estimates of the value functions for both players and outputs a policy pair by solving NEs based on the two value functions. Furthermore, we establish a datadependent upper bound on the suboptimality which recovers a sublinear rate without the assumption on uniform coverage of the dataset. We also prove an informationtheoretical lower bound, which suggests that the datadependent term in the upper bound is intrinsic. Our theoretical results also highlight a notion of ``relative uncertainty'', which characterizes the necessary and sufficient condition for achieving sample efficiency in offline MGs. To the best of our knowledge, we provide the first nearly minimax optimal result for offline MGs with function approximation. 
Han Zhong · Wei Xiong · Jiyuan Tan · Liwei Wang · Tong Zhang · Zhaoran Wang · Zhuoran Yang 🔗 


Can Reinforcement Learning Efficiently Find StackelbergNash Equilibria in GeneralSum Markov Games with Myopic Followers？
(
Poster
)
link »
We study multiplayer generalsum Markov games with one of the players designated as the leader and the rest regarded as the followers. In particular, we focus on the class of games where the followers are myopic, i.e., the followers aim to maximize the instantaneous rewards. For such a game, our goal is to find the StackelbergNash equilibrium (SNE), which is a policy pair $(\pi^*, \nu^*)$ such that (i) $\pi^*$ is the optimal policy for the leader when the followers always play their best response, and (ii) $\nu^*$ is the best response policy of the followers, which is a Nash equilibrium of the followers' game induced by $\pi^*$. We develop sample efficient reinforcement learning (RL) algorithms for solving SNE under both the online and offline settings. Respectively, our algorithms are optimistic and pessimistic variants of leastsquares value iteration and are readily able to incorporate function approximation tools for handling large state spaces. Furthermore, for the case with linear function approximation, we prove that our algorithms achieve sublinear regret and suboptimality under online and offline setups respectively. To our best knowledge, we establish the first provably efficient RL algorithms for solving SNE in generalsum Markov games with myopic followers.

Han Zhong · Zhuoran Yang · Zhaoran Wang · Michael Jordan 🔗 


Teamwork Reinforcement Learning with Concave Utilities
(
Poster
)
link »
Complex reinforcement learning (RL) tasks often require a divideandconquer approach, where a large task is divided into pieces and solved by individual agents. In this paper, we study a teamwork RL setting where individual agents make decisions on disjoint subsets (blocks) of the state space and have private interests (reward functions), while the entire team aims to maximize a general longterm team utility function and may be subject to constraints. This team utility, which is not necessarily a cumulative sum of rewards, is modeled as a nonlinear function of the team's joint stateaction occupancy distribution. By leveraging the inherent duality of policy optimization, we propose a minmax multiblock policy optimization framework to decompose the overall problem into individual local tasks. This enables a federated teamwork mechanism where a team lead coordinates individual agents via reward shaping, and each agent solves her local task defined only on their local state subset. We analyze the convergence of this teamwork policy optimization mechanism and establish an $O(1/T)$ convergence rate to the team's joint optimum. This mechanism allows team members to jointly find the global socially optimal policy while keeping their local privacy.

Zheng Yu · Junyu Zhang · Zheng Wen · Andrea Tacchetti · Mengdi Wang · Ian Gemp 🔗 


EigenGame Unloaded: When playing games is better than optimizing
(
Poster
)
link »
We build on the recently proposed EigenGame that views eigendecomposition as a competitive game. EigenGame's updates are biased if computed using minibatches of data, which hinders convergence and more sophisticated parallelism in the stochastic setting. In this work, we propose an unbiased stochastic update that is asymptotically equivalent to EigenGame, enjoys greater parallelism allowing computation on datasets of larger sample sizes, and outperforms EigenGame in experiments. We present applications to finding the principal components of massive datasets and performing spectral clustering of graphs. We analyze and discuss our proposed update in the context of EigenGame and the shift in perspective from optimization to games. 
Ian Gemp · Brian McWilliams · Claire Vernade · Thore Graepel 🔗 


Backdoors Stuck At The Frontdoor: MultiAgent Backdoor Attacks That Backfire
(
Poster
)
link »
Malicious agents in collaborative learning and outsourced data collection threaten the training of clean models. Backdoor attacks, where an attacker poisons a model during training to successfully achieve targeted misclassification, are a major concern to traintime robustness. In this paper, we investigate a multiagent backdoor attack scenario, where multiple attackers attempt to backdoor a victim model simultaneously. A consistent backfiring phenomenon is observed across a wide range of games, where agents suffer from a low collective attack success rate. We examine different modes of backdoor attack configurations, noncooperation / cooperation, joint distribution shifts, and game setups to return an equilibrium attack success rate at the lower bound. The results motivate the reevaluation of backdoor defense research for practical environments. 
Siddhartha Datta · Nigel Shadbolt 🔗 


Learning to Share in MultiAgent Reinforcement Learning
(
Poster
)
link »
In this paper, we study the problem of networked multiagent reinforcement learning (MARL), where a number of agents are deployed as a partially connected network and each interacts only with nearby agents. Networked MARL requires all agents make decision in a decentralized manner to optimize a global objective with restricted communication between neighbors over the network. Inspired by the fact that sharing plays a key role in human's learning of cooperation, we propose LToS, a hierarchically decentralized MARL framework that enables agents to learn to dynamically share reward with neighbors so as to encourage agents to cooperate on the global objective. For each agent, the highlevel policy learns how to share reward with neighbors to decompose the global objective, while the lowlevel policy learns to optimize local objective induced by the highlevel policies in the neighborhood. The two policies form a bilevel optimization and learn alternately. We empirically demonstrate that LToS outperforms existing methods in both social dilemma and networked MARL scenario. 
Yuxuan Yi · Ge Li · Yaowei Wang · Zongqing Lu 🔗 


FOLLOW THE NEURALLYPERTURBED LEADER FOR ADVERSARIAL TRAINING
(
Poster
)
link »
Gametheoretic models of learning are a powerful set of models that optimize multiobjective architectures. Among these models are zerosum architectures that have inspired adversarial learning frameworks. We extend these twoplayer frameworks by introducing a mediating neural agent whose role is to augment the observation of the players to achieve certain maximum entropic objectives.We show that the new framework can be utilized for 1) efficient online training in multimodal and adaptive environments and 2) addressing the ergodic convergence and cyclic dynamics issues of adversarial training. We also note the proposed training framework resembles the ‘follow the perturbed leader’ learning algorithms where perturbations are the result of actions of the mediating agent.We validate our theoretical results by applying them to the games with convex and nonconvex loss as well as generative adversarial architectures. Moreover, we customize the implementation of this algorithm for adversarial imitation learning applications where we validate our assertions by using a procedurally generated game environment as well as synthetic data. 
Ari Azarafrooz 🔗 


Safe OpponentExploitation Subgame Refinement
(
Poster
)
link »
In zerosum games, an NE strategy tends to be overly conservative confronted with opponents of limited rationality, because it does not actively exploit their weaknesses. From another perspective, best responding to an estimated opponent model is vulnerable to estimation errors and lacks safety guarantees. Inspired by the recent success of realtime search algorithms in developing superhuman AI, we investigate the dilemma of safety and opponent exploitation and present a novel realtime search framework, called Safe Exploitation Search (SES), which smoothly interpolates between the two extremes of online strategy refinement. We provide SES with a theoretically upperbounded exploitability and a lowerbounded evaluation performance. Additionally, SES enables computationally efficient online adaptation to a possibly updating opponent model, while previous safe exploitation methods have to recompute for the whole game. Empirical results show that SES significantly outperforms NE baselines and previous algorithms while keeping exploitability low at the same time. 
Mingyang Liu · Chengjie Wu · Qihan Liu · Yansen Jing · Jun Yang · Pingzhong Tang · Chongjie Zhang 🔗 


HCMDzero: Learning Value Aligned Mechanisms from Data
(
Poster
)
link »
Artificial learning agents are mediating a larger and larger number of interactions among humans, firms, and organizations, and the intersection between mechanism design and machine learning has been heavily investigated in recent years. However, mechanism design methods make strong assumptions on how participants behave (e.g. rationality), or on the kind of knowledge designers have access to a priori (e.g. access to strong baseline mechanisms). Here we introduce HCMDzero, a general purpose method to construct mechanism agents. HCMDzero learns by mediating interactions among participants, while remaining engaged in an electoral contest with copies of itself, thereby accessing direct feedback from participants. Our results on the Public Investment Game, a stylized resource allocation game that highlights the tension between productivity, equality and the temptation to freeride, show that HCMDzero produces competitive mechanism agents that are consistently preferred by human participants over baseline alternatives, and does so automatically, without requiring human knowledge, and by using human data sparingly and effectively Our detailed analysis shows HCMDzero elicits consistent improvements over the course of training, and that it results in a mechanism with an interpretable and intuitive policy. 
Jan Balaguer · Raphael Koster · Ari Weinstein · Lucy CampbellGillingham · Christopher Summerfield · Matthew Botvinick · Andrea Tacchetti 🔗 


The Good Shepherd: An Oracle Agent for Mechanism Design
(
Poster
)
link »
From social networks to traffic routing, artificial learning agents are playing a central role in modern institutions. We must therefore understand how to leverage these systems to foster outcomes and behaviors that align with our own values and aspirations. While multiagent learning has received considerable attention in recent years, artificial agents have been primarily evaluated when interacting with fixed, nonlearning coplayers. While this evaluation scheme has merit, it fails to capture the dynamics faced by institutions that must deal with adaptive and continually learning constituents. Here we address this limitation, and construct agents ( 
Jan Balaguer · Raphael Koster · Christopher Summerfield · Andrea Tacchetti 🔗 


Learning Truthful, Efficient, and Welfare Maximizing Auction Rules
(
Poster
)
link »
From social networks to supply chains, more and more aspects of how humans, firms and organizations interact is mediated by artificial learning agents. As the influence of machine learning systems grows, it is paramount that we study how to imbue our modern institutions with our own values and principles.Here we consider the problem of allocating goods to buyers who have preferences over them in settings where the seller's aim is not to maximize their monetary gains, but rather to advance some notion of social welfare (e.g. the government trying to award construction licenses for hospitals or schools).This problem has a long history in economics, and solutions take the form of auction rules. Researchers have proposed reliable auction rules that work in extremely general settings, and in the presence of information asymmetry and strategic buyers. However, these protocols require significant payments from participants resulting in low aggregate welfare. Here we address this shortcoming by casting auction rule design as a statistical learning problem, and trade generality for participant welfare effectively and automatically with a novel deep learning network architecture and auction representation. Our analysis shows that our auction rules outperform stateofthe art approaches in terms of participants welfare, applicability, robustness. 
Andrea Tacchetti · DJ Strouse · Marta Garnelo · Thore Graepel · Yoram Bachrach 🔗 


MTLight: Efficient MultiTask Reinforcement Learning for Traffic Signal Control
(
Poster
)
link »
Traffic signal control has a great impact on alleviating traffic congestion in modern cities. Deep reinforcement learning (RL) has been widely used for this task in recent years, demonstrating promising performance but also facing many challenges such as limited performances and sample inefficiency. To handle these challenges, MTLight is proposed to enhance the agent observation with a latent state, which is learned from numerous traffic indicators. Meanwhile, multiple auxiliary and supervisory tasks are constructed to learn the latent state, and two types of embedding latent features, the taskspecific feature and taskshared feature, are used to make the latent state more abundant. Extensive experiments conducted on CityFlow demonstrate that MTLight has leading convergence speed and asymptotic performance. We further simulate under peakhour pattern in all scenarios with increasing control difficulty and the results indicate that MTLight is highly adaptable. 
Liwen Zhu · Peixi Peng · Zongqing Lu · Yonghong Tian 🔗 


A Regret Minimization Approach to MultiAgent Control
(
Poster
)
link »
We study the problem of multiagent control of a dynamical system with known dynamics and adversarial disturbances. Our study focuses on optimal control without centralized precomputed policies, but rather with adaptive control policies for the different agents that are only equipped with a stabilizing controller. We give a reduction from any (standard) regret minimizing control method to a distributed algorithm. The reduction guarantees that the resulting distributed algorithm has low regret relative to the optimal precomputed joint policy. Our methodology involves generalizing online convex optimization to a multiagent setting and applying recent tools from nonstochastic control derived for a single agent. We empirically evaluate our method on a model of an overactuated aircraft. We show that the distributed method is robust to failure and to adversarial perturbations in the dynamics. 
Udaya Ghai · Udari Madhushani · Naomi Leonard · Elad Hazan 🔗 


Global Convergence of MultiAgent Policy Gradient in Markov Potential Games
(
Poster
)
link »
Potential games are one of the most important and widely studied classes of normalform games. They define the archetypal setting of multiagent coordination in which all agents utilities are perfectly aligned via a common potential function. Can we embed this intuitive framework in the setting of Markov games? What are the similarities and differences between multiagent coordination with and without state dependence? To answer these questions, we study a natural class of Markov Potential Games (MPGs) that generalizes prior attempts to capture complex stateful multiagent coordination. Counterintuitively, insights from normalform potential games do not carry over since MPGs involve Markov games with zerosum stategames, but Markov games in which all stategames are potential games are not necessarily MPGs. Nevertheless, MPGs showcase standard desirable properties such as the existence of deterministic Nash policies. In our main result, we prove convergence of independent policy gradient and its stochastic counterpart to Nash policies at a rate that is polynomial in the approximation error by adapting singleagent gradient domination properties to multiagent settings. This answers questions on the convergence of finitesample, independent policy gradient methods beyond settings of pure conflicting or pure common interests. 
Stefanos Leonardos · Will Overman · Ioannis Panageas · Georgios Piliouras 🔗 


A GameTheoretic Approach for Improving Generalization Ability of TSP Solvers
(
Poster
)
link »
In this paper, we introduce a twoplayer zerosum framework between a trainable \emph{Solver} and a \emph{Data Generator} to improve the generalization ability of deep learningbased solvers for Traveling Salesman Problems (TSP).Grounded in \textsl{Policy Space Response Oracle} (PSRO) methods, our twoplayer framework outputs a population of bestresponding Solvers, over which we can mix and output a combined model that achieves the least exploitability against the Generator, and thereby the most generalizable performance on different TSP tasks. We conduct experiments on a variety of TSP instances with different types and sizes. Results suggest that our Solvers achieve the stateoftheart performance even on tasks the Solver never meets, whilst the performance of other deep learningbased Solvers drops sharply due to overfitting. To demonstrate the principle of our framework, we study the learning outcome of the proposed twoplayer game and demonstrate that the exploitability of the Solver population decreases during training, and it eventually approximates the Nash equilibrium along with the Generator. 
Chenguang Wang · Yaodong Yang · Oliver Slumbers · Congying Han · Tiande Guo · Haifeng Zhang · Jun Wang 🔗 