Can we reformulate machine learning from the ground up with multiagent in mind? Modern machine learning primarily takes an optimization-first, single-agent approach, however, many of life’s intelligent systems are multiagent in nature across a range of scales and domains such as market economies, ant colonies, forest ecosystems, and decentralized energy grids.
Generative adversarial networks represent one of the most recent successful deviations from the dominant single-agent paradigm by formulating generative modeling as a two-player, zero-sum game. Similarly, a few recent methods formulating root node problems of machine learning and data science as games among interacting agents have gained recognition (PCA, NMF). Multiagent designs are typically distributed and decentralized which leads to robust and parallelizable learning algorithms.
We want to bring together a community of people that wants to revisit machine learning problems and reformulate them as solutions to games. How might this algorithmic bias affect the solutions that arise and could we define a blueprint for problems that are amenable to gamification? By exploring this direction, we may gain a fresh perspective on machine learning with distinct advantages to the current dominant optimization paradigm.
Fri 5:00 a.m. - 5:15 a.m.
|
Opening Remarks
(
Intro
)
|
🔗 |
Fri 5:15 a.m. - 5:50 a.m.
|
Professor Sarit Kraus: Agent-Human Complex Games for Multi-agent Studies
(
Invited Talk
)
SlidesLive Video » |
Sarit Kraus 🔗 |
Fri 5:50 a.m. - 5:55 a.m.
|
Sarit Kraus Q&A
(
Q&A
)
|
🔗 |
Fri 5:55 a.m. - 6:00 a.m.
|
Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium Learning from Offline Datasets
(
Poster Spotlights
)
link »
SlidesLive Video » We study episodic two-player zero-sum Markov games (MGs) in the offline setting, where the goal is to find an approximate Nash equilibrium (NE) policy pair based on a dataset collected a priori. When the dataset does not have uniform coverage over all policy pairs, finding an approximate NE involves challenges in three aspects: (i) distributional shift between the behavior policy and the optimal policy, (ii) function approximation to handle large state space, and (iii) minimax optimization for equilibrium solving. We propose a pessimism-based algorithm, dubbed as pessimistic minimax value iteration (PMVI), which overcomes the distributional shift by constructing pessimistic estimates of the value functions for both players and outputs a policy pair by solving NEs based on the two value functions. Furthermore, we establish a data-dependent upper bound on the suboptimality which recovers a sublinear rate without the assumption on uniform coverage of the dataset. We also prove an information-theoretical lower bound, which suggests that the data-dependent term in the upper bound is intrinsic. Our theoretical results also highlight a notion of ``relative uncertainty'', which characterizes the necessary and sufficient condition for achieving sample efficiency in offline MGs. To the best of our knowledge, we provide the first nearly minimax optimal result for offline MGs with function approximation. |
Han Zhong · Wei Xiong · Jiyuan Tan · Liwei Wang · Tong Zhang · Zhaoran Wang · Zhuoran Yang 🔗 |
Fri 6:00 a.m. - 6:05 a.m.
|
Learning to Share in Multi-Agent Reinforcement Learning
(
Poster Spotlights
)
link »
SlidesLive Video » In this paper, we study the problem of networked multi-agent reinforcement learning (MARL), where a number of agents are deployed as a partially connected network and each interacts only with nearby agents. Networked MARL requires all agents make decision in a decentralized manner to optimize a global objective with restricted communication between neighbors over the network. Inspired by the fact that sharing plays a key role in human's learning of cooperation, we propose LToS, a hierarchically decentralized MARL framework that enables agents to learn to dynamically share reward with neighbors so as to encourage agents to cooperate on the global objective. For each agent, the high-level policy learns how to share reward with neighbors to decompose the global objective, while the low-level policy learns to optimize local objective induced by the high-level policies in the neighborhood. The two policies form a bi-level optimization and learn alternately. We empirically demonstrate that LToS outperforms existing methods in both social dilemma and networked MARL scenario. |
Yuxuan Yi · Ge Li · Yaowei Wang · Zongqing Lu 🔗 |
Fri 6:05 a.m. - 6:10 a.m.
|
A Regret Minimization Approach to Multi-Agent Control
(
Poster Spotlights
)
link »
SlidesLive Video » We study the problem of multi-agent control of a dynamical system with known dynamics and adversarial disturbances. Our study focuses on optimal control without centralized precomputed policies, but rather with adaptive control policies for the different agents that are only equipped with a stabilizing controller. We give a reduction from any (standard) regret minimizing control method to a distributed algorithm. The reduction guarantees that the resulting distributed algorithm has low regret relative to the optimal precomputed joint policy. Our methodology involves generalizing online convex optimization to a multi-agent setting and applying recent tools from nonstochastic control derived for a single agent. We empirically evaluate our method on a model of an overactuated aircraft. We show that the distributed method is robust to failure and to adversarial perturbations in the dynamics. |
Udaya Ghai · Udari Madhushani · Naomi Leonard · Elad Hazan 🔗 |
Fri 6:10 a.m. - 6:15 a.m.
|
A Game-Theoretic Approach for Improving Generalization Ability of TSP Solvers
(
Poster Spotlights
)
link »
SlidesLive Video » In this paper, we introduce a two-player zero-sum framework between a trainable \emph{Solver} and a \emph{Data Generator} to improve the generalization ability of deep learning-based solvers for Traveling Salesman Problems (TSP).Grounded in \textsl{Policy Space Response Oracle} (PSRO) methods, our two-player framework outputs a population of best-responding Solvers, over which we can mix and output a combined model that achieves the least exploitability against the Generator, and thereby the most generalizable performance on different TSP tasks. We conduct experiments on a variety of TSP instances with different types and sizes. Results suggest that our Solvers achieve the state-of-the-art performance even on tasks the Solver never meets, whilst the performance of other deep learning-based Solvers drops sharply due to over-fitting. To demonstrate the principle of our framework, we study the learning outcome of the proposed two-player game and demonstrate that the exploitability of the Solver population decreases during training, and it eventually approximates the Nash equilibrium along with the Generator. |
Chenguang Wang · Yaodong Yang · Oliver Slumbers · Congying Han · Tiande Guo · Haifeng Zhang · Jun Wang 🔗 |
Fri 6:15 a.m. - 6:50 a.m.
|
Professor Elad Schneidman: Efficient Collective Behavior from Maximizing Diversity
(
Invited Talk
)
SlidesLive Video » |
🔗 |
Fri 6:50 a.m. - 6:55 a.m.
|
Elad Schneidman Q&A
(
Q&A
)
|
🔗 |
Fri 6:55 a.m. - 7:45 a.m.
|
Poster Session 1 + Coffee Break
(
Poster Session
)
|
🔗 |
Fri 7:45 a.m. - 8:10 a.m.
|
Near-Optimal Learning of Extensive-Form Games with Imperfect Information
(
Contributed Talk
)
link »
SlidesLive Video »
This paper resolves the open question of designing near-optimal algorithms for learning imperfect-information extensive-form games from bandit feedback. We present the first line of algorithms that require only $\widetilde{\mathcal{O}}((XA+YB)/\varepsilon^2)$ episodes of play to find an $\varepsilon$-approximate Nash equilibrium in two-player zero-sum games, where $X,Y$ are the number of information sets and $A,B$ are the number of actions for the two players. This improves upon the best known sample complexity of $\widetilde{\mathcal{O}}((X^2A+Y^2B)/\varepsilon^2)$ by a factor of $\widetilde{\mathcal{O}}(\max\{X, Y\})$, and matches the information-theoretic lower bound up to logarithmic factors. We achieve this sample complexity by two new algorithms: Balanced Online Mirror Descent, and Balanced Counterfactual Regret Minimization. Both algorithms rely on novel approaches of integrating \emph{balanced exploration policies} into their classical counterparts. We also extend our results to learning Coarse Correlated Equilibria in multi-player general-sum games.
|
Yu Bai · Chi Jin · Song Mei · Tiancheng Yu 🔗 |
Fri 8:10 a.m. - 8:50 a.m.
|
“The Multiagent Hammer” w/ Sarit Kraus, Elad Schneidman, Tiancheng Yu, Goran Zuzic
(
Discussion Panel
)
|
🔗 |
Fri 8:50 a.m. - 9:30 a.m.
|
Professor Constantinos ("Costis") Daskalakis: Equilibrium Computation & Machine Learning
(
Invited Talk
)
SlidesLive Video » |
Constantinos C Daskalakis 🔗 |
Fri 9:30 a.m. - 9:35 a.m.
|
Constantinos Daskalakis Q&A
(
Q&A
)
|
🔗 |
Fri 9:35 a.m. - 10:45 a.m.
|
Lunch
|
🔗 |
Fri 10:45 a.m. - 11:10 a.m.
|
Learning Robust Algorithms for Online Allocation Problems Using Adversarial Training
(
Contributed Talk
)
link »
SlidesLive Video » We address the challenge of finding algorithms for online allocation (i.e. bipartite matching) using a machine learning approach. In this paper, we focus on the AdWords problem, which is a classical online budgeted matching problem of both theoretical and practical significance. In contrast to existing work, our goal is to accomplish algorithm design {\em tabula rasa}, i.e., without any human-provided insights or expert-tuned training data beyond specifying the objective and constraints of the optimization problem. We construct a framework based on insights from game theory and adversarial training. Key to our approach is to generate adversarial examples that expose the weakness of any given algorithm. A unique challenge in our context is to generate complete examples from scratch rather than perturbing given examples and we demonstrate this can be accomplished for the Adwords problem. We use this framework to co-train an algorithm network and an adversarial network against each other until they converge to an equilibrium. This approach finds algorithms and adversarial examples that are consistent with known optimal results. Secondly, we address the question of robustness of the algorithm, namely can we design algorithms that are both strong under practical distributions, as well as exhibit robust performance against adversarial instances. To accomplish this, we train algorithm networks using a mixture of adversarial and fixed practical distributions like power-laws; the resulting networks exhibit a smooth trade-off between the two input regimes. |
Goran Zuzic · Di Wang · Aranyak Mehta · D. Sivakumar 🔗 |
Fri 11:10 a.m. - 11:45 a.m.
|
Dr Kaiqing Zhang: Multi-agent Reinforcement Learning in Stochastic Games: From AlphaGo to Robust Control
(
Invited Talk
)
|
Kaiqing Zhang 🔗 |
Fri 11:45 a.m. - 11:50 a.m.
|
Kaiqing Zhang Q&A
(
Q&A
)
|
🔗 |
Fri 11:50 a.m. - 11:55 a.m.
|
V-Learning -- A Simple, Efficient, Decentralized Algorithm for Multiagent RL
(
Poster Spotlights
)
link »
SlidesLive Video »
A major challenge of multiagent reinforcement learning (MARL) is \emph{the curse of multiagents}, where the size of the joint action space scales exponentially with the number of agents. This remains to be a bottleneck for designing efficient MARL algorithms even in a basic scenario with finitely many states and actions. This paper resolves this challenge for the model of episodic Markov games. We design a new class of fully decentralized algorithms---V-learning, which provably learns Nash equilibria (in the two-player zero-sum setting), correlated equilibria and coarse correlated equilibria (in the multiplayer general-sum setting) in a number of samples that only scales with $\max_{i\in[m]} A_i$, where $A_i$ is the number of actions for the $i\th$ player. This is in sharp contrast to the size of the joint action space which is $\prod_{i=1}^m A_i$.V-learning (in its basic form) is a new class of single-agent RL algorithms that convert any adversarial bandit algorithm with suitable regret guarantees into an RL algorithm. Similar to the classical Q-learning algorithm, it performs incremental updates to the value functions. Different from Q-learning, it only maintains the estimates of V-values instead of Q-values. This key difference allows V-learning to achieve the claimed guarantees in the MARL setting by simply letting all agents run V-learning independently.
|
Chi Jin · Qinghua Liu · Yuanhao Wang · Tiancheng Yu 🔗 |
Fri 11:55 a.m. - 12:00 p.m.
|
Model-Free Opponent Shaping
(
Poster Spotlights
)
link »
SlidesLive Video » In general-sum games, the interaction of self-interested learning agents commonly leads to collectively worst-case outcomes, such as defect-defect in the iterated prisoner’s dilemma (IPD). To overcome this, some methods, such as Learning with Opponent-Learning Awareness (LOLA), shape their opponents’ learning process. However, these methods are myopic since only a small number of steps can be anticipated, are asymmetric since they treat other agents as naive learners, and require the use of higher-order derivatives, which are calculated through white-box access to an opponent’s differentiable learning algorithm. To address these issues, we propose Model-Free Opponent Shaping (M-FOS). M-FOS learns in a meta-game in which each meta-step is an episode of the underlying (“inner”) game. The meta-state consists of the inner policies, and the meta-policy produces a new inner policy to be used in the next episode. M-FOS then uses generic model-free optimisation methods to learn meta-policies that accomplish long-horizon opponent shaping. Empirically, M-FOS near-optimally exploits naive learners and other, more sophisticated algorithms from the literature. For example, to the best of our knowledge, it is the first method to learn the well-known Zero-Determinant (ZD) extortion strategy in the IPD. In the same settings, M-FOS leads to socially optimal outcomes under meta-self-play. Finally, we show that M-FOS can be scaled to high-dimensional settings. |
Chris Lu · Timon Willi · Christian Schroeder de Witt · Jakob Foerster 🔗 |
Fri 12:00 p.m. - 12:05 p.m.
|
Influencing Long-Term Behavior in Multiagent Reinforcement Learning
(
Poster Spotlights
)
link »
SlidesLive Video » The main challenge of multiagent reinforcement learning is the difficulty of learning useful policies in the presence of other simultaneously learning agents whose changing behaviors jointly affect the environment's transition and reward dynamics. An effective approach that has recently emerged for addressing this non-stationarity is for each agent to anticipate the learning of other interacting agents and influence the evolution of their future policies towards desirable behavior for its own benefit. Unfortunately, all previous approaches for achieving this suffer from myopic evaluation, considering only a few or a finite number of updates to the policies of other agents. In this paper, we propose a principled framework for considering the limiting policies of other agents as the time approaches infinity. Specifically, we develop a new optimization objective that maximizes each agent's average reward by directly accounting for the impact of its behavior on the limiting set of policies that other agents will take on. Thanks to our farsighted evaluation, we demonstrate better long-term performance than state-of-the-art baselines in various domains, including the full spectrum of general-sum, competitive, and cooperative settings. |
Dong Ki Kim · Matt Riemer · Miao Liu · Jakob Foerster · Michael Everett · Chuangchuang Sun · Gerald Tesauro · JONATHAN HOW 🔗 |
Fri 12:05 p.m. - 12:10 p.m.
|
Zero-Sum Stochastic Stackelberg Games
(
Poster Spotlights
)
link »
SlidesLive Video » Min-max optimization problems (i.e., zero-sum games) have been used to model problems in a variety of fields in recent years, from machine learning to economics. The literature to date has mostly focused on static zero-sum games, assuming independent strategy sets. In this paper, we study a form of dynamic zero-sum games, called stochastic games, with dependent strategy sets. Just as zero-sum games with dependent strategy sets can be interpreted as zero-sum Stackelberg games, stochastic zero-sum games with dependent strategy sets can be interpreted as zero-sum stochastic Stackelberg games. We prove the existence of an optimal solution in zero-sum stochastic Stackelberg games (i.e., a recursive Stackelberg equilibrium), provide necessary and sufficient conditions for a solution to be optimal, and show that a recursive Stackelberg equilibrium can be computed in polynomial time via value iteration. Finally, we show that stochastic Stackelberg games can model the problem of pricing and allocating goods across agents and time; more specifically, we propose a stochastic Stackelberg game whose solutions correspond to a recursive competitive equilibrium in a stochastic Fisher market. We close with a series of experiments which confirm our theoretical results and show how value iteration performs in practice. |
Denizalp Goktas · Jiayi Zhao · Amy Greenwald 🔗 |
Fri 12:10 p.m. - 1:00 p.m.
|
Poster Session 2 + Coffee Break
(
Poster Session
)
|
🔗 |
Fri 1:00 p.m. - 1:40 p.m.
|
Professor Lillian Ratliff: Beyond Open Loop Algorithm Design: Learning from Decision-Dependent Data
(
Invited Talk
)
SlidesLive Video » |
Lillian J Ratliff 🔗 |
Fri 1:40 p.m. - 1:45 p.m.
|
Lillian Ratliff Q&A
(
Q&A
)
|
🔗 |
Fri 1:45 p.m. - 2:10 p.m.
|
Modeling Strong and Human-like Gameplay with KL-Regularized Search
(
Contributed Talk
)
link »
SlidesLive Video » We consider the task of building strong but human-like policies in multi-agent decision-making problems, given examples of human behavior. Imitation learning is effective at predicting human actions but may not match the strength of expert humans, while self-play learning and search techniques (e.g. AlphaZero) lead to strong performance but may produce policies that are difficult for humans to understand and coordinate with. We show in chess and Go that regularizing search based on the KL divergence from an imitation-learned policy results in higher human prediction accuracy and stronger performance than imitation learning alone. We then introduce a novel regret minimization algorithm that is regularized based on the KL divergence from an imitation-learned policy, and show that using this algorithm for search in no-press Diplomacy yields a policy that matches the human prediction accuracy of imitation learning while being substantially stronger. |
Athul Paul Jacob · David Wu · Gabriele Farina · Adam Lerer · Hengyuan Hu · Anton Bakhtin · Jacob Andreas · Noam Brown 🔗 |
Fri 2:10 p.m. - 2:50 p.m.
|
“The Multiagent Hammer” w/ Lillian Ratliff, Constantinos Daskalakis, Kaiqing Zhang, Noam Brown
(
Discussion Panel
)
|
🔗 |
Fri 2:50 p.m. - 3:00 p.m.
|
Closing Remarks
(
Close
)
|
🔗 |
-
|
Optimal Correlated Equilibria in General-Sum Extensive-Form Games: Fixed-Parameter Algorithms, Hardness, and Two-Sided Column-Generation
(
Poster
)
link »
We study the problem of finding optimal correlated equilibria of various sorts: normal-form coarse correlated equilibrium (NFCCE), extensive-form coarse correlated equilibrium (EFCCE), and extensive-form correlated equilibrium (EFCE). This is NP-hard in the general case and has been studied in special cases, most notably triangle-free games (Farina & Sandholm 2020), which include all two-player games with public chance moves. However, the general case is not well understood, and algorithms usually scale poorly. In this paper, we make two primary contributions. First, we introduce the correlation DAG, a representation of the space of correlated strategies whose structure and size are dependent on the specific solution concept desired. It extends the team belief DAG of Zhang et al (2022) to general-sum games. For each of the three solution concepts, its size depends exponentially only on a parameter related to the information structure of the game. We also prove a fundamental complexity gap: while our size bounds for NFCCE are similar to those achieved in the case of team games by Zhang et al (2022), this is impossible to achieve for the other two concepts under standard complexity assumptions. Second, we propose a two-sided column generation approach to compute optimal correlated strategies in extensive-form games. Our algorithm improves upon the one-sided approach of Farina et al (2021a) by means of a new decomposition of correlated strategies which allows players to re-optimize their sequence-form strategies with respect to correlation plans which were previously added to the support. Experiments show that our techniques outperform the prior state of the art for computing optimal general-sum correlated equilibria, and that our two families of approaches have complementary strengths: the correlation DAG is fast when the parameter is small and the two-sided column generation approach is superior when the parameter is large. |
Brian Zhang · Gabriele Farina · Andrea Celli · Tuomas Sandholm 🔗 |
-
|
General sum stochastic games with networked information flow
(
Poster
)
link »
Inspired by naturally emerging networks of reinforcement learning problems in operation sectors such as supply chains, epidemics, and social networks, we present a stochastic game formulation with networked reward and transition dependencies for multi-agent reinforcement learning (MARL). Despite the rapid development of MARL algorithms, the majority of research efforts are motivated by settings where each pair of players is either collaborating or fully competing. However, in many large-scale networked systems, pairs of interacting players experience general sum-type objectives--- players must collaborate for total profit maximization and locally compete for individual profit maximization. In contrast to popular MARL benchmark environments, stochastic games with general sum objectives have received relatively less attention, in part due to the lack of motivating application and toy examples. In this paper, we address the lack of motivating examples by presenting a networked stochastic game framework with pair-wise general sum objectives and relating the framework to operation sector systems such as supply chains. We discuss specific game features that distinguish this type of stochastic game from existing canonical stochastic game models and specifically outline the impact of these features on the adaptation of popular multi-agent learning paradigms such as individual learning and centralized learning decentralized execution. We conclude with a two player supply chain to benchmark existing MARL algorithms and to contextualize the challenges at hand. |
Sarah Li · Lillian J Ratliff · Peeyush Kumar 🔗 |
-
|
Finding and only finding local Nash equilibria by both pretending to be a follower
(
Poster
)
link »
Finding local Nash equilibria in two-player differentiable games is a classical problem in game theory with important relevance in machine learning. We propose double Follow-the-Ridge (double-FTR), an algorithm whose attracting critical points are equivalent to differential Nash equilibria in general-sum two-player differential games. To our knowledge, double-FTR is the first algorithm with such guarantees for general-sum games. Furthermore, we show that by varying its preconditioner, double-FTR leads to a broader family of algorithms with the same properties. Double-FTR avoids oscillation near equilibria due to the real-eigenvalues of its Jacobian at critical points. Finally, we empirically verify the effectiveness of double-FTR in finding local Nash equilibria in two simple examples. |
Xuchan Bao · Guodong Zhang 🔗 |
-
|
Automated equilibrium analysis of 2x2x2 games
(
Poster
)
link »
The set of all Nash equilibria of a non-cooperative game with more than two players is defined by equations and inequalities between nonlinear polynomials, which makes it challenging to compute. This paper presents an algorithm that computes this set for the simplest game with more than two players with arbitrary (possibly non-generic) payoffs, which has not been done before. We give new elegant formulas for completely mixed equilibria, and compute visual representations of the best-response correspondences and their intersections, which define the Nash equilibrium set. These have been implemented in Python and will be part of a public web-based software for automated equilibrium analysis. For small games, which are often studied in economic models, a complete Nash equilibrium analysis is desirable and should be feasible. This project demonstrates the complexity of this task and offers pathways for extensions to larger games. |
Sahar Jahani · Bernhard Von Stengel 🔗 |
-
|
Collaborative Auto-Curricula Multi-Agent Reinforcement Learning with Graph Neural Network Communication Layer for Open-ended Wildfire-Management Resource Distribution
(
Poster
)
link »
Most real-world domains can be formulated as multi-agent (MA) systems. Multiple intentionality sharing agents can solve more complex tasks by collaborating, possibly in less time. True cooperative actions are beneficial for egoistic and collective reasons. However, teaching individual agents to sacrifice egoistic benefits for positive collective performance seems challenging. We build on a recently proposed Multi-Agent Reinforcement Learning (MARL) mechanism with a Graph Neural Network (GNN) communication layer. Rarely chosen communication actions were marginally beneficial. Here we propose a MARL system in which agents can help collaborators perform better while risking low self-performance or send help requests for predicted increased task difficulty. We conduct our study in the context of resource distribution for wildfire management. Communicating environmental features and partially observable fire occurrence observations help the agent collective pre-emptively distribute resources. Furthermore, we introduce a procedural training environment accommodating auto-curricula and open-endedness towards better generalizability. While our MA communication proposal includes a large action and observation space, we outperform a Greedy Heuristic Baseline, a Single-Agent (SA) and MA setup. |
Philipp D Siedler 🔗 |
-
|
Coalition Formation in Ridesharing with Walking Options
(
Poster
)
link »
In this work, we introduce a novel coalition formation mechanism for ridesharing services. We extend on the current literature to integrate walking options within the trip, this allows us to effectively account for a user’s value of time when walk- ing, which is not negligible. Additionally, we propose a price allocation method that ensures proportionality for sharing the costs. We formally evaluate the effi- cacy of our pricing method and discuss its desirable properties. Our study is a step towards developing smart mobility systems that recommend optimal ridesharing coalitions to users and suggest cost allocations that are fair and stable. |
Lucia Cipolina Kun · Sebastian Stein · Vahid Yazdanpanah · Enrico Gerding 🔗 |
-
|
Stackelberg Policy Gradient: Evaluating the Performance of Leaders and Followers
(
Poster
)
link »
Hierarchical order of play is an important concept for reinforcement learning to understand better the decisions made by strategic agents in a shared environment. In this paper, we compare the learning dynamics between Stackelberg and simultaneous reinforcement learning agents. Agents are trained using their policy gradient and are tested against each other in a tournament. We compare agent performance in zero-sum and non-zero-sum Markov games. We show that the Stackelberg leader performs better in training under the same parameters. However, under the same parameters in the tournament setting, Stackelberg leaders and followers performed similarly to the simultaneous player. Analytically, hierarchical training can potentially provide stronger guarantees for policy gradient. |
Quoc-Liem Vu · Zane Alumbaugh · Ryan Ching · Quanchen Ding · Arnav Mahajan · Benjamin Chasnov · Sam Burden · Lillian J Ratliff 🔗 |
-
|
Generalization Games for Reinforcement Learning
(
Poster
)
link »
In reinforcement learning (RL), the term generalization has either denoted the practice of introducing function approximation to reduce the intractability of large state and action spaces problems or designated RL agents' ability to transfer learned experiences to one or more evaluation tasks. Recently, many subfields have emerged to understand how distributions of training tasks affect an RL agent's performance in unseen environments. While the field is extensive and ever-growing, recent research has underlined that variability among the different approaches is not as significant. We leverage this intuition to demonstrate how current methods for generalization in RL are specializations of a general framework. We obtain the fundamental aspects of this formulation by rebuilding a Markov Decision Process (MDP) from the ground up by resurfacing the game-theoretic framework of games against nature. The two-player game that arises from considering nature as a complete player on this formulation explains how existing methods rely on learned and randomized dynamics and initial state distributions. We develop this result further by drawing inspiration from mechanism design theory to introduce the role of a principal as a third player that can modify the payoff functions of the decision-making agent and nature. The games induced by playing against the principal extend our framework to explain how learned and randomized reward functions induce generalization in RL agents. The main contribution of our work is the complete description of the Generalization Games for Reinforcement Learning, a multiagent, multiplayer, game-theoretic formal approach to study generalization methods in RL. We offer a preliminary ablation experiment of the different components of the framework and demonstrate that a more simplified composition of the objectives that we introduce for each player leads to comparable, and in some cases superior, zero-shot generalization performance than state-of-the-art methods while requiring almost two orders of magnitude fewer samples. |
Manfred Diaz · Charlie Gauthier · Glen Berseth · Liam Paull 🔗 |
-
|
Zero-Sum Stochastic Stackelberg Games
(
Poster
)
link »
Min-max optimization problems (i.e., zero-sum games) have been used to model problems in a variety of fields in recent years, from machine learning to economics. The literature to date has mostly focused on static zero-sum games, assuming independent strategy sets. In this paper, we study a form of dynamic zero-sum games, called stochastic games, with dependent strategy sets. Just as zero-sum games with dependent strategy sets can be interpreted as zero-sum Stackelberg games, stochastic zero-sum games with dependent strategy sets can be interpreted as zero-sum stochastic Stackelberg games. We prove the existence of an optimal solution in zero-sum stochastic Stackelberg games (i.e., a recursive Stackelberg equilibrium), provide necessary and sufficient conditions for a solution to be optimal, and show that a recursive Stackelberg equilibrium can be computed in polynomial time via value iteration. Finally, we show that stochastic Stackelberg games can model the problem of pricing and allocating goods across agents and time; more specifically, we propose a stochastic Stackelberg game whose solutions correspond to a recursive competitive equilibrium in a stochastic Fisher market. We close with a series of experiments which confirm our theoretical results and show how value iteration performs in practice. |
Denizalp Goktas · Jiayi Zhao · Amy Greenwald 🔗 |
-
|
Object Representations as Equilibria: Training Iterative Inference Algorithms with Implicit Differentiation
(
Poster
)
link »
Deep generative models, particularly those that aim to factorize the observations into discrete entities (such as objects), must often use iterative inference procedures that break symmetries among equally plausible explanations for the data. Such inference procedures include variants of the expectation-maximization algorithm and structurally resemble clustering algorithms in a latent space. However, combining such methods with deep neural networks necessitates differentiating through the inference process, which can make optimization exceptionally challenging. In this work, we observe that such iterative inference methods can be made differentiable by means of the implicit function theorem, and develop an implicit differentiation approach that improves the stability and tractability of training such models by decoupling the forward and backward passes. This connection enables us to apply recent advances in optimizing implicit layers to not only improve the stability and optimization of the slot attention module in SLATE, a state-of-the-art method for learning entity representations, but do so with constant space and time complexity in backpropagation and only one additional line of code. |
Michael Chang · Thomas L. Griffiths · Sergey Levine 🔗 |
-
|
Team Belief DAG Form: A Concise Representation for Team-Correlated Game-Theoretic Decision Making
(
Poster
)
link »
In this paper, we introduce a new representation for team-coordinated game-theoretic decision making, which we coin team belief DAG form. In our representation, at every timestep, a team coordinator observes the information that is public to all its members, and then decides on a prescription for all the possible states consistent with its observations. Our representation unifies and extends recent approaches to team coordination. Similar to the approach of Carminati et al (2021), our team belief DAG form can be used to capture adversarial team games, and enables standard, out-of-the-box game-theoretic techniques including no-regret learning (e.g., CFR and its state-of-the-art modern variants such as DCFR and PCFR$^+$) and first-order methods. However, our representation can be exponentially smaller, and can be viewed as a lossless abstraction of theirs into a directed acyclic graph. In particular, like the LP-based algorithm of Zhang & Sandholm (2022), the size of our representation scales with the amount of information uncommon to the team; in fact, using linear programming on top of our team belief DAG form to solve for a team correlated equilibrium in an adversarial team games recovers almost exactly their algorithm. Unlike that paper, however, our representation explicitly exposes the structure of the decision space, which is what enables the aforementioned game-theoretic techniques.
|
Brian Zhang · Gabriele Farina · Tuomas Sandholm 🔗 |
-
|
Staged independent learning: Towards decentralized cooperative multi-agent Reinforcement Learning
(
Poster
)
link »
We empirically show that classic ideas from two-time scale stochastic approximation \citep{borkar1997stochastic} can be extended to complex cooperative multi-agent reinforcement learning (MARL) problems. We first start with giving a multi-agent estimation problem as a motivating example where staged best response iteration converges while parallel best response iteration does not. Then we present a general implementation of staged multi-agent RL algorithms based on multi-time scale stochastic approximation, and show that our new method called Staged Independent Proximal Policy Optimization (SIPPO) outperforms state-of-the-art independent learning on almost all the tasks in epymarl \citep{papoudakis2020benchmarking} benchmark. This can be seen as a first step towards more decentralized MARL methods based on multi-time scale learning principle. |
Hadi Nekoei · Akilesh Badrinaaraayanan · Amit Sinha · Mohammad Amini · Janarthanan Rajendran · Aditya Mahajan · Sarath Chandar 🔗 |
-
|
Competitive Physics Informed Networks
(
Poster
)
link »
Physics Informed Neural Networks (PINNs) solve partial differential equations (PDEs) by representing them as neural networks. The original PINN implementation does not provide high accuracy, typically attaining about $10^{-3}$ $L_2$ relative error. We formulate and test an adversarial approach called competitive PINNs (CPINNs) to overcome this limitation. CPINNs train a discriminator that is rewarded for predicting PINN mistakes. The discriminator and PINN participate in a zero-sum game with the exact PDE solution as an optimal strategy. This approach avoids the issue of squaring the large condition numbers of PDE discretizations. Numerical experiments show that a CPINN trained with competitive gradient descent can achieve errors two orders of magnitude smaller than that of a PINN trained with Adam or stochastic gradient descent.
|
Qi Zeng · Spencer Bryngelson · Florian Schaefer 🔗 |
-
|
Modeling Strong and Human-like Gameplay with KL-Regularized Search
(
Poster
)
link »
We consider the task of building strong but human-like policies in multi-agent decision-making problems, given examples of human behavior. Imitation learning is effective at predicting human actions but may not match the strength of expert humans, while self-play learning and search techniques (e.g. AlphaZero) lead to strong performance but may produce policies that are difficult for humans to understand and coordinate with. We show in chess and Go that regularizing search based on the KL divergence from an imitation-learned policy results in higher human prediction accuracy and stronger performance than imitation learning alone. We then introduce a novel regret minimization algorithm that is regularized based on the KL divergence from an imitation-learned policy, and show that using this algorithm for search in no-press Diplomacy yields a policy that matches the human prediction accuracy of imitation learning while being substantially stronger. |
Athul Paul Jacob · David Wu · Gabriele Farina · Adam Lerer · Hengyuan Hu · Anton Bakhtin · Jacob Andreas · Noam Brown 🔗 |
-
|
Multi-Agent Neural Rewriter for Vehicle Routing with Limited Disclosure of Costs
(
Poster
)
link »
We interpret solving the multi-vehicle routing problem as a team Markov game with partially observable costs. For a given set of customers to serve, the playing agents (vehicles) have the common goal to determine the team-optimal agent routes with minimal total cost. Each agent thereby observes only its own cost. Our multi-agent reinforcement learning approach, the so-called multi-agent Neural Rewriter, builds on the single-agent Neural Rewriter to solve the problem by iteratively rewriting solutions. Parallel agent action execution and partial observability require new rewriting rules for the game. We propose the introduction of a so-called pool in the system which serves as a collection point for unvisited nodes. It enables agents to act simultaneously and exchange nodes in a conflict-free manner. We realize limited disclosure of agent-specific costs by only sharing them during learning. During inference, each agents acts decentrally, solely based on its own cost.First empirical results on small problem sizes demonstrate that we reach a performance close to the employed OR-Tools benchmark which operates in the perfect cost information setting. |
Nathalie Paul · Alexander Kiser · Tim Wirtz · Stefan Wrobel 🔗 |
-
|
A Non-Negative Matrix Factorization Game
(
Poster
)
link »
We present a novel game-theoretic formulation of Non-Negative Matrix Factorization (NNMF), a popular data-analysis method with many scientific and engineering applications. The game-theoretic formulation is shown to have favorable scaling and parallelization properties, while retaining reconstruction and convergence performance comparable to the traditional Multiplicative Updates (Lee & Seung,1999) algorithm. |
Satpreet H Singh 🔗 |
-
|
Influencing Long-Term Behavior in Multiagent Reinforcement Learning
(
Poster
)
link »
The main challenge of multiagent reinforcement learning is the difficulty of learning useful policies in the presence of other simultaneously learning agents whose changing behaviors jointly affect the environment's transition and reward dynamics. An effective approach that has recently emerged for addressing this non-stationarity is for each agent to anticipate the learning of other interacting agents and influence the evolution of their future policies towards desirable behavior for its own benefit. Unfortunately, all previous approaches for achieving this suffer from myopic evaluation, considering only a few or a finite number of updates to the policies of other agents. In this paper, we propose a principled framework for considering the limiting policies of other agents as the time approaches infinity. Specifically, we develop a new optimization objective that maximizes each agent's average reward by directly accounting for the impact of its behavior on the limiting set of policies that other agents will take on. Thanks to our farsighted evaluation, we demonstrate better long-term performance than state-of-the-art baselines in various domains, including the full spectrum of general-sum, competitive, and cooperative settings. |
Dong Ki Kim · Matt Riemer · Miao Liu · Jakob Foerster · Michael Everett · Chuangchuang Sun · Gerald Tesauro · JONATHAN HOW 🔗 |
-
|
Online Learning in Iterated Prisoner's Dilemma to Mimic Human Behavior
(
Poster
)
link »
Prisoner’s Dilemma mainly treat the choice to cooperate or defect as an atomic action. We propose to study online learning algorithm behavior in the Iterated Prisoner’s Dilemma (IPD) game, where we explored the full spectrum of reinforcement learning agents: multi-armed bandits, contextual bandits and reinforcement learning. We have evaluate them based on a tournament of iterated prisoner's dilemma where multiple agents can compete in a sequential fashion. This allows us to analyze the dynamics of policies learned by multiple self-interested independent reward-driven agents, and also allows us study the capacity of these algorithms to fit the human behaviors. Results suggest that considering the current situation to make decision is the worst in this kind of social dilemma game. Multiples discoveries on online learning behaviors and clinical validations are stated. |
Baihan Lin · Djallel Bouneffouf · Guillermo Cecchi 🔗 |
-
|
Solving Structured Hierarchical Games Using Differential Backward Induction
(
Poster
)
link »
From large-scale organizations to decentralized political systems, hierarchical strategic decision making is commonplace. We introduce a novel class of structured hierarchical games (SHGs) that formally capture such hierarchical strategic interactions. In an SHG, each player is a node in a tree, and strategic choices of players are sequenced from root to leaves, with root moving first, followed by its children, then followed by their children, and so on until the leaves. A player’s utility in an SHG depends on its own decision, and on the choices of its parent and all the tree leaves. SHGs thus generalize simultaneous-move games, as well as Stackelberg games with many followers. We leverage the structure of both the sequence of player moves as well as payoff dependence to develop a novel gradient-based backpropagation-style algorithm, which we call Differential Backward Induction (DBI), for approximating equilibria of SHGs. We provide a sufficient condition for convergence of DBI and demonstrate its efficacy in finding approximate equilibrium solutions to several SHG models of hierarchical policy-making problems. |
Zun Li · Feiran Jia · Aditya Mate · Shahin Jabbari · Mithun Chakraborty · Milind Tambe · Yevgeniy Vorobeychik 🔗 |
-
|
Model-Free Opponent Shaping
(
Poster
)
link »
In general-sum games, the interaction of self-interested learning agents commonly leads to collectively worst-case outcomes, such as defect-defect in the iterated prisoner’s dilemma (IPD). To overcome this, some methods, such as Learning with Opponent-Learning Awareness (LOLA), shape their opponents’ learning process. However, these methods are myopic since only a small number of steps can be anticipated, are asymmetric since they treat other agents as naive learners, and require the use of higher-order derivatives, which are calculated through white-box access to an opponent’s differentiable learning algorithm. To address these issues, we propose Model-Free Opponent Shaping (M-FOS). M-FOS learns in a meta-game in which each meta-step is an episode of the underlying (“inner”) game. The meta-state consists of the inner policies, and the meta-policy produces a new inner policy to be used in the next episode. M-FOS then uses generic model-free optimisation methods to learn meta-policies that accomplish long-horizon opponent shaping. Empirically, M-FOS near-optimally exploits naive learners and other, more sophisticated algorithms from the literature. For example, to the best of our knowledge, it is the first method to learn the well-known Zero-Determinant (ZD) extortion strategy in the IPD. In the same settings, M-FOS leads to socially optimal outcomes under meta-self-play. Finally, we show that M-FOS can be scaled to high-dimensional settings. |
Chris Lu · Timon Willi · Christian Schroeder de Witt · Jakob Foerster 🔗 |
-
|
Introducing Coordination in Concurrent Reinforcement Learning
(
Poster
)
link »
Research on exploration in reinforcement learning has mostly focused on problems with a single agent interacting with an environment. However many problems are better addressed by the concurrent reinforcement learning paradigm, where multiple agents operate in a common environment. Recent work has tackled the challenge of exploration in this particular setting \citep{dimakopoulou2018coordinated,dimakopoulou2018scalable}.Nonetheless, they do not completely leverage the characteristics of this framework and agents end up behaving independently from each other. In this work we argue that coordination among concurrent agents is crucial for efficient exploration.We introduce coordination in Thompson Sampling based methods by drawing correlated samples from an agent's posterior.We apply this idea to extend existing exploration schemes such as randomized least squares value iteration (RLSVI).Empirical results emphasize the merits of our approach and call attention to coordination as a key objective for efficient exploration in concurrent reinforcement learning. |
Adrien Ali Taiga · Aaron Courville · Marc G Bellemare 🔗 |
-
|
V-Learning -- A Simple, Efficient, Decentralized Algorithm for Multiagent RL
(
Poster
)
link »
A major challenge of multiagent reinforcement learning (MARL) is \emph{the curse of multiagents}, where the size of the joint action space scales exponentially with the number of agents. This remains to be a bottleneck for designing efficient MARL algorithms even in a basic scenario with finitely many states and actions. This paper resolves this challenge for the model of episodic Markov games. We design a new class of fully decentralized algorithms---V-learning, which provably learns Nash equilibria (in the two-player zero-sum setting), correlated equilibria and coarse correlated equilibria (in the multiplayer general-sum setting) in a number of samples that only scales with $\max_{i\in[m]} A_i$, where $A_i$ is the number of actions for the $i\th$ player. This is in sharp contrast to the size of the joint action space which is $\prod_{i=1}^m A_i$.V-learning (in its basic form) is a new class of single-agent RL algorithms that convert any adversarial bandit algorithm with suitable regret guarantees into an RL algorithm. Similar to the classical Q-learning algorithm, it performs incremental updates to the value functions. Different from Q-learning, it only maintains the estimates of V-values instead of Q-values. This key difference allows V-learning to achieve the claimed guarantees in the MARL setting by simply letting all agents run V-learning independently.
|
Chi Jin · Qinghua Liu · Yuanhao Wang · Tiancheng Yu 🔗 |
-
|
Learning Meta Representations for Agents in Multi-Agent Reinforcement Learning
(
Poster
)
link »
In multi-agent reinforcement learning, the behaviors that agents learn in a single Markov Game (MG) are typically confined to the given agent number. Every single MG induced by varying population sizes may possess distinct optimal joint strategies and game-specific knowledge, which are modeled independently in modern multi-agent algorithms. In this work, we focus on creating agents that generalize across population-varying MGs. Instead of learning a unimodal policy, each agent learns a policy set that is formed by effective strategies across a variety of games. We propose Meta Representations for Agents (MRA) that explicitly models the game-common and game-specific strategic knowledge. By representing the policy sets with multi-modal latent policies, the common strategic knowledge and diverse strategic modes are discovered with an iterative optimization procedure. We prove that as an approximation to a constrained mutual information maximization objective, the learned policies can reach Nash Equilibrium in every evaluation MG under the assumption of Lipschitz game on a sufficiently large latent space. When deploying it at practical latent models with limited size, fast adaptation can be achieved by leveraging the first-order gradient information. Extensive experiments show the effectiveness of MRA on both training performance and generalization ability in hard and unseen games. |
Shenao Zhang · Li Shen · Lei Han · Li Shen 🔗 |
-
|
Near-Optimal Learning of Extensive-Form Games with Imperfect Information
(
Poster
)
link »
This paper resolves the open question of designing near-optimal algorithms for learning imperfect-information extensive-form games from bandit feedback. We present the first line of algorithms that require only $\widetilde{\mathcal{O}}((XA+YB)/\varepsilon^2)$ episodes of play to find an $\varepsilon$-approximate Nash equilibrium in two-player zero-sum games, where $X,Y$ are the number of information sets and $A,B$ are the number of actions for the two players. This improves upon the best known sample complexity of $\widetilde{\mathcal{O}}((X^2A+Y^2B)/\varepsilon^2)$ by a factor of $\widetilde{\mathcal{O}}(\max\{X, Y\})$, and matches the information-theoretic lower bound up to logarithmic factors. We achieve this sample complexity by two new algorithms: Balanced Online Mirror Descent, and Balanced Counterfactual Regret Minimization. Both algorithms rely on novel approaches of integrating \emph{balanced exploration policies} into their classical counterparts. We also extend our results to learning Coarse Correlated Equilibria in multi-player general-sum games.
|
Yu Bai · Chi Jin · Song Mei · Tiancheng Yu 🔗 |
-
|
Learning Robust Algorithms for Online Allocation Problems Using Adversarial Training
(
Poster
)
link »
We address the challenge of finding algorithms for online allocation (i.e. bipartite matching) using a machine learning approach. In this paper, we focus on the AdWords problem, which is a classical online budgeted matching problem of both theoretical and practical significance. In contrast to existing work, our goal is to accomplish algorithm design {\em tabula rasa}, i.e., without any human-provided insights or expert-tuned training data beyond specifying the objective and constraints of the optimization problem. We construct a framework based on insights from game theory and adversarial training. Key to our approach is to generate adversarial examples that expose the weakness of any given algorithm. A unique challenge in our context is to generate complete examples from scratch rather than perturbing given examples and we demonstrate this can be accomplished for the Adwords problem. We use this framework to co-train an algorithm network and an adversarial network against each other until they converge to an equilibrium. This approach finds algorithms and adversarial examples that are consistent with known optimal results. Secondly, we address the question of robustness of the algorithm, namely can we design algorithms that are both strong under practical distributions, as well as exhibit robust performance against adversarial instances. To accomplish this, we train algorithm networks using a mixture of adversarial and fixed practical distributions like power-laws; the resulting networks exhibit a smooth trade-off between the two input regimes. |
Goran Zuzic · Di Wang · Aranyak Mehta · D. Sivakumar 🔗 |
-
|
Dynamic Noises of Multi-Agent Environments Can Improve Generalization: Agent-based Models meets Reinforcement Learning
(
Poster
)
link »
We study the benefits of reinforcement learning (RL) environments based on agent-based models (ABM). While ABMs are known to offer microfoundational simulations at the cost of computational complexity, we empirically show in this work that their non-deterministic dynamics can improve the generalization of RL agents. To this end, we examine the control of an epidemic SIR environments based on either differential equations or ABMs. Numerical simulations demonstrate that the intrinsic noise in the ABM-based dynamics of the SIR model not only improve the average reward but also allow the RL agent to generalize on a wider ranges of epidemic parameters. |
Mohamed Akrout · Bob McLeod 🔗 |
-
|
When is Offline Two-Player Zero-Sum Markov Game Solvable?
(
Poster
)
link »
We study what dataset assumption permits solving offline two-player zero-sum Markov game. In stark contrast to the offline single-agent Markov decision process, we show that the single strategy concentration assumption is insufficient for learning the Nash equilibrium (NE) strategy in offline two-player zero-sum Markov games. On the other hand, we propose a new assumption named unilateral concentration and design a pessimism-type algorithm that is provably efficient under this assumption. In addition, we show that the unilateral concentration assumption is necessary for learning an NE strategy. Furthermore, our algorithm can achieve minimax sample complexity without any modification for two widely studied settings: dataset with uniform concentration assumption and turn-based Markov game. Our work serves as an important initial step towards understanding offline multi-agent reinforcement learning. |
Qiwen Cui · Simon Du 🔗 |
-
|
A Self-Play Posterior Sampling Algorithm for Zero-Sum Markov Games
(
Poster
)
link »
Existing studies on provably efficient algorithms for Markov games (MGs) almost exclusively build on the ``optimism in the face of uncertainty'' (OFU) principle. This work focuses on a different approach of posterior sampling, which is celebrated in many bandits and reinforcement learning settings but remains under-explored for MGs. Specifically, for episodic two-player zero-sum MGs, a novel posterior sampling algorithm is developed with \emph{general} function approximation. Theoretical analysis demonstrates that the posterior sampling algorithm admits a $\sqrt{T}$-regret bound for problems with a low multi-agent decoupling coefficient, which is a new complexity measure for MGs, where $T$ denotes the number of episodes. When specialized to linear MGs, the obtained regret bound matches the state-of-the-art results. To the best of our knowledge, this is the first provably efficient posterior sampling algorithm for MGs with frequentist regret guarantees, which enriches the toolbox for MGs and promotes the broad applicability of posterior sampling.
|
Wei Xiong · Han Zhong · Chengshuai Shi · Cong Shen · Tong Zhang 🔗 |
-
|
Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium Learning from Offline Datasets
(
Poster
)
link »
We study episodic two-player zero-sum Markov games (MGs) in the offline setting, where the goal is to find an approximate Nash equilibrium (NE) policy pair based on a dataset collected a priori. When the dataset does not have uniform coverage over all policy pairs, finding an approximate NE involves challenges in three aspects: (i) distributional shift between the behavior policy and the optimal policy, (ii) function approximation to handle large state space, and (iii) minimax optimization for equilibrium solving. We propose a pessimism-based algorithm, dubbed as pessimistic minimax value iteration (PMVI), which overcomes the distributional shift by constructing pessimistic estimates of the value functions for both players and outputs a policy pair by solving NEs based on the two value functions. Furthermore, we establish a data-dependent upper bound on the suboptimality which recovers a sublinear rate without the assumption on uniform coverage of the dataset. We also prove an information-theoretical lower bound, which suggests that the data-dependent term in the upper bound is intrinsic. Our theoretical results also highlight a notion of ``relative uncertainty'', which characterizes the necessary and sufficient condition for achieving sample efficiency in offline MGs. To the best of our knowledge, we provide the first nearly minimax optimal result for offline MGs with function approximation. |
Han Zhong · Wei Xiong · Jiyuan Tan · Liwei Wang · Tong Zhang · Zhaoran Wang · Zhuoran Yang 🔗 |
-
|
Can Reinforcement Learning Efficiently Find Stackelberg-Nash Equilibria in General-Sum Markov Games with Myopic Followers?
(
Poster
)
link »
We study multi-player general-sum Markov games with one of the players designated as the leader and the rest regarded as the followers. In particular, we focus on the class of games where the followers are myopic, i.e., the followers aim to maximize the instantaneous rewards. For such a game, our goal is to find the Stackelberg-Nash equilibrium (SNE), which is a policy pair $(\pi^*, \nu^*)$ such that (i) $\pi^*$ is the optimal policy for the leader when the followers always play their best response, and (ii) $\nu^*$ is the best response policy of the followers, which is a Nash equilibrium of the followers' game induced by $\pi^*$. We develop sample efficient reinforcement learning (RL) algorithms for solving SNE under both the online and offline settings. Respectively, our algorithms are optimistic and pessimistic variants of least-squares value iteration and are readily able to incorporate function approximation tools for handling large state spaces. Furthermore, for the case with linear function approximation, we prove that our algorithms achieve sublinear regret and suboptimality under online and offline setups respectively. To our best knowledge, we establish the first provably efficient RL algorithms for solving SNE in general-sum Markov games with myopic followers.
|
Han Zhong · Zhuoran Yang · Zhaoran Wang · Michael Jordan 🔗 |
-
|
Teamwork Reinforcement Learning with Concave Utilities
(
Poster
)
link »
Complex reinforcement learning (RL) tasks often require a divide-and-conquer approach, where a large task is divided into pieces and solved by individual agents. In this paper, we study a teamwork RL setting where individual agents make decisions on disjoint subsets (blocks) of the state space and have private interests (reward functions), while the entire team aims to maximize a general long-term team utility function and may be subject to constraints. This team utility, which is not necessarily a cumulative sum of rewards, is modeled as a nonlinear function of the team's joint state-action occupancy distribution. By leveraging the inherent duality of policy optimization, we propose a min-max multi-block policy optimization framework to decompose the overall problem into individual local tasks. This enables a federated teamwork mechanism where a team lead coordinates individual agents via reward shaping, and each agent solves her local task defined only on their local state subset. We analyze the convergence of this teamwork policy optimization mechanism and establish an $O(1/T)$ convergence rate to the team's joint optimum. This mechanism allows team members to jointly find the global socially optimal policy while keeping their local privacy.
|
Zheng Yu · Junyu Zhang · Zheng Wen · Andrea Tacchetti · Mengdi Wang · Ian Gemp 🔗 |
-
|
EigenGame Unloaded: When playing games is better than optimizing
(
Poster
)
link »
We build on the recently proposed EigenGame that views eigendecomposition as a competitive game. EigenGame's updates are biased if computed using minibatches of data, which hinders convergence and more sophisticated parallelism in the stochastic setting. In this work, we propose an unbiased stochastic update that is asymptotically equivalent to EigenGame, enjoys greater parallelism allowing computation on datasets of larger sample sizes, and outperforms EigenGame in experiments. We present applications to finding the principal components of massive datasets and performing spectral clustering of graphs. We analyze and discuss our proposed update in the context of EigenGame and the shift in perspective from optimization to games. |
Ian Gemp · Brian McWilliams · Claire Vernade · Thore Graepel 🔗 |
-
|
Backdoors Stuck At The Frontdoor: Multi-Agent Backdoor Attacks That Backfire
(
Poster
)
link »
Malicious agents in collaborative learning and outsourced data collection threaten the training of clean models. Backdoor attacks, where an attacker poisons a model during training to successfully achieve targeted misclassification, are a major concern to train-time robustness. In this paper, we investigate a multi-agent backdoor attack scenario, where multiple attackers attempt to backdoor a victim model simultaneously. A consistent backfiring phenomenon is observed across a wide range of games, where agents suffer from a low collective attack success rate. We examine different modes of backdoor attack configurations, non-cooperation / cooperation, joint distribution shifts, and game setups to return an equilibrium attack success rate at the lower bound. The results motivate the re-evaluation of backdoor defense research for practical environments. |
Siddhartha Datta · Nigel Shadbolt 🔗 |
-
|
Learning to Share in Multi-Agent Reinforcement Learning
(
Poster
)
link »
In this paper, we study the problem of networked multi-agent reinforcement learning (MARL), where a number of agents are deployed as a partially connected network and each interacts only with nearby agents. Networked MARL requires all agents make decision in a decentralized manner to optimize a global objective with restricted communication between neighbors over the network. Inspired by the fact that sharing plays a key role in human's learning of cooperation, we propose LToS, a hierarchically decentralized MARL framework that enables agents to learn to dynamically share reward with neighbors so as to encourage agents to cooperate on the global objective. For each agent, the high-level policy learns how to share reward with neighbors to decompose the global objective, while the low-level policy learns to optimize local objective induced by the high-level policies in the neighborhood. The two policies form a bi-level optimization and learn alternately. We empirically demonstrate that LToS outperforms existing methods in both social dilemma and networked MARL scenario. |
Yuxuan Yi · Ge Li · Yaowei Wang · Zongqing Lu 🔗 |
-
|
FOLLOW THE NEURALLY-PERTURBED LEADER FOR ADVERSARIAL TRAINING
(
Poster
)
link »
Game-theoretic models of learning are a powerful set of models that optimize multi-objective architectures. Among these models are zero-sum architectures that have inspired adversarial learning frameworks. We extend these two-player frameworks by introducing a mediating neural agent whose role is to augment the observation of the players to achieve certain maximum entropic objectives.We show that the new framework can be utilized for 1) efficient online training in multi-modal and adaptive environments and 2) addressing the ergodic convergence and cyclic dynamics issues of adversarial training. We also note the proposed training framework resembles the ‘follow the perturbed leader’ learning algorithms where perturbations are the result of actions of the mediating agent.We validate our theoretical results by applying them to the games with convex and non-convex loss as well as generative adversarial architectures. Moreover, we customize the implementation of this algorithm for adversarial imitation learning applications where we validate our assertions by using a procedurally generated game environment as well as synthetic data. |
Ari Azarafrooz 🔗 |
-
|
Safe Opponent-Exploitation Subgame Refinement
(
Poster
)
link »
In zero-sum games, an NE strategy tends to be overly conservative confronted with opponents of limited rationality, because it does not actively exploit their weaknesses. From another perspective, best responding to an estimated opponent model is vulnerable to estimation errors and lacks safety guarantees. Inspired by the recent success of real-time search algorithms in developing superhuman AI, we investigate the dilemma of safety and opponent exploitation and present a novel real-time search framework, called Safe Exploitation Search (SES), which smoothly interpolates between the two extremes of online strategy refinement. We provide SES with a theoretically upper-bounded exploitability and a lower-bounded evaluation performance. Additionally, SES enables computationally efficient online adaptation to a possibly updating opponent model, while previous safe exploitation methods have to recompute for the whole game. Empirical results show that SES significantly outperforms NE baselines and previous algorithms while keeping exploitability low at the same time. |
Mingyang Liu · Chengjie Wu · Qihan Liu · Yansen Jing · Jun Yang · Pingzhong Tang · Chongjie Zhang 🔗 |
-
|
HCMD-zero: Learning Value Aligned Mechanisms from Data
(
Poster
)
link »
Artificial learning agents are mediating a larger and larger number of interactions among humans, firms, and organizations, and the intersection between mechanism design and machine learning has been heavily investigated in recent years. However, mechanism design methods make strong assumptions on how participants behave (e.g. rationality), or on the kind of knowledge designers have access to a priori (e.g. access to strong baseline mechanisms). Here we introduce HCMD-zero, a general purpose method to construct mechanism agents. HCMD-zero learns by mediating interactions among participants, while remaining engaged in an electoral contest with copies of itself, thereby accessing direct feedback from participants. Our results on the Public Investment Game, a stylized resource allocation game that highlights the tension between productivity, equality and the temptation to free-ride, show that HCMD-zero produces competitive mechanism agents that are consistently preferred by human participants over baseline alternatives, and does so automatically, without requiring human knowledge, and by using human data sparingly and effectively Our detailed analysis shows HCMD-zero elicits consistent improvements over the course of training, and that it results in a mechanism with an interpretable and intuitive policy. |
Jan Balaguer · Raphael Koster · Ari Weinstein · Lucy Campbell-Gillingham · Christopher Summerfield · Matthew Botvinick · Andrea Tacchetti 🔗 |
-
|
The Good Shepherd: An Oracle Agent for Mechanism Design
(
Poster
)
link »
From social networks to traffic routing, artificial learning agents are playing a central role in modern institutions. We must therefore understand how to leverage these systems to foster outcomes and behaviors that align with our own values and aspirations. While multiagent learning has received considerable attention in recent years, artificial agents have been primarily evaluated when interacting with fixed, non-learning co-players. While this evaluation scheme has merit, it fails to capture the dynamics faced by institutions that must deal with adaptive and continually learning constituents. Here we address this limitation, and construct agents ( |
Jan Balaguer · Raphael Koster · Christopher Summerfield · Andrea Tacchetti 🔗 |
-
|
Learning Truthful, Efficient, and Welfare Maximizing Auction Rules
(
Poster
)
link »
From social networks to supply chains, more and more aspects of how humans, firms and organizations interact is mediated by artificial learning agents. As the influence of machine learning systems grows, it is paramount that we study how to imbue our modern institutions with our own values and principles.Here we consider the problem of allocating goods to buyers who have preferences over them in settings where the seller's aim is not to maximize their monetary gains, but rather to advance some notion of social welfare (e.g. the government trying to award construction licenses for hospitals or schools).This problem has a long history in economics, and solutions take the form of auction rules. Researchers have proposed reliable auction rules that work in extremely general settings, and in the presence of information asymmetry and strategic buyers. However, these protocols require significant payments from participants resulting in low aggregate welfare. Here we address this shortcoming by casting auction rule design as a statistical learning problem, and trade generality for participant welfare effectively and automatically with a novel deep learning network architecture and auction representation. Our analysis shows that our auction rules outperform state-of-the art approaches in terms of participants welfare, applicability, robustness. |
Andrea Tacchetti · DJ Strouse · Marta Garnelo · Thore Graepel · Yoram Bachrach 🔗 |
-
|
MTLight: Efficient Multi-Task Reinforcement Learning for Traffic Signal Control
(
Poster
)
link »
Traffic signal control has a great impact on alleviating traffic congestion in modern cities. Deep reinforcement learning (RL) has been widely used for this task in recent years, demonstrating promising performance but also facing many challenges such as limited performances and sample inefficiency. To handle these challenges, MTLight is proposed to enhance the agent observation with a latent state, which is learned from numerous traffic indicators. Meanwhile, multiple auxiliary and supervisory tasks are constructed to learn the latent state, and two types of embedding latent features, the task-specific feature and task-shared feature, are used to make the latent state more abundant. Extensive experiments conducted on CityFlow demonstrate that MTLight has leading convergence speed and asymptotic performance. We further simulate under peak-hour pattern in all scenarios with increasing control difficulty and the results indicate that MTLight is highly adaptable. |
Liwen Zhu · Peixi Peng · Zongqing Lu · Yonghong Tian 🔗 |
-
|
A Regret Minimization Approach to Multi-Agent Control
(
Poster
)
link »
We study the problem of multi-agent control of a dynamical system with known dynamics and adversarial disturbances. Our study focuses on optimal control without centralized precomputed policies, but rather with adaptive control policies for the different agents that are only equipped with a stabilizing controller. We give a reduction from any (standard) regret minimizing control method to a distributed algorithm. The reduction guarantees that the resulting distributed algorithm has low regret relative to the optimal precomputed joint policy. Our methodology involves generalizing online convex optimization to a multi-agent setting and applying recent tools from nonstochastic control derived for a single agent. We empirically evaluate our method on a model of an overactuated aircraft. We show that the distributed method is robust to failure and to adversarial perturbations in the dynamics. |
Udaya Ghai · Udari Madhushani · Naomi Leonard · Elad Hazan 🔗 |
-
|
Global Convergence of Multi-Agent Policy Gradient in Markov Potential Games
(
Poster
)
link »
Potential games are one of the most important and widely studied classes of normal-form games. They define the archetypal setting of multi-agent coordination in which all agents utilities are perfectly aligned via a common potential function. Can we embed this intuitive framework in the setting of Markov games? What are the similarities and differences between multi-agent coordination with and without state dependence? To answer these questions, we study a natural class of Markov Potential Games (MPGs) that generalizes prior attempts to capture complex stateful multi-agent coordination. Counter-intuitively, insights from normal-form potential games do not carry over since MPGs involve Markov games with zero-sum state-games, but Markov games in which all state-games are potential games are not necessarily MPGs. Nevertheless, MPGs showcase standard desirable properties such as the existence of deterministic Nash policies. In our main result, we prove convergence of independent policy gradient and its stochastic counterpart to Nash policies at a rate that is polynomial in the approximation error by adapting single-agent gradient domination properties to multi-agent settings. This answers questions on the convergence of finite-sample, independent policy gradient methods beyond settings of pure conflicting or pure common interests. |
Stefanos Leonardos · Will Overman · Ioannis Panageas · Georgios Piliouras 🔗 |
-
|
A Game-Theoretic Approach for Improving Generalization Ability of TSP Solvers
(
Poster
)
link »
In this paper, we introduce a two-player zero-sum framework between a trainable \emph{Solver} and a \emph{Data Generator} to improve the generalization ability of deep learning-based solvers for Traveling Salesman Problems (TSP).Grounded in \textsl{Policy Space Response Oracle} (PSRO) methods, our two-player framework outputs a population of best-responding Solvers, over which we can mix and output a combined model that achieves the least exploitability against the Generator, and thereby the most generalizable performance on different TSP tasks. We conduct experiments on a variety of TSP instances with different types and sizes. Results suggest that our Solvers achieve the state-of-the-art performance even on tasks the Solver never meets, whilst the performance of other deep learning-based Solvers drops sharply due to over-fitting. To demonstrate the principle of our framework, we study the learning outcome of the proposed two-player game and demonstrate that the exploitability of the Solver population decreases during training, and it eventually approximates the Nash equilibrium along with the Generator. |
Chenguang Wang · Yaodong Yang · Oliver Slumbers · Congying Han · Tiande Guo · Haifeng Zhang · Jun Wang 🔗 |