ICLR Poster When Can We Learn General-Sum Markov Games with a Large Number of Players Sample-Efficiently?

Poster

When Can We Learn General-Sum Markov Games with a Large Number of Players Sample-Efficiently?

Ziang Song · Song Mei · Yu Bai

Keywords: [ Reinforcement learning theory ]

[ Abstract ]

[ Visit Poster at Spot B1 in Virtual World ] [ OpenReview]

Abstract: Multi-agent reinforcement learning has made substantial empirical progresses in solving games with a large number of players. However, theoretically, the best known sample complexity for finding a Nash equilibrium in general-sum games scales exponentially in the number of players due to the size of the joint action space, and there is a matching exponential lower bound. This paper investigates what learning goals admit better sample complexities in the setting of

m

$m$ -player general-sum Markov games with

H

$H$ steps,

S

$S$ states, and

A_{i}

$A_i$ actions per player. First, we design algorithms for learning an

ϵ

$\epsilon$ -Coarse Correlated Equilibrium (CCE) in

\tilde{O} (H^{5} S max_{i \leq m} A_{i} / ϵ^{2})

$\widetilde{\mathcal{O}}(H^5S\max_{i\le m} A_i / \epsilon^2)$ episodes, and an

ϵ

$\epsilon$ -Correlated Equilibrium (CE) in

\tilde{O} (H^{6} S max_{i \leq m} A_{i}^{2} / ϵ^{2})

$\widetilde{\mathcal{O}}(H^6S\max_{i\le m} A_i^2 / \epsilon^2)$ episodes. This is the first line of results for learning CCE and CE with sample complexities polynomial in

max_{i \leq m} A_{i}

$\max_{i\le m} A_i$ . Our algorithm for learning CE integrates an adversarial bandit subroutine which minimizes a weighted swap regret, along with several novel designs in the outer loop. Second, we consider the important special case of Markov Potential Games, and design an algorithm that learns an

ϵ

$\epsilon$ -approximate Nash equilibrium within

\tilde{O} (S \sum_{i \leq m} A_{i} / ϵ^{3})

$\widetilde{\mathcal{O}}(S\sum_{i\le m} A_i / \epsilon^3)$ episodes (when only highlighting the dependence on

S

$S$ ,

A_{i}

$A_i$ , and

ϵ

$\epsilon$ ), which only depends linearly in

\sum_{i \leq m} A_{i}

$\sum_{i\le m} A_i$ and significantly improves over the existing efficient algorithm in the

ϵ

$\epsilon$ dependence. Overall, our results shed light on what equilibria or structural assumptions on the game may enable sample-efficient learning with many players.

Chat is not available.