ICLR Poster Optimal Sample Complexity for Average Reward Markov Decision Processes

Poster

Optimal Sample Complexity for Average Reward Markov Decision Processes

Shengbo Wang · Jose Blanchet · Peter Glynn

Halle B #196

[ Abstract ]

[ OpenReview]

Abstract: We resolve the open question regarding the sample complexity of policy learning for maximizing the long-run average reward associated with a uniformly ergodic Markov decision process (MDP), assuming a generative model. In this context, the existing literature provides a sample complexity upper bound of

˜ O (| S | | A | t_{mix}^{2} ϵ^{- 2})

$\widetilde O(|S||A|t_{\text{mix}}^2 \epsilon^{-2})$ and a lower bound of

Ω (| S | | A | t_{mix} ϵ^{- 2})

$\Omega(|S||A|t_{\text{mix}} \epsilon^{-2})$ . In these expressions,

| S |

$|S|$ and

| A |

$|A|$ denote the cardinalities of the state and action spaces respectively,

t_{mix}

$t_{\text{mix}}$ serves as a uniform upper limit for the total variation mixing times, and

ϵ

$\epsilon$ signifies the error tolerance. Therefore, a notable gap of

t_{mix}

$t_{\text{mix}}$ still remains to be bridged. Our primary contribution is the development of an estimator for the optimal policy of average reward MDPs with a sample complexity of

˜ O (| S | | A | t_{mix} ϵ^{- 2})

$\widetilde O(|S||A|t_{\text{mix}}\epsilon^{-2})$ . This marks the first algorithm and analysis to reach the literature's lower bound. Our new algorithm draws inspiration from ideas in Li et al. (2020), Jin \& Sidford (2021), and Wang et al. (2023). Additionally, we conduct numerical experiments to validate our theoretical findings.

Chat is not available.