ICLR Poster Global Convergence of Policy Gradient in Average Reward MDPs

Poster

Global Convergence of Policy Gradient in Average Reward MDPs

Navdeep Kumar · Yashaswini Murthy · Itai Shufaro · Kfir Y Levy · R. Srikant · Shie Mannor

Hall 3 + Hall 2B #279

[ Abstract ]

Thu 24 Apr midnight PDT — 2:30 a.m. PDT

Abstract: We present the first comprehensive finite-time global convergence analysis of policy gradient for infinite horizon average reward Markov decision processes (MDPs). Specifically, we focus on ergodic tabular MDPs with finite state and action spaces. Our analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of

$O(\frac{1}{T})$ , where

$T$ represents the number of iterations. Performance bounds for discounted reward MDPs cannot be easily extended to average reward MDPs as the bounds grow proportional to the fifth power of the effective horizon. Recent work on such extensions makes a smoothness assumption that has not been verified. Thus, our primary contribution is in providing the first complete proof that the policy gradient algorithm converges globally for average-reward MDPs, without such an assumption. We also obtain the corresponding finite-time performance guarantees. In contrast to the existing discounted reward performance bounds, our performance bounds have an explicit dependence on constants that capture the complexity of the underlying MDP. Motivated by this observation, we reexamine and improve the existing performance bounds for discounted reward MDPs. We also present simulations that empirically validate the result.

Live content is unavailable. Log in and register to view live content