Skip to yearly menu bar Skip to main content


Poster

Optimal Sample Complexity for Average Reward Markov Decision Processes

Shengbo Wang · Jose Blanchet · Peter Glynn

Halle B #196

Abstract: We resolve the open question regarding the sample complexity of policy learning for maximizing the long-run average reward associated with a uniformly ergodic Markov decision process (MDP), assuming a generative model. In this context, the existing literature provides a sample complexity upper bound of ˜O(|S||A|t2mixϵ2)˜O(|S||A|t2mixϵ2) and a lower bound of Ω(|S||A|tmixϵ2)Ω(|S||A|tmixϵ2). In these expressions, |S||S| and |A||A| denote the cardinalities of the state and action spaces respectively, tmixtmix serves as a uniform upper limit for the total variation mixing times, and ϵϵ signifies the error tolerance. Therefore, a notable gap of tmixtmix still remains to be bridged. Our primary contribution is the development of an estimator for the optimal policy of average reward MDPs with a sample complexity of ˜O(|S||A|tmixϵ2)˜O(|S||A|tmixϵ2). This marks the first algorithm and analysis to reach the literature's lower bound. Our new algorithm draws inspiration from ideas in Li et al. (2020), Jin \& Sidford (2021), and Wang et al. (2023). Additionally, we conduct numerical experiments to validate our theoretical findings.

Chat is not available.