Poster
Optimal Sample Complexity for Average Reward Markov Decision Processes
Shengbo Wang · Jose Blanchet · Peter Glynn
Halle B #196
Abstract:
We resolve the open question regarding the sample complexity of policy learning for maximizing the long-run average reward associated with a uniformly ergodic Markov decision process (MDP), assuming a generative model. In this context, the existing literature provides a sample complexity upper bound of ˜O(|S||A|t2mixϵ−2)˜O(|S||A|t2mixϵ−2) and a lower bound of Ω(|S||A|tmixϵ−2)Ω(|S||A|tmixϵ−2). In these expressions, |S||S| and |A||A| denote the cardinalities of the state and action spaces respectively, tmixtmix serves as a uniform upper limit for the total variation mixing times, and ϵϵ signifies the error tolerance. Therefore, a notable gap of tmixtmix still remains to be bridged. Our primary contribution is the development of an estimator for the optimal policy of average reward MDPs with a sample complexity of ˜O(|S||A|tmixϵ−2)˜O(|S||A|tmixϵ−2). This marks the first algorithm and analysis to reach the literature's lower bound. Our new algorithm draws inspiration from ideas in Li et al. (2020), Jin \& Sidford (2021), and Wang et al. (2023). Additionally, we conduct numerical experiments to validate our theoretical findings.
Chat is not available.