Skip to yearly menu bar Skip to main content


Poster

Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs

Kaixuan Ji · Qingyue Zhao · Jiafan He · Weitong ZHANG · Quanquan Gu

Halle B #254

Abstract: Recent studies have shown that the regret of reinforcement learning (RL) can be polylogarithmic in the planning horizon HH. However, it remains an open question whether such a result holds for adversarial RL. In this paper, we answer this question affirmatively by proposing the first horizon-free policy search algorithm. To tackle the challenges caused by exploration and adversarially chosen reward over episodes, our algorithm employs (1) a variance-uncertainty-aware weighted least square estimator for the transition kernel; and (2) an occupancy measure-based technique for the online search of a stochastic policy. We show that our algorithm achieves an ˜O((d+log|S|)K+d2)~O((d+log|S|)K+d2) regret with full-information feedback, where dd is the dimension of a known feature mapping linearly parametrizing the unknown transition kernel of the MDP, KK is the number of episodes, |S||S| is the cardinality of the state space. We also provide hardness results to justify the near optimality of our algorithm and the inevitability of log|S|log|S| in the regret bound.

Chat is not available.