Skip to yearly menu bar Skip to main content


Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs

Kaixuan Ji · Qingyue Zhao · Jiafan He · Weitong ZHANG · Quanquan Gu

Halle B #254
[ ]
Wed 8 May 7:30 a.m. PDT — 9:30 a.m. PDT

Abstract: Recent studies have shown that the regret of reinforcement learning (RL) can be polylogarithmic in the planning horizon $H$. However, it remains an open question whether such a result holds for adversarial RL. In this paper, we answer this question affirmatively by proposing the first horizon-free policy search algorithm. To tackle the challenges caused by exploration and adversarially chosen reward over episodes, our algorithm employs (1) a variance-uncertainty-aware weighted least square estimator for the transition kernel; and (2) an occupancy measure-based technique for the online search of a stochastic policy. We show that our algorithm achieves an $\tilde{O}\big((d+\log |\mathcal{S}|)\sqrt{K} + d^2\big)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|\mathcal{S}|$ is the cardinality of the state space. We also provide hardness results to justify the near optimality of our algorithm and the inevitability of $\log|\mathcal{S}|$ in the regret bound.

Chat is not available.