ICLR Poster Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs

Poster

Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs

Kaixuan Ji · Qingyue Zhao · Jiafan He · Weitong ZHANG · Quanquan Gu

Halle B #254

[ Abstract ]

[ OpenReview]

Abstract: Recent studies have shown that the regret of reinforcement learning (RL) can be polylogarithmic in the planning horizon

H

$H$ . However, it remains an open question whether such a result holds for adversarial RL. In this paper, we answer this question affirmatively by proposing the first horizon-free policy search algorithm. To tackle the challenges caused by exploration and adversarially chosen reward over episodes, our algorithm employs (1) a variance-uncertainty-aware weighted least square estimator for the transition kernel; and (2) an occupancy measure-based technique for the online search of a stochastic policy. We show that our algorithm achieves an

~ O ((d + log | S |) \sqrt{K} + d^{2})

$\tilde{O}\big((d+\log |\mathcal{S}|)\sqrt{K} + d^2\big)$ regret with full-information feedback, where

d

$d$ is the dimension of a known feature mapping linearly parametrizing the unknown transition kernel of the MDP,

K

$K$ is the number of episodes,

| S |

$|\mathcal{S}|$ is the cardinality of the state space. We also provide hardness results to justify the near optimality of our algorithm and the inevitability of

log | S |

$\log|\mathcal{S}|$ in the regret bound.

Chat is not available.