ICLR Poster Gap-Dependent Bounds for Q-Learning using Reference-Advantage Decomposition

Poster

Gap-Dependent Bounds for Q-Learning using Reference-Advantage Decomposition

Zhong Zheng · Haochen Zhang · Lingzhou Xue

Hall 3 + Hall 2B #403

[ Abstract ]

Fri 25 Apr 7 p.m. PDT — 9:30 p.m. PDT

Abstract: We study the gap-dependent bounds of two important algorithms for on-policy

$Q$ -learning for finite-horizon episodic tabular Markov Decision Processes (MDPs): UCB-Advantage (Zhang et al. 2020) and Q-EarlySettled-Advantage (Li et al. 2021). UCB-Advantage and Q-EarlySettled-Advantage improve upon the results based on Hoeffding-type bonuses and achieve the {almost optimal}

$\sqrt{T}$ -type regret bound in the worst-case scenario, where

$T$ is the total number of steps. However, the benign structures of the MDPs such as a strictly positive suboptimality gap can significantly improve the regret. While gap-dependent regret bounds have been obtained for

$Q$ -learning with Hoeffding-type bonuses, it remains an open question to establish gap-dependent regret bounds for

$Q$ -learning using variance estimators in their bonuses and reference-advantage decomposition for variance reduction. We develop a novel error decompositionframework to prove gap-dependent regret bounds of UCB-Advantage and Q-EarlySettled-Advantage that are logarithmic in

$T$ and improve upon existing ones for

$Q$ -learning algorithms. Moreover, we establish the gap-dependent bound for the policy switching cost of UCB-Advantage and improve that under the worst-case MDPs. To our knowledge, this paper presents the first gap-dependent regret analysis for

$Q$ -learning using variance estimators and reference-advantage decomposition and also provides the first gap-dependent analysis on policy switching cost for

$Q$ -learning.

Live content is unavailable. Log in and register to view live content