Q-Learning with Fine-Grained Gap-Dependent Regret
Haochen Zhang ⋅ Zhong Zheng ⋅ Lingzhou Xue
Abstract
We study fine-grained gap-dependent regret bounds for model-free reinforcement learning in episodic tabular Markov Decision Processes. Existing model-free algorithms achieve minimax worst-case regret, but their gap-dependent bounds remain coarse and fail to fully capture the structure of suboptimality gaps. To address this limitation, we establish fine-grained gap-dependent regret guarantees for both UCB-based and non-UCB-based algorithms. In the UCB-based setting, we develop a novel analytical framework that explicitly separates the analysis of optimal and suboptimal state-action pairs, yielding the first fine-grained regret upper bound for UCB-Hoeffding (Jin et al., 2018). In the non-UCB-based setting, we revisit the only existing algorithm, AMB (Xu et al., 2021), and identify two issues in its design and analysis: improper truncation in the $Q$-updates and violation of the martingale difference condition in the concentration argument. To resolve these issues, we propose two refinements of AMB: the UCB-based ULCB-Hoeffding and the non-UCB-based Refined AMB. For ULCB-Hoeffding, we establish the same fine-grained regret bound as UCB-Hoeffding by applying our fine-grained framework, highlighting its broad applicability. For Refined AMB, we derive a rigorous fine-grained gap-dependent regret bound in the non-UCB setting and demonstrate consistent empirical improvements over the original AMB.
Successful Page Load