Poster (GatherTown)
in
Workshop: GroundedML: Anchoring Machine Learning in Classical Algorithmic Theory
Dual Conservative Policy Update for Efficient Model-Based Reinforcement Learning
Shenao Zhang
Model-based reinforcement learning (MBRL) algorithms by acquiring predictive models are data-efficient. Different from greedy model exploitation algorithms, provable MBRL algorithms based on optimism or posterior sampling are ensured to achieve the optimal performance asymptotically when with additional model complexity measures. However, such complexity is not polynomial for many model function classes, which poses challenges to reach the global optimum in practice. Due to the aggressive policy update and over-exploration, the convergence can be a very slow process and the policy may even end up suboptimal. Thus, in addition to the asymptotic guarantee, ensuring iterative policy improvement is the key to achieving high performance with finite timesteps. To this end, we propose Dual Conservative Policy Update (DCPU) for MBRL that involves a locally greedy update procedure and a conservative exploration update procedure. By greedily exploiting the local model and maximizing the expected value within the trust region, DCPU agents explore efficiently. We theoretically provide the iterative policy improvement bound of DCPU and show the monotonic improvement property under the Lipschitz assumption and with a proper constraint threshold. Besides, we prove the asymptotic optimality of DCPU with sublinear Bayes regret bound. Empirical results demonstrate the superiority of DCPU on several Mujoco tasks.