Black-box Off-policy Estimation for Infinite-Horizon Reinforcement Learning

Ali Mousavi, Lihong Li, Qiang Liu, Denny Zhou

Keywords: generalization, importance sampling, reinforcement learning, robotics

Abstract: Off-policy estimation for long-horizon problems is important in many real-life applications such as healthcare and robotics, where high-fidelity simulators may not be available and on-policy evaluation is expensive or impossible. Recently, \citet{liu18breaking} proposed an approach that avoids the curse of horizon suffered by typical importance-sampling-based methods. While showing promising results, this approach is limited in practice as it requires data being collected by a known behavior policy. In this work, we propose a novel approach that eliminates such limitations. In particular, we formulate the problem as solving for the fixed point of a "backward flow" operator and show that the fixed point solution gives the desired importance ratios of stationary distributions between the target and behavior policies. We analyze its asymptotic consistency and finite-sample generalization. Experiments on benchmarks verify the effectiveness of our proposed approach.

Similar Papers

Keep Doing What Worked: Behavior Modelling Priors for Offline Reinforcement Learning
Noah Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael Neunert, Thomas Lampe, Roland Hafner, Nicolas Heess, Martin Riedmiller,
Doubly Robust Bias Reduction in Infinite Horizon Off-Policy Estimation
Ziyang Tang, Yihao Feng, Lihong Li, Dengyong Zhou, Qiang Liu,
Infinite-horizon Off-Policy Policy Evaluation with Multiple Behavior Policies
Xinyun Chen, Lu Wang, Yizhe Hang, Heng Ge, Hongyuan Zha,