Skip to yearly menu bar Skip to main content


RLIF: Interactive Imitation Learning as Reinforcement Learning

Jianlan Luo · Perry Dong · Yuexiang Zhai · Yi Ma · Sergey Levine

Halle B #159
[ ] [ Project Page ]
Thu 9 May 1:45 a.m. PDT — 3:45 a.m. PDT


Although reinforcement learning methods offer a powerful framework for auto-matic skill acquisition, for practical learning-based control problems in domainssuch as robotics, imitation learning often provides a more convenient and accessiblealternative. In particular, an interactive imitation learning method such as DAgger,which queries a near-optimal expert to intervene online to collect correction data foraddressing the distributional shift challenges that afflict naïve behavioral cloning,can enjoy good performance both in theory and practice without requiring manuallyspecified reward functions and other components of full reinforcement learningmethods. In this paper, we explore how off-policy reinforcement learning canenable improved performance under assumptions that are similar but potentiallyeven more practical than those of interactive imitation learning. Our proposedmethod uses reinforcement learning with user intervention signals themselves asrewards. This relaxes the assumption that intervening experts in interactive imita-tion learning should be near-optimal and enables the algorithm to learn behaviorsthat improve over the potential suboptimal human expert. We also provide a uni-fied framework to analyze our RL method and DAgger; for which we present theasymptotic analysis of the suboptimal gap for both methods as well as the non-asymptotic sample complexity bound of our method. We then evaluate our methodon challenging high-dimensional continuous control simulation benchmarks aswell as real-world robotic vision-based manipulation tasks. The results show that itstrongly outperforms DAgger-like approaches across the different tasks, especiallywhen the intervening experts are suboptimal. Additional ablations also empiricallyverify the proposed theoretical justification that the performance of our method isassociated with the choice of intervention model and suboptimality of the expert.Code and videos can be found on the project website:

Chat is not available.