Skip to yearly menu bar Skip to main content

Workshop: Generalizable Policy Learning in the Physical World

Density Estimation For Conservative Q-Learning

Paul Daoudi · Ludovic Dos Santos · Merwan Barlier · Aladin Virmaux


Batch Reinforcement Learning algorithms aim at learning the best policy from a batch of data without interacting with the environment. Within this setting, one difficulty is to correctly assess the value of state-action pairs far from the data set. Indeed, the lack of information may provoke an overestimation of the value function, leading to non-desirable behaviours. A compromise between enhancing the performance of the behaviour policy and staying close to it must be found.To alleviate this issue, most existing approaches introduce a regularization term to favor state-action pairs from the data set.In this paper, we refine this idea by estimating the density of these state-action pairs to distinguish neighbourhoods. The resulting regularization guides the policy toward meaningful unseen regions, improving the learning process. We hence introduce Density Conservative Q-Learning (D-CQL), a sound batch RL algorithm that carefully penalizes the value function based on the information collected in the state-action space. The performance of our approach is outlined on many classical benchmark in batch RL.

Chat is not available.