Off-Policy Safe Reinforcement Learning with Cost-Constrained Optimistic Exploration
Abstract
When formulating safety as limits of cumulative cost, safe reinforcement learning (RL) learns policies that maximize rewards subject to these constraints during both data collection and deployment. While off-policy methods offer high sample efficiency, their application to safe RL faces substantial challenges from constraint violations caused by the cost-agnostic exploration and the underestimation bias in the cost value function. To address these challenges, we propose Constrained Optimistic eXploration Q-learning (COX-Q), an off-policy primal-dual safe RL method that integrates cost-bounded exploration and conservative distributional RL. First, we introduce a novel cost-constrained optimistic exploration strategy that resolves gradient conflicts between reward and cost in the action space, and adaptively adjusts the trust region to control constraint violation in exploration. Second, we adopt truncated quantile critics to mitigate the underestimation bias in costs. The quantile critics also quantify distributional, risk-sensitive epistemic uncertainty for guiding exploration. Experiments across velocity-constrained robot locomotion, safe navigation, and complex autonomous driving tasks demonstrate that COX-Q achieves high sample efficiency, competitive safety performance during evaluation, and controlled data collection cost in exploration. The results highlight the proposed method as a promising solution for safety-critical RL.