Poster
in
Workshop: Deep Generative Model in Machine Learning: Theory, Principle and Efficacy
Efficient Consistency Model Training for Policy Distillation in Reinforcement Learning
Bowen Fang · Xuan Di
Keywords: [ Policy Gradient ] [ Probability Flow-ODE ] [ Consistency Model ] [ Reinforcement Learning ] [ Efficiency ]
This paper proposes an efficient consistency model (CM) training scheme in the context of reinforcement learning (RL). More specifically, we leverage Probability Flow ODE (PF-ODE) and introduce two novel loss functions to improve CM training for RL policy distillation. We propose Importance Weighting (IW) and Gumbel-Based Sampling (GBS) as strategies to refine policy learning under limited sampling budgets. Our approach enables efficient training by directly incorporating probability estimates, which reduces variance and improves sample efficiency. The numerical experiments demonstrate that our method outperforms conventional CM training, achieving more accurate policy representations under limited samples. These findings highlight the potential of CMs as an efficient alternative for policy optimization in RL.