Differentiable Evolutionary Reinforcement Learning
Sitao Cheng ⋅ Tianle Li ⋅ Xuhan Huang ⋅ Xunjian Yin ⋅ Difan Zou
Abstract
The design of reward functions presents an arduous challenge in reinforcement learning (RL). Existing automated reward modeling typically relies on derivative-free evolutionary heuristics that treat the reward function as a black box, failing to capture the causal relationship between structure changes and task performance. To bridge this gap, we propose $\textbf{Differentiable Evolutionary Reinforcement Learning (DERL)}$, a bi-level training framework for autonomous discovery of optimal reward signal. In DERL, a $\textit{Meta-Optimizer}$ evolves a reward function by composing structured atomic primitives, guiding the evolution of inner-loop policy. Crucially, DERL is differentiable in meta-optimization—updating the Meta-Optimizer via policy gradient derived from inner-loop validation performance. This allows the progressively learning of the ''meta-gradient'' of task success for denser and more actionable feedback. We validate DERL across three distinct domains: robotic agent (ALFWorld), scientific simulation (ScienceWorld), and mathematical reasoning (GSM8k, MATH). Results show that DERL achieves state-of-the-art performance on agent benchmarks, significantly outperforming non-differentiable methods, especially in out-of-distribution scenarios. Analysis of the evolutionary trajectory demonstrates that DERL successfully captures the intrinsic structure of tasks, enabling self-improving agent alignment without human intervention.
Chat is not available.
Successful Page Load