Poster
in
Workshop: ICLR 2026 Workshop on AI with Recursive Self-Improvement

Differentiable Evolutionary Reinforcement Learning

Sitao Cheng ⋅ Tianle Li ⋅ Xuhan Huang ⋅ Xunjian Yin ⋅ Difan Zou

Project Page [ OpenReview]

Abstract

The design of reward functions presents an arduous challenge in reinforcement learning (RL). Existing automated reward modeling typically relies on derivative-free evolutionary heuristics that treat the reward function as a black box, failing to capture the causal relationship between structure changes and task performance. To bridge this gap, we propose $\textbf{Differentiable Evolutionary Reinforcement Learning (DERL)}$, a bi-level training framework for autonomous discovery of optimal reward signal. In DERL, a $\textit{Meta-Optimizer}$ evolves a reward function by composing structured atomic primitives, guiding the evolution of inner-loop policy. Crucially, DERL is differentiable in meta-optimization—updating the Meta-Optimizer via policy gradient derived from inner-loop validation performance. This allows the progressively learning of the ''meta-gradient'' of task success for denser and more actionable feedback. We validate DERL across three distinct domains: robotic agent (ALFWorld), scientific simulation (ScienceWorld), and mathematical reasoning (GSM8k, MATH). Results show that DERL achieves state-of-the-art performance on agent benchmarks, significantly outperforming non-differentiable methods, especially in out-of-distribution scenarios. Analysis of the evolutionary trajectory demonstrates that DERL successfully captures the intrinsic structure of tasks, enabling self-improving agent alignment without human intervention.

Chat is not available.