GRACE: A Language Model Framework for Explainable Inverse Reinforcement Learning
Abstract
Inverse Reinforcement Learning (IRL) aims to recover Reward Models from expert demonstrations, but traditional methods yield "black-box" models that are difficult to interpret and debug. In this work, we introduce GRACE (Generating Rewards As CodE), a method for using code Large Language Models (LLMs) within an evolutionary search to reverse-engineer an interpretable, code-based reward function directly from expert trajectories. The resulting reward function is executable code that can be inspected and verified. We empirically demonstrate that GRACE can efficiently learn highly accurate rewards in the multi-task setups as defined by two benchmarks, BabyAI and AndroidWorld. Further, we demonstrate that the resulting reward leads to strong policies compared to both competitive Imitation Learning and online RL approaches with groundtruth rewards. Finally, we show that GRACE is able to build complex reward APIs in mulit-task setups.