Reward Hacking in Self-Improving Code Agents
Abstract
Recursive self-improvement (RSI) systems that iteratively optimize code with ex- ecution feedback are increasingly used for automated algorithm and systems op- timization. A central failure mode is reward hacking: the agent improves a cheap proxy metric without improving or even harming the true objective under more re- alistic evaluation. We present a large-scale quantitative study of reward hacking in iterative code optimization by agents across two settings: GPU kernel optimiza- tion and algorithmic optimization. Across three frontier models and five agent configurations, we analyze thousands of agent trajectories under a setting where we have access to a held set of tasks (real tasks) for code evaluation while the agent only has access to a public set of evaluations (proxy tasks). Reward hacking is per- vasive: among our experiments, 73.8% of Kernel-Bench optimizations and 46.8% of ALE-Bench optimizations exhibit proxy gains without gains in the real tasks. A temporal analysis shows the proxy-reality gap widens with optimization steps, go- ing from 10 steps to 100 steps of optimization, percentage of reward hacking rises 31.4% from 26.4% to 57.8%. With this quantitative evaluation framework, we are able to evaluate techniques that may prevent reward hacking such as retrospection, a lightweight self-critique intervention triggered either probabilistically or on sig- nificant proxy metrics jumps. Retrospection reduces Kernel-Bench hacking by ∼17–19 points in some cases, but shows no consistent reduction on ALE-Bench and can increase hacking in some settings, indicating potential future research works to be done on this direction. We believe this quantitative formulation of reward hacking can be useful for future research studying and measuring agent capabilities on recursive self-improvements.