Poster Sat, Apr 25, 2026 • 11:15 AM – 1:45 PM PDT Pavilion 3 P3-#319

Towards Self-Evolving Agent Benchmarks : Validatable Agent Trajectory via Test-Time Exploration

Dadi Guo ⋅ Tianyi Zhou ⋅ Dongrui Liu ⋅ Chen Qian ⋅ Qihan Ren ⋅ Shuai Shao ⋅ Zhiyuan Fan ⋅ Yi R. Fung ⋅ Kun Wang ⋅ Linfeng Zhang ⋅ Jing Shao

[ Poster] [ OpenReview]

Abstract

Recent advances in large language models (LLMs) and agent system designs have empowered agents with unprecedented levels of capability. However, existing agent benchmarks are showing a trend of rapid ceiling-hitting by newly developed agents, making it increasingly difficult to meet the demands of evaluating agent abilities. To address this problem, we propose the Trajectory-based Validated-by-Reproducing Agent-benchmark Complexity Evolution (TRACE) framework. This framework takes an original task from an existing benchmark and encourages agents to freely explore and evolve it into a new task with higher difficulty while recording the corresponding execution trajectories. The framework proceeds in three stages: (1) evolutionary proposal mining, which generates task evolution proposals through preliminary exploration and divergent thinking; (2) problem construction via free exploration, where proposals are instantiated into concrete problem instances through agent exploration, with execution trajectories recorded along the process; and (3) multi-level validation, which ensures that the evolved tasks are accompanied by reproducible and logically coherent trajectories. Experiments on the GAIA benchmark demonstrate that the TRACE framework consistently enhances task complexity while improving correctness reliability through trajectory-level validation. In addition, our framework can successfully adapt to and improve reasoning benchmarks such as AIME-2024. This work marks a paradigm shift from static, manually curated benchmarks to dynamic, self-evolving evaluation systems, providing a sustainable and challenging foundation for agent development. Code and data can be found at https://github.com/titanwings/trace-benchmark-evolving.

Video

Chat is not available.