Track: Oral Session 3A Agents

Fri 24 April 6:30 - 6:40 PDT

ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

Zhaoyang Liu ⋅ JingJing Xie ⋅ Zichen Ding ⋅ Zehao Li ⋅ Bowen Yang ⋅ Zhenyu Wu ⋅ Xuehui Wang ⋅ Qiushi Sun ⋅ Shi Liu ⋅ Weiyun Wang ⋅ Shenglong Ye ⋅ Qingyun Li ⋅ Zeyue Tian ⋅ Gen Luo ⋅ Xiangyu Yue ⋅ Biqing Qi ⋅ Kai Chen ⋅ Bowen Zhou ⋅ Yu Qiao ⋅ Qifeng Chen ⋅ Wenhai Wang

Vision-Language Models (VLMs) have enabled computer use agents (CUAs) that operate GUIs autonomously, showing great potential, yet progress is limited by the lack of large-scale, open-source computer use data and foundation models. In this work, we introduce ScaleCUA, a step toward scaling open-source CUAs. It offers a large-scale dataset spanning 6 operating systems and 3 task domains, built via a closed-loop pipeline uniting automated agents with human experts. Trained on this scaled-up data, ScaleCUA can operate seamlessly across platforms. Specifically, it delivers strong gains over baselines (+26.6 on WebArena-Lite-v2, +10.7 on ScreenSpot-Pro) and sets new state-of-the-art results (94.4% on MMBench-GUI L1-Hard, 60.6% on OSWorld-G, 47.4% on WebArena-Lite-v2). These findings underscore the power of data-driven scaling for general-purpose computer use agents. We will release data, models, and code to advance future research.

Fri 24 April 6:42 - 6:52 PDT

Shoot First, Ask Questions Later? Building Rational Agents that Explore and Act Like People

Gabriel Grand ⋅ Valerio Pepe ⋅ Joshua B Tenenbaum ⋅ Jacob Andreas

Many emerging applications of AI—from scientific discovery to medical diagnosis—require agents to seek information strategically: forming hypotheses, asking targeted questions, and making decisions under uncertainty. In high-stakes settings with limited resources, do language models (LMs) behave like rational agents? Drawing on insights from human cognition, we develop methods to evaluate and enhance agentic information-seeking. First, we introduce a decision-oriented dialogue task called Collaborative Battleship, in which a Captain must balance exploration (asking questions) and action (taking shots), while a Spotter must supply accurate, contextually-grounded answers. Compared to human players (N=42), we find that many LM agents struggle to ask informative questions, produce accurate answers, and identify high-utility actions. To address these gaps, we develop novel Monte Carlo inference strategies for LMs inspired by Bayesian Experimental Design (BED). For Spotter agents, our approach boosts accuracy by up to 14.7% absolute over LM-only baselines; for Captain agents, it raises expected information gain (EIG) by up to 0.227 bits (94.2% of the achievable noise ceiling). Combined, these components yield sharper targeting (+0.303–0.374 F1), and enable weaker LMs, such as Llama-4-Scout, to outperform both humans (8% → 82% win rate) and frontier models (0% → 67% win rate vs. GPT-5) at ≈1% of GPT-5's cost. We replicate these findings on Guess Who?, where our methods significantly boost accuracy (+28.3–42.4 p.p.), demonstrating their general applicability for building information-seeking agents.

Fri 24 April 6:54 - 7:04 PDT

In-the-Flow Agentic System Optimization for Effective Planning and Tool Use

Zhuofeng Li ⋅ Haoxiang Zhang ⋅ Seungju Han ⋅ Sheng Liu ⋅ Jianwen Xie ⋅ Yu Zhang ⋅ Yejin Choi ⋅ James Y Zou ⋅ Pan Lu

Outcome-driven reinforcement learning has advanced reasoning in large language models (LLMs), but prevailing tool-augmented approaches train a single, monolithic policy that interleaves thoughts and tool calls under full context; this scales poorly with long horizons and diverse tools and generalizes weakly to new scenarios. Agentic systems offer a promising alternative by decomposing work across specialized modules, yet most remain training-free or rely on offline training decoupled from the live dynamics of multi-turn interaction. We introduce AgentFlow, a trainable, in-the-flow agentic framework that coordinates four modules (planner, executor, verifier, generator) through an evolving memory and directly optimizes its planner inside the multi-turn loop. To train on-policy in live environments, we propose Flow-based Group Refined Policy Optimization (Flow-GRPO), which tackles long-horizon, sparse-reward credit assignment by converting multi-turn optimization into a sequence of tractable single-turn policy updates. It broadcasts a single, verifiable trajectory-level outcome to every turn to align local planner decisions with global success and stabilizes learning with group-normalized advantages. Across ten benchmarks, AgentFlow with a 7B-scale backbone outperforms top-performing baselines with average accuracy gains of 14.9% on search, 14.0% on agentic, 14.5% on mathematical, and 4.1% on scientific tasks, even surpassing larger proprietary models like GPT-4o. Further analyses confirm the benefits of in-the-flow optimization, showing improved planning, enhanced tool-calling reliability, and positive scaling with model size and reasoning turns.

Fri 24 April 7:06 - 7:16 PDT

Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments

Romain Froger ⋅ Pierre Andrews ⋅ Matteo Bettini ⋅ Amar Budhiraja ⋅ Ricardo Cabral ⋅ Virginie Do ⋅ Emilien Garreau ⋅ Jean-Baptiste Gaya ⋅ Hugo Laurençon ⋅ Maxime Lecanu ⋅ Kunal Malkan ⋅ Dheeraj Mekala ⋅ Pierre Ménard ⋅ Gerard Bertran ⋅ Ulyana Piterbarg ⋅ Mikhail Plekhanov ⋅ Mathieu Rita ⋅ Andrey Rusakov ⋅ Vladislav Vorotilov ⋅ Mengjuew Wang ⋅ Ian Yu ⋅ Amine Benhalloum ⋅ Grégoire Mialon ⋅ Thomas Scialom

We introduce Gaia2, a benchmark for evaluating large language model agents in realistic, asynchronous environments. Unlike prior static or synchronous evaluations, Gaia2 introduces scenarios where environments evolve independently of agent actions, requiring agents to operate under temporal constraints, adapt to noisy and dynamic events, resolve ambiguity, and collaborate with other agents. Each scenario is paired with a write-action verifier, enabling fine-grained, action-level evaluation and making Gaia2 directly usable for reinforcement learning from verifiable rewards. Our evaluation of state-of-the-art proprietary and open-source models shows that no model dominates across capabilities: GPT-5 (high) reaches the strongest overall score of 42% pass@1 but fails on time-sensitive tasks, Claude-4 Sonnet trades accuracy and speed for cost, Kimi-K2 leads among open-source models with 21% pass@1. These results highlight fundamental trade-offs between reasoning, efficiency, robustness, and expose challenges in closing the “sim2real” gap. Gaia2 is built on a consumer environment with the open-source Agents Research Environments platform and designed to be easy to extend. By releasing Gaia2 alongside the foundational ARE framework, we aim to provide the community with a flexible infrastructure for developing, benchmarking, and training the next generation of practical agent systems.

Fri 24 April 7:18 - 7:28 PDT

AgentGym-RL: An Open-Source Framework to Train LLM Agents for Long-Horizon Decision Making via Multi-Turn RL

Zhiheng Xi ⋅ Jixuan Huang ⋅ Chenyang Liao ⋅ Baodai Huang ⋅ Jiaqi Liu ⋅ Honglin Guo ⋅ yajie yang ⋅ Rui Zheng ⋅ Junjie Ye ⋅ Jiazheng Zhang ⋅ Wenxiang Chen ⋅ Wei He ⋅ Yiwen Ding ⋅ Guanyu Li ⋅ Zehui Chen ⋅ Zhengyin Du ⋅ Xuesong Yao ⋅ Yufei Xu ⋅ Jiecao Chen ⋅ Tao Gui ⋅ Zuxuan Wu ⋅ Qi Zhang ⋅ Xuanjing Huang ⋅ Yu-Gang Jiang

Training LLM agents for complex multi-turn decision-making tasks requires extensive exploration within their environment, with reinforcement learning (RL) as a natural way. However, the open-source community currently lacks a unified RL framework capable of training agents from scratch across diverse and realistic environments. To bridge this gap, we introduce AgentGym-RL, a modular and decoupled framework specifically designed for RL-based agent in multi-turn decision-making tasks. It offers high flexibility and extensibility, supports mainstream RL algorithms, and spans a broad range of real-world scenarios. To effectively train agents for challenging tasks, we argue that they are required to expand external interactions with the environment, rather than relying solely on internal reasoning. Nevertheless, training agents for long-horizon interaction with vanilla methods often faces challenges like training instability. To this end, we propose ScalingInter-RL, a staged training approach for stable long-horizon RL training. It starts with short-horizon interaction to establish foundational policies and progressively expands them to encourage deeper exploration. Extensive experiments show that agents trained with our method achieve performance on par with—or even surpass—commercial counterparts like OpenAI o3 and Gemini-2.5-Pro across 27 tasks in diverse environments. We share key insights and release the full framework, including code and datasets, to empower the community in building the next generation of intelligent agents. Our framework is available at https://github.com/WooooDyy/AgentGym-RL.

Fri 24 April 7:30 - 7:40 PDT

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Lakshya A Agrawal ⋅ Shangyin Tan ⋅ Dilara Soylu ⋅ Noah Ziems ⋅ Rishi Khare ⋅ Krista Opsahl-Ong ⋅ Arnav Singhvi ⋅ Herumb Shandilya ⋅ Michael J Ryan ⋅ Meng Jiang ⋅ Christopher Potts ⋅ Koushik Sen ⋅ Alex Dimakis ⋅ Ion Stoica ⋅ Dan Klein ⋅ Matei Zaharia ⋅ Omar Khattab

Large language models (LLMs) are increasingly adapted to downstream tasks via reinforcement learning (RL) methods like Group Relative Policy Optimization (GRPO), which often require thousands of rollouts to learn new tasks. We argue that the interpretable nature of language often provides a much richer learning medium for LLMs, compared to policy gradients derived from sparse, scalar rewards. To test this, we introduce GEPA (Genetic-Pareto), a prompt optimizer that thoroughly incorporates natural language reflection to learn high-level rules from trial and error. Given any AI system containing one or more LLM prompts, GEPA samples trajectories (e.g., reasoning, tool calls, and tool outputs) and reflects on them in natural language to diagnose problems, propose and test prompt updates, and combine complementary lessons from the Pareto frontier of its own attempts. As a result of GEPA's design, it can often turn even just a few rollouts into a large quality gain. Across six tasks, GEPA outperforms GRPO by 6 percentage points on average and by up to 19pp, while using up to 35x fewer rollouts. GEPA also outperforms the leading prompt optimizer, MIPROv2, by over 10 percentage points (e.g., +12pp on AIME-2025), and demonstrates promising results as an inference-time search strategy for code optimization. We release our code at https://github.com/gepa-ai/gepa.

Fri 24 April 7:42 - 7:52 PDT

Speculative Actions: A Lossless Framework for Faster AI Agents

Naimeng Ye ⋅ Arnav Ahuja ⋅ Georgios Liargkovas ⋅ Yunan Lu ⋅ Kostis Kaffes ⋅ Tianyi Peng

AI agents are increasingly deployed in complex, interactive environments, yet their runtime remains a major bottleneck for training, evaluation, and real-world use. Typical agent behavior unfolds sequentially, where each action requires an API call that can incur substantial latency. For example, a game of chess between two state-of-the-art agents can take hours. We introduce speculative actions, a lossless acceleration framework for general agentic systems. Inspired by speculative execution in microprocessors and speculative decoding in LLM inference, our method uses faster models to predict likely future actions and executes them in parallel, committing only when predictions match. We evaluate speculative actions across gaming, e-commerce, and web search environments, and additionally study a lossy extension in an operating systems setting. Across domains, we achieve up to 55% next-action prediction accuracy, translating into substantial latency reductions. Finally, we present a cost–latency analysis that formalizes the tradeoff between speculative breadth and time savings. This analysis enables principled tuning and selective branch launching, to ensure multi-branch speculation delivers practical speedups without prohibitive cost growth.