Track: Oral Session 4D Coding and scientific agents

Fri 24 April 11:15 - 11:25 PDT

SWINGARENA: Adversarial Programming Arena for Long-context GitHub Issue Solving

Wendong XU ⋅ Jing Xiong ⋅ Chenyang Zhao ⋅ Qiujiang Chen ⋅ Haoran Wang ⋅ Hui Shen ⋅ Zhongwei Wan ⋅ Jianbo Dai ⋅ Taiqiang Wu ⋅ He Xiao ⋅ Chaofan Tao ⋅ Zhuoqing Mao ⋅ Ying Sheng ⋅ Zhijiang Guo ⋅ Hongxia Yang ⋅ Bei Yu ⋅ Lingpeng Kong ⋅ Quanquan Gu ⋅ Ngai Wong

We present \textsc{SwingArena}, a adversarial evaluation framework for Large Language Models (LLMs) that closely mirrors real-world software development workflows. Unlike traditional static benchmarks, \textsc{SwingArena} models the collaborative process of software iteration by pairing LLMs as \textit{submitters}, who generate patches, and \textit{reviewers}, who create test cases and verify the patches through continuous integration (CI) pipelines. To support these interactive evaluations, we introduce a retrieval-augmented code generation (RACG) module that efficiently handles long-context challenges by providing syntactically and semantically relevant code snippets from large codebases, supporting multiple programming languages (C++, Python, Rust, and Go). This enables the framework to scale across diverse tasks and contexts while respecting token limitations. Our experiments, using over 400 high-quality real-world GitHub issues selected from a pool of 2,300 issues, show that models like GPT-4o excel at aggressive patch generation, whereas DeepSeek and Gemini prioritize correctness in CI validation. \textsc{SwingArena} presents a scalable and extensible methodology for evaluating LLMs in realistic, CI-driven software development settings.

Fri 24 April 11:27 - 11:37 PDT

BIRD-INTERACT: Re-imagining Text-to-SQL Evaluation via Lens of Dynamic Interactions

Nan Huo ⋅ Xiaohan Xu ⋅ Jinyang Li ⋅ Per Jacobsson ⋅ Shipei Lin ⋅ Bowen Qin ⋅ Binyuan Hui ⋅ Xiaolong Li ⋅ Ge Qu ⋅ Shuzheng Si ⋅ Linheng Han ⋅ Edward Alexander ⋅ Xintong Zhu ⋅ Rui Qin ⋅ Ruihan Yu ⋅ Yiyao Jin ⋅ Feige Zhou ⋅ Weihao Zhong ⋅ Yun Chen ⋅ Hongyu Liu ⋅ Chenhao Ma ⋅ Fatma Ozcan ⋅ Yannis Papakonstantinou ⋅ Reynold Cheng

Large language models (LLMs) have demonstrated remarkable performance on single-turn text-to-SQL tasks, but real-world database applications predominantly require multi-turn interactions to handle ambiguous queries, execution errors, and evolving user requirements. Existing multi-turn benchmarks fall short of capturing this complexity, either by treating conversation histories as static context or by limiting evaluation to narrow, read-only (SELECT-ONLY) operations, thereby potentially failing to reflect the challenges encountered in production-grade database assistant. In this work, we introduce BIRD-INTERACT, a benchmark that restores this missing realism through: (1) a comprehensive interaction environment that couples each database with a hierarchical knowledge base, metadata files, and a function-driven user simulator, enabling models to solicit clarifications, retrieve knowledge, and recover from execution errors without human supervision; (2) two evaluation settings reflecting real-world interaction settings which contain a pre-defined conversational protocol (c-Interact) and a more open-ended agentic setting (a-Interact) in which the model autonomously decides when to query the user simulator or explore the DB environment; (3) a challenging task suite that covers the full CRUD spectrum for both business-intelligence and operational use cases, guarded by executable test cases. Each task features ambiguous and follow-up sub-tasks, requiring LLMs to engage in dynamic interaction. The suite is organized into two sets: a full set (BIRD-INTERACT-FULL) of 600 tasks which unfold up to 11,796 dynamic interactions for a comprehensive overview of performance and a lite set (BIRD-INTERACT-LITE) of 300 tasks, with simplified databases for detailed behavioral analysis of interactions, and fast development of methods. Our empirical results highlight the difficulty of BIRD-INTERACT: the most recent flagship model GPT-5 completes only 8.67% of tasks in the c-Interact setting and 17.00% in the a-Interact setting on the full task suite. Further analysis via memory grafting and Interaction Test-time Scaling (ITS) validates the importance of effective interaction for achieving success in dynamic text-to-SQL tasks.

Fri 24 April 11:39 - 11:49 PDT

EditBench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits

Wayne Chi ⋅ Valerie Chen ⋅ Ryan Shar ⋅ Aditya Mittal ⋅ Jenny Liang ⋅ Wei-Lin Chiang ⋅ Anastasios Angelopoulos ⋅ Ion Stoica ⋅ Graham Neubig ⋅ Ameet Talwalkar ⋅ Chris Donahue

Instructed code editing, where LLMs directly modify a developer's existing code based on a user instruction, is becoming a widely used interaction mode in AI coding assistants. However, few benchmarks directly evaluate this capability and current datasets often rely on artificial sources. We introduce EditBench, a benchmark for evaluating LLM code editing capabilities grounded in real-world usage, i.e.,~user instructions and code contexts collected in the wild. EditBench comprises of 545 problems, multiple natural and programming languages, and a diverse set of real-world use cases, ranging from resolving errors to adding features. EditBench introduces context-dependent problems that require the model to understand code context, highlighted code, and cursor position in addition to the user instruction. We evaluate 40 diverse LLMs and observe that EditBench is a challenging set of problems where only 3 models score over 60\%. We find that model performance varies across different categories of user instructions. Further, we find that varying levels of contextual information greatly affect task success rate, with performance varying up to 11\%, indicating the importance of evaluating with realistic context.

Fri 24 April 11:51 - 12:01 PDT

Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents

Yueqi Song ⋅ Ketan Ramaneti ⋅ Zaid Sheikh ⋅ Ziru Chen ⋅ Boyu Gou ⋅ Tianbao Xie ⋅ Yiheng Xu ⋅ Danyang Zhang ⋅ Apurva Gandhi ⋅ Fan Yang ⋅ Joseph Liu ⋅ Tianyue Ou ⋅ Zhihao Yuan ⋅ Frank F Xu ⋅ Shuyan Zhou ⋅ Xingyao Wang ⋅ Xiang Yue ⋅ Tao Yu ⋅ Huan Sun ⋅ Yu Su ⋅ Graham Neubig

Public research results on large-scale supervised finetuning of AI agents remain relatively rare, since the collection of agent training data presents unique challenges. In this work, we argue that the bottleneck is not a lack of underlying data sources, but that a large variety of data is fragmented across heterogeneous formats, tools, and interfaces. To this end, we introduce the Agent Data Protocol (ADP), a light-weight representation language that serves as an "interlingua" between agent datasets in diverse formats and unified agent training pipelines downstream. The design of ADP is expressive enough to capture a large variety of tasks, including API/tool use, browsing, coding, software engineering, and general agentic workflows, while remaining simple to parse and train on without engineering at a per-dataset level. In experiments, we unified a broad collection of 13 existing agent training datasets into ADP format, and converted the standardized ADP data into training-ready formats for multiple agent frameworks. We performed supervised finetuning on the unified data, and demonstrated an average performance gain of $\sim$20\% over corresponding base models, and delivers state-of-the-art or near-SOTA performance on standard coding, browsing, tool use, and research benchmarks, without domain-specific tuning. All code and data are released publicly, in the hope that ADP could help lower the barrier to standardized, scalable, and reproducible agent training.

Fri 24 April 12:03 - 12:13 PDT

AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite

Jonathan Bragg ⋅ Mike D'Arcy ⋅ Nishant Balepur ⋅ Dan Bareket ⋅ Bhavana Dalvi Mishra ⋅ Sergey Feldman ⋅ Dany Haddad ⋅ Jena Hwang ⋅ Peter Jansen ⋅ Varsha Kishore ⋅ Bodhisattwa Prasad Majumder ⋅ Aakanksha Naik ⋅ Sigal Rahamimov ⋅ Kyle Richardson ⋅ Amanpreet Singh ⋅ Harshit Surana ⋅ Aryeh Tiktinsky ⋅ Rosni Vasu ⋅ Guy Wiener ⋅ Chloe Anastasiades ⋅ Stefanus Candra ⋅ Jason Dunkelberger ⋅ Daniel Emery ⋅ Rob Evans ⋅ Malachi Hamada ⋅ Regan Huff ⋅ Rodney Kinney ⋅ Matt Latzke ⋅ Jaron Lochner ⋅ Ruben Lozano-Aguilera ⋅ Ngoc-Uyen Nguyen ⋅ Smita Rao ⋅ Amber Tanaka ⋅ Brooke Vlahos ⋅ Peter Clark ⋅ Doug Downey ⋅ Yoav Goldberg ⋅ Ashish Sabharwal ⋅ Daniel Weld

AI agents hold the potential to revolutionize scientific productivity by automating literature reviews, replicating experiments, analyzing data, and even proposing new directions of inquiry; indeed, there are now many such agents, ranging from general-purpose "deep research" systems to specialized science-specific agents, such as AI Scientist and AIGS. Rigorous evaluation of these agents is critical for progress. Yet existing benchmarks fall short on several fronts: they often (1) lack reproducible agent tools necessary for a controlled comparison of core agentic capabilities; (2) do not account for confounding variables such as model cost and tool access; (3) do not provide standardized interfaces for quick agent prototyping and evaluation; (4) fail to provide holistic, product-informed measures of real-world use cases such as science research; and (5) lack comprehensive baseline agents necessary to identify true advances. In response, we define principles and tooling for more rigorously benchmarking agents. Using these, we present AstaBench, a suite that provides a holistic measure of agentic ability to perform scientific research, comprising 2400+ problems spanning the entire scientific discovery process and multiple scientific domains, and including many problems inspired by actual user requests to deployed Asta agents. Our suite comes with the first scientific research environment with production-grade search tools that enable controlled, reproducible evaluation, better accounting for confounders. Alongside, we provide a comprehensive suite of nine science-optimized classes of Asta agents and numerous baselines. Our extensive evaluation of 57 agents across 22 agent classes reveals several interesting findings, most importantly that despite meaningful progress on certain individual aspects, AI remains far from solving the challenge of science research assistance.

Fri 24 April 12:15 - 12:25 PDT

MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science

Ran Xu ⋅ Yuchen Zhuang ⋅ Yishan Zhong ⋅ Yue Yu ⋅ Zifeng Wang ⋅ Xiangru Tang ⋅ Hang Wu ⋅ May Dongmei Wang ⋅ Peifeng Ruan ⋅ Donghan Yang ⋅ Tao Wang ⋅ Guanghua Xiao ⋅ Xin Liu ⋅ Carl Yang ⋅ Yang Xie ⋅ Wenqi Shi

We introduce MedAgentGym, a scalable and interactive training environment designed to enhance coding-based biomedical reasoning capabilities in large language model (LLM) agents. MedAgentGym comprises 72,413 task instances across 129 categories derived from 12 authentic real-world biomedical scenarios. Tasks are encapsulated within executable sandbox environments, each featuring detailed task specifications, interactive feedback mechanisms, verifiable ground truth annotations, and scalable training trajectory generation. Extensive benchmarking of 29 LLMs reveals substantial performance disparities in biomedical data science between commercial and open-source LLMs. Leveraging efficient multi-threaded and multi-turn trajectory sampling in MedAgentGym, Med-Copilot achieves performance gains of +43.02% and +45.28% from offline and online reinforcement learning, respectively, demonstrating MedAgentGym as an effective training ground while establishing itself as a cost-effective, privacy-preserving alternative competitive with proprietary LLMs (gpt-4o). By offering a unified execution environment with a comprehensive benchmark and accessible, extensible training resources, MedAgentGym delivers an integrated platform to develop LLM-based coding assistants for advanced biomedical data science.