Poster Sat, Apr 25, 2026 • 11:15 AM – 1:45 PM PDT

The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

Junlong Li · Wenshuo Zhao · Jian Zhao · Weihao Zeng · Haoze Wu · Xiaochen Wang · Rui Ge · Yuxuan Cao · Yuzhen Huang · Wei Liu · Junteng LIU · Zhaochen Su · Yiyang Guo · FAN ZHOU · Lueyang Zhang · Juan Michelini · Xingyao Wang · Xiang Yue · Shuyan Zhou · Graham Neubig · Junxian He

Abstract

Real-world language agents must handle complex, multi-step workflows across diverse applications. For instance, an agent may manage emails by coordinating with calendars and file systems, or monitor a production database like BigQuery to detect anomalies and generate reports following a standard operating manual. However, existing language agent benchmarks often focus on narrow domains or simplified tasks that lack the diversity, realism, and long-horizon complexity required to evaluate agents' real-world performance. To address this gap, we introduce the Tool Decathlon (dubbed as Toolathlon), a benchmark for language agents offering diverse applications and tools, realistic environment setup, and reliable execution-based evaluation. Toolathlon spans 32 software applications and 604 tools, ranging from everyday platforms such as Google Calendar and Notion to professional applications like WooCommerce, Kubernetes, and BigQuery. Most of the tools are based on a high-quality set of Model Context Protocol (MCP) servers that we may have revised or implemented ourselves. Unlike prior works, which primarily ensure functional realism but offer limited environment state diversity, we provide realistic initial environment states from real software, such as multiple Canvas courses each with dozens of students or real-world financial spreadsheets. The Toolathlon benchmark includes 108 manually sourced or crafted tasks in total, requiring interacting with multiple applications over ~20 interaction turns on average to complete. Each task is strictly verifiable through dedicated evaluation scripts. Comprehensive evaluation of state-of-the-art models highlights their significant shortcomings in performing real-world, long-horizon tasks: the best-performing model, Claude-4-Sonnet, achieves only a 29.9% success rate with 28 tool calling turns on average, while the top open-weights model DeepSeek-V3.1 reaches 13.9%. We expect Toolathlon to drive the development of more capable language agents for real-world, long-horizon task execution.