ATOD: An Evaluation Framework and Benchmark for Agentic Task-Oriented Dialogue Systems
Abstract
Recent advances in task-oriented dialogue (TOD) systems, driven by LLMs with extensive API and tool integration, have expanded the scope of conversational agents beyond traditional turn-by-turn task execution. Modern systems are increasingly expected to coordinate interleaved goals, preserve long-horizon context, and provide proactive assistance under asynchronous execution. However, existing benchmarks do not systematically evaluate these agentic behaviors. To address this gap, we introduce ATOD, a benchmark and synthetic dialogue generation pipeline that produces richly annotated conversations requiring long-horizon reasoning. ATOD captures key characteristics of Advanced TOD, including multi-goal coordination, dependency management, long-horizon memory, and proactivity. Building on ATOD, we propose ATOD-Eval, a holistic evaluation framework that operationalizes these dimensions through fine-grained metrics and supports reproducible evaluation in both offline and online settings. We further present an agentic memory-based evaluator for benchmarking models on ATOD. Experiments show that ATOD-Eval enables comprehensive assessment of task completion, agentic capability, and response quality, and that the proposed evaluator provides a favorable accuracy--efficiency trade-off relative to strong memory-based and LLM-based baselines under this evaluation setting.