Ready For General Agents? Let's test it.
Abstract
Recent progress in LLMs has pushed the field from domain-specific systems toward increasingly general-purpose models. A similar shift is now emerging for AI agents: domain agents share reusable components and already operate across multiple domains with minimal adaptation. This capacity to integrate into new environments and tackle entirely new classes of tasks gives general agents the potential for effectively unbounded real-world value. However, current evaluation frameworks cannot measure this core capability. We organize prior evaluation efforts into a five-level taxonomy and highlight the currently missing fifth level: a unified evaluation of general agents. Unlike existing work—which typically assesses a single agent across many environments—this missing level must support comparison across agents with different architectures and determine how well they operate in unfamiliar settings. We outline the challenges that prevent such evaluation today and analyze why common benchmarks and protocols fall short. We then propose requirements for a protocol-agnostic evaluation framework capable of reliably measuring agent performance and adaptability across diverse environments.