OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks
Abstract
Large language models excel at abstract reasoning, but their embodied agent reasoning capacity remains under explored. We present OmniEAR, a comprehensive framework for evaluating LLM reasoning about physical interactions, tool usage, and multi-agent coordination. Unlike existing benchmarks with predefined tools and explicit collaboration directives, OmniEAR requires agents to dynamically acquire capabilities and autonomously determine coordination strategies. Our benchmark models continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household, industrial, and diverse professional domains. Our evaluation reveals severe degradation when reasoning must emerge from physical constraints: performance drops from 85-96% with explicit instructions to below 50% on compound tasks. Surprisingly, complete environmental information degrades coordination performance, indicating models cannot filter task-relevant constraints. Fine-tuning dramatically improves single-agent performance, but fails to transfer to multi-agent scenarios, exposing fundamental architectural limitations. These findings demonstrate that embodied reasoning poses fundamentally different challenges from what current architectures can address, establishing OmniEAR as a rigorous benchmark for advancing embodied AI. Code and data are provided in the supplementary materials and will be publicly released.