Poster
in
Workshop: Workshop on Large Language Models for Agents
HELPER-X: A Unified Instructable Embodied Agent to Tackle Four Interactive Vision-Language Domains with Memory-Augmented Language Models
Gabriel Sarch · Sahil Somani · Raghav Kapoor · Michael Tarr · Katerina Fragkiadaki
Methods for developing instructable embodied artificial agents typically train distinct models for each application and language domain to map instructions to the corresponding actions and task plans. Here we explore the feasibility of developing a versatile “generalist” instructable agent capable of operating across a broad spectrum of tasks, language domains, and environments, with a single model. Recent research on instructable agents has used memory-augmented Large Language Models (LLMs) as task planners, a technique that retrieves language-program examples relevant to the input instruction and uses them as in-context examples in the LLM prompt to improve the performance of the LLM in inferring the correct action and task plans. Our approach, HELPER-X, expands such external language-program memory with a wide range of examples and prompt templates, while also extending the agent's action API. This expansion of a shared unified memory enables the agent to work across the domains of executing plans from dialogue, natural language instruction following, active question asking, and commonsense room reorganization. We evaluate the agent on four diverse interactive visual-language embodied agent benchmarks: ALFRED, TEACh, DialFRED, and the Tidy Task. These benchmarks vary significantly in terms of input instructions, question-asking capabilities, task structures, and environmental settings. HELPER-X achieves few-shot, state-of-the-art performance across these benchmarks using a single agent, without requiring in-domain training, and remains competitive with agents that have undergone in-domain training. Our work demonstrates the potential of memory-augmented large language models to support generalist instructable embodied agents.