Poster
in
Workshop: ICLR 2026 Workshop on Memory for LLM-Based Agentic Systems (MemAgents) Mon, Apr 27, 2026 • 5:50 AM – 6:35 AM PDT

PROCED-MEM: BENCHMARKING PROCEDURAL MEMORY RETRIEVAL IN LANGUAGE AGENTS ACROSS DOMAINS

Ishant Kohar ⋅ Aswanth Krishnan

Project Page [ Poster] [ OpenReview]

Abstract

We introduce Proced-Mem, a benchmark for procedural memory retrieval in language agents with two sub-domains: text-based household tasks (ALFWorld) and real computer environments (OSWorld). Evaluating retrieval independently of downstream execution is critical because current agent evaluations conflate retrieval with planning and execution, masking whether agents retrieve relevant procedures or succeed despite poor memory access. Proced-Mem evaluates up to seven methods across text, visual, and lexical modalities, using an LLM-as-judge protocol for ALFWorld and a leave-one-out protocol with hierarchical ground truth at two granularity levels for OSWorld. Across both sub-domains, we find a generalization cliff (30–42% MAP degradation on novel contexts) and a granularity-method reversal where visual features rank first at coarse retrieval but last at fine-grained procedural matching. Proced-Mem provides the first diagnostic framework for identifying such failure modes, enabling the principled design of retrieval systems that generalize across granularity levels and modalities.

Chat is not available.