Benchmarking Continual Agent Memory for Online Learning, Transfer, and Forgetting
Abstract
Memory is pivotal to LLM-based agents that improve over time by accumulating reusable experience, retaining user preferences, and adapting to evolving environments. Yet current benchmarks provide an incomplete picture: memory is often evaluated as a static add-on via offline tests, which masks temporal dynamics such as consolidation, restructuring, and forgetting; and system memory and personal memory are typically assessed in isolation, overlooking the mixed workloads in which task execution and personalization co-evolve. We introduce AgentMemoryBench, a unified benchmark that evaluates system and personal memory within a single framework and standardizes five complementary modes to quantify improvement, retention, forgetting, generalization and knowledge conflict resolution over time. AgentMemoryBench spans representative interactive settings, including code-centric tool use, embodied tasks, web interaction, and long-horizon dialogue, and supports interleaved task streams to better reflect continuous, boundary-blurred real-world workflows. Building on this benchmark, we propose MEMs, a multi-memory coordination approach that maintains separate system and personal memory stores and employs a lightweight trigger as a meta-cognitive router to selectively retrieve and update each store. Extensive comparisons and ablations on AgentMemoryBench establish a reproducible evaluation loop and expose the limitations of single-memory designs under continual mixed interaction, providing practical guidance for developing more sustainable agent memory. Source code: https://anonymous.4open.science/r/AgentMemoryBench-12FD.