MobileMem: Evaluating Long-Horizon Memory for Language Agents in Real-World Mobile Environments
Abstract
Long-term memory is widely regarded as a key enabler of personalization for Large Language Model (LLM) agents, yet existing benchmarks almost exclusively model users through human–assistant dialogues, implicitly assuming that user preferences can be fully inferred from conversational signals alone. However, an effective personalized memory system should not be limited to conversations, but should be learned from continuous observations of diverse user behaviors, a setting that remains largely unexplored due to the lack of appropriate benchmarks. To this end, we introduce MobileMem, a benchmark for evaluating personalized long-term memory in realistic environments, using mobile usage as a representative and challenging testbed. MobileMem is constructed from real user trajectories, where human-assistant dialogues are naturally interleaved with interactions across multiple mobile applications. To enable coherent long-horizon evaluation from fragmented sessions, we further propose KEME, a knowledge-guided experience synthesis framework that integrates temporally dispersed interactions into consistent lifelong trajectories. Each trajectory is paired with long-horizon question–answer pairs that require memory systems to organize, retrieve, and integrate information across sources and time. Evaluations on MobileMem expose previously overlooked limitations of existing memory systems, revealing a significant gap between current benchmarks and real-world deployment demands.