CoMem: Compositional Concept-Graph Memory for Continual Vision–Language Learning
Abstract
Continual vision–language learning is crucial for multimodal tasks such as image–text retrieval, visual question answering, and grounded reasoning in dynamic environments, yet deployed systems must learn from non-stationary streams under strict privacy and memory budgets, where naïve finetuning forgets and harms transfer. We aim to sustain stable yet plastic capability in this setting without storing raw data, enabling reuse and recombination across domains and tasks. We present CoMem, a framework that treats compositional structure as the unit of memory and rehearsal: it incrementally organizes knowledge into a compact graph of concepts and relations and rehearses directly in feature space by conditioning practice signals on sampled subgraphs. A lightweight compositional consistency objective keeps part–whole predictions coherent, while teacher-informed, uncertainty-aware filtering limits off-manifold drift. Across cross-domain retrieval, structured concept learning, and continual multimodal VQA, CoMem achieves state-of-the-art retention and transfer alongside consistent gains on SVLC and VQACL/CLOVE under matched memory and parameter budgets. By casting structure as memory and rehearsing where learning happens (feature space), CoMem provides a privacy-friendly and testable paradigm for reliable continual adaptation without raw exemplars.