Poster
in
Workshop: 7th Robot Learning Workshop: Towards Robots with Human-Level Abilities
SAM2Act: Integrating Visual Foundation Model with A Memory Architecture for Robotic Manipulation
Haoquan Fang · Markus Grotz · Wilbert Pumacay · Yi Ru Wang · Dieter Fox · Ranjay Krishna · Jiafei Duan
Sat 26 Apr 5:55 p.m. PDT — 3 a.m. PDT
Robotic manipulation systems operating in diverse, dynamic environments must exhibit three critical abilities: generalization to unseen scenarios, multitask interaction, and spatial memory. While significant progress has been made in robotic manipulation, existing approaches often fall short in addressing memory-dependent tasks and generalization to complex environmental variations. To bridge this gap, we introduce SAM2Act, a multi-view robotic transformer that leverages multi-resolution upsampling and visual representations from large-scale foundation models. SAM2Act achieves a state-of-the-art average success rate of 86.8% across 18 tasks in the RLBench benchmark, and demonstrates robust generalization on The Colosseum benchmark, with only a 4.3% performance drop under diverse environmental perturbations. Building on this foundation, we propose SAM2Act+, a memory-augmented architecture inspired by SAM2, which incorporates a memory bank and attention mechanism for spatial memory. To address the need for evaluating memory-dependent tasks, we introduce MemoryBench, a novel benchmark designed to assess spatial memory and action recall in robotic manipulation. SAM2Act+ achieves competitive performance on MemoryBench, significantly outperforming existing approaches and pushing the boundaries of memory-enabled robotic systems.