ShiftBench: Measuring Recovery of Agent Memory Under Distribution Shift
Teresa Zhang
Abstract
Selecting memory policies by long-horizon accuracy can be misleading under shift, because rankings may reverse when evaluated by post-shift recovery. We introduce ShiftBench, a lightweight protocol defining shift segments and Recovery@T on LoCoMo and HaluMem-Long. On LoCoMo, lexical baselines (TF--IDF methods) show reversal under interruption (Spearman $\rho=-0.30$, inversion $0.60$), and alignment drops from $0.94$ to $0.70$ ($\Delta \rho=0.24$, 95\% CI $[0.12, 0.37]$). On HaluMem-Long, reversal is smaller but still present ($\rho=0.02$, inversion $0.50$). Overall, ShiftBench shows that post-shift recovery is a distinct evaluation axis that can change memory-policy selection.
Successful Page Load