$\textbf{DomusMind}$: A Benchmark for Evaluating Lifelong Smart Home Agents Under Drift
Rong Xu ⋅ Yinxin Wan ⋅ Xiaochan Xue
Abstract
Smart home agents require continuous operation in non-stationary environments where human preferences and device reliability keep evolving. However, dominant evaluation protocols remain episodic and reset-based, failing to capture the degradation and recovery dynamics essential for long-term deployment. To address this gap, we introduce $\textbf{DomusMind}$, a benchmark for evaluating lifelong agents under two sources of non-stationarity: $\textit{preference drift}$ (persona) and $\textit{tool drift}$ (execution). $\textbf{DomusMind}$ instantiates a persistent interaction loop where agents balance autonomous execution and user burden. By tracking time-resolved metrics across preference, tool, and mixed drift scenarios, our results show that online Theory of Mind (ToM) with uncertainty-gated confirmation provides the most robust adaptation overall. Notably, $\texttt{ORACLE}$ persona access fails to mitigate $\textit{tool drift}$, which identifies execution reliability as a distinct bottleneck. By sweeping a confirmation threshold, $\textbf{DomusMind}$ characterizes a success–annoyance frontier that enables principled selection of operating points for long-horizon alignment.
Chat is not available.
Successful Page Load