ICLR Poster Learning Spatial-Semantic Features for Robust Video Object Segmentation

Poster

Learning Spatial-Semantic Features for Robust Video Object Segmentation

Xin Li · Deshui Miao · Zhenyu He · Yaowei Wang · Huchuan Lu · Ming-Hsuan Yang

Hall 3 + Hall 2B #79

[ Abstract ] [ Project Page ]

Thu 24 Apr midnight PDT — 2:30 a.m. PDT

Abstract:

Tracking and segmenting multiple similar objects with distinct or complex parts in long-term videos is particularly challenging due to the ambiguity in identifying target components and the confusion caused by occlusion, background clutter, and changes in appearance or environment over time. In this paper, we propose a robust video object segmentation framework that learns spatial-semantic features and discriminative object queries to address the above issues. Specifically, we construct a spatial-semantic block comprising a semantic embedding component and a spatial dependency modeling part for associating global semantic features and local spatial features, providing a comprehensive target representation. In addition, we develop a masked cross-attention module to generate object queries that focus on the most discriminative parts of target objects during query propagation, alleviating noise accumulation to ensure effective long-term query propagation. The experimental results show that the proposed method sets new state-of-the-art performance on multiple data sets, including the DAVIS2017 test (\textbf{87.8\%}), YoutubeVOS 2019 (\textbf{88.1\%}), MOSE val (\textbf{74.0\%}), and LVOS test (\textbf{73.0\%}), which demonstrate the effectiveness and generalization capacity of the proposed method. We will make all the source code and trained models publicly available.

Live content is unavailable. Log in and register to view live content