AMLRIS: Alignment-aware Masked Learning for Referring Image Segmentation
Abstract
Referring Image Segmentation (RIS) aims to segment the object in an image uniquely referred to by a natural language expression. However, RIS training often contains hard-to-align and instance-specific visual signals; optimizing on such pixels injects misleading gradients and drives the model in the wrong direction. By explicitly estimating pixel-level vision–language alignment, the learner can suppress low-alignment regions, concentrate on reliable cues, and acquire more generalizable alignment features. In this paper, we propose Alignment-Aware Masked Learning (AML), a simple yet effective training strategy that quantifies region–referent alignment (PMME) and filters out unreliable pixels during optimization (AFM). Specifically, each sample first computes a similarity map between visual and textual features, and then masks out pixels falling below an adaptive similarity threshold, thereby excluding poorly aligned regions from the training process. AML does not require architectural changes and incurs no inference overhead, directing attention to the areas aligned with the textual description. Experiments on the RefCOCO (vanilla/+/g) datasets show that AML achieves state-of-the-art results across all 8 splits, and beyond improving RIS performance, AML also enhances the model’s robustness to diverse descriptions and scenarios.