WIMFRIS: WIndow Mamba Fusion and Parameter Efficient Tuning for Referring Image Segmentation
Abstract
Existing Parameter-Efficient Tuning (PET) methods for Referring Image Segmentation (RIS) primarily focus on layer-wise feature alignment, often neglecting the crucial role of a neck module for the intermediate fusion of aggregated multi-scale features, which creates a significant performance bottleneck. To address this limitation, we introduce WIMFRIS, a novel framework that establishes a powerful neck architecture alongside a simple yet effective PET strategy. At its core is our proposed HMF block, which first aggregates multi-scale features and then employs a novel WMF module to perform effective intermediate fusion. This WMF module leverages non-overlapping window partitioning to mitigate the information decay problem inherent in SSMs while ensuring rich local-global context interaction. Furthermore, our PET strategy enhances primary alignment with a MTA for robust textual priors, a MSA for precise vision-language fusion, and learnable emphasis parameters for adaptive stage-wise feature weighting. Extensive experiments demonstrate that WIMFRIS achieves new state-of-the-art performance across all public RIS benchmarks.