Enabling True Global Perception in State Space Models for Visual Tasks
Abstract
Despite the importance of global contextual modeling in visual tasks, a rigorous mathematical definition remains absent, and the concept is still largely described in heuristic or empirical terms. Existing methods either rely on computationally expensive attention mechanisms or are constrained by the recursive modeling nature of State Space Models (SSMs), making it challenging to achieve both efficiency and true global perception. To address this, we first propose a mathematical definition of global modeling for visual images, providing a theoretical foundation for designing globally-aware and interpretable models. Based on in-depth analysis of SSMs and frequency-domain modeling principles, we construct a complete theoretical framework that overcomes the limitations imposed by SSMs' recursive modeling mechanism from a frequency perspective, thereby adapting SSMs for global perception in image modeling. Guided by this framework, we design the Global-aware SSM (GSSM) module and formally prove that it satisfies definitional requirements of global image modeling. GSSM leverages a Discrete Fourier Transform (DFT)-based modulation mechanism, providing precise front-end control over the SSM's modeling behavior, and enabling efficient global image modeling with linear-logarithmic complexity. Building upon GSSM, we develop GMamba, a plug-and-play module that can be seamlessly integrated at any stage of Convolutional Neural Networks (CNNs). Extensive experiments across multiple tasks, including object detection, semantic segmentation, and instance segmentation, across diverse model architectures, demonstrate that GMamba consistently outperforms existing global modeling modules, validating both the effectiveness of our theoretical framework and the rigor of proposed definition.