Mamba-3: Improved Sequence Modeling using State Space Principles
Abstract
Scaling inference-time compute has emerged as an important driver of LLM performance, making inference efficiency a central focus of model design alongside model quality. While current Transformer models deliver strong quality, their quadratic compute and linear memory requirements make inference expensive. This has spurred the development of sub-quadratic models with reduced compute and constant memory requirements. However, many recent linear models trade off model quality and capability for algorithmic efficiency, failing on tasks such as state tracking. Moreover, their theoretically linear inference remains hardware-inefficient in practice. Guided by an inference-first perspective, we introduce three core methodological improvements inspired by the state space model (SSM) viewpoint of linear models. We combine: (1) a more expressive recurrence derived from SSM discretization, (2) a complex-valued state update rule enabling richer state tracking, and (3) a multi-input, multi-output (MIMO) formulation that improves model performance without increasing decode latency. Together with architectural refinements, Mamba-3 achieves significant gains across retrieval, state-tracking, and downstream language modeling tasks. At the 1.5B scale, Mamba-3 improves average downstream accuracy by 0.6 percentage points compared to the next best model (Gated DeltaNet), with the MIMO variant further improving accuracy by an additional 1.2 points, for a total gain of 1.8 points. Across state-size experiments, Mamba-3 achieves comparable perplexity to Mamba-2 despite using half the state size. These results demonstrate that Mamba-3 advances the performance–efficiency frontier.