From Theory to Throughput: Unifying Architectures and Scaling Deep Memory
Abstract
While Transformers and modern linear RNNs have driven massive advancements in sequence modeling, their underlying mechanisms are often treated as entirely distinct paradigms. In this talk, we first introduce MIRAS, a unifying framework that reconceptualizes these diverse architectures as associative-memory modules governed by (inverse) online optimization. We then explore how MIRAS unlocks a rich (potentially non-Euclidean) design space utilizing robust statistics, leading to novel, highly stable architectures. However, realizing the full potential of these deep memory modules requires overcoming several significant practical bottlenecks: token-level myopia, fixed memory capacities, and severe chunk-wise training inefficiencies. We will explore how recent advances resolve these limitations. We discuss ATLAS, which utilizes the Omega rule and Muon optimizer for optimal context memorization; Memory Caching, which allows RNN memory capacity to dynamically grow; and TNT, a novel parallel training paradigm that completely decouples training hardware throughput from inference resolution. Finally, we conclude by situating these innovations within the broader Nested Learning paradigm, proposing a future where both architectures and their optimizers function as an interconnected hierarchy of multi-timescale continual learning systems.