Skip to yearly menu bar Skip to main content


Poster

Vision-LSTM: xLSTM as Generic Vision Backbone

Benedikt Alkin · Maximilian Beck · Korbinian Pöppel · Sepp Hochreiter · Johannes Brandstetter

Hall 3 + Hall 2B #90
[ ] [ Project Page ]
Fri 25 Apr midnight PDT — 2:30 a.m. PDT

Abstract:

Transformers are widely used as generic backbones in computer vision, despite initially introduced for natural language processing. Recently, the Long Short-Term Memory (LSTM) has been extended to a scalable and performant architecture - the xLSTM - which overcomes long-standing LSTM limitations via exponential gating and parallelizable matrix memory structure. In this paper, we introduce Vision-LSTM (ViL), an adaption of the xLSTM building blocks to computer vision. ViL comprises a stack of xLSTM blocks where odd blocks process the sequence of patch tokens from top to bottom while even blocks go from bottom to top.ViL achieves strong performances on classification, transfer learning and segmentation tasks as well as a beneficial pre-training cost-to-performance trade-off. Experiments show that ViL holds promise to be further deployed as new generic backbone for computer vision architectures.

Live content is unavailable. Log in and register to view live content