Skip to yearly menu bar Skip to main content


Poster

Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures

Yuchen Duan · Weiyun Wang · Zhe Chen · Xizhou Zhu · Lewei Lu · Tong Lu · Yu Qiao · Hongsheng Li · Jifeng Dai · Wenhai Wang

Hall 3 + Hall 2B #124
[ ] [ Project Page ]
Wed 23 Apr 7 p.m. PDT — 9:30 p.m. PDT

Abstract:

Transformers have revolutionized computer vision and natural language processing, but their high computational complexity limits their application in high-resolution image processing and long-context analysis. This paper introduces Vision-RWKV (VRWKV), a model that builds upon the RWKV architecture from the NLP field with key modifications tailored specifically for vision tasks. Similar to the Vision Transformer (ViT), our model demonstrates robust global processing capabilities, efficiently handles sparse inputs like masked images, and can scale up to accommodate both large-scale parameters and extensive datasets. Its distinctive advantage is its reduced spatial aggregation complexity, enabling seamless processing of high-resolution images without the need for window operations. Our evaluations demonstrate that VRWKV surpasses ViT's performance in image classification and has significantly faster speeds and lower memory usage processing high-resolution inputs. In dense prediction tasks, it outperforms window-based models, maintaining comparable speeds. These results highlight VRWKV's potential as a more efficient alternative for visual perception tasks. Code and models are available at~\url{https://github.com/OpenGVLab/Vision-RWKV}.

Live content is unavailable. Log in and register to view live content