Skip to yearly menu bar Skip to main content


Poster

OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces

zehan wang · Ziang Zhang · Minjie Hong · Hang Zhang · Luping Liu · Rongjie Huang · Xize Cheng · Shengpeng Ji · Tao Jin · Hengshuang Zhao · Zhou Zhao

Hall 3 + Hall 2B #625
[ ]
Fri 25 Apr midnight PDT — 2:30 a.m. PDT

Abstract:

Recently, human-computer interaction with various modalities has shown promising applications, like GPT-4o and Gemini. Meanwhile, multimodal representation models have emerged as the foundation for these versatile multimodal understanding and generation pipeline. Models like CLIP, CLAP and ImageBind can map their specialized modalities into respective joint spaces. To construct a high-quality omni representation space that can be shared and expert in any modality, we propose to merge these advanced models into a unified space in scale. With this insight, we present \textbf{OmniBind}, advanced multimodal joint representation models via fusing knowledge of 14 pre-trained spaces, which support 3D, audio, image, video and language inputs. To alleviate the interference between different knowledge sources in integrated space, we dynamically assign weights to different spaces by learning routers with two objectives: cross-modal overall alignment and language representation decoupling. Notably, since binding and routing spaces only require lightweight networks, OmniBind is extremely training-efficient. Extensive experiments demonstrate the versatility and superiority of OmniBind as an omni representation model, highlighting its great potential for diverse applications, such as any-query and composable multimodal understanding.

Live content is unavailable. Log in and register to view live content