ICLR Poster Learnable Expansion of Graph Operators for Multi-Modal Feature Fusion

Poster

Learnable Expansion of Graph Operators for Multi-Modal Feature Fusion

Dexuan Ding · Lei Wang · Liyun Zhu · Tom Gedeon · Piotr Koniusz

Hall 3 + Hall 2B #457

[ Abstract ]

Sat 26 Apr midnight PDT — 2:30 a.m. PDT

Abstract:

In computer vision tasks, features often come from diverse representations, domains (e.g., indoor and outdoor), and modalities (e.g., text, images, and videos). Effectively fusing these features is essential for robust performance, especially with the availability of powerful pre-trained models like vision-language models. However, common fusion methods, such as concatenation, element-wise operations, and non-linear techniques, often fail to capture structural relationships, deep feature interactions, and suffer from inefficiency or misalignment of features across domains or modalities. In this paper, we shift from high-dimensional feature space to a lower-dimensional, interpretable graph space by constructing relationship graphs that encode feature relationships at different levels, e.g., clip, frame, patch, token, etc. To capture deeper interactions, we expand graphs through iterative graph relationship updates and introduce a learnable graph fusion operator to integrate these expanded relationships for more effective fusion. Our approach is relationship-centric, operates in a homogeneous space, and is mathematically principled, resembling element-wise relationship score aggregation via multilinear polynomials. We demonstrate the effectiveness of our graph-based fusion method on video anomaly detection, showing strong performance across multi-representational, multi-modal, and multi-domain feature fusion tasks.

Live content is unavailable. Log in and register to view live content