Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Workshop on Large Language Models for Agents

Adapting Uni-Modal Language Models for Dense Multi-Modal Co-Reference Resolution using Parameter Augmentation

Samuel Osebe · Prashan Wanigasekara · Thanh Tran · Thomas Gueudre


Abstract: The context of modern smart voice assistants are often multi-modal, where images, audio and video content are consumed by users simultaneously. In such a setup, co-reference resolution is especially challenging, and runs across modalities and dialogue turns. We explore the problem of multi-modal co-reference resolution in multi-turn dialogues and quantify the performance of multi-modal LLMs on a specially curated dataset of long, image-interleaved conversations between a voice assistant and a human for a shopping use case. We propose and evaluate a custom architecture for multi-modal embedding alignment using a novel parameter augmentation technique. Our proposed Parameter Augmented LLM approach shows a $4.9\%$ absolute F1 improvement above a baseline while reducing the number of parameters being trained by $13.3\%$ for a complex co-referencinging task on a multi-turn shopping dataset.

Chat is not available.