Poster
in
Workshop: Workshop on Reasoning and Planning for Large Language Models

Unveiling and Enhancing Multimodal In-context Learning of Large Vision-language Models

Yanshu Li

Project Page [ OpenReview]

Abstract

With the rise of Large Vision-Language Models (LVLMs), multimodal in-context learning (ICL) has emerged as a crucial capability due to its vast application potential. However, the complexity of multimodal inputs and the sensitivity of ICL to input configuration make in-context demonstration (ICD) selection and prompt construction highly challenging. To fully unlock the potential of LVLMs, we first investigate the mechanisms of multimodal ICL and identify the critical role of task mapping in ICD sequences for efficient multimodal learning. To achieve precise task mapping in ICD sequence configuration, we propose a four-layer decoder-only model $SabER$ with task-aware attention, which autoregressively selects ICDs from a demonstration library to construct prompts. Our specially designed modules enable fine-grained feature extraction and interpretation across modalities, iteratively refining task mapping to generate optimal ICD sequences. Extensive experiments across five LVLMs and nine datasets demonstrate the superior performance of our model.

Chat is not available.