MIMIC-Bench: Exploring the User-Like Thinking and Mimicking Capabilities of Multimodal Large Language Models
Abstract
The rapid advancement of multimodal large language models (MLLMs) has greatly prompted the video interpretation task, and numerous works have been proposed to explore and benchmark the cognition and basic visual reasoning capabilities of MLLMs. However, practical applications on social media platforms demand MLLMs that can emulate user-like thinking and behavior when interpreting user-generated videos, which has been rarely studied in current research. To bridge the gap and get closer to general practical artificial intelligence (AI), we first construct MIMIC-Data, a large-scale dataset containing 150K+ user-shared videos with corresponding information including captions, tags, comments, etc. Then, we present MIMIC-Bench, a large-scale benchmark building upon curated 4,000 user-shared videos from MIMIC-Data, which is designed to evaluate user-like thinking and mimicking capabilities of MLLMs in real-world video contexts. MIMIC-Bench not only supports user-like thinking challenges including creator intent, user feedback interpretation, etc., but also introduces a novel comment imitation task to assess whether MLLMs can generate human-like responses to video content. Based on MIMIC-Data and MIMIC-Bench, we develop MIMIC-Chat, which integrates spatial and temporal features into a large language model, and finetunes the model to perform user-like thinking and mimicking tasks. Extensive experiments conducted based on 24 existing MLLMs and our MIMIC-Chat model show that current MLLMs exhibit limited capabilities to perform human-like thinking and responses, and MIMIC-Chat performs better to some extent. We hope MIMIC-Bench can contribute to the advancement of human-aligned video understanding in the multi-modal era. The MIMIC-Data, MIMIC-Bench, and MIMIC-Chat will be released upon the publication.