ICLR Poster Is Your Video Language Model a Reliable Judge?

Poster

Is Your Video Language Model a Reliable Judge?

Ming Liu · Wensheng Zhang

Hall 3 + Hall 2B #515

[ Abstract ]

Thu 24 Apr 7 p.m. PDT — 9:30 p.m. PDT

Abstract:

As video language models (VLMs) gain more applications in various scenarios,the need for robust and scalable evaluation of their performance becomes increasingly critical. The traditional human expert-based evaluation of VLMs has limitations in consistency and scalability, which sparked interest in automatic methodssuch as employing VLMs to evaluate VLMs. However, the reliability of VLMs asjudges remains underexplored. Existing methods often rely on a single VLM asthe evaluator. However, this approach can be unreliable or biased because such amodel may lack the ability to fully understand the content and may have inherentbiases, ultimately compromising evaluation reliability. A remedy is to apply theprinciple of collective thoughts, aggregating evaluations from multiple VLMs toenhance reliability. This study investigates the efficacy of such approaches, particularly when the pool of judges includes both reliable and unreliable models. Ourfindings reveal that incorporating collective judgments from such a mixed pooldoes not necessarily improve the accuracy of the final evaluation. The inclusion ofless reliable judges can introduce noise, undermining the overall reliability of theoutcomes. To explore the factors that impact evaluation reliability, we fine-tunean underperforming VLM judge, Video-LLaVA, and observe that improved understanding ability alone is insufficient to make VLM judges more reliable. Thesefindings stress the limitations of collective thought approaches and highlight theneed for more advanced methods that can account for the reliability of individualmodels. Our study promotes the development of more reliable evaluation methodsfor VLMs

Live content is unavailable. Log in and register to view live content