Poster
in
Workshop: 2nd Workshop on Mathematical and Empirical Understanding of Foundation Models
Prometheus-Vision: Vision-Language Model as a Judge for Fine-Grained Evaluation
Seongyun Lee · Seungone Kim · Sue Park · Geewook Kim · Minjoon Seo
Assessing long-form responses generated by Vision-Language Models (VLMs) is challenging. It not only requires checking whether the VLM follows the given instruction but also verifying whether the text output is properly grounded on the given image. Inspired by the recent approach of evaluating LMs with LMs, in this work, we propose to evaluate VLMs with VLMs. For this purpose, we present a new multi-modal feedback dataset called the Perception Collection, encompassing 15K customized score rubrics that users might care about during assessment. Using the Perception Collection, we train Prometheus-Vision, the first open-source VLM specialized for fine-grained evaluation purposes. Prometheus-Vision shows the highest Pearson correlation with human evaluators and GPT-4V among the open-source VLM baselines, showing its effectiveness for transparent and accessible evaluation. We open-source our code, dataset, and model at https://anonymous.4open.science/r/prometheus-vision-9D37.