THEMIS: Towards Holistic Evaluation of MLLMs for Scientific Paper Fraud Forensics
Abstract
We present THEMIS, a novel multi-task benchmark designed to comprehensively evaluate Multimodal Large Language Models (MLLMs) on visual fraud reasoning within real-world academic scenarios. Compared to existing benchmarks, THEMIS introduces three major advancements. (1) Real-world Scenarios & Complexity: Our benchmark comprises over 4K questions spanning 7 scenarios, derived from authentic retracted-paper cases and carefully curated multimodal synthetic data. With 73.73% complex-texture images, THEMIS bridges the critical gap between existing benchmarks and the complexity of real-world academic fraud. (2) Task Diversity & Granularity: THEMIS systematically covers five challenging tasks and introduces 16 fine-grained manipulation operations. On average, each sample undergoes multiple stacked manipulation operations, with the diversity and difficulty of these manipulations demanding a high level of visual fraud reasoning from the models. (3) Multi-dimensional Capability Evaluation: We establish a mapping from fraud tasks to five core visual fraud reasoning capabilities, thereby enabling an evaluation that reveals the distinct strengths and specific weaknesses of different models across these core capabilities. Experiments on 11 leading MLLMs show that even the best-performing model still falls below the passing threshold, demonstrating that our benchmark presents a stringent test. We expect THEMIS to advance the development of MLLMs for complex, real-world fraud detection tasks. The data and code will be updated on url: https://anonymous.4open.science/r/themis1638.