Evaluating LLM Judges in Cybersecurity Script Analysis
Abstract
Building on Large Language Models' (LLMs) increasing usage as judges for evaluating natural language outputs, this paper examines which models generate responses ranking higher in expert evaluation of cybersecurity script analyses. Our newly constructed dataset of 1,000+ clean and malicious scripts, with expert-curated natural language summaries, serves as a reference for the evaluation of behavioral script summarization task performed by a candidate LLM. Several judge LLMs are asked to evaluate responses generated by the candidate LLM using human responses as reference. Through manual assessment of judge evaluations, we identify those models with outputs rated higher by experts in cybersecurity contexts and analyze the factors influencing judge quality, including self-preference bias and prompting strategy effects. Our publicly released dataset supports continued research in this domain where accurate evaluation is increasingly vital.