Poster
in
Workshop: Agents in the Wild: Safety, Security, and Beyond

Evaluating LLM Judges in Cybersecurity Script Analysis

Alexandra-Daniela Damir ⋅ Apostu Alexandru-Mihai ⋅ Diana Bolocan ⋅ Andrei Preda ⋅ Ioana Croitoru ⋅ Mihaela Gaman ⋅ Laura Vasilie ⋅ Bilal Issa ⋅ Monica-Nicoleta Pascu

Project Page [ OpenReview]

Abstract

Building on Large Language Models' (LLMs) increasing usage as judges for evaluating natural language outputs, this paper examines which models generate responses ranking higher in expert evaluation of cybersecurity script analyses. Our newly constructed dataset of 1,000+ clean and malicious scripts, with expert-curated natural language summaries, serves as a reference for the evaluation of behavioral script summarization task performed by a candidate LLM. Several judge LLMs are asked to evaluate responses generated by the candidate LLM using human responses as reference. Through manual assessment of judge evaluations, we identify those models with outputs rated higher by experts in cybersecurity contexts and analyze the factors influencing judge quality, including self-preference bias and prompting strategy effects. Our publicly released dataset supports continued research in this domain where accurate evaluation is increasingly vital.

Chat is not available.