Prompt-Level Drift as an Operational Monitoring Problem: Schema Failure Cliffs and Judge-Version Risk in Artifact-Grounded Evaluation
Abstract
Prompt templates are routinely edited in deployed LLM systems and evaluation pipelines, often treated as “non-functional” refactors. We reframe prompt-level drift as an operational monitoring problem: minimal prompt perturbations can trigger discontinuous schema violations, degrade instruction following, and even shift conclusions when the judge prompt—the measurement instrument—changes. Methodologically, we study these failures through an artifact-grounded evaluation pipeline built on a fixed evidence snapshot, where preserved outputs, versioned judge/model slices, processed records, and summary tables form a one-way audit chain. Across six preserved judge/model slices under a baseline judge (three judge models scoring outputs from the other two providers), reducing contract salience from explicit to implicit increases schema hard-breaks (A structure= 0) from 8/48 (16.7%) to 34/48 (70.8%) and drops the mean total score from 7.69 to 2.75 (max total = 10). We further show judge-version sensitivity on identical preserved outputs: for one overlapping slice (ChatGPT-as-judge scoring Gemini outputs under implicit triggers), the hard-break rate shifts from 8/8 (v0) to 0/8 (v1/v2), and the mean total score shifts from 0.0 (v0) to 8.0 (v1) to 10.0 (v2). These patterns motivate CAO-style change control: pin prompt and judge versions, monitor interface validity, and prohibit pooling metrics across judge versions.