When Lie Detectors Learn Model Identity: Confounds in Black-Box Sandbagging Detection
Lin Yulong ⋅ Pablo Bernabeu-Perez ⋅ Benjamin Arnav ⋅ Lennie Wells ⋅ Mary Phuong
Abstract
Embedding-based lie detectors are a natural approach to sandbagging detection---if a model is being deceptive, probes trained on deceptive vs. honest text should catch it. We provide the first test of this idea and find it achieves 89\% recall (95\% CI: [82, 94]) in mixed-model evaluation but collapses to 2\% ([0, 8]) under same-model controls. The reason: probes learn model identity, not deception, and achieve high accuracy even on honest-vs-honest cross-model comparisons. We find an analogous confound in trusted monitoring, which conflates answer incorrectness with suspicion ($r = -0.66$). These failure modes are previously undiagnosed despite both methods being actively proposed for safety evaluation. A confound-aware alternative (cross-context consistency) achieves 67\% recall at 5\% FPR, confirming the detection signal exists but the problem remains open. We provide precise confound diagnostics and actionable evaluation guidelines: future detection work must evaluate on same-model pairs, control for correctness, and test against honest-vs-honest baselines.
Chat is not available.
Successful Page Load