Is Evaluation Awareness Just Format Sensitivity? Limitations of Probe-Based Evidence under Controlled Prompt Structure
Abstract
Prior work suggests that large language models encode “evaluation awareness”, often supported by probe-based evidence on benchmark data. However, evaluation benchmarks tightly correlate usage context with prompt structure and genre. We test whether probe-based signals attributed to evaluation awareness persist once prompt format and genre are partially controlled. Using a controlled 2×2 dataset matrix and diagnostic rewrites, we find that linear probes overwhelmingly respond to benchmark-canonical structured formats, failing to generalize across free-form prompts regardless of linguistic style. These results suggest that commonly used probe-based methodologies are insufficient to disentangle evaluation context from structural artifacts, highlighting a key limitation in current evidence for evaluation awareness.