Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors
Abstract
Activation monitoring, which probes a model’s internal states using lightweight classifiers, is an emerging tool for AI safety. However, its worst-case robustness under a misalignment threat model---where a model might learn to actively conceal its internal states---remains untested. We ask: could a model learn to evade previously unseen activation monitors? We demonstrate that fine-tuning can create Neural Chameleons: models capable of zero-shot evading activation monitors. Specifically, we fine-tune an LLM to evade monitors for benign concepts (e.g., HTML) when conditioned on a trigger phrase "You are being probed for {concept}". This learned mechanism generalizes zero-shot: substituting the concept with a safety-relevant term like 'deception' causes the model to successfully evade previously unseen safety monitors---even those trained post hoc on the model's frozen weights. We validate this across diverse model families (Llama, Gemma, Qwen), finding that evasion is highly selective to the triggered concept and incurs minimal capability degradation. Mechanistically, we show the model achieves evasion by shifting activations into a low-dimensional subspace that avoids probe decision boundaries. Our work provides a proof-of-concept for this failure mode of activation monitoring under misalignment threat models.