Endogenous Resistance to Activation Steering in Language Models
Alex McKenzie ⋅ Keenan Pepper ⋅ Stijn Servaes ⋅ Martin Leitgab ⋅ Murat Cubuktepe ⋅ Michael Vaiana ⋅ Diogo de Lucena ⋅ Judd Rosenblatt ⋅ Michael Graziano
Abstract
Activation steering is increasingly used for AI safety interventions, but its reliability depends on whether models can resist such steering. We find that large language models can spontaneously resist task-misaligned activation steering during inference, recovering mid-generation even when steering remains active. We term this Endogenous Steering Resistance (ESR). Using sparse autoencoder (SAE) latents to steer Llama-3.3-70B and smaller models, ESR occurs substantially more in Llama-3.3-70B (3.8\% vs.\ $<$1\% for others). We identify 26 SAE latents causally linked to ESR: ablating them reduces ESR by 25\%. Meta-prompts enhance ESR 4$\times$, demonstrating controllability. These findings have dual implications for trustworthy AI: ESR could protect against adversarial manipulation but might undermine beneficial activation-based safety interventions.
Successful Page Load