Endogenous Resistance to Activation Steering in Language Models
Alex McKenzie ⋅ Keenan Pepper ⋅ Stijn Servaes ⋅ Martin Leitgab ⋅ Murat Cubuktepe ⋅ Michael Vaiana ⋅ Diogo de Lucena ⋅ Judd Rosenblatt ⋅ Michael Graziano
Abstract
Large language models can resist task-misaligned activation steering during inference, recovering mid-generation to produce improved responses even when steering remains active. We term this Endogenous Steering Resistance (ESR). Using sparse autoencoder (SAE) latents to steer Llama-3.3-70B and smaller models, we find ESR occurs substantially more in Llama-3.3-70B (3.8\% vs.\ $<$1\% for others). We identify 26 SAE latents causally linked to ESR: ablating them reduces ESR by 25\%. Meta-prompts enhance ESR 4$\times$, demonstrating controllability. These findings have dual safety implications: ESR could protect against adversarial manipulation but might interfere with beneficial activation-based safety interventions.
Chat is not available.
Successful Page Load