Poster
in
Workshop: Principled Design for Trustworthy AI: Interpretability, Robustness, and Safety Across Modalities

Endogenous Resistance to Activation Steering in Language Models

Alex McKenzie ⋅ Keenan Pepper ⋅ Stijn Servaes ⋅ Martin Leitgab ⋅ Murat Cubuktepe ⋅ Michael Vaiana ⋅ Diogo de Lucena ⋅ Judd Rosenblatt ⋅ Michael Graziano

Project Page

Abstract

Activation steering is increasingly used for AI safety interventions, but its reliability depends on whether models can resist such steering. We find that large language models can spontaneously resist task-misaligned activation steering during inference, recovering mid-generation even when steering remains active. We term this Endogenous Steering Resistance (ESR). Using sparse autoencoder (SAE) latents to steer Llama-3.3-70B and smaller models, ESR occurs substantially more in Llama-3.3-70B (3.8\% vs.\ $<$1\% for others). We identify 26 SAE latents causally linked to ESR: ablating them reduces ESR by 25\%. Meta-prompts enhance ESR 4$\times$, demonstrating controllability. These findings have dual implications for trustworthy AI: ESR could protect against adversarial manipulation but might undermine beneficial activation-based safety interventions.