Refusal-Orthogonal Gated Editing for Safer Localized LLM Adaptation
Abstract
As LLMs are increasingly adapted post-training under tight data and compute budgets, standard parameter-efficient fine-tuning (PEFT) can inadvertently degrade safety-critical refusal behavior. Prior work suggests refusal is concentrated in a small number of directions in representation space, yet most safety interventions either operate in weight space or rely on post-hoc steering, leaving unclear how to adapt models while avoiding interference with refusal-sensitive representations. We propose Safety-Aware Gated Editing (SAGE), a two-stage method that (i) estimates a refusal subspace and (ii) learns sparse, gated additive and multiplicative activation edits for task performance while constraining all learned edits to the refusal-orthogonal subspace. The base model remains frozen and the learned edits are applied at inference time. Across low-resource adaptation on ARC and BoolQ with Llama-3.1-8B-Instruct, SAGE maintains competitive task accuracy while consistently reducing attack success rates on StrongREJECT, AdvBench, and JailbreakBench relative to standard PEFT and unconstrained activation editing. These results suggest that lightweight geometry-based constraints can mitigate safety degradation without additional safety supervision or full-model fine-tuning.