Safe Downstream Adaptation of LLMs via Refusal-Orthogonal Gated Editing
Abstract
Parameter-efficient fine-tuning (PEFT) methods adapt large language models (LLMs) with limited data and compute, but can inadvertently degrade safety-critical refusal behavior. Prior work suggests refusal is concentrated in a small number of directions in representation space; we hypothesize that estimating these directions and restricting adaptation to their orthogonal complement can prevent downstream fine-tuning from making a model less safe. We propose Safety-Aware Gated Editing (SAGE), a two-stage method that (i) estimates a refusal subspace and (ii) fine-tunes sparse, gated additive and multiplicative activation edits for task performance while constraining all edits to the refusal-orthogonal subspace. On Llama-3.1-8B-Instruct, SAGE achieves competitive downstream task and MMLU-Pro accuracy while consistently lowering attack success rates on StrongREJECT, AdvBench, and JailbreakBench compared to standard PEFT and unconstrained activation editing. Overall, SAGE operationalizes refusal directions as a protected control primitive, enabling geometry-aware adaptation that preserves safety under low-resource fine-tuning.