Oral
in
Workshop: Secure and Trustworthy Large Language Models

Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

Boyi Wei ⋅ Kaixuan Huang ⋅ Yangsibo Huang ⋅ Tinghao Xie ⋅ Xiangyu Qi ⋅ Mengzhou Xia ⋅ Prateek Mittal ⋅ Mengdi Wang ⋅ Peter Henderson

Project Page [ OpenReview]

Abstract

Large language models (LLMs) show inherent brittleness in their safety mechanisms, as evidenced by their susceptibility to jailbreaking and even non-malicious fine-tuning. This study explores this brittleness of safety alignment by leveraging pruning and low-rank modifications. We develop methods to identify critical regions that are vital for safety guardrails, and that are disentangled from utility-relevant regions at both the neuron and rank levels. Surprisingly, the isolated regions we find are sparse, comprising about 3% at the parameter level and 2.5% at the rank level. Removing these regions compromises safety without significantly impacting utility, corroborating the inherent brittleness of the model's safety mechanisms. Moreover, we show that LLMs remain vulnerable to low-cost fine-tuning attacks even when modifications to the safety-critical regions are restricted. These findings underscore the urgent need for more robust safety strategies in LLMs.

Chat is not available.