ICLR Poster Understanding and Enhancing Safety Mechanisms of LLMs via Safety-Specific Neuron

Poster

Understanding and Enhancing Safety Mechanisms of LLMs via Safety-Specific Neuron

Yiran Zhao · Wenxuan Zhang · Yuxi Xie · Anirudh Goyal · Kenji Kawaguchi · Michael Qizhe Shieh

Hall 3 + Hall 2B #602

[ Abstract ]

Sat 26 Apr midnight PDT — 2:30 a.m. PDT

Abstract: Safety alignment for large language models (LLMs) has become a critical issue due to their rapid progress. However, our understanding of effective safety mechanisms in LLMs remains limited, leading to safety alignment training that mainly focuses on improving optimization, data-level enhancement, or adding extra structures to intentionally block harmful outputs. To address this gap, we develop a neuron detection method to identify safety neurons—those consistently crucial for handling and defending against harmful queries. Our findings reveal that these safety neurons constitute less than

$1\%$ of all parameters, are language-specific and are predominantly located in self-attention layers. Moreover, safety is collectively managed by these neurons in the first several layers. Based on these observations, we introduce a

$\underline{S}$ afety

$\underline{N}$ euron

$\underline{Tun}$ ing method, named

$\texttt{SN-Tune}$ , that exclusively tune safety neurons without compromising models' general capabilities.

$\texttt{SN-Tune}$ significantly enhances the safety of instruction-tuned models, notably reducing the harmful scores of Llama3-8B-Instruction from

$65.5$ to

$2.0$ , Mistral-7B-Instruct-v0.2 from

$70.8$ to

$4.5$ , and Vicuna-13B-1.5 from

$93.5$ to

$3.0$ . Moreover,

$\texttt{SN-Tune}$ can be applied to base models on efficiently establishing LLMs' safety mechanism. In addition, we propose

$\underline{R}$ obust

$\underline{S}$ afety

$\underline{N}$ euron

$\underline{Tun}$ ing method (

$\texttt{RSN-Tune}$ ), which preserves the integrity of LLMs' safety mechanisms during downstream task fine-tuning by separating the safety neurons from models' foundation neurons.

Live content is unavailable. Log in and register to view live content