Processing math: 100%
Skip to yearly menu bar Skip to main content


Poster

Understanding and Enhancing Safety Mechanisms of LLMs via Safety-Specific Neuron

Yiran Zhao · Wenxuan Zhang · Yuxi Xie · Anirudh Goyal · Kenji Kawaguchi · Michael Qizhe Shieh

Hall 3 + Hall 2B #602
[ ]
Sat 26 Apr midnight PDT — 2:30 a.m. PDT

Abstract: Safety alignment for large language models (LLMs) has become a critical issue due to their rapid progress. However, our understanding of effective safety mechanisms in LLMs remains limited, leading to safety alignment training that mainly focuses on improving optimization, data-level enhancement, or adding extra structures to intentionally block harmful outputs. To address this gap, we develop a neuron detection method to identify safety neurons—those consistently crucial for handling and defending against harmful queries. Our findings reveal that these safety neurons constitute less than 1% of all parameters, are language-specific and are predominantly located in self-attention layers. Moreover, safety is collectively managed by these neurons in the first several layers. Based on these observations, we introduce a S_afety N_euron Tun_ing method, named SN-Tune, that exclusively tune safety neurons without compromising models' general capabilities. SN-Tune significantly enhances the safety of instruction-tuned models, notably reducing the harmful scores of Llama3-8B-Instruction from 65.5 to 2.0, Mistral-7B-Instruct-v0.2 from 70.8 to 4.5, and Vicuna-13B-1.5 from 93.5 to 3.0. Moreover, SN-Tune can be applied to base models on efficiently establishing LLMs' safety mechanism. In addition, we propose R_obust S_afety N_euron Tun_ing method (RSN-Tune), which preserves the integrity of LLMs' safety mechanisms during downstream task fine-tuning by separating the safety neurons from models' foundation neurons.

Live content is unavailable. Log in and register to view live content