Oral
in
Workshop: Secure and Trustworthy Large Language Models
What's in Your "Safe" Data?: Identifying Benign Data that Breaks Safety
Luxi He · Mengzhou Xia · Peter Henderson
Recent research indicates that Large Language Models (LLMs), even those tuned for safety and alignment, are susceptible to jailbreaking. Some have found that just further fine-tuning an aligned model with benign data (i.e., data without harmful content) surprisingly leads to substantial degradation in safety. We delve into the data-centric aspects of why benign fine-tuning inadvertently contributes to jailbreaking. We represent data through two lenses: representation and gradient spaces. We introduce a bi-directional anchoring method that effectively finds subsets of benign data that are more likely to degrade safety after fine-tuning. Training on just 100 of these benign datapoints can lead to the fine-tuned model responding in a potentially unsafe manner for >70% of tested harmful requests, compared to <20% after fine-tuning on randomly selected data. We further find that selected data are often in the form of lists and bullet points, or math questions.