Poster
in
Workshop: Secure and Trustworthy Large Language Models

What's in Your "Safe" Data?: Identifying Benign Data that Breaks Safety

Luxi He · Mengzhou Xia · Peter Henderson

Project Page [ OpenReview]

Abstract

Recent research indicates that Large Language Models (LLMs), even those tuned for safety and alignment, are susceptible to jailbreaking. Some have found that just further fine-tuning an aligned model with benign data (i.e., data without harmful content) surprisingly leads to substantial degradation in safety. We delve into the data-centric aspects of why benign fine-tuning inadvertently contributes to jailbreaking. We represent data through two lenses: representation and gradient spaces. We introduce a bi-directional anchoring method that effectively finds subsets of benign data that are more likely to degrade safety after fine-tuning. Training on just 100 of these benign datapoints can lead to the fine-tuned model responding in a potentially unsafe manner for >70% of tested harmful requests, compared to <20% after fine-tuning on randomly selected data. We further find that selected data are often in the form of lists and bullet points, or math questions.

Chat is not available.