Poster
in
Workshop: Building Trust in LLMs and LLM Applications: From Guardrails to Explainability to Regulation

AdvBDGen: A Robust Framework for Generating Adaptive and Stealthy Backdoors in LLM Alignment Attacks

Pankayaraj Pathmanathan · Udari Sehwag · Michael-Andrei Panaitescu-Liess · Furong Huang

Project Page [ OpenReview]

Abstract

With the increasing adoption of reinforcement learning with human feedback(RLHF) to align large language models (LLMs), the risk of backdoor installationduring the alignment process has grown, potentially leading to unintended andharmful behaviors. Existing backdoor attacks mostly focus on simpler tasks, suchas sequence classification, making them either difficult to install in LLM alignmentor installable but easily detectable and removable. In this work, we introduceAdvBDGen, a generative fine-tuning framework that automatically creates prompt-specific paraphrases as triggers, enabling stealthier and more resilient backdoorattacks in LLM alignment. AdvBDGen is designed to exploit the disparities inlearning speeds between strong and weak discriminators to craft backdoors thatare both installable and stealthy. Using as little as 3% of the fine-tuning data,AdvBDGen can install highly effective backdoor triggers that, once installed, notonly jailbreak LLMs during inference but also exhibit greater stability againstinput perturbations and improved robustness to trigger removal methods. Ourfindings highlight the growing vulnerability of LLM alignment pipelines to ad-vanced backdoor attacks, underscoring the pressing need for more robust defensemechanisms.

Chat is not available.