Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Secure and Trustworthy Large Language Models

On Prompt-Driven Safeguarding for Large Language Models

Chujie Zheng · Fan Yin · Hao Zhou · Fandong Meng · Jie Zhou · Kai-Wei Chang · Minlie Huang · Nanyun (Violet) Peng


Abstract:

Prepending model inputs with safety prompts is a common practice for safeguarding large language models (LLMs) from complying with queries that contain harmful intents. However, the working mechanisms of safety prompts have not been revealed yet. In this work, we investigate the impact of safety prompts from the perspective of model representations. We find that in models' representation space, harmful and harmless queries can be largely distinguished, but this is not noticeably enhanced by safety prompts. Instead, the queries' representations are moved by safety prompts in similar directions where models become more prone to refusal (i.e., refusing to provide assistance) even when the queries are harmless. Inspired by these findings, we further present a safety prompt optimization method in the Appendix. We demonstrate that the proposed method remarkably improves the safeguarding performance of human-crafted safety prompts without compromising the general model capability.

Chat is not available.