Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Secure and Trustworthy Large Language Models

Initial Response Selection for Prompt Jailbreaking using Model Steering

Thien Tran · Koki Wataoka · Tsubasa Takahashi


Abstract:

Jailbreak prompts, which are the inputs made to make LLMs create unsafe content, are a critical threat to the safety deployment of LLMs. Traditional jailbreak methods depend on optimizing malicious prompts to generate an affirmative initial response, assuming the continuation of harmful generation. Yet, the effectiveness of these initial responses can vary, impacting the likelihood of subsequent harmful output. This work focuses on the importance of selecting the proper initial response and the difficulties that come with it. We propose a new method that uses model steering to effectively choose the initial response that may lead to successful attacks. Our experiments show that this method can greatly improve how accurately we choose the proper initial responses, leading to high attack success rates.

Chat is not available.