ICLR Simple Permutations Can Fool LLaMA: Permutation Attack and Defense for Large Language Models

Oral
in
Workshop: Secure and Trustworthy Large Language Models

Simple Permutations Can Fool LLaMA: Permutation Attack and Defense for Large Language Models

Liang Chen · Yatao Bian · Li Shen · Kam-Fai Wong

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

In-context learning (ICL) enables Large Language Models (LLMs) to undertake challenging tasks through given examples. However, it is prone to instability: different orderings of input examples can significantly influence predictions. Current mitigation strategies, focused on post-processing, fail to enhance the model's inherent robustness. This paper extensively investigates this issue of LLMs and uncovers a natural, permutation-based attack that can nearly achieve 100\% success rates on LLMs, while remaining imperceptible to humans. To address this vulnerability, we propose a distributionally robust optimization (DRO)-based tuning method as a defence, explicitly optimizing the model's performance against worst-case permutations to bolster robustness. Our framework comprises two modules: the Permutation Proposal network (P-Net) and LLM. The P-Net formulates the identification of the most challenging permutation as an optimal transport problem, solved using the Sinkhorn algorithm. Through adversarial training, the P-Net progressively enhances the LLM's robustness against permutation instability. Experiments with a synthetic task and ICL tuning task demonstrate that our methodology effectively mitigates permutation attacks and enhances overall performance.

Chat is not available.

Oral in Workshop: Secure and Trustworthy Large Language Models

Simple Permutations Can Fool LLaMA: Permutation Attack and Defense for Large Language Models

Liang Chen · Yatao Bian · Li Shen · Kam-Fai Wong

Oral
in
Workshop: Secure and Trustworthy Large Language Models