ICLR Poster Durable Quantization Conditioned Misalignment Attack on Large Language Models

Poster

Durable Quantization Conditioned Misalignment Attack on Large Language Models

Peiran Dong · Haowei Li · Song Guo

Hall 3 + Hall 2B #522

[ Abstract ]

Thu 24 Apr midnight PDT — 2:30 a.m. PDT

Abstract:

As large language models (LLMs) are increasingly deployed on resource-constrained edge devices, quantization techniques have been widely adopted to reduce model size and computational requirements. However, this process can expose models to new vulnerabilities. In this work, we introduce the Quantization Conditioned Misalignment (Q-Misalign) attack, a novel threat in which safety misalignment remains dormant in a full-precision LLM but becomes exploitable post-quantization. We demonstrate that our Q-Misalign attack effectively bypasses safety mechanisms and enables the generation of harmful content in quantized models while maintaining full-precision performance. Furthermore, we propose a contrastive task vector-based approach to enhance attack durability, ensuring that vulnerabilities persist even after downstream fine-tuning. Experimental results show that Q-Misalign attack significantly increases jailbreak success rates in quantized models, while preserving model utility and safety alignment in full precision. Our findings highlight a critical gap in current LLM safety measures and call for more robust defenses in quantization-aware scenarios.

Live content is unavailable. Log in and register to view live content