Understanding Adversarial Transfer Across Modalities: Why Representation-Space Attacks Fail Where Data-Space Attacks Succeed
Abstract
Why do adversarial examples transfer between image classifiers and jailbreaks transfer between language models, yet image jailbreaks fail to transfer between vision-language models (VLMs)? We propose that geometric alignment is the deciding mechanism: attacks in shared input “data-space” transfer readily, while attacks in “representation-space” transfer only when models’ internal geometries correlate. We argue that VLM image inputs function effectively as representation-space attacks: unlike text, which enters via a shared, discrete tokenizer, images are mapped through model-specific, unaligned continuous projectors. Consequently, adversarial perturbations on images result in geometrically orthogonal embedding vectors across models, preventing transfer. We validate this design principle theoretically and empirically. First, we provide a theoretical framework showing that representation attacks fail between functionally identical linear models unless their bases are aligned. Second, we demonstrate that representation-space attacks against ImageNet classifiers and LMs fail to transfer, mirroring the VLM phenomenon. Third, we show that textual jailbreaks do transfer between VLMs. We also show that aligning representation spaces restores transfer for both VLM image jailbreaks and LM latent attacks, establishing geometric alignment as the key mechanism. Our framework unifies the understanding of adversarial transfer across vision, language, and multimodal systems and suggests that the misalignment of internal interfaces in modular systems serves as a natural defense against transfer attacks, providing principled guidelines for building robust AI systems across modalities.