PROBING INFORMATION FLOW IN VISION TRANSFORMERS THROUGH CONTROLLED ATTENTION PERTURBATION
Abstract
We apply identical attention sparsity to three vision transformer tasks and find order-of-magnitude differences in sensitivity: at 75% sparsity, CLIP retrieval de- grades 2%, classification degrades 7%, while diffusion generation degrades 274%. To systematically probe this, we design three masking strategies with distinct graph-theoretic properties (small-world, preferential attachment, hub-spoke) and measure degradation across density levels. Ablating small-world masks reveals that spatial locality, not long-range shortcuts, drives performance preservation, with local-only connectivity outperforming random-only by 7.6×. We hypothe- size that diffusion’s sensitivity arises from error accumulation across 250 sequen- tial denoising steps, where each disruption compounds through subsequent iter- ations. These findings demonstrate how controlled perturbation can reveal task- dependent differences in transformer information flow that static analysis would miss.