Not All Features Are Created Equal: A Mechanistic Study of Vision-Language-Action Models
Bryce Grant ⋅ Xijia Zhao ⋅ Peng Wang
Abstract
Vision-Language-Action (VLA) models represent a frontier application of multimodal foundation models, unifying visual perception, language understanding, and continuous motor control. But how do these models actually combine their modalities? We present the first cross-architecture mechanistic study of multimodal fusion in VLAs, using activation injection, sparse autoencoders (SAEs), and linear probes across $\textbf{25,000+ rollout episodes}$ on three architecturally distinct models: $\pi_{0.5}$ (3B, flow-matching), OpenVLA-OFT (7B, continuous regression), and ACT (80M, CVAE). Our findings reveal a striking asymmetry: $\textbf{vision completely dominates language}$. Injecting visual activations into null-prompt episodes recovers near-identical actions (cosine similarity $=0.999$), while null prompts alone achieve baseline task success ($p>0.24$ vs.\ correct prompts). Internal representations distinguish prompts with 99.3\% accuracy, yet behavior is unchanged; the language pathway is read but not used. We further show that action tokenization fundamentally constrains interpretability: discrete 256-bin tokens prevent SAE intervention (0\% success) while continuous representations enable it (99.2\%), and per-token SAE processing is essential (mean-pooling causes 88\% failure). Cross-task activation transfer fails universally (0\% across 528+ pairs in all models), revealing that VLAs encode scene-grounded motor programs rather than abstract task representations. These findings expose a fundamental limitation of current multimodal fusion in deployed robotic systems and provide concrete guidance for model design.
Chat is not available.
Successful Page Load