When Does Embedding Arithmetic Fail? A Systematic Analysis in Remote Sensing Vision-Language Models
Jinpyo Hong ⋅ Le Yu
Abstract
Embedding arithmetic promises flexible compositional queries over remote sensing imagery---transforming a harbor into an airport by subtracting "water" and adding "runway"---yet when this actually works remains poorly understood. We systematically evaluate four CLIP-based models across five RS datasets and identify concept entanglement as the dominant failure mode (40--60\% of failures): semantically related concepts occupy overlapping embedding subspaces that confound arithmetic. We propose a pre-hoc entanglement metric---requiring only text embeddings---that predicts failure with AUC up to 0.818, with GeoRSCLIP showing the most consistent predictions (mean AUC=0.675). Notably, embedding geometry does not reliably predict compositional capability ($r$=0.30, $p$=0.20), suggesting discriminative and compositional reasoning require different representational properties. We provide practical guidelines: arithmetic succeeds for well-separated concepts (88\%) but fails predictably for structurally similar classes (42\%).
Chat is not available.
Successful Page Load