Automatic Interpretation of Visual Concepts
Abstract
Recent progress in mechanistic interpretability and sparse autoencoders (SAEs) has opened new avenues for understanding vision models, yet automatically assigning accurate textual descriptions to discovered concepts remains unprincipled. Existing studies rely on proxy metrics such as CLIP similarity or qualitative inspection, which fail to measure semantic faithfulness of the concept descriptions. To bridge this gap, we conduct a principled study of the automatic interpretation pipeline, evaluating key design choices including MLLM query construction and sample selection. We introduce Semantic Label Quality (SLQ) metrics from language model interpretability to vision, providing direct measurement of label faithfulness. We further investigate whether synthetic counterfactuals generated by a conditional generative model can further improve interpretation. Experiments on synthetic faces, histopathology, and remote sensing images reveal that optimal interpretation strategies are dataset-dependent: no single configuration universally outperforms others. Counterfactual contrastive samples improve interpretation for localized, additive concepts but provide limited benefit for global concepts where counterfactuals are less well defined.