Skip to yearly menu bar Skip to main content


Poster

Boosting the visual interpretability of CLIP via adversarial fine-tuning

Shizhan Gong · Haoyu LEI · Qi Dou · Farzan Farnia

Hall 3 + Hall 2B #502
[ ]
Sat 26 Apr midnight PDT — 2:30 a.m. PDT

Abstract:

CLIP has achieved great success in visual representation learning and is becoming an important plug-in component for many large multi-modal models like LLaVA and DALL-E. However, the lack of interpretability caused by the intricate image encoder architecture and training process restricts its wider use in high-stake decision making applications. In this work, we propose an unsupervised adversarial fine-tuning (AFT) with norm-regularization to enhance the visual interpretability of CLIP. We provide theoretical analysis showing that AFT has implicit regularization that enforces the image encoder to encode the input features sparsely, directing the network's focus towards meaningful features. Evaluations by both feature attribution techniques and network dissection offer convincing evidence that the visual interpretability of CLIP has significant improvements. With AFT, the image encoder prioritizes pertinent input features, and the neuron within the encoder exhibits better alignment with human-understandable concepts. Moreover, these effects are generalizable to out-of-distribution datasets and can be transferred to downstream tasks. Additionally, AFT enhances the visual interpretability of derived large vision-language models that incorporate the pre-trained CLIP an integral component. The code of this paper is available at the CLIP_AFT GitHub repository.

Live content is unavailable. Log in and register to view live content