Poster
in
Workshop: Navigating and Addressing Data Problems for Foundation Models (DPFM)
Model & Data Insights using Pre-trained Language Models
Saeid Asgari · Aliasghar Khani · Amir Khasahmadi · Aditya Sanghi · Karl Willis · Ali Mahdavi Amiri
Keywords: [ interpretability ] [ data bias ] [ spurious correlations ]
We propose TExplain, using language models to interpret pre-trained image classifiers' features. Our approach connects the feature space of image classifiers with language models, generating explanatory sentences during inference. By extracting frequent words from such explanations, we gain insights into learned features and patterns. This method detects spurious correlations and biases within a dataset, providing a deeper understanding of the classifier's behavior. Experimental validation on diverse datasets, including ImageNet-9L and Waterbirds, shows potential for improving interpretability and robustness in image classifiers.