Poster
in
Workshop: First Workshop on Representational Alignment (Re-Align)

Measuring Mechanistic Interpretability at Scale Without Humans

Roland Zimmermann ⋅ David Klindt ⋅ Wieland Brendel

Keywords: alignment explainability activation maximization deep learning neural networks interpretability analysis evaluation

Project Page [ OpenReview]

Abstract

In today’s era, whatever we can measure at scale, we can optimize. So far, measuring the interpretability of units in deep neural networks (DNNs) for computer vision still requires direct human evaluation and is not scalable. As a result, the inner workings of DNNs remain a mystery despite the remarkable progress we have seen in their applications. In this work, we introduce the first scalable method to measure the per-unit interpretability in vision DNNs. This method does not require any human evaluations, yet its prediction correlates well with existing human interpretability measurements. We validate its predictive power through an interventional human psychophysics study. We demonstrate the usefulness of this measure by performing previously infeasible experiments: (1) A large-scale interpretability analysis across more than 70 million units from 835 computer vision models, and (2) an extensive analysis of how units transform during training. We find an anticorrelation between a model's downstream classification performance and per-unit interpretability, which is also observable during model training. Furthermore, we see that a layer's location and width influence its interpretability.

Chat is not available.