ICLR Poster Towards Best Practices of Activation Patching in Language Models: Metrics and Methods

Poster

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods

Fred Zhang · Neel Nanda

Halle B #99

[ Abstract ]

[ Poster] [ OpenReview]

Abstract:

Mechanistic interpretability seeks to understand the internal mechanisms ofmachine learning models, where localization—identifying the important modelcomponents—is a key step. Activation patching, also known as causal tracing orinterchange intervention, is a standard technique for this task (Vig et al., 2020), butthe literature contains many variants with little consensus on the choice of hyperparameters or methodology. In this work, we systematically examine the impactof methodological details in activation patching, including evaluation metrics andcorruption methods. In several settings of localization and circuit discovery in language models, we find that varying these hyperparameters could lead to disparateinterpretability results. Backed by empirical observations, we give conceptual arguments for why certain metrics or methods may be preferred. Finally, we providerecommendations for the best practices of activation patching going forwards.

Chat is not available.