Poster
in
Workshop: Building Trust in LLMs and LLM Applications: From Guardrails to Explainability to Regulation

Self-Ablating Transformers: More Interpretability, Less Sparsity

Jeremias Ferrao · Luhan Mikaelson · Keenan Pepper · Natalia Perez-Campanero

Project Page [ OpenReview]

Abstract

A growing intuition in machine learning suggests a link between sparsity and interpretability. We introduce a novel self-ablation mechanism to investigate this connection ante-hoc in the context of language transformers. Our approach dynamically enforces a k-winner-takes-all constraint, forcing the model to demonstrate selective activation across neuron and attention units. Training small models on the TinyStories dataset and employing interpretability tests, we find that self-ablation leads to more localized circuits, concentrated feature representations, and increased neuron specialization without compromising language modelling performance. Surprisingly, our method also led to a decrease in overall sparsity, indicating that self-ablation promotes specialization rather than widespread inactivity. This reveals a complex interplay between sparsity, and interpretability, where decreased global sparsity can coexist with increased local specialization, leading to enhanced interpretability.

Chat is not available.