Poster
in
Workshop: Building Trust in LLMs and LLM Applications: From Guardrails to Explainability to Regulation

LLM Neurosurgeon: Targeted Knowledge Removal in LLMs using Sparse Autoencoders

Dylan Zhou · Kunal Patil · Yifan Sun · Karthik lakshmanan · Senthooran Rajamanoharan · Arthur Conmy

Project Page [ Slides] [ OpenReview]

Abstract

Generative AI's widespread use has raised concerns about trust, safety, steerability, and interpretability. Existing solutions, like prompt engineering, fine-tuning, and reinforcement learning (e.g., RLHF, DPO), are often hard to iterate, computationally expensive, and rely heavily on dataset quality.This paper introduces Neurosurgeon, an efficient procedure that uses sparse autoencoders to identify and remove specific topics from a language model’s internal representations. This approach offers precise control over model responses while maintaining overall behavior. Experiments on the Gemma 2-9B model show Neurosurgeon’s ability to reduce bias in targeted areas without altering the model’s core functionality.

Chat is not available.