Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Building Trust in LLMs and LLM Applications: From Guardrails to Explainability to Regulation

Finding Sparse Autoencoder Representations Of Errors In CoT Prompting

Justin Theodorus · V Swaytha · Shivani Gautam · Adam Ward · Mahir Shah · Cole Blondin · Kevin Zhu


Abstract:

Current large language models often suffer from subtle, hard-to-detect reasoningerrors in their intermediate chain-of-thought (CoT) steps. These errors includelogical inconsistencies, factual hallucinations, and arithmetic mistakes, whichcompromise trust and reliability. While previous research focuses on mechanisticinterpretability for best output, understanding and categorizing internal reasoningerrors remains challenging. The complexity and non-linear nature of these CoTsequences call for methods to uncover structured patterns hidden within them. Asan initial step, we evaluate Sparse Autoencoder (SAE) activations within neuralnetworks to investigate how specific neurons contribute to different types of errors.

Chat is not available.