Steering Large Language Models Toward Clarification through Sparse Autoencoders
Abstract
Instruction-tuned LLMs often respond to ambiguous instructions by guessing missing details rather than asking clarifying questions. Clarification-seeking improves reliability by aligning responses with user intent and avoiding under-specified assumptions. This is especially important for embodied AI, where misinterpretations can translate into task failure or safety risks. We propose \textbf{ClarifySAE}, an inference-time method that steers clarification-seeking by intervening on Sparse Autoencoder (SAE) features. ClarifySAE ranks SAE features using ClarifyScore, which measures association with clarification contexts, and filters them with OutputScore to retain features that measurably affect the model’s output distribution. During decoding, we apply additive biases to the selected features, increasing the likelihood of generating a clarifying question without updating model weights. We evaluate our method on two datasets with ambiguous instructions (AmbiK and ClarQ-LLM) and two Gemma instruction-tuned models (2B and 9B) using pretrained 16k-feature SAEs. On AmbiK with Gemma-2-9B-IT, ClarifySAE increases clarification rate from 0.61 to 0.95 and improves task success from 0.06 to 0.21.