DECODING LOGICAL NEGATION IN LARGE LANGUAGE MODELS: FROM STATISTICAL HEURISTICS TO CAUSAL SEMANTIC CIRCUITS
Umair Tariq ⋅ Brian Cong ⋅ Archish Prakhya ⋅ Tinuade Adeleke ⋅ Sean Wu ⋅ Ruizhe Li
Abstract
We investigate the internal computational mechanisms that activate when large language models process foundational logical atomics, specifically focusing on logical negation. Utilizing sparse autoencoders (SAEs), we decompose high dimensional residual stream activations into interpretable, localized features. We present a two stage investigation to isolate true logical abstraction from statistical pattern matching. In our exploratory phase, we demonstrate that smaller autoregressive models (e.g., GPT-2 Small) fail to encode formal logical abstractions, achieving near random accuracy on synthetic logical extraction tasks and relying instead on shallow bag-of-words heuristics. Consequently, our primary phase shifts to Gemma-2-27B utilizing a highly controlled "nonce" (pseudoword) dataset to strictly isolate boolean reasoning from real world semantic priors. We identify a sparse set of features at Layer 10 that serve as the causal locus of negation. We causally invert the model's logical state and demonstrate that these features act as generalized semantic operators, robustly activating across diverse negators ("no", "never", "fail", "un-"), rather than mere lexical detectors. Finally, circuit tracing reveals a feed-forward pathway in which ablating early layer features collapses downstream representations by $\sim$40\%.
Chat is not available.
Successful Page Load