Back to the Feature: Toward a Feature-Centric Account of Brain–LM Alignment
Abstract
Large Language Models (LLMs) have emerged as powerful proxies for linguistic processing in the human brain, yet the standard practice of quantifying this alignment via a single scalar obscures its underlying drivers. This "scalar-centric" approach treats high-dimensional embeddings as atomic units, establishing that alignment exists while remaining agnostic to how it is achieved. Here, we study brain–LLM alignment during naturalistic language comprehension, using intracranial electrocorticography (ECoG) recordings from human participants listening to a spoken narrative. We propose a shift toward a feature-centric perspective that inspects the embedding dimensions contributing to neural alignment. We show that naive feature analyses suggest superficial homogeneity across cortical regions and time, but this reflects a methodological artifact of polysemantic representations. By combining representational disentanglement via Sparse Autoencoders (SAEs) with sparse encoding models (Lasso), we uncover distinct feature subsets that dissociate between cortical regions (superior temporal gyrus vs. inferior frontal gyrus) and temporal windows (pre- vs. post-onset). These results recast brain–LLM alignment not as aggregate similarity, but as a structural inquiry into which computational dimensions map onto distinct neural dynamics.