CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention
Abstract
Converting pretrained attention modules such as grouped-query attention (GQA) into multi-head latent attention (MLA) can improve expressivity without increasing KV-cache cost, making it attractive for efficient inference. However, existing conversion methods typically apply naïve singular value decomposition (SVD). They focus on minimizing the difference between weight matrices rather than how those weights affect input activations, ignore the covariance structure of activations, and enforce a uniform rank across layers—causing activation drift and degraded attention fidelity. To address these issues, we propose CARE (Covariance-Aware, Rank-Enhanced), a MLA conversion pipeline under a fixed KV width. CARE introduces three key steps: (i). Activation-preserving factorization — align the approximation with the actual input activations rather than just the weights. (ii). Adjusted-rank allocation — distribute a fixed KV budget across layers by giving more capacity to layers that need it most. (iii). KV-parity mapping — reparameterize the converted (K) and (V) to fit the MLA format while keeping the KV-cache size unchanged. Under a matched KV-cache budget, our method consistently outperforms a uniform-rank SVD baseline on Llama-3-8B, delivering up to 331% relative gains in one-shot evaluation (higher accuracy, lower perplexity). With a brief post-SVD “healing” fine-tune, we fully recover the original model’s accuracy.