Skip to yearly menu bar Skip to main content


Poster

Dynamic Low-Rank Sparse Adaptation for Large Language Models

Weizhong Huang · Yuxin Zhang · Xiawu Zheng · Yang Liu · Jing Lin · Yiwu Yao · Rongrong Ji

Hall 3 + Hall 2B #265
[ ] [ Project Page ]
Sat 26 Apr midnight PDT — 2:30 a.m. PDT

Abstract: Despite the efficacy of network sparsity in alleviating the deployment strain of Large Language Models (LLMs), it endures significant performance degradation. Applying Low-Rank Adaptation (LoRA) to fine-tune the sparse LLMs offers an intuitive approach to counter this predicament, while it holds shortcomings include: 1) The inability to integrate LoRA weights into sparse LLMs post-training, and 2) Insufficient performance recovery at high sparsity ratios. In this paper, we introduces dynamic LoLow-rank SSparse AAdaptation (LoSA)(LoSA), a novel method that seamlessly integrates low-rank adaptation into LLM sparsity within a unified framework, thereby enhancing the performance of sparse LLMs without increasing the inference latency. In particular, LoSA dynamically sparsifies the LoRA outcomes based on the corresponding sparse weights during fine-tuning, thus guaranteeing that the LoRA module can be integrated into the sparse LLMs post-training. Besides, to achieve the optimal sparse model architecture, LoSA leverages Representation Mutual Information (RMI) as an indicator to determine the importance of layers, thereby dynamically determining the optimal layer-wise sparsity rates during fine-tuning. Predicated on this, LoSA adjusts the rank of the LoRA module based on the variability in layer-wise reconstruction errors, allocating an appropriate fine-tuning for each layer to reduce the output discrepancies between dense and sparse LLMs. Extensive experiments tell that LoSA can efficiently boost the efficacy of sparse LLMs within a few hours, without introducing any additional inferential burden. For example, LoSA reduced the perplexity of sparse LLaMA-2-7B by 68.7368.73 and increased zero-shot accuracy by 16.3216.32%, achieving a 2.60×2.60× speedup on CPU and 2.23×2.23× speedup on GPU, requiring only 45 minutes45 minutes of fine-tuning on a singlea single NVIDIA A100 80GB GPU. Code is available at https://github.com/wzhuang-xmu/LoSA.

Live content is unavailable. Log in and register to view live content