ICLR Poster Dynamic Low-Rank Sparse Adaptation for Large Language Models

Poster

Dynamic Low-Rank Sparse Adaptation for Large Language Models

Weizhong Huang · Yuxin Zhang · Xiawu Zheng · Yang Liu · Jing Lin · Yiwu Yao · Rongrong Ji

Hall 3 + Hall 2B #265

[ Abstract ] [ Project Page ]

Sat 26 Apr midnight PDT — 2:30 a.m. PDT

Abstract: Despite the efficacy of network sparsity in alleviating the deployment strain of Large Language Models (LLMs), it endures significant performance degradation. Applying Low-Rank Adaptation (LoRA) to fine-tune the sparse LLMs offers an intuitive approach to counter this predicament, while it holds shortcomings include: 1) The inability to integrate LoRA weights into sparse LLMs post-training, and 2) Insufficient performance recovery at high sparsity ratios. In this paper, we introduces dynamic

Lo

$\textbf{Lo}$ w-rank

S

$\textbf{S}$ parse

A

$\textbf{A}$ daptation

(LoSA)

$\textbf{(LoSA)}$ , a novel method that seamlessly integrates low-rank adaptation into LLM sparsity within a unified framework, thereby enhancing the performance of sparse LLMs without increasing the inference latency. In particular, LoSA dynamically sparsifies the LoRA outcomes based on the corresponding sparse weights during fine-tuning, thus guaranteeing that the LoRA module can be integrated into the sparse LLMs post-training. Besides, to achieve the optimal sparse model architecture, LoSA leverages Representation Mutual Information (RMI) as an indicator to determine the importance of layers, thereby dynamically determining the optimal layer-wise sparsity rates during fine-tuning. Predicated on this, LoSA adjusts the rank of the LoRA module based on the variability in layer-wise reconstruction errors, allocating an appropriate fine-tuning for each layer to reduce the output discrepancies between dense and sparse LLMs. Extensive experiments tell that LoSA can efficiently boost the efficacy of sparse LLMs within a few hours, without introducing any additional inferential burden. For example, LoSA reduced the perplexity of sparse LLaMA-2-7B by

68.73

$\textbf{68.73}$

↓

$\downarrow$ and increased zero-shot accuracy by

16.32

$\textbf{16.32}$ %

↑

$\uparrow$ , achieving a

2.60 \times

$\textbf{2.60$\times$}$ speedup on CPU and

2.23 \times

$\textbf{2.23$\times$}$ speedup on GPU, requiring only

45 minutes

$\textbf{45 minutes}$ of fine-tuning on

a single

$\textbf{a single}$ NVIDIA A100 80GB GPU. Code is available at https://github.com/wzhuang-xmu/LoSA.

Live content is unavailable. Log in and register to view live content