Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Workshop on Sparsity in LLMs (SLLM): Deep Dive into Mixture of Experts, Quantization, Hardware, and Inference

Wanda++: Pruning Large Language Models via Regional Gradients

Yifan Yang · Kai Zhen · Bhavana Ganesh · Aram Galstyan · Goeric Huybrechts · Markus Müller · Jonas Kübler · Rupak Swaminathan · Athanasios Mouchtaris · Sravan Babu Bodapati · Nathan Susanj · Zheng Zhang · Jack FitzGerald · Abhishek Kumar


Abstract:

Large Language Models (LLMs) pruning seeks to remove unimportant weights for inference speedup with minimal accuracy impact. However, existing methods often suffer from accuracy degradation without full-model sparsity-aware fine-tuning. This paper presents Wanda++, a novel pruning framework that outperforms the state-of-the-art methods by utilizing decoder-block-level regional gradients.Specifically, Wanda++ improves the pruning score with regional gradients for the first time and proposes an efficient regional optimization method to minimize pruning-induced output discrepancies between the dense and sparse decoder output. Notably, Wanda++ improves perplexity by up to 32% over Wanda in the language modeling task and generalizes effectively to downstream tasks. Moreover, despiteusing regional gradients for calibration, Wanda++ remains compatible with LoRA fine-tuning, further reducing perplexity. Our approach is lightweight, pruning a 7B LLaMA model in under 10 minutes on a single NVIDIA H100 GPU.

Chat is not available.