A UNIFIED FRAMEWORK FOR SHAPE PRESERVING COMPRESSION OF LARGE LANGUAGE MODELS
in
Workshop: Workshop on Sparsity in LLMs (SLLM): Deep Dive into Mixture of Experts, Quantization, Hardware, and Inference
Abstract
Large language models (LLMs) exhibit remarkable performance across various natural language processing tasks but suffer from immense computational and memory demands, limiting their deployment in resource-constrained environ ments. To address this challenge, we propose NoWA (Normalized Weight and Activation Compression), a unified framework for zero-shot shape preservingcompression algorithms. We compressed Llama-2 7B/13B/70B and Llama-3 8B models, using two popular forms of shape-preserving compression, vector quantization NoWA-VQ (NoWA for Vector Quantization), and unstructured/structured pruning NoWA-P (NoWA for Pruning). We found that NoWA-VQ significantly outperforms state-of-the-art zero shot VQ, and that NoWA-P performs competitively against state-of-the-art methods.