Skip to yearly menu bar Skip to main content


Poster

MoDeGPT: Modular Decomposition for Large Language Model Compression

Chi-Heng Lin · Shangqian Gao · James Smith · Abhishek Patel · Shikhar Tuli · Yilin Shen · Hongxia Jin · Yen-Chang Hsu

Hall 3 + Hall 2B #227
[ ]
Fri 25 Apr 7 p.m. PDT — 9:30 p.m. PDT
 
Oral presentation: Oral Session 6B
Sat 26 Apr 12:30 a.m. PDT — 2 a.m. PDT

Abstract:

Large Language Models (LLMs) have significantly advanced AI with their exceptional performance across a wide range of tasks. However, their extensive computational requirements restrict their use on devices with limited resources.While recent compression methods based on low-rank matrices show potentialsolutions, they often suffer from significant loss of accuracy or introduce substantialoverhead in parameters and inference time. In this paper, we introduce Modular De-composition (MoDeGPT), a new, efficient, and structured compression frameworkthat overcomes these limitations. MoDeGPT jointly decomposes pairs of consecu-tive subcomponents within Transformer blocks, reduces hidden dimensions throughoutput reconstruction on a larger structural scale than conventional low-rank meth-ods, and repurposes three classical matrix decomposition algorithms—Nyströmapproximation, CR decomposition, and SVD—to ensure bounded errors in ournovel decomposition approach. Our experiments show that MoDeGPT, withoutrelying on backward propagation, consistently matches or surpasses the performance of prior techniques that depend on gradient information, while achieving a98% reduction in compute costs when compressing a 13B-parameter model. OnLLaMA-2/3 and OPT models, MoDeGPT retains 90-95% of zero-shot performancewith compression rates of 25-30%. The compression process can be completed ona single GPU in a few hours, boosting inference throughput by up to 46%.

Live content is unavailable. Log in and register to view live content