Poster
in
Workshop: 5th Workshop on practical ML for limited/low resource settings (PML4LRS) @ ICLR 2024

Addax: Memory-Efficient Fine-Tuning of Language Models with a Combination of Forward-Backward and Forward-Only Passes

Zeman Li · Xinwei Zhang · Meisam Razaviyayn

Project Page [ OpenReview]

Abstract

Fine-tuning language models (LMs) with first-order optimizers often demands excessive memory, limiting accessibility, while zeroth-order optimizers use less memory, but suffer from slow convergence depending on model size. We introduce a novel method named Addax that integrates the recently introduced Memory-Efficient Zeroth-order Optimizer of Malladi et al. (2023) with Stochastic Gradient Descent (SGD). Addax obtains zeroth-order and first-order gradient estimates and optimally combines them as the descent direction in each step. The first-order updates are performed "in-place" to further save memory. Theoretically, we establish the convergence of Addax under mild assumptions, demonstrating less restrictive hyper-parameters and independence from model size. Our extensive experiments with diverse LMs and tasks show that Addax consistently outperforms zero-shot and MeZO in terms of accuracy. Moreover, Addax surpasses the performance of standard fine-tuning approaches, such as SGD and Adam, in specific scenarios with significantly less memory requirement.

Chat is not available.