ICLR Poster Adam-mini: Use Fewer Learning Rates To Gain More

Poster

Adam-mini: Use Fewer Learning Rates To Gain More

Yushun Zhang · Congliang Chen · Ziniu Li · Tian Ding · Chenwei Wu · Diederik (Durk) Kingma · Yinyu Ye · Zhi-Quan Luo · Ruoyu Sun

Hall 3 + Hall 2B #280

[ Abstract ] [ Project Page ]

Thu 24 Apr 7 p.m. PDT — 9:30 p.m. PDT

Abstract: We propose Adam-mini, an optimizer that achieves on-par or better performance than AdamW with

$50$ % less memory footprint. Adam-mini reduces memory by cutting down the learning rate resources in Adam (i.e.,

$1/\sqrt{v}$ ). By delving into the Hessian structure of neural nets, we find Adam’s

$v$ might not function at its full potential as effectively as we expected. We find that

$\geq 99.9$ % of these learning rates in

$v$ could be harmlessly removed if we (1) carefully partition the parameters into blocks following our proposed principle on Hessian structure; (2) assign a single but good learning rate to each parameter block. We then provide one simple way to find good learning rates and propose Adam-mini. Empirically, we verify that Adam-mini performs on par or better than AdamW on various language models sized from 39M to 13B for pre-training, supervised fine-tuning, and RLHF. The reduced memory footprint of Adam-mini also alleviates communication overheads among GPUs, thereby increasing throughput. For instance, Adam-mini achieves

$49.6$ % higher throughput than AdamW when pre-training Llama 2-7B on

$2\times$ A800-80GB GPUs, which saves 33% wall-clock time for pre-training.

Live content is unavailable. Log in and register to view live content