APOLLO: SGD-like Memory, AdamW-level Performance

Authors: Hanqing Zhu, Zhenyu Zhang, Wenyan Cong, Xi Liu, Sem Park, Vikas Chandra, Bo Long, David Z. Pan, Zhangyang Wang, Jinwon Lee

Abstract: Large language models (LLMs) are notoriously memory-intensive during
training, particularly with the popular AdamW optimizer. This memory burden
necessitates using more or higher-end GPUs or reducing batch sizes, limiting
training scalability and throughput. To address this, various memory-efficient
optimizers have been proposed to reduce optimizer memory usage. However, they
face critical challenges: (i) reliance on costly SVD operations; (ii)
significant performance trade-offs compared to AdamW; and (iii) still
substantial optimizer memory overhead to maintain competitive performance.
In this work, we identify that AdamW’s learning rate adaptation rule can be
effectively coarsened as a structured learning rate update. Based on this
insight, we propose Approximated Gradient Scaling for Memory-Efficient LLM
Optimization (APOLLO), which approximates learning rate scaling using an
auxiliary low-rank optimizer state based on pure random projection. This
structured learning rate update rule makes APOLLO highly tolerant to further
memory reductions while delivering comparable pre-training performance. Even
its rank-1 variant, APOLLO-Mini, achieves superior pre-training performance
compared to AdamW with SGD-level memory costs.
Extensive experiments demonstrate that the APOLLO series performs on-par with
or better than AdamW, while achieving greater memory savings by nearly
eliminating the optimization states of AdamW. These savings provide significant
system-level benefits: (1) Enhanced Throughput: 3x throughput on an 8xA100-80GB
setup compared to AdamW by supporting 4x larger batch sizes. (2) Improved Model
Scalability: Pre-training LLaMA-13B with naive DDP on A100-80GB GPUs without
system-level optimizations. (3) Low-End GPU Friendly Pre-training: Pre-training
LLaMA-7B on a single GPU using less than 12 GB of memory with weight
quantization.

Source: http://arxiv.org/abs/2412.05270v1

About the Author

Leave a Reply

Your email address will not be published. Required fields are marked *

You may also like these