Authors: Wenbo Gong, Meyer Scetbon, Chao Ma, Edward Meeds
Abstract: Designing efficient optimizers for large language models (LLMs) with
low-memory requirements and fast convergence is an important and challenging
problem. This paper makes a step towards the systematic design of such
optimizers through the lens of structured Fisher information matrix (FIM)
approximation. We show that many state-of-the-art efficient optimizers can be
viewed as solutions to FIM approximation (under the Frobenius norm) with
specific structural assumptions. Building on these insights, we propose two
design recommendations of practical efficient optimizers for LLMs, involving
the careful selection of structural assumptions to balance generality and
efficiency, and enhancing memory efficiency of optimizers with general
structures through a novel low-rank extension framework. We demonstrate how to
use each design approach by deriving new memory-efficient optimizers: Row and
Column Scaled SGD (RACS) and Adaptive low-dimensional subspace estimation
(Alice). Experiments on LLaMA pre-training (up to 1B parameters) validate the
effectiveness, showing faster and better convergence than existing
memory-efficient baselines and Adam with little memory overhead. Notably, Alice
achieves better than 2x faster convergence over Adam, while RACS delivers
strong performance on the 1B model with SGD-like memory.
Source: http://arxiv.org/abs/2502.07752v1