Authors: Boyao Wang, Rui Pan, Shizhe Diao, Xingyuan Pan, Jipeng Zhang, Renjie Pi, Tong Zhang
Abstract: Small language models (SLMs) have attracted considerable attention from both
academia and industry due to their broad range of applications in edge devices.
To obtain SLMs with strong performance, conventional approaches either
pre-train the models from scratch, which incurs substantial computational
costs, or compress/prune existing large language models (LLMs), which results
in performance drops and falls short in comparison to pre-training. In this
paper, we investigate the family of acceleration methods that involve both
structured pruning and model training. We found 1) layer-wise adaptive pruning
(Adapt-Pruner) is extremely effective in LLMs and yields significant
improvements over existing pruning techniques, 2) adaptive pruning equipped
with further training leads to models comparable to those pre-training from
scratch, 3) incremental pruning brings non-trivial performance gain by
interleaving pruning with training and only removing a small portion of neurons
($\sim$5%) at a time. Experimental results on LLaMA-3.1-8B demonstrate that
Adapt-Pruner outperforms conventional pruning methods, such as LLM-Pruner,
FLAP, and SliceGPT, by an average of 1%-7% in accuracy on commonsense
benchmarks. Additionally, Adapt-Pruner restores the performance of
MobileLLM-125M to 600M on the MMLU benchmark with 200$\times$ fewer tokens via
pruning from its larger counterparts, and discovers a new 1B model that
surpasses LLaMA-3.2-1B in multiple benchmarks.
Source: http://arxiv.org/abs/2502.03460v1