Authors: Quentin Anthony, Benjamin Michalowicz, Jacob Hatef, Lang Xu, Mustafa Abduljabbar, Aamir Shafi, Hari Subramoni, Dhabaleswar Panda
Abstract: Deep learning (DL) models based on the transformer architecture have
revolutionized many DL applications such as large language models (LLMs),
vision transformers, audio generation, and time series prediction. Much of this
progress has been fueled by distributed training, yet distributed communication
remains a substantial bottleneck to training progress. This paper examines the
communication behavior of transformer models – that is, how different
parallelism schemes used in multi-node/multi-GPU DL Training communicate data
in the context of transformers. We use GPT-based language models as a case
study of the transformer architecture due to their ubiquity. We validate the
empirical results obtained from our communication logs using analytical models.
At a high level, our analysis reveals a need to optimize small message
point-to-point communication further, correlations between sequence length,
per-GPU throughput, model size, and optimizations used, and where to
potentially guide further optimizations in framework and HPC middleware design
and optimization.
Source: http://arxiv.org/abs/2408.10197v1