[HTML][HTML] Layered mixed-precision training: A new training method for large-scale AI models

H Li, Y Wang, Y Hong, F Li, X Ji - Journal of King Saud University-Computer …, 2023 - Elsevier
H Li, Y Wang, Y Hong, F Li, X Ji
Journal of King Saud University-Computer and Information Sciences, 2023Elsevier
How to efficiently and quickly train large-scale AI models has become a hot topic in recent
deep learning. Mixed-precision training is an effective technique to speed up training and
reduce memory usage. At present, the automatic mixed-precision training method mainly
uses half-precision (FP16) for the matrix operations of forward and backward propagation of
the entire model and accumulates the FP32 weight copies to avoid rounding errors.
However, this method is not optimized for each layer individually, leading to poor …
How to efficiently and quickly train large-scale AI models has become a hot topic in recent deep learning. Mixed-precision training is an effective technique to speed up training and reduce memory usage. At present, the automatic mixed-precision training method mainly uses half-precision (FP16) for the matrix operations of forward and backward propagation of the entire model and accumulates the FP32 weight copies to avoid rounding errors. However, this method is not optimized for each layer individually, leading to poor convergence in large-scale model training because different layers have different data patterns. Therefore, this paper proposes a layered mixed-precision training method, which can flexibly adjust training precisions according to the contribution of each layer to the training effect. Applying the layered mixed-precision method, the ResNet model achieves a 1.9× speedup compared to the baseline and a lower percentage of accuracy loss. In addition, this paper combines the layered mixed-precision method with distributed training strategies. Combining data parallel training, the model achieves a 3.74× speedup by using four Tesla V100 GPUs. The applicability of the layered mixed-precision method in model parallel training has been verified. Combining optimized pipeline parallel training, the model achieves a 3.26× speedup by using three Tesla V100 GPUs.
Elsevier
Showing the best result for this search. See all results