A batched GEMM optimization framework for deep learning
Z Yang, L Lu, R Wang - The Journal of Supercomputing, 2022 - Springer
Z Yang, L Lu, R Wang
The Journal of Supercomputing, 2022•SpringerGeneralized matrix multiplication (GEMM) is one of the most widely utilized algorithms in
many fields such as deep learning, astrophysics, signal processing, and advanced physical
analysis. It plays an extremely important role in deep learning, especially for convolutional
neural networks, because many of the calculations involved are converted into matrix
multiplications in order to speed up the computation process leveraging the parallel
processing power of GPUs. However, the sizes of the converted matrices are generally too …
many fields such as deep learning, astrophysics, signal processing, and advanced physical
analysis. It plays an extremely important role in deep learning, especially for convolutional
neural networks, because many of the calculations involved are converted into matrix
multiplications in order to speed up the computation process leveraging the parallel
processing power of GPUs. However, the sizes of the converted matrices are generally too …
Abstract
Generalized matrix multiplication (GEMM) is one of the most widely utilized algorithms in many fields such as deep learning, astrophysics, signal processing, and advanced physical analysis. It plays an extremely important role in deep learning, especially for convolutional neural networks, because many of the calculations involved are converted into matrix multiplications in order to speed up the computation process leveraging the parallel processing power of GPUs. However, the sizes of the converted matrices are generally too small to fully occupy the GPU. In this paper, we focus on the impact of GEMM on deep learning and propose a framework for calculating a batch of GEMMs in one kernel function so as to increase GPU occupancy. A suite of tiling strategies is designed for a batch of matrices with small dimensions and variable sizes. The tiling strategy is determined by considering Kernel Occupancy for each GEMM to fit different matrix sizes and GPU architectures. Then the GoogLeNet is implemented using MIOpen as a representative case and the batched GEMM framework is integrated into it. The experimental results show that compared with MAGMA, the elapsed time of the GoogLeNet optimized with our framework obtains 2.60 and 2.79 speedup on AMD Radeon Instinct MI50 and MI100 GPU, respectively.
Springer
Showing the best result for this search. See all results