Google Scholar

Shfl-bw: Accelerating deep neural network inference with tensor-core aware weight pruning

G Huang, H Li, M Qin, F Sun, Y Ding, Y Xie - Proceedings of the 59th …, 2022 - dl.acm.org

G Huang, H Li, M Qin, F Sun, Y Ding, Y Xie

Proceedings of the 59th ACM/IEEE Design Automation Conference, 2022•dl.acm.org

Weight pruning in deep neural networks (DNNs) can reduce storage and computation cost,
but struggles to bring practical speedup to the model inference time. Tensor-cores can
significantly boost the throughput of GPUs on dense computation, but exploiting tensor-
cores for sparse DNNs is very challenging. Compared to existing CUDA-cores, tensor-cores
require higher data reuse and matrix-shaped instruction granularity, both difficult to yield
from sparse DNN kernels. Existing pruning approaches fail to balance the demands of …

Weight pruning in deep neural networks (DNNs) can reduce storage and computation cost, but struggles to bring practical speedup to the model inference time. Tensor-cores can significantly boost the throughput of GPUs on dense computation, but exploiting tensor-cores for sparse DNNs is very challenging. Compared to existing CUDA-cores, tensor-cores require higher data reuse and matrix-shaped instruction granularity, both difficult to yield from sparse DNN kernels. Existing pruning approaches fail to balance the demands of accuracy and efficiency: random sparsity preserves the model quality well but prohibits tensor-core acceleration, while highly-structured block-wise sparsity can exploit tensor-cores but suffers from severe accuracy loss.

In this work, we propose a novel sparse pattern, Shuffled Blockwise sparsity (Shfl-BW), designed to efficiently utilize tensor-cores while minimizing the constraints on the weight structure. Our insight is that row- and column-wise permutation provides abundant flexibility for the weight structure, while introduces negligible overheads using our GPU kernel designs. We optimize the GPU kernels for Shfl-BW in linear and convolution layers. Evaluations show that our techniques can achieve the state-of-the-art speed-accuracy trade-offs on GPUs. For example, with small accuracy loss, we can accelerate the computation-intensive layers of Transformer [1] by 1.81, 4.18 and 1.90× on NVIDIA V100, T4 and A100 GPUs respectively at 75% sparsity.

ACM Digital Library

Show moreShow less

Save Cite Cited by 14 Related articles All 4 versions

Showing the best result for this search. See all results

Cite

Advanced search

Saved to My library

Shfl-bw: Accelerating deep neural network inference with tensor-core aware weight pruning