Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference

Jaradat, Ghadeer; Tolba, Mohammed; Alsuhli, Ghada; Saleh, Hani; Al-Qutayri, Mahmoud; Stouraitis, Thanos; Mohammad, Baker

Computer Science > Machine Learning

arXiv:2407.12893 (cs)

[Submitted on 17 Jul 2024]

Title:Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference

Authors:Ghadeer Jaradat, Mohammed Tolba, Ghada Alsuhli, Hani Saleh, Mahmoud Al-Qutayri, Thanos Stouraitis, Baker Mohammad

View PDF HTML (experimental)

Abstract:In the world of deep learning, Transformer models have become very significant, leading to improvements in many areas from understanding language to recognizing images, covering a wide range of applications. Despite their success, the deployment of these models in real-time applications, particularly on edge devices, poses significant challenges due to their quadratic computational intensity and memory demands. To overcome these challenges we introduce a novel Hybrid Dynamic Pruning (HDP), an efficient algorithm-architecture co-design approach that accelerates transformers using head sparsity, block sparsity and approximation opportunities to reduce computations in attention and reduce memory access. With the observation of the huge redundancy in attention scores and attention heads, we propose a novel integer-based row-balanced block pruning to prune unimportant blocks in the attention matrix at run time, also propose integer-based head pruning to detect and prune unimportant heads at an early stage at run time. Also we propose an approximation method that reduces attention computations. To efficiently support these methods with lower latency and power efficiency, we propose a HDP co-processor architecture.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2407.12893 [cs.LG]
	(or arXiv:2407.12893v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2407.12893

Submission history

From: Ghada Alsuhli [view email]
[v1] Wed, 17 Jul 2024 11:15:16 UTC (1,651 KB)

Computer Science > Machine Learning

Title:Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators