Authors:
Mohamad Alansari
1
;
Hamad AlRemeithi
1
;
2
;
Bilal Hassan
1
;
3
;
Sara Alansari
1
;
Jorge Dias
1
;
3
;
Majid Khonji
1
;
3
;
Naoufel Werghi
1
;
3
;
4
and
Sajid Javed
1
;
3
Affiliations:
1
Department of Electrical Engineering and Computer Science, Khalifa University, Abu Dhabi, U.A.E.
;
2
Research and Technology Development Department, Tawauzn Technology & Innovation, Abu Dhabi, U.A.E.
;
3
Center for Autonomous Robotic Systems, Khalifa University, Abu Dhabi, U.A.E.
;
4
Center for Cyber-Physical Systems, Khalifa University, Abu Dhabi, U.A.E.
Keyword(s):
Attention Mechanisms, Computational Resources, Pyramid Vision Transformers, Scene Understanding, Semantic Segmentation.
Abstract:
Semantic segmentation, essential in computer vision, involves labeling each image pixel with its semantic class. Transformer-based models, recognized for their exceptional performance, have been pivotal in advancing this field. Our contribution, the Vision-Perceptual Transformer Network (VPTN), ingeniously combines transformer encoders with a feature pyramid-based decoder to deliver precise segmentation maps with minimal computational burden. VPTN’s transformative power lies in its integration of the pyramiding technique, enhancing multi-scale variations handling. In direct comparisons with Vision Transformer-based networks and variants, VPTN consistently excels. On average, it achieves 4.2%, 3.41%, and 6.24% higher mean Intersection over Union (mIoU) compared to Dense Prediction (DPT), Data-efficient image Transformer (DeiT), and Swin Transformer networks, while demanding only 15.63%, 3.18%, and 10.05% of their Giga Floating-Point Operations (GFLOPs). Our validation spans five diver
se datasets, including Cityscapes, BDD100K, Mapil-lary Vistas, CamVid, and ADE20K. VPTN secures the position of state-of-the-art (SOTA) on BDD100K and CamVid and consistently outperforms existing deep learning models on other datasets, boasting mIoU scores of 82.6%, 67.29%, 61.2%, 86.3%, and 55.3%, respectively. Impressively, it does so with an average computational complexity just 11.44% of SOTA models. VPTN represents a significant advancement in semantic segmentation, balancing efficiency and performance. It shows promising potential, especially for autonomous driving and natural setting computer vision applications.
(More)