Lightweight Vision Transformer with Cross Feature Attention

Zhao, Youpeng; Tang, Huadong; Jiang, Yingying; A, Yong; Wu, Qiang

Computer Science > Computer Vision and Pattern Recognition

arXiv:2207.07268 (cs)

[Submitted on 15 Jul 2022 (v1), last revised 5 Jul 2023 (this version, v2)]

Title:Lightweight Vision Transformer with Cross Feature Attention

Authors:Youpeng Zhao, Huadong Tang, Yingying Jiang, Yong A, Qiang Wu

View PDF

Abstract:Recent advances in vision transformers (ViTs) have achieved great performance in visual recognition tasks. Convolutional neural networks (CNNs) exploit spatial inductive bias to learn visual representations, but these networks are spatially local. ViTs can learn global representations with their self-attention mechanism, but they are usually heavy-weight and unsuitable for mobile devices. In this paper, we propose cross feature attention (XFA) to bring down computation cost for transformers, and combine efficient mobile CNNs to form a novel efficient light-weight CNN-ViT hybrid model, XFormer, which can serve as a general-purpose backbone to learn both global and local representation. Experimental results show that XFormer outperforms numerous CNN and ViT-based models across different tasks and datasets. On ImageNet1K dataset, XFormer achieves top-1 accuracy of 78.5% with 5.5 million parameters, which is 2.2% and 6.3% more accurate than EfficientNet-B0 (CNN-based) and DeiT (ViT-based) for similar number of parameters. Our model also performs well when transferring to object detection and semantic segmentation tasks. On MS COCO dataset, XFormer exceeds MobileNetV2 by 10.5 AP (22.7 -> 33.2 AP) in YOLOv3 framework with only 6.3M parameters and 3.8G FLOPs. On Cityscapes dataset, with only a simple all-MLP decoder, XFormer achieves mIoU of 78.5 and FPS of 15.3, surpassing state-of-the-art lightweight segmentation networks.

Comments:	Technical Report. A shorter version has been accepted to ICIP 2023
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2207.07268 [cs.CV]
	(or arXiv:2207.07268v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2207.07268

Submission history

From: Youpeng Zhao [view email]
[v1] Fri, 15 Jul 2022 03:27:13 UTC (4,235 KB)
[v2] Wed, 5 Jul 2023 16:11:41 UTC (4,236 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Lightweight Vision Transformer with Cross Feature Attention

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Lightweight Vision Transformer with Cross Feature Attention

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators