VSA: Learning Varied-Size Window Attention in Vision Transformers

Zhang, Qiming; Xu, Yufei; Zhang, Jing; Tao, Dacheng

Computer Science > Computer Vision and Pattern Recognition

arXiv:2204.08446 (cs)

[Submitted on 18 Apr 2022 (v1), last revised 3 Jul 2023 (this version, v2)]

Title:VSA: Learning Varied-Size Window Attention in Vision Transformers

Authors:Qiming Zhang, Yufei Xu, Jing Zhang, Dacheng Tao

View PDF

Abstract:Attention within windows has been widely explored in vision transformers to balance the performance, computation complexity, and memory footprint. However, current models adopt a hand-crafted fixed-size window design, which restricts their capacity of modeling long-term dependencies and adapting to objects of different sizes. To address this drawback, we propose \textbf{V}aried-\textbf{S}ize Window \textbf{A}ttention (VSA) to learn adaptive window configurations from data. Specifically, based on the tokens within each default window, VSA employs a window regression module to predict the size and location of the target window, i.e., the attention area where the key and value tokens are sampled. By adopting VSA independently for each attention head, it can model long-term dependencies, capture rich context from diverse windows, and promote information exchange among overlapped windows. VSA is an easy-to-implement module that can replace the window attention in state-of-the-art representative models with minor modifications and negligible extra computational cost while improving their performance by a large margin, e.g., 1.1\% for Swin-T on ImageNet classification. In addition, the performance gain increases when using larger images for training and test. Experimental results on more downstream tasks, including object detection, instance segmentation, and semantic segmentation, further demonstrate the superiority of VSA over the vanilla window attention in dealing with objects of different sizes. The code will be released this https URL.

Comments:	23 pages, 13 tables, and 5 figures; ECCV 2022 version
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2204.08446 [cs.CV]
	(or arXiv:2204.08446v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2204.08446

Submission history

From: Qiming Zhang [view email]
[v1] Mon, 18 Apr 2022 17:56:07 UTC (3,235 KB)
[v2] Mon, 3 Jul 2023 07:49:59 UTC (4,395 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:VSA: Learning Varied-Size Window Attention in Vision Transformers

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:VSA: Learning Varied-Size Window Attention in Vision Transformers

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators