Scaled ReLU Matters for Training Vision Transformers

Wang, Pichao; Wang, Xue; Luo, Hao; Zhou, Jingkai; Zhou, Zhipeng; Wang, Fan; Li, Hao; Jin, Rong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2109.03810 (cs)

[Submitted on 8 Sep 2021 (v1), last revised 12 Jan 2022 (this version, v2)]

Title:Scaled ReLU Matters for Training Vision Transformers

Authors:Pichao Wang, Xue Wang, Hao Luo, Jingkai Zhou, Zhipeng Zhou, Fan Wang, Hao Li, Rong Jin

View PDF

Abstract:Vision transformers (ViTs) have been an alternative design paradigm to convolutional neural networks (CNNs). However, the training of ViTs is much harder than CNNs, as it is sensitive to the training parameters, such as learning rate, optimizer and warmup epoch. The reasons for training difficulty are empirically analysed in ~\cite{xiao2021early}, and the authors conjecture that the issue lies with the \textit{patchify-stem} of ViT models and propose that early convolutions help transformers see better. In this paper, we further investigate this problem and extend the above conclusion: only early convolutions do not help for stable training, but the scaled ReLU operation in the \textit{convolutional stem} (\textit{conv-stem}) matters. We verify, both theoretically and empirically, that scaled ReLU in \textit{conv-stem} not only improves training stabilization, but also increases the diversity of patch tokens, thus boosting peak performance with a large margin via adding few parameters and flops. In addition, extensive experiments are conducted to demonstrate that previous ViTs are far from being well trained, further showing that ViTs have great potential to be a better substitute of CNNs.

Comments:	Accepted by AAAI2022
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2109.03810 [cs.CV]
	(or arXiv:2109.03810v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2109.03810

Submission history

From: Pichao Wang [view email]
[v1] Wed, 8 Sep 2021 17:57:58 UTC (155 KB)
[v2] Wed, 12 Jan 2022 01:01:35 UTC (146 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Scaled ReLU Matters for Training Vision Transformers

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Scaled ReLU Matters for Training Vision Transformers

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators