Sufficient vision transformer

Z Cheng, X Su, X Wang, S You, C Xu - Proceedings of the 28th ACM …, 2022 - dl.acm.org
Vision Transformer (as in Figs. 1 and 5), we aim to improve the robustness of the Transformer
… information intact in Transformer is the unique form of token sequences processed by the …

A survey on vision transformer

K Han, Y Wang, H Chen, X Chen, J Guo… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
… When pre-trained at sufficient scale, transformers achieve excellent results on tasks with
fewer datapoints. For example, when pre-trained on the JFT-300M dataset, ViT approached or …

Three things everyone should know about vision transformers

H Touvron, M Cord, A El-Nouby, J Verbeek… - … on Computer Vision, 2022 - Springer
… variants of vision transformers. (1) The residual layers of vision transformers, which are …
(2) Fine-tuning the weights of the attention layers is sufficient to adapt vision transformers to …

Deepvit: Towards deeper vision transformer

D Zhou, B Kang, X Jin, L Yang, X Lian, Z Jiang… - arXiv preprint arXiv …, 2021 - arxiv.org
… issue and effectively scale the vision transformer to be deeper, … very deep vision transformers
with even 32 transformer blocks … is all lower than 30% and they present sufficient diversity. …

Scaling vision transformers

X Zhai, A Kolesnikov, N Houlsby… - … on computer vision and …, 2022 - openaccess.thecvf.com
… Our results suggest that with sufficient data, training a larger model for fewer steps is preferable.
This observation mirrors results in language modelling and machine translation [21, 25]. …

Long-short transformer: Efficient transformers for language and vision

C Zhu, W Ping, C Xiao, M Shoeybi… - Advances in neural …, 2021 - proceedings.neurips.cc
… language and vision domains. … Transformer (Transformer-LS), an efficient self-attention
mechanism for modeling long sequences with linear complexity for both language and vision

Transformers in vision: A survey

S Khan, M Naseer, M Hayat, SW Zamir… - ACM computing …, 2022 - dl.acm.org
… of vision tasks using Transformer networks. This survey aims to provide a comprehensive
overview of the Transformer models in the computer vision … the success of Transformers, ie, self-…

Training vision transformers with only 2040 images

YH Cao, H Yu, J Wu - European Conference on Computer Vision, 2022 - Springer
… /tokens and employed a pure transformer structure. With sufficient training data, ViT outperforms
… Beyond classification, Transformer has been adopted in diverse vision tasks, including …

Exploring plain vision transformer backbones for object detection

Y Li, H Mao, R Girshick, K He - European conference on computer vision, 2022 - Springer
… plain, non-hierarchical Vision Transformer (ViT) as a … sufficient to build a simple feature
pyramid from a single-scale feature map (without the common FPN design) and (ii) it is sufficient

Vilt: Vision-and-language transformer without convolution or region supervision

W Kim, B Son, I Kim - International conference on machine …, 2021 - proceedings.mlr.press
… of performance on other vision-and-language downstream … embedders may not be sufficient
to learn complex vision-and-… Figure 2c use a deep transformer to model the interaction of …