Google Scholar

Sufficient vision transformer

Z Cheng, X Su, X Wang, S You, C Xu - Proceedings of the 28th ACM …, 2022 - dl.acm.org

… Vision Transformer (as in Figs. 1 and 5), we aim to improve the robustness of the Transformer
… information intact in Transformer is the unique form of token sequences processed by the …

Save Cite Cited by 5 Related articles

[PDF] baai.ac.cn

A survey on vision transformer

K Han, Y Wang, H Chen, X Chen, J Guo… - IEEE transactions on …, 2022 - ieeexplore.ieee.org

… When pre-trained at sufficient scale, transformers achieve excellent results on tasks with
fewer datapoints. For example, when pre-trained on the JFT-300M dataset, ViT approached or …

Save Cite Cited by 2269 Related articles All 7 versions

[PDF] arxiv.org

Three things everyone should know about vision transformers

H Touvron, M Cord, A El-Nouby, J Verbeek… - … on Computer Vision, 2022 - Springer

… variants of vision transformers. (1) The residual layers of vision transformers, which are …
(2) Fine-tuning the weights of the attention layers is sufficient to adapt vision transformers to …

Save Cite Cited by 113 Related articles All 6 versions

[PDF] arxiv.org

Deepvit: Towards deeper vision transformer

D Zhou, B Kang, X Jin, L Yang, X Lian, Z Jiang… - arXiv preprint arXiv …, 2021 - arxiv.org

… issue and effectively scale the vision transformer to be deeper, … very deep vision transformers
with even 32 transformer blocks … is all lower than 30% and they present sufficient diversity. …

Save Cite Cited by 635 Related articles All 4 versions View as HTML

[PDF] thecvf.com

Scaling vision transformers

X Zhai, A Kolesnikov, N Houlsby… - … on computer vision and …, 2022 - openaccess.thecvf.com

… Our results suggest that with sufficient data, training a larger model for fewer steps is preferable.
This observation mirrors results in language modelling and machine translation [21, 25]. …

Save Cite Cited by 1168 Related articles All 9 versions View as HTML

[PDF] neurips.cc

Long-short transformer: Efficient transformers for language and vision

C Zhu, W Ping, C Xiao, M Shoeybi… - Advances in neural …, 2021 - proceedings.neurips.cc

… language and vision domains. … Transformer (Transformer-LS), an efficient self-attention
mechanism for modeling long sequences with linear complexity for both language and vision …

Save Cite Cited by 133 Related articles All 8 versions View as HTML

[PDF] arxiv.org

Transformers in vision: A survey

S Khan, M Naseer, M Hayat, SW Zamir… - ACM computing …, 2022 - dl.acm.org

… of vision tasks using Transformer networks. This survey aims to provide a comprehensive
overview of the Transformer models in the computer vision … the success of Transformers, ie, self-…

Save Cite Cited by 2653 Related articles All 8 versions

[PDF] arxiv.org

Training vision transformers with only 2040 images

YH Cao, H Yu, J Wu - European Conference on Computer Vision, 2022 - Springer

… /tokens and employed a pure transformer structure. With sufficient training data, ViT outperforms
… Beyond classification, Transformer has been adopted in diverse vision tasks, including …

Save Cite Cited by 50 Related articles All 8 versions

[PDF] arxiv.org

Exploring plain vision transformer backbones for object detection

Y Li, H Mao, R Girshick, K He - European conference on computer vision, 2022 - Springer

… plain, non-hierarchical Vision Transformer (ViT) as a … sufficient to build a simple feature
pyramid from a single-scale feature map (without the common FPN design) and (ii) it is sufficient …

Save Cite Cited by 820 Related articles All 6 versions

[PDF] mlr.press

Vilt: Vision-and-language transformer without convolution or region supervision

W Kim, B Son, I Kim - International conference on machine …, 2021 - proceedings.mlr.press

… of performance on other vision-and-language downstream … embedders may not be sufficient
to learn complex vision-and-… Figure 2c use a deep transformer to model the interaction of …

Save Cite Cited by 1717 Related articles All 5 versions View as HTML

Create alert

Cite

Advanced search

Saved to My library

Sufficient vision transformer

A survey on vision transformer

Three things everyone should know about vision transformers

Deepvit: Towards deeper vision transformer

Scaling vision transformers

Long-short transformer: Efficient transformers for language and vision

Transformers in vision: A survey

Training vision transformers with only 2040 images

Exploring plain vision transformer backbones for object detection

Vilt: Vision-and-language transformer without convolution or region supervision