A Theoretical Understanding of shallow Vision Transformers: Learning, Generalization, and Sample Complexity

Li, Hongkang; Wang, Meng; Liu, Sijia; Chen, Pin-yu

Computer Science > Machine Learning

arXiv:2302.06015v1 (cs)

[Submitted on 12 Feb 2023 (this version), latest version 12 Nov 2023 (v3)]

Title:A Theoretical Understanding of shallow Vision Transformers: Learning, Generalization, and Sample Complexity

Authors:Hongkang Li, Meng Wang, Sijia Liu, Pin-yu Chen

View PDF

Abstract:Vision Transformers (ViTs) with self-attention modules have recently achieved great empirical success in many vision tasks. Due to non-convex interactions across layers, however, theoretical learning and generalization analysis is mostly elusive. Based on a data model characterizing both label-relevant and label-irrelevant tokens, this paper provides the first theoretical analysis of training a shallow ViT, i.e., one self-attention layer followed by a two-layer perceptron, for a classification task. We characterize the sample complexity to achieve a zero generalization error. Our sample complexity bound is positively correlated with the inverse of the fraction of label-relevant tokens, the token noise level, and the initial model error. We also prove that a training process using stochastic gradient descent (SGD) leads to a sparse attention map, which is a formal verification of the general intuition about the success of attention. Moreover, this paper indicates that a proper token sparsification can improve the test performance by removing label-irrelevant and/or noisy tokens, including spurious correlations. Empirical experiments on synthetic data and CIFAR-10 dataset justify our theoretical results and generalize to deeper ViTs.

Subjects:	Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Cite as:	arXiv:2302.06015 [cs.LG]
	(or arXiv:2302.06015v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2302.06015

Submission history

From: Hongkang Li [view email]
[v1] Sun, 12 Feb 2023 22:12:35 UTC (3,775 KB)
[v2] Sun, 19 Mar 2023 22:36:28 UTC (3,775 KB)
[v3] Sun, 12 Nov 2023 04:36:45 UTC (1,705 KB)

Computer Science > Machine Learning

Title:A Theoretical Understanding of shallow Vision Transformers: Learning, Generalization, and Sample Complexity

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:A Theoretical Understanding of shallow Vision Transformers: Learning, Generalization, and Sample Complexity

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators