Stochastic gradient descent introduces an effective landscape-dependent regularization favoring flat solutions

Yang, Ning; Tang, Chao; Tu, Yuhai

doi:10.1103/PhysRevLett.130.237101

Condensed Matter > Disordered Systems and Neural Networks

arXiv:2206.01246 (cond-mat)

[Submitted on 2 Jun 2022]

Title:Stochastic gradient descent introduces an effective landscape-dependent regularization favoring flat solutions

Authors:Ning Yang, Chao Tang, Yuhai Tu

View PDF

Abstract:Generalization is one of the most important problems in deep learning (DL). In the overparameterized regime in neural networks, there exist many low-loss solutions that fit the training data equally well. The key question is which solution is more generalizable. Empirical studies showed a strong correlation between flatness of the loss landscape at a solution and its generalizability, and stochastic gradient descent (SGD) is crucial in finding the flat solutions. To understand how SGD drives the learning system to flat solutions, we construct a simple model whose loss landscape has a continuous set of degenerate (or near degenerate) minima. By solving the Fokker-Planck equation of the underlying stochastic learning dynamics, we show that due to its strong anisotropy the SGD noise introduces an additional effective loss term that decreases with flatness and has an overall strength that increases with the learning rate and batch-to-batch variation. We find that the additional landscape-dependent SGD-loss breaks the degeneracy and serves as an effective regularization for finding flat solutions. Furthermore, a stronger SGD noise shortens the convergence time to the flat solutions. However, we identify an upper bound for the SGD noise beyond which the system fails to converge. Our results not only elucidate the role of SGD for generalization they may also have important implications for hyperparameter selection for learning efficiently without divergence.

Comments:	Main text: 11 pages, 3 figures; supplementary materials: 19 pages, 5 figures
Subjects:	Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
Cite as:	arXiv:2206.01246 [cond-mat.dis-nn]
	(or arXiv:2206.01246v1 [cond-mat.dis-nn] for this version)
	https://doi.org/10.48550/arXiv.2206.01246
Related DOI:	https://doi.org/10.1103/PhysRevLett.130.237101

Submission history

From: Ning Yang [view email]
[v1] Thu, 2 Jun 2022 18:49:36 UTC (3,513 KB)

Condensed Matter > Disordered Systems and Neural Networks

Title:Stochastic gradient descent introduces an effective landscape-dependent regularization favoring flat solutions

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Condensed Matter > Disordered Systems and Neural Networks

Title:Stochastic gradient descent introduces an effective landscape-dependent regularization favoring flat solutions

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators