A Quadratic Synchronization Rule for Distributed Deep Learning

Gu, Xinran; Lyu, Kaifeng; Arora, Sanjeev; Zhang, Jingzhao; Huang, Longbo

Computer Science > Machine Learning

arXiv:2310.14423 (cs)

[Submitted on 22 Oct 2023 (v1), last revised 12 Apr 2024 (this version, v2)]

Title:A Quadratic Synchronization Rule for Distributed Deep Learning

Authors:Xinran Gu, Kaifeng Lyu, Sanjeev Arora, Jingzhao Zhang, Longbo Huang

View PDF HTML (experimental)

Abstract:In distributed deep learning with data parallelism, synchronizing gradients at each training step can cause a huge communication overhead, especially when many nodes work together to train large models. Local gradient methods, such as Local SGD, address this issue by allowing workers to compute locally for $H$ steps without synchronizing with others, hence reducing communication frequency. While $H$ has been viewed as a hyperparameter to trade optimization efficiency for communication cost, recent research indicates that setting a proper $H$ value can lead to generalization improvement. Yet, selecting a proper $H$ is elusive. This work proposes a theory-grounded method for determining $H$, named the Quadratic Synchronization Rule (QSR), which recommends dynamically setting $H$ in proportion to $\frac{1}{\eta^2}$ as the learning rate $\eta$ decays over time. Extensive ImageNet experiments on ResNet and ViT show that local gradient methods with QSR consistently improve the test accuracy over other synchronization strategies. Compared with the standard data parallel training, QSR enables Local AdamW on ViT-B to cut the training time on 16 or 64 GPUs down from 26.7 to 20.2 hours or from 8.6 to 5.5 hours and, at the same time, achieves $1.16\%$ or $0.84\%$ higher top-1 validation accuracy.

Comments:	camera-ready version for ICLR'24
Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2310.14423 [cs.LG]
	(or arXiv:2310.14423v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2310.14423

Submission history

From: Xinran Gu [view email]
[v1] Sun, 22 Oct 2023 21:38:57 UTC (3,095 KB)
[v2] Fri, 12 Apr 2024 13:59:01 UTC (4,003 KB)

Computer Science > Machine Learning

Title:A Quadratic Synchronization Rule for Distributed Deep Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:A Quadratic Synchronization Rule for Distributed Deep Learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators