Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers

Li, Shijian; Walls, Robert J.; Guo, Tian

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2004.03072 (cs)

[Submitted on 7 Apr 2020]

Title:Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers

Authors:Shijian Li, Robert J. Walls, Tian Guo

View PDF

Abstract:Cloud GPU servers have become the de facto way for deep learning practitioners to train complex models on large-scale datasets. However, it is challenging to determine the appropriate cluster configuration---e.g., server type and number---for different training workloads while balancing the trade-offs in training time, cost, and model accuracy. Adding to the complexity is the potential to reduce the monetary cost by using cheaper, but revocable, transient GPU servers.
In this work, we analyze distributed training performance under diverse cluster configurations using CM-DARE, a cloud-based measurement and training framework. Our empirical datasets include measurements from three GPU types, six geographic regions, twenty convolutional neural networks, and thousands of Google Cloud servers. We also demonstrate the feasibility of predicting training speed and overhead using regression-based models. Finally, we discuss potential use cases of our performance modeling such as detecting and mitigating performance bottlenecks.

Comments:	11 pages, 12 figures, 5 tables, in proceedings of 40th IEEE International Conference on Distributed Computing Systems (ICDCS) 2020
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Performance (cs.PF)
Cite as:	arXiv:2004.03072 [cs.DC]
	(or arXiv:2004.03072v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2004.03072

Submission history

From: Shijian Li [view email]
[v1] Tue, 7 Apr 2020 01:49:58 UTC (639 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.DC

< prev | next >

new | recent | 2020-04

Change to browse by:

cs
cs.LG
cs.PF

References & Citations

DBLP - CS Bibliography

listing | bibtex

Shijian Li
Robert J. Walls
Tian Guo

export BibTeX citation

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators