Speeding up Deep Learning with Transient Servers

Li, Shijian; Walls, Robert J.; Xu, Lijie; Guo, Tian

Computer Science > Performance

arXiv:1903.00045 (cs)

[Submitted on 28 Feb 2019 (v1), last revised 5 May 2019 (this version, v2)]

Title:Speeding up Deep Learning with Transient Servers

Authors:Shijian Li, Robert J. Walls, Lijie Xu, Tian Guo

View PDF

Abstract:Distributed training frameworks, like TensorFlow, have been proposed as a means to reduce the training time of deep learning models by using a cluster of GPU servers. While such speedups are often desirable---e.g., for rapidly evaluating new model designs---they often come with significantly higher monetary costs due to sublinear scalability. In this paper, we investigate the feasibility of using training clusters composed of cheaper transient GPU servers to get the benefits of distributed training without the high costs.
We conduct the first large-scale empirical analysis, launching more than a thousand GPU servers of various capacities, aimed at understanding the characteristics of transient GPU servers and their impact on distributed training performance. Our study demonstrates the potential of transient servers with a speedup of 7.7X with more than 62.9% monetary savings for some cluster configurations. We also identify a number of important challenges and opportunities for redesigning distributed training frameworks to be transient-aware. For example, the dynamic cost and availability characteristics of transient servers suggest the need for frameworks to dynamically change cluster configurations to best take advantage of current conditions.

Comments:	Accepted to ICAC'19. 11 pages, 8 figures, 5 tables
Subjects:	Performance (cs.PF); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Cite as:	arXiv:1903.00045 [cs.PF]
	(or arXiv:1903.00045v2 [cs.PF] for this version)
	https://doi.org/10.48550/arXiv.1903.00045

Submission history

From: Tian Guo [view email]
[v1] Thu, 28 Feb 2019 19:47:59 UTC (242 KB)
[v2] Sun, 5 May 2019 07:20:37 UTC (242 KB)

Computer Science > Performance

Title:Speeding up Deep Learning with Transient Servers

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Performance

Title:Speeding up Deep Learning with Transient Servers

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators