Time-sharing redux for large-scale hpc systems

S Hofmeyr, C Iancu, J Colmenares… - 2016 IEEE 18th …, 2016 - ieeexplore.ieee.org
2016 IEEE 18th International Conference on High Performance …, 2016ieeexplore.ieee.org
HPC facilities typically use batch scheduling to space-share jobs. In this paper we revisit
time-sharing using a trace of over 2.4 million jobs obtained during 20 months of operation of
a modern petascale supercomputer. Our simulations show that batch scheduling produces
skewed distributions with much larger slowdowns for shorter-running, larger jobs, whereas
time-sharing produces more uniform slowdowns. Consequently, for applications that strong
scale, the turnaround time does not scale with batch scheduling, but it does with time …
HPC facilities typically use batch scheduling to space-share jobs. In this paper we revisit time-sharing using a trace of over 2.4 million jobs obtained during 20 months of operation of a modern petascale supercomputer. Our simulations show that batch scheduling produces skewed distributions with much larger slowdowns for shorter-running, larger jobs, whereas time-sharing produces more uniform slowdowns. Consequently, for applications that strong scale, the turnaround time does not scale with batch scheduling, but it does with time-sharing, resulting in turnarounds that are orders of magnitude better at the largest scales. We also show that time-sharing can confer additional benefits in noisy systems and with modern programming practices. Future Exascale HPC systems are expected to exhibit billion-way heterogeneous parallelism and poor performance predictability. As many applications will run in strong scaling, how resource allocation policies affect the experience of supercomputer users has once again become a timely subject.
ieeexplore.ieee.org
Showing the best result for this search. See all results