Power-check: An energy-efficient checkpointing framework for HPC clusters
RR Chandrasekar, A Venkatesh… - 2015 15th IEEE/ACM …, 2015 - ieeexplore.ieee.org
2015 15th IEEE/ACM International Symposium on Cluster, Cloud and …, 2015•ieeexplore.ieee.org
Checkpoint-restart is a predominantly used reactive fault-tolerance mechanism for
applications running on HPC systems. While there are innumerable studies in literature that
have analyzed, and optimized for, the performance and scalability of a variety of check
pointing protocols, not much research has been done from an energy or power perspective.
The limited number of studies conducted along this line have primarily analyzed and
modeled power and energy usage during check pointing phases. Applications running on …
applications running on HPC systems. While there are innumerable studies in literature that
have analyzed, and optimized for, the performance and scalability of a variety of check
pointing protocols, not much research has been done from an energy or power perspective.
The limited number of studies conducted along this line have primarily analyzed and
modeled power and energy usage during check pointing phases. Applications running on …
Checkpoint-restart is a predominantly used reactive fault-tolerance mechanism for applications running on HPC systems. While there are innumerable studies in literature that have analyzed, and optimized for, the performance and scalability of a variety of check pointing protocols, not much research has been done from an energy or power perspective. The limited number of studies conducted along this line have primarily analyzed and modeled power and energy usage during check pointing phases. Applications running on future exascale machines will be constrained by a power envelope, and it is not only important to understand the behavior of check pointing systems under such an envelope but to also adopt techniques that can leverage power capping capabilities exposed by the OS to achieve energy savings without forsaking performance. In this paper, we address the problem of marginal energy benefits with significant performance degradation due to naive application of power capping around check pointing phases by proposing a novel power-aware check pointing framework -- Power-Check. By use of data funnelling mechanisms and selective core power-capping, Power-Check makes efficient use of the I/O and CPU subsystem. Evaluations with application kernels show that Power-Check can yield as much as 48% reduction in the amount of energy consumed during a checkpoint, while improving the check pointing performance by 14%.
ieeexplore.ieee.org
Showing the best result for this search. See all results