Hunting Killer Tasks for Cloud System through Machine Learning: A Google Cluster Case Study

H Tang, Y Li, T Jia, Z Wu - 2016 IEEE International Conference …, 2016 - ieeexplore.ieee.org
H Tang, Y Li, T Jia, Z Wu
2016 IEEE International Conference on Software Quality …, 2016ieeexplore.ieee.org
Motivated by frequent failures in cloud computing systems, we analyze failure frequency and
failure continuity of tasks from the Google cloud cluster, and find what we call killer tasks that
suffer from frequent failures and repeated rescheduling. Killer tasks cause unnecessary
resource wasting and significant increase of scheduling workloads, which can be a big
concern in cloud systems. We aim to recognize killer tasks at the very early stage of their
occurrence so that they can be addressed proactively instead of being rescheduled …
Motivated by frequent failures in cloud computing systems, we analyze failure frequency and failure continuity of tasks from the Google cloud cluster, and find what we call killer tasks that suffer from frequent failures and repeated rescheduling. Killer tasks cause unnecessary resource wasting and significant increase of scheduling workloads, which can be a big concern in cloud systems. We aim to recognize killer tasks at the very early stage of their occurrence so that they can be addressed proactively instead of being rescheduled repeatedly, so as to promote reliability and save resources. To recognize killer tasks from a large amount of tasks in real time is really challenging. In this paper, we first investigate characteristics and behavior patterns of killer tasks and then develop two machine learning based methods, K-HUNTER and C-HUNTER, for online recognition of killer tasks. The empirical results show that our approach performs at 97% of precision in recognizing killer tasks with an 89% timing advance and 88% of resource saving for the cloud system on average.
ieeexplore.ieee.org
Showing the best result for this search. See all results