A survey of online failure prediction methods

F Salfner, M Lenk, M Malek - ACM Computing Surveys (CSUR), 2010 - dl.acm.org
With the ever-growing complexity and dynamicity of computer systems, proactive fault
management is an effective approach to enhancing availability. Online failure prediction is …

Predictive performance modeling for distributed batch processing using black box monitoring and machine learning

C Witt, M Bux, W Gusew, U Leser - Information Systems, 2019 - Elsevier
In many domains, the previous decade was characterized by increasing data volumes and
growing complexity of data analyses, creating new demands for batch processing on …

Backfilling using system-generated predictions rather than user runtime estimates

D Tsafrir, Y Etsion, DG Feitelson - IEEE Transactions on …, 2007 - ieeexplore.ieee.org
The most commonly used scheduling algorithm for parallel supercomputers is FCFS with
backfilling, as originally introduced in the EASY scheduler. Backfilling means that short jobs …

An analysis of traces from a production mapreduce cluster

S Kavulya, J Tan, R Gandhi… - 2010 10th IEEE/ACM …, 2010 - ieeexplore.ieee.org
MapReduce is a programming paradigm for parallel processing that is increasingly being
used for data-intensive applications in cloud computing environments. An understanding of …

Predicting workflow task execution time in the cloud using a two-stage machine learning approach

TP Pham, JJ Durillo, T Fahringer - IEEE Transactions on Cloud …, 2017 - ieeexplore.ieee.org
Many techniques such as scheduling and resource provisioning rely on performance
prediction of workflow tasks for varying input data. However, such estimates are difficult to …

The GrADS project: Software support for high-level grid application development

F Berman, A Chien, K Cooper… - … Journal of High …, 2001 - journals.sagepub.com
Advances in networking technologies will soon make it possible to use the global
information infrastructure in a qualitatively different way—as a computational as well as an …

A best practice guide to resource forecasting for computing systems

GA Hoffmann, KS Trivedi… - IEEE Transactions on …, 2007 - ieeexplore.ieee.org
Recently, measurement-based studies of software systems have proliferated, reflecting an
increasingly empirical focus on system availability, reliability, aging, and fault tolerance …

Using moldability to improve the performance of supercomputer jobs

W Cirne, F Berman - Journal of Parallel and Distributed Computing, 2002 - Elsevier
In most parallel supercomputers, submitting a job for execution involves specifying (i) how
many processors are to be allocated to the job, and (ii) for how long these processors are to …

Using machine learning ensemble methods to predict execution time of e-science workflows in heterogeneous distributed systems

F Nadeem, D Alghazzawi, A Mashat, K Faqeeh… - IEEE …, 2019 - ieeexplore.ieee.org
Effective planning and optimized execution of the e-Science workflows in distributed
systems, such as the Grid, need predictions of execution times of the workflows. However …

Predicting the execution time of workflow activities based on their input features

T Miu, P Missier - 2012 SC Companion: High Performance …, 2012 - ieeexplore.ieee.org
The ability to accurately estimate the execution time of computationally expensive e-science
algorithms enables better scheduling of workflows that incorporate those algorithms as their …