User profiles for Alessio Netti
Alessio NettiHPC/AI Senior Research Engineer, DeepL Verified email at deepl.com Cited by 285 |
Operational data analytics in practice: experiences from design to deployment in production HPC environments
As HPC systems continue to grow in scale and complexity, efficient and manageable operation
is increasingly critical. For this reason, many centers are starting to explore the use of …
is increasingly critical. For this reason, many centers are starting to explore the use of …
DCDB wintermute: Enabling online and holistic operational data analytics on HPC systems
A Netti, M Müller, C Guillen, M Ott, D Tafani… - Proceedings of the 29th …, 2020 - dl.acm.org
As we approach the exascale era, the size and complexity of HPC systems continues to
increase, raising concerns about their manageability and sustainability. For this reason, more …
increase, raising concerns about their manageability and sustainability. For this reason, more …
From facility to application sensor data: modular, continuous and holistic monitoring with DCDB
A Netti, M Müller, A Auweter, C Guillen, M Ott… - Proceedings of the …, 2019 - dl.acm.org
Today's HPC installations are highly-complex systems, and their complexity will only
increase as we move to exascale and beyond. At each layer, from facilities to systems, from …
increase as we move to exascale and beyond. At each layer, from facilities to systems, from …
A machine learning approach to online fault classification in HPC systems
As High-Performance Computing (HPC) systems strive towards the exascale goal, failure
rates both at the hardware and software levels will increase significantly. Thus, detecting and …
rates both at the hardware and software levels will increase significantly. Thus, detecting and …
A conceptual framework for HPC operational data analytics
This paper provides a broad framework for understanding trends in Operational Data Analytics
(ODA) for High-Performance Computing (HPC) facilities. The goal of ODA is to allow for …
(ODA) for High-Performance Computing (HPC) facilities. The goal of ODA is to allow for …
Hpc hardware design reliability benchmarking with hdfit
P Omland, A Netti, Y Peng, A Baldovin… - … on Parallel and …, 2023 - ieeexplore.ieee.org
Chips pack ever more, ever smaller transistors. Fault rates increase in turn and become more
concerning, particularly at the scale of High-Performance Computing (HPC) systems: on …
concerning, particularly at the scale of High-Performance Computing (HPC) systems: on …
Mixed precision support in HPC applications: What about reliability?
A Netti, Y Peng, P Omland, M Paulitsch, J Parra… - Journal of Parallel and …, 2023 - Elsevier
In their quest for exascale and beyond, High-Performance Computing (HPC) systems
continue becoming ever larger and more complex. Application developers, on the other hand, …
continue becoming ever larger and more complex. Application developers, on the other hand, …
FINJ: A fault injection tool for HPC systems
We present FINJ, a high-level fault injection tool for High-Performance Computing (HPC)
systems, with a focus on the management of complex experiments. FINJ provides support for …
systems, with a focus on the management of complex experiments. FINJ provides support for …
Towards a predictive energy model for HPC runtime systems using supervised learning
High-Performance Computing systems collect vast amounts of operational data with the
employment of monitoring frameworks, often augmented with additional information from …
employment of monitoring frameworks, often augmented with additional information from …
AccaSim: a customizable workload management simulator for job dispatching research in HPC systems
We present AccaSim, a simulator for workload management in HPC systems. Thanks to
AccaSim’s scalability to large workload datasets, support for easy customization, and practical …
AccaSim’s scalability to large workload datasets, support for easy customization, and practical …