A Conceptual Framework for HPC Operational Data Analytics

Abstract
This paper provides a broad framework for understanding trends in Operational Data Analytics (ODA) for High-Performance Computing (HPC) facilities. The goal of ODA is to allow for the continuous monitoring, archiving, and analysis of near real-time performance data, providing immediately actionable information for multiple operational uses. In this work, we combine two models to provide a comprehensive HPC ODA framework: one is an evolutionary model of analytics capabilities that consists of four types, which are descriptive, diagnostic, predictive and prescriptive, while the other is a four-pillar model for energy-efficient HPC operations that covers facility, system hardware, system software, and applications. This new framework is then overlaid with a description of current development and production deployments of ODA within leading-edge HPC facilities. Finally, we perform a comprehensive survey of ODA works and classify them according to our framework, in order to demonstrate its effectiveness.
Type
Publication
EEHPCWG State of Practice Workshop at the 2021 IEEE International Conference on Cluster Computing (CLUSTER)
Authors
Alessio Netti
HPC/AI Research Engineer
Alessio Netti (Ph.D., Technical University of Munich, 2022) is an HPC/AI research engineer at DeepL, after earlier work at Leibniz Supercomputing Centre (LRZ) and at Intel on HPC and AI dependability. He is a lead co-author of “A Conceptual Framework for HPC Operational Data Analytics” (IEEE 2021), which establishes the 4x4 scope-and-capability model widely used by the ODA community.
Authors
Torsten Wilde
HPC System Software Architect
Torsten Wilde (Ph.D.) is a system software architect at Hewlett Packard Enterprise (HPE). His work spans high volume, high frequency data collection and analytics for IT operations and dynamic system power management. He is part of the leadership team of the Energy Efficient HPC Working Group (EE HPC WG).