The computational needs of modern scientific research grow steadily, and High-Performance Computing (HPC) systems are designed accordingly with ever-increasing scale and parallelism. As we approach the era of exascale HPC systems, with a power of operations per second, concerns about excessive power consumption  and high failure rates  undermine their feasibility: with system components potentially failing every few minutes, wasting in turn precious computational and electrical power, some of the long-running applications common in the HPC landscape might simply become too expensive and difficult to run. Alongside the extreme scale of modern HPC systems, their intricate complexity is fueled by the adoption of heterogeneous architectures and novel cooling systems, as well as complex management software and high performance variability in components caused by their manufacturing processes . For these reasons, HPC machines are increasingly being treated as dynamic, complex systems themselves whose efficiency and effectiveness must be proactively improved .
Analyzing the operation of an HPC system and taking appropriate actions for its optimization is the purpose of Operational Data Analytics (ODA), driven by the large amounts of data produced by monitoring frameworks . The latter capture and store data at a fine granularity from various sensors in hardware and software components, from the facility infrastructure down to the compute node level, which can in turn be used to infer knowledge about system behavior, and thus implement a proactive control loop. Monitoring and ODA are therefore two key aspects in the design of future HPC systems. However, while monitoring is an established reality in most supercomputing centers , ODA are still far from it: many experimental solutions have addressed individual issues ranging from node resiliency to infrastructure management and energy efficiency, but these are insular and rarely adopted in production environments. The main reason for this lies in the absence of established frameworks enabling the adoption of ODA approaches for entire HPC systems or even facilities. A framework of this kind should be designed to cope with the extreme volumes of data associated to monitoring, as well as the tight latency and overhead constraints of real-time system control, and the wide variety of operational requirements inherent to specific techniques.
I-a Related Work
The problem of enabling ODA in HPC systems in a generic, holistic way is still an open research question, and no definitive solution has been proposed. The Lightweight Distributed Metric Service (LDMS)  has been recently enhanced to support ODA features on top of standard HPC monitoring . However, due to its pull-based architecture, it is not be suitable for in-band, fine-grained ODA applications that require live data with minimal overhead and latency. Moreover, LDMS currently lacks configuration abstractions to simplify the instantiation of models at a large scale.
The Examon  framework is shown to be suitable for ODA applications, being based on the MQTT protocol and thus compatible with tools such as Apache Spark. However, it relies on the use of external tools and thus needs to be tuned ad-hoc for each specific use case. The OMNI  framework has a similar architecture, but is more oriented towards visualization of data. Elastic Stack111https://www.elastic.co/products/ supports the post-processing of data ingested from external sources, thus enabling data analytics for monitoring frameworks such as Ganglia . The analysis, however, is limited to the server side, which hinders scalability on large HPC installations.
Many other tools propose basic applications of online analytics specifically tailored for HPC, and implement simple feedback loops between the monitoring component and the resource manager (e.g., the Energy-Aware Runtime (EAR)  or IBM LoadLeveler ). Similarly, tools like SPar  provide abstract and user-friendly interfaces for runtime optimization. These efforts, however, tackle specific issues of resource management in HPC systems, and customization for other purposes is not trivial. The Global Extensible Open Power Manager (GEOPM)  provides a plugin-oriented and extensible interface for resource and power management in HPC systems, but its monitoring capabilities are limited.
Alongside the open-source solutions discussed above, there are also many commercial and closed-source products, such asZenoss222https://www.zenoss.com/ or Splunk333https://www.splunk.com/, offering extensive data analytics capabilities, but they are not tailored for HPC requirements. Due to their commercial nature, these products are often not suitable for use in HPC environments, where direct access and customization of the underlying code are required.
In this paper we present Wintermute, a novel ODA framework tailored for HPC systems. Our solution is based on the DCDB monitoring framework  and enables holistic operational data analytics, with models able to process data and take decisions at any level in an HPC system. We designed Wintermute in light of an extensive literature overview, and based on the previous experiences in ODA at our supercomputing center. Its workflow accommodates most real-world ODA applications, while its small resource footprint renders it suitable for applications in which overhead and latency are critical. In particular, our contributions are the following:
We propose a taxonomy of ODA techniques for HPC systems based on a literature survey, and classify them according to their functional requirements.
We introduce an approach, called the Unit System, to aid in the navigation of an HPC system’s space of monitored sensors using a tree representation.
We implement Wintermute, a holistic ODA framework, which enables analysis of data and control at all levels in the hierarchy of an HPC system.
We demonstrate the scalability and applicability of Wintermute through a series of case studies carried out on the HPC systems at our supercomputing center.
The paper is organized as follows. In Section II we outline the design requirements for our framework based on a literature survey. In Section III we discuss the logical representation of the sensor space we adopt to simplify the configuration of models. In Section IV we describe the workflow of Wintermute, alongside its architecture in Section V. In Section VI we then present a series of case studies we implemented, and in Section VII we conclude the paper.
Ii Analysis of Requirements
First, we present the use case analysis for the design of an Wintermute framework, following a literature survey and extracting common functional requirements.
Ii-a Uses of Operational Data Analytics
Even though ODA techniques are emerging for managing many aspects of HPC systems, they have not been systematically classified and typical functional requirements are still not clear, to the best of our knowledge. This, however, is fundamental for the design of a generic framework: for this reason we propose a non-exhaustive taxonomy, depicted in Figure 1, identifying the most common use cases associated to ODA in HPC systems. In particular, we identify the following main usage scenarios:
Ii-B Taxonomy Based on Functional Requirements
The applications listed above operate at varying levels in an HPC system and at different time scales, but they all rely on monitoring data. Some applications, such as those associated with job analysis, may require additional data (e.g., job id or wall time). Based on the above, we derive four classes of ODA techniques, according to the type of data they use and their mode of operation. In particular, we envision two potential types of data sources:
In-band: data sampled and consumed within a specific component in an HPC system, usually a compute node. Techniques using such data sources often operate at fine temporal scale (i.e., greater than 1Hz) and require low analysis overhead and latency in gathering data.
Out-of-band: data potentially coming from any available source in the system, including asynchronous facility data. In a few cases, job-related data may be used as well. For techniques using this type of data, operation often has to be at coarse scale (e.g., in the order of minutes), must be explicitly synchronized (e.g., through time-stamps), but latency and overhead are less of a concern.
On top of their data sources, we also group ODA techniques according to the two following modes of operation:
Online: continuous operation, producing an output resembling a time series, which can then be re-used to drive management decisions and thus produce a feedback loop;
On-demand: operation triggered at specific times (e.g., job submission), to steer management decisions which require certain information about the system’s status.
Using these functional requirements, we can classify the use cases described above, as shown in Figure 1. This gives us a coarse-grained taxonomy, which use in the following to guide the design of our Wintermute framework.
Iii The Unit System
As more and more data sources are available to tap into, navigating the space of available sensors in a monitored HPC system to be used as input into ODA systems becomes a critical issue: deriving the semantic and hierarchical relationships existing between thousands of sensors, as well as deploying analysis models at a large scale becomes difficult and error-prone . We therefore introduce a set of abstractions to address these issues in Wintermute in a structured way.
Iii-a The Sensor Tree
In Wintermute we model the sensor space as a hierarchical sensor tree, an example of which is depicted in Figure 2. We assume that the keys (or topics) used to identify sensors are forward slash-separated strings similar to file system paths, expressing their physical or logical placement in an HPC system. This scheme is used in DCDB and complies with the MQTT standard. The following is an example of sensor topic:
The last segment of a topic is the name of the sensor itself, and the preceding path elements express its placement in the system. This representation can be exploited to construct a tree, in which each internal node is a system component (e.g., a compute node or a rack) and each leaf is a sensor. The constructed tree then supplies a comprehensive view of the monitored system’s structure, as well as a natural way to correlate hierarchically-related sensors (e.g., the sensors of a compute node and those of the rack it belongs to).
The structure of the sensor tree is analogous to a file system: components of the HPC system represented by internal tree nodes can be seen as containers, i.e., directories, whereas the sensors themselves corresponding to leaves are akin to files. Its effectiveness depends on the level of detail expressed by the hierarchy of topics, and the responsibility for devising such a hierarchy lies on system administrators and designers.
Iii-B Units and Pattern Units
In Wintermute, units are atomic components to which analysis computations are bound. A unit directly represents a node in the sensor tree, from which it takes its name. Then, it comprises a set of input and output sensors: the output sensors are used to deliver the results of the analysis process, and are leaves of the node the unit represents. Input sensors, which provide the data for the analysis, can either be leaves of that same node, or belong to any another node in the sensor tree connected by an ascending or descending path to the unit node. Figure 2 shows a generic example for a unit, named s02, a compute node in an HPC system. In this example, the unit has the output sensor healthy, and a series of input sensors: the cycles and cache misses counters of the CPUs in the compute node, plus the power sensor of the chassis it belongs to.
Combined with its input and output sensors, a unit defines a sub-tree in the sensor tree. Extending this concept, a unit can also be defined as a pattern: here, instead of defining the input and output sensors with their topics, we only specify their name (last topic segment), while the components to which they belong (preceding topic path) are omitted. Instead, we use a pattern expression to define the level of these nodes in the tree (vertical navigation), alongside a filtering expression (horizontal navigation). This expression will match with a set of nodes, which is defined as its domain, in the sensor tree.
When a unit name is defined, a binding in the sensor tree can be created and the pattern unit can be resolved: for each sensor, its pattern expression is replaced with a node from its domain that is hierarchically-related to the unit’s name. Since multiple nodes may satisfy this condition, one pattern expression can produce multiple actual sensors. Conversely, if no node satisfies it, the unit cannot be built.
Recalling the similarity between the sensor tree and a file system, describing sensors through pattern expressions can be interpreted as defining files using relative paths: these paths can match multiple points in the file system, and they are fully resolved in function of the current working directory, whose analogous in this case is the name of the unit. The main difference between the two is that the tree level of nodes in pattern expressions is defined with an absolute level, whereas for relative file system paths it is defined as a relative offset with respect to the current working directory.
Iii-C Pattern Instantiation
The unit example shown in Figure 2 can be built from a generic pattern unit using the following pattern expression:
In pattern expressions, the topdown and bottomup keywords drive the vertical navigation, and indicate the highest and lowest level in the tree respectively; the root node of the sensor tree is excluded from this representation, and other levels can be reached through relative offsets. The filter keyword defines the horizontal
navigation, and is used to filter the set of nodes the pattern matches, within the tree level indicated by the pattern expression. In this example, the unit’s name is set to/r03/c02/s02/, which identifies an HPC compute node. Once this is set, the rest of the unit is resolved: the power sensor is resolved as /r03/c02/power, as its pattern expression specifies that the sensor should be one level below the highest tree level, at c02. Conversely, the cpu-cycles and cache-misses sensors are on the lowest level, with two nodes (cpu0 and cpu1) belonging to the domains of the respective pattern expressions. As such, sensors from both of them are added to the unit. As the healthy output sensor lies at the same level as s02, it is simply resolved as /r03/c02/s02/healthy.
Pattern units allow us to create abstract I/O specifications, independent of where a model will be run and of the actual sensors, but only specifying their hierarchical relationships. In a large-scale HPC system, this enables the instantiation of thousands of independent ODA models, each with their own set of sensors, by using only a small configuration block.
Iv Overview of DCDB Wintermute
Wintermute provides a generic ODA framework and its principles can be used on a wide variety of monitoring frameworks. In this work we use the DCDB monitoring framework : in the following we briefly describe DCDB, followed by the workflow of Wintermute, as well as its integration into DCDB and associated options.
Iv-a Architecture of DCDB
DCDB is a holistic solution for continuous monitoring in HPC systems . In DCDB, a sensor defines an atomic monitoring entity (e.g., power, temperature or a CPU performance counter) that produces readings each identified by a numerical value and a time-stamp. DCDB is made of several components in order to achieve a distributed and scalable architecture, which is summarized in Figure 3: Pushers perform the sampling of sensors on monitored components, using a plugin-based architecture that allows users to easily add new data sources. All collected data is sent via the MQTT protocol to Collect Agents, which act as data brokers and forward the data to a Storage Backend, currently implemented using Apache Cassandra. Alongside a series of interfaces for visualizing and retrieving data from a Storage Backend, DCDB also exposes a control RESTful API in every component, as well as sensor caches for fast in-memory access to recent readings.
Iv-B Workflow of Wintermute
Wintermute is designed as an additional plugin-based software component that is integrated within Pushers and Collect Agents, and that enhances them by supplying ODA capabilities. In particular, Wintermute plugins allow to instantiate operators, which are computational entities performing ODA tasks asynchronously. Each operator works on a set of units, which implement the concept described in Section III, and which comprise a set of input and output sensors, which map to DCDB sensors. Figure 3 shows the integration of Wintermute in the existing DCDB architecture: it has access to all resources in a Pusher or Collect Agent, including sensor caches, RESTful APIs and data output methods (i.e., MQTT or Storage Backend). Here we discuss the main available options that allow to configure Wintermute’s workflow to accommodate all of the use cases laid out in Section II.
As Wintermute is integrated both in the Pusher and Collect Agent, operators can be instantiated in both components by loading the appropriate plugins. In a Pusher, operators have only access to locally-sampled sensors and their sensor cache data. This location is optimal for runtime models requiring data liveness, low latency and horizontal scalability. In a Collect Agent, on the other hand, access to the entire system’s sensor space is available. Data is retrieved from the local sensor cache, if possible, or otherwise queried from the Storage Backend, to which output sensor values of operators are also written. This location is optimal for system or infrastructure-level analysis and feedback loops.
Operators can be configured to work in two different ways depending on their requirements. In Online mode, an operator is invoked at regular time intervals, resulting in continuous analysis and thus producing time series-like sensor data. This is ideal for applications such as fault detection or runtime optimization. In On-demand mode, on the other hand, an operator’s capabilities must be explicitly invoked via the RESTful APIs, by querying a specific unit. Output data is propagated only as a response to the RESTful request. This mode is ideal for scheduling applications, which are triggered irregularly in time.
When using the Online mode, the units of a single operator can be arranged with respect to the underlying model: as sequential, all units share the same model, and are processed sequentially by the operator at each computation interval to avoid race conditions; as parallel, one distinct model (and thus operator) is created for each unit, allowing us to parallelize computation and improve scalability.
As the output data produced by online operators is identical to all other sensor data in DCDB, operators can use the output of other operators as input. This, in turn, allows us to create pipelines, in which multiple stages of a complex analysis are divided among several operators. This can be used to split computational load between multiple entities (e.g., Pusher and Collect Agent) or to achieve complex analysis capabilities with few, general-purpose plugins. Furthermore, this method allows us to implement feedback loops in an HPC system, via control operators at the end of the pipeline that use processed data to tune system knobs.
V Architecture of DCDB Wintermute
In Figure 4 we show the architecture of Wintermute, abstracting from its integration in DCDB. The Sensor Input and Sensor Output blocks describe the interfaces through which Wintermute obtains sensor data and exposes analysis results respectively. These are also shown in Figure 3 as arrows in and out of the Wintermute blocks for each instantiation scenario. Wintermute is a modular framework based on operator plugins supplying analysis capabilities: these follow an agnostic code interface and are supported by two central entities, the Query Engine and the Operator Manager, which provide the input to operators and expose their output respectively. These are designed to isolate the plugins from the location in which they are instantiated, meaning that a plugin can be deployed to a Pusher or Collect Agent without alterations.
Wintermute is developed in C++11, and all source code is freely available under the GNU GPL license via its GitLab repository 444https://dcdb.it. In the following, we describe the framework’s individual components in detail.
V-a Operator Manager
The Operator Manager is the central entity responsible for reading Wintermute configuration files, loading requested plugins and managing their life cycle. As such, it is the main interface between Wintermute and DCDB and allows users to specify which sensors to read. Additionally, Wintermute uses DCDB’s HTTPs Server to forward all ODA-related RESTful API requests to the Operator Manager, so that it can take appropriate action. For example, these requests can instruct the manager to start, stop, or load plugins dynamically, as well as triggering specific actions on a per-plugin basis.
V-B Query Engine
The Query Engine is a singleton component which exposes the space of available sensors to operator plugins. In particular, it gives access to a Sensor Navigator object, which maintains the tree representation of the sensor space using the Unit System defined in Section III, and allows Wintermute plugins to discover which sensors are available and where in the hierarchical structure they stand. The Query Engine’s uniform interface enables queries based on sensor names and time-stamp ranges. Access to low-level data structures containing the sensor data is achieved by means of a callback function, which is set at startup by the DCDB entity in which the Wintermute framework is running.
When possible, the Query Engine gives higher priority to data in the local sensor caches, which is faster to retrieve compared to querying the Storage Backend. Moreover, queries can be performed in two modes, affecting how the caches are accessed: in the first, relative time-stamps are supplied as an offset against the most recent reading, and the underlying cache view to be returned can be computed in time. In the second, absolute time-stamps are used, resulting in a binary search to compute the view with time complexity.
V-C Operator Plugins
The plugins implement all specific logic to perform analysis processes of a certain kind. Operator plugins perform analysis by taking as input only regular sensor data and comply to the same standard code interface in the Wintermute framework. Job operator plugins are an extension of normal operator plugins, complying to the same interface, and can also use job-related data (e.g., user id or node list) producing output that is associated to a specific job. Plugins consist of the following main internal components:
Operators are objects performing the required analysis tasks. Each operator has assigned a series of units, as described in Section III, each storing a set of input and output sensor objects. Whenever computation is invoked for an operator, it will iterate through its units and perform an analysis for each of them, usually querying the respective input sensors through the Query Engine, processing the obtained readings, and storing the result in the output sensors. When performing analysis for a certain unit, access to the operator’s other units is allowed for correlation purposes.
A configurator is responsible for reading a plugin’s own configuration file and instantiating operators accordingly, together with their units: the process to generate the latter is controlled by a series of pattern-based constructs that allow users to define a pattern unit, as discussed in Section III. In detail, unit generation works in the following steps, starting from a pattern unit: a) based on the current sensor tree, the domain of the output sensors’ pattern expression is computed; b) one unit is instantiated for each of the retrieved nodes in the domain; c) for each unit, its set of input and output sensors is resolved according to the domains of the respective expressions. On top of unit-level outputs, users may also define a set of operator-level outputs that can, for example, store the average error of a model applied to a set of units. More details about the configuration mechanism can be found in DCDB Wintermute’s GitLab.
Vi Experimental Results
In this section we present our insights on the resource footprint of Wintermute, and discuss several case studies that were carried out to show its flexibility and suitability for large-scale HPC installations. All experiments described in this section were carried out on the CooLMUC-3 HPC cluster at the Leibniz Supercomputing Centre555https://www.lrz.de/services/compute/linux-cluster/coolmuc3/. This cluster is composed of 148 compute nodes, each equipped with a 64-cores Intel Xeon Phi 7210-F Knights Landing CPU, 96GB of RAM and an Intel Omni Path Architecture (OPA) interconnect. On this system DCDB runs continuously in production mode, with Pushers in compute nodes sampling data from the perfevent, sysFS, ProcFS and OPA plugins, and with a single Collect Agent forwarding such data to a Storage Backend.
Vi-a Performance and Scalability
The performance of DCDB was extensively characterized in the work by Netti et al. , and as such here we will focus on quantifying the performance impact of Wintermute’s Query Engine component alone.
We study the performance impact of a Pusher on the High-Performance Linpack (HPL) benchmark . In this context we use the runtime overhead, computed as the percentage increase in execution time of HPL with a Pusher active, as opposed to running it alone. Execution times are calculated via the date Linux command. We instantiate a set of operators in online mode: these belong to a tester operator plugin, and simply perform a certain number of queries over the input sensors of their units. The input monitoring data is provided by another tester plugin, producing a total of 1000 monotonic sensors with negligible overhead, so as to provide a reliable baseline. All operator and monitoring plugins use a sampling interval of 1 second, and a cache size of 180 seconds. The HPL benchmark was configured to use as many threads as physical cores, and each experiment was repeated 10 times, picking median results to ensure statistical significance.
Vi-A2 Performance Evaluation
Figure 5 presents the results of our performance evaluation. The two heatmaps depict overhead values when varying the number of queries performed at each analysis interval, as well as the temporal range of each query, using the Query Engine in absolute and relative mode, respectively. Overhead is below 0.5% in all cases, with absolute mode performing slightly worse than relative and showing higher peak overhead values: this is expected, as absolute mode employs binary search and has a higher computational complexity. Further, no clear increase can be observed when increasing the amount of queried sensor data, showing that the Query Engine has good scalability and minimal impact on overhead. Average per-core CPU load of the Pusher is mostly uniform and peaks at 1.2%. Likewise, memory usage never exceeded 25MB.
Vi-B Case Study 1 - Power Consumption Prediction
Here we show a case study focused on predicting the power consumption of a compute node (i.e., whole node power measured at the power supply) in the CooLMUC-3 cluster, which can be used alongside other performance metrics to steer online control decisions. This case study serves to show the effectiveness of Wintermute in such a scenario, where data is collected in-band, at a fine time scale, and is immediately re-used for control purposes. The model represents an online implementation of the one proposed by Ozer et al. .
In a Pusher, we instantiate a single operator from a regressor
plugin, which enables random forest-based online regression. Its input data consists of a set of performance metrics and sensors, and both sampling and regression operate at a 250ms interval. The regressor plugin, which is based on the OpenCV library666https://opencv.org
, works in the following way: at each computation interval, for each input sensor of a certain unit a series of statistical features (e.g., mean or standard deviation) are computed from its recent readings. These features are then combined to form a feature vector, which is fed into the random forest model to perform regression and output a sensor prediction of the next 250ms.
Training of the model, which is shared by all units of an operator, is performed automatically: feature vectors are accumulated in memory until a certain training set size is reached, alongside the responses from the sensor to be predicted. In this case, the responses come from the power sensor, with the model set to predict its value in the next 250ms. With the Pusher running, we execute the Kripke, AMG, Nekbone and LAMMPS HPC applications from the CORAL-2777https://asc.llnl.gov/coral-2-benchmarks suite, with as many threads as physical cores, while the regressor operator builds its training set. In this case, the operator has only one unit, corresponding to the whole compute node, and the training set’s size is set to 30k. Once training is complete, we evaluate the regression online with new DCDB data.
Figure 6 summarizes the results of the model. In particular, it shows a small excerpt from the time series associated to the real and predicted power sensors respectively: it can be seen that the time series of the predicted power consumption closely follows the real time series, capturing status changes and periodic behaviors before they occur. However, the predicted time series fails to capture some short spikes or oscillations in power consumption, and presents itself like a smoothed version of the real one. These events are difficult to predict, as they are usually related to the CPU’s power management policy, which may exhibit short-term spikes for throughput improvement (e.g., Turbo mode on Intel CPUs) or may be related to electrical or sensor noise. The phenomena described above are more apparent in Figure 6
b, which shows the average relative prediction error for each real power value, together with the fitted probability density function (PDF) of the latter. It can be seen that prediction is worse for higher power values; as it can be observed from the PDF, these values represent a minimal portion of the distribution, and have negligible impact on the overall error. Moreover, this imbalance in the distribution directly translates to an imbalance in the training set of the model, which does not have enough data to capture this type of behavior. Similarly, some low-power states that are relatively rare are not predicted well by the model. However, in the regions of the distribution where most of the data concentrates, error is always close to 5%, proving the model’s effectiveness.
We obtained comparable results when sampling and predicting power consumption at a time interval of 125ms and 500ms, with average relative error values of 10.4% and 6.7% respectively. While more accurate prediction could be obtained with specialized plugins, this example shows that very good results can be obtained with general-purpose plugins, and with little effort. Moreover, we found that the additional overhead of performing regression on top of standard monitoring (measured against the HPL benchmark as in Section VI-A) was in the 0.1% range and thus negligible.
Vi-C Case Study 2 - Analysis of Job Behavior
In this case study we use Wintermute to produce aggregated performance metrics on a per-job basis, which can be visualized to gain insight on application behavior. We combine two different plugins, showing how pipelines can be used in Wintermute to perform complex analyses and split computational load. The plugins discussed here represent a re-implementation of the PerSyst framework .
We deploy two distinct operator plugins, implementing a pipeline as described in Section IV. The first perfmetrics plugin, instantiated in the Pushers, takes as input CPU and node-level data and computes as output a series of derived performance metrics, such as cycles per instruction (CPI), floating point operations per second (FLOPS) or vectorization ratio, which are useful to evaluate application performance. A second persyst plugin is instantiated in the main Collect Agent: at each computing interval, it queries the set of running jobs on the HPC system, and for each of them it instantiates a unit according to its configuration. In this case, units have as input one of the perfmetrics derived metrics from all compute nodes on which the job is running. From these, the operator computes a series of job-level statistical indicators (e.g., mean) as output. In the Pushers and Collect Agent, sampling and computation are performed at a 1s interval.
We executed four jobs, each on 32 CooLMUC-3 nodes and running the Kripke, AMG, Nekbone and LAMMPS
applications. The job runs were repeated multiple times and under different node configurations to ensure consistency. Here we focus on the CPI metric: thus, we configured the perfmetrics plugin to have an operator with one unit per CPU core, each producing as output its CPI value. Then, we use a persyst operator which outputs the deciles of the CPI distribution at each time point, as computed by aggregating the corresponding input values for each job. Since the latter are computed per-core, each decile is aggregated from 2048 samples at a time. This allows us to gain an overall understanding of the behavior of the applications running on the HPC system, whereas the full extent of available metrics allows to characterize their performance profile and bottlenecks.
In Figure 7 we show the results of our analysis: for each job, we show the time series for deciles 0, 2, 5, 8 and 10 of the aggregated per-core CPI values while running the corresponding Coral-2 application; deciles 0, 5 and 10 correspond to the minimum, median and maximum CPI values respectively. It can be seen that the applications exhibit distinctly different behaviors depending on the underlying computational workload: LAMMPS shows low CPI values averaging at 1.6, with minimum spread in the distribution, which is due to its mostly compute-bound nature. A similar behavior can be observed with AMG, with low CPI values up to decile 5: however, deciles 8 and 10 show spikes up to CPI values of 30. As AMG is a heavily network-bound application, this behavior is likely caused by network latency affecting I/O.
Kripke has a very distinctive profile: it is possible to separate each single iteration, thanks to the increase and decrease in CPI values across all deciles. Similarly to AMG, Kripke is also a network and memory-bound application, and is thus characterized by relatively high CPI values. Finally, Nekbone shows the most interesting behavior: low CPI values can be observed in the first part of the application run, which is expected as Nekbone is a compute-bound application. In the second part of the run, however, the spread across deciles increases dramatically, with at least 20% of the CPUs exhibiting higher CPI values. Our hypothesis is that, as Nekbone performs a batch of tests on increasing problem sizes, the application becomes memory-limited as soon as it grows past the 16GB-High-Bandwidth Memory available in CooLMUC-3 nodes. This represents a typical example of how visualization of performance metrics can be used to spot performance bottlenecks in HPC applications.
Vi-D Case Study 3 - Identification of Performance Anomalies
For the final case study, we conduct a long-term analysis on coarse-grained monitoring data from all compute nodes in the CooLMUC-3 cluster. By applying unsupervised learning techniques, we characterize the performance of the entire HPC system and highlight variance between compute nodes, as well as identify outliers and anomalous behaviors.
We instantiate a single plugin performing bayesian gaussian mixture-based clustering in the main Collect Agent. This plugin is configured to have one operator with as many units as compute nodes, each having as input a node’s power, temperature and CPU idle time sensors, and as output a label of the cluster to which it belongs. More precisely, at every computation interval the operator computes 2-week averages for the input sensors of each unit. Then, each unit is treated as a data point in a three-dimensional space, and clustering is applied to them. Sampling in Pushers is performed at 10s intervals, and clustering is performed hourly.
We adopt a bayesian gaussian mixture model because, unlike ordinary gaussian mixture models, they are able to determine autonomously the optimal number of clusters from data. This is useful in an online, continuous scenario, where the diverse states of an HPC system can be captured without manual tuning of the model’s parameters. The number of input sensors to the clustering algorithm - and thus the number of dimensions - can be changed at will, as well as the length of the averages’ aggregation window. Since the job runtime limit is set to 2 days on CooLMUC-3, we choose a value of 2 weeks so as to extract the performance profile of each node without knowledge of running jobs.
Figure 8 shows the result of the clustering process for a single time window. The points in the scatter plot correspond to compute nodes in CooLMUC-3 whose coordinates are their 2-week power, temperature and CPU idle time averages. First, it can be observed that the three metrics are strongly correlated, and the compute nodes describe a clear linear trend: this is expected, as a compute node will consume less power if idling, and its temperature will be lower as well. Most nodes concentrate in Cluster 1 towards the center of the plot, with relatively small spread.
Despite the 2-week aggregation window we adopted, some stark differences in node behavior can still be observed: compute nodes belonging to Cluster 0 have a higher CPU idle time, showing low power and temperature values accordingly. Conversely, nodes in Cluster 2 were under heavier load compared to other nodes, peaking at 200W of average power consumption for a single node. While this behavior could simply be due to specific sequences of applications running on the nodes, it is more likely the symptom of a job scheduling policy that does not account for workload balance between nodes. A few points were classified as outliers when their probability was lower than a certain threshold (0.001 in our case) in the PDFs of all fitted gaussian components, and the behavior of the corresponding nodes diverges significantly from the rest of the system. One node in particular shows a concerning trend, consuming roughly 20% more power than other nodes with similar CPU idle time. We are currently investigating this anomaly, and plan to conduct a long-term root cause study. As shown, this type of analysis is very effective at supplying a comprehensive view of an HPC system’s behavior, and can be useful to system administrators and researchers alike. Similarly, this can also be used to improve scheduling policies, by considering recent node behavior to establish priority rules.
In this paper we have presented Wintermute, a holistic ODA framework for HPC systems suitable for both streaming and on-demand operation. Its core objective is to simplify the instantiation of complex models for the management of HPC systems. While its implementation extends the DCDB monitoring system so it can be used in our production environment, its design, developed after an extensive literature review and requirement analysis, is generic and can be applied to other monitoring solutions. Furthermore, we adopt a novel set of logical abstractions, denoted as the Unit System, to partition the space of available sensors and simplify model configurations. We show that Wintermute has a small resource footprint making it suitable for applications in which latency and overhead are critical. We then implemented a series of case studies in the fields of runtime optimization, job analysis and performance variation. This highlights Wintermute’s flexibility and applicability across a wide range of usage scenarios.
Wintermute is currently deployed to perform aggregation of monitored metrics in the CooLMUC-3 system. As future work, we plan to identify additional use cases for it at our HPC center, as well as further refine it for production use. As DCDB is already used on most of our HPC systems, deployment of Wintermute mostly consists of developing the required plugins, and of updating the existing DCDB installations.
Acknowledgements. This research activity has received funding from the DEEP-EST project under the EU H2020-FETHPC-01-2016 Programme grant agreement n. 754304.
-  (2014) The lightweight distributed metric service: a scalable infrastructure for continuous monitoring of large scale computing systems and applications. In Proc. of SC 2014, pp. 154–165. Cited by: §I-A.
-  (2018) Large-scale system monitoring experiences and recommendations. In Proc. of CLUSTER 2018, pp. 532–542. Cited by: §I.
-  (2018) Taxonomist: application detection through rich monitoring data. In Proc. of Euro-Par 2018, Cited by: 4th item.
-  (2014) A case study of energy aware scheduling on SuperMUC. In Proc. of ISC 2014, pp. 394–409. Cited by: §I-A.
-  (2018) Cognified distributed computing. In Proc. of ICDCS 2018, pp. 1180–1191. Cited by: §I.
-  (2007) Cool job allocation: measuring the power savings of placing jobs at cooling-efficient locations in the data center.. In Proc. of USENIX 2007, Vol. 138, pp. 140. Cited by: 2nd item.
-  (2019) Collecting, monitoring, and analyzing facility and systems data at the national energy research scientific computing center. In Proc. of the ICPP 2019 Workshops, pp. 10. Cited by: §I-A.
-  (2017) Continuous learning of HPC infrastructure models using big data analytics and in-memory processing tools. In Proc. of DATE 2017, pp. 1038–1043. Cited by: §I-A.
-  (2019) Operational data analytics: optimizing the national energy research scientific computing center cooling systems. In Proc. of the ICPP 2019 Workshops, pp. 5:1–5:7. Cited by: §I.
-  (2014) Toward exascale resilience: 2014 update. Supercomputing frontiers and innovations 1 (1), pp. 5–28. Cited by: §I.
-  (2015) Energy-aware cooling for hot-water cooled supercomputers. In Proc. of DATE 2015, pp. 1353–1358. Cited by: 1st item.
-  (submitted) EAR: energy management framework for supercomputers. In Proc. of IPDPS 2018, Cited by: §I-A, 6th item.
-  (2003) The linpack benchmark: past, present and future. Concurrency and Computation: practice and experience 15 (9), pp. 803–820. Cited by: §VI-A1.
-  (2017) Global extensible open power manager: a vehicle for HPC community collaboration on co-designed energy management solutions. In Proc. of ISC 2017, pp. 394–412. Cited by: §I-A, 6th item.
-  (2010) Semantic resource allocation with historical data based predictions. In Proc. of CLOUD 2010, Cited by: 3rd item.
-  (2015) Evalix: classification and prediction of job resource consumption on HPC platforms. In Proc. of JSSPP 2015, pp. 102–122. Cited by: 3rd item.
-  (2017) Data-driven job dispatching in HPC systems. In Proc. of MOD 2017, pp. 449–461. Cited by: 3rd item.
Analysis of XDMoD/SUPReMM data using machine learning techniques. In Proc. of CLUSTER 2015, pp. 642–649. Cited by: 4th item.
-  (2017) ScrubJay: deriving knowledge from the disarray of HPC performance data. In Proc. of SC 2017, pp. 35. Cited by: §III.
-  (2015) Overtime: a tool for analyzing performance variation due to network interference. In Proc. of the Exascale MPI Workshop 2015, pp. 4. Cited by: 1st item.
-  (2018) Service level objectives via c++11 attributes. In Proc. of REPARA Workshop 2018, Cited by: §I-A.
-  (2013) Adaptive anomaly identification by exploring metric subspace in cloud computing infrastructures. In Proc. of SRDS 2013, pp. 205–214. Cited by: 5th item.
The persyst monitoring tool - A transport system for performance data using quantiles. In Proc. of the Euro-Par 2014 Workshops, pp. 363–374. Cited by: §VI-C.
-  (2018) Energy-efficient application resource scheduling using machine learning classifiers. In Proc. of ICPP 2018, pp. 45. Cited by: 2nd item.
-  (2015) Analyzing and mitigating the impact of manufacturing variability in power-constrained supercomputing. In Proc. of SC 2015, pp. 1–12. Cited by: §I.
-  (2018) Integrating low-latency analysis into hpc system monitoring. In Proc. of ICPP 2018, pp. 5. Cited by: §I-A.
-  (2018) Characterizing supercomputer traffic networks through link-level analysis. In Proc. of CLUSTER 2018, pp. 562–570. Cited by: 1st item.
-  (2019) Fine-grained warm water cooling for improving datacenter economy. In Proc. of ISCA 2019, pp. 474–486. Cited by: 1st item.
A reinforcement learning-based power management framework for green computing data centers. In Proc. of IC2E 2016, Vol. , pp. 135–138. External Links: Cited by: 6th item.
-  (2004) The ganglia distributed monitoring system: design, implementation, and experience. Parallel Computing 30 (7), pp. 817–840. Cited by: §I-A.
-  (2010) On the use of machine learning to predict the time and resources consumed by applications. In Proc. of CCGrid 2010, pp. 495–504. Cited by: 3rd item.
-  (2016) Machine learning predictions of runtime and IO traffic on high-end clusters. In Proc. of CLUSTER 2016, pp. 255–258. Cited by: 4th item.
-  (2018) Adaptive online runtime prediction to improve HPC applications latency in cloud. In Proc. of CLOUD 2018, pp. 762–769. Cited by: 3rd item.
-  (2019) From facility to application sensor data: modular, continuous and holistic monitoring with DCDB. In Proc. of SC 2019, Cited by: §I-B, §IV-A, §IV, §VI-A.
Towards a predictive energy model for hpc runtime systems using supervised learning. In Proc. of PMACS Workshop 2019, Cited by: §VI-B.
-  (1998) Bayesian approaches to gaussian mixture modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (11), pp. 1133–1142. Cited by: §VI-D1.
-  (2018) An approach for dynamic detection of inefficient supercomputer applications. Procedia Computer Science 136, pp. 35–43. Cited by: 5th item.
-  (2016) Power consumption modeling and prediction in a hybrid CPU-GPU-MIC supercomputer. In Proc. of Euro-Par 2016, pp. 117–130. Cited by: 2nd item.
-  (2016) Towards operator-less data centers through data-driven, predictive, proactive autonomics. Cluster Computing 19 (2), pp. 865–878. Cited by: 5th item.
-  (2018) Online diagnosis of performance variation in HPC systems using machine learning. IEEE Transactions on Parallel and Distributed Systems. Cited by: 5th item.
-  (2008) Power-aware dynamic placement of hpc applications. In Proc. of ICS 2008, pp. 175–184. Cited by: 2nd item.
-  (2014) Scaling the power wall: a path to exascale. In Proc. of SC 2014, pp. 830–841. Cited by: §I.
-  (2017) Modular reinforcement learning for self-adaptive energy efficiency optimization in multicore system. In Proc. of ASP-DAC 2017, pp. 684–689. Cited by: 6th item.
PRIONN: predicting runtime and io using neural networks. In Proc. of ICPP 2018, pp. 46. Cited by: 4th item.
HPC usage behavior analysis and performance estimation with machine learning techniques. In Proc. of PDPTA 2012, pp. 1. Cited by: 4th item.