The following report presents meaningful insights and results from data analysis of a huge computing environment - supercomputer Mistral111https://www.dkrz.de/up/systems/mistral, ranked as 42nd most powerful on the world as of January 2018222Ranking November 2017 https://www.top500.org/system/178567. The HPC system has a peak performance of 3.14 PetaFLOPS and consists of approx. 3,300 compute nodes, 100,000 compute cores, 266 Terabytes of memory, and 54 PiB of Lustre file system. Presented work is associated with PhD research on root cause analysis for complex and distributed IT systems [1, 2].
Ii Data center of German Climate Computing Centre (DKRZ)
Data center placed in DKRZ contains (1) Mistral supercomputer, which comprises 3336 computing nodes placed in 47 racks, (2) about 90 special nodes dedicated for maintenance activities, data pre-processing, post-processing and advanced visualizations, (3) a separate 54 petabytes Lustre file system. Majority of the racks are homogeneous having mounted the same 72 blades, and each rack encloses 4 or 2 chassis, with the maximum capacity of 18 blades per chassis. Computing nodes are divided into several partitions: development, pre/postprocessing, test, production. In this paper, we analyze data from the production partitions. The workload is generated by a variety of applications and simulators used in areas such as climate science, geology, and natural environment. In this data center, resource allocation and accounting are maintained using Slurm333https://slurm.schedmd.com/. This open source Resource and Job Management System manages nodes’ reservations from users. The computing nodes are Intel Xeon 12C 2.5GHz (Haswell) and Intel Xeon 18C 2.1GHz (Broadwell).
Chassis contains 18 blades, less frequently 12 or 16. Understanding the data center structure is essential for the creation of diagnostic models. First of all, it allows determining the structures from which we need to collect data. Secondly, it enables the detection of local interactions between neighboring devices. For example, in the Mistral system, inventory tables contain detailed information about all the installed equipment, its interconnections, management controllers, and localization. Table I presents the number of computing nodes by type, e.g., B720_24_64 stands for Bull B720 with 24 cores and 64 GB of RAM. Furthermore, the used dataset contains metrics and topology information from other equipment installed in the data center, such as controllers, switches, power meters, water valves, cold doors. See Table I for details. The rest of the racks contain blades of diverse types, but in each chassis, there is exactly one node type.
|Node type||Quantity||Homogeneous racks, 72 blades|
Ii-a Job scheduler history
According to the dataset, each job submission finishes with one of the following states, defined by Slurm documentation.
Completed – Job has terminated all processes on all nodes with an exit code of zero.
Cancelled – Job is canceled by the user or administrator. The job may or may not have been initiated. In the following analysis, we take into account only cancelled jobs longer than 0 s
Timeout – Job terminated upon reaching its time limit.
Failed – Job terminated with non-zero exit code or other failure condition. According to Mistral, other failure condition includes failures caused by any external factor to an allocated node, e.g., failures of Lustre FS, IB.
Node fail – Job terminated due to a failure of one or more allocated nodes. This state includes only own hardware related problems of a computational node.
Users submit jobs. These jobs consist of one or more steps. Steps are sets of tasks within a job. Steps execution order is defined, and they may be executed sequentially, in parallel, or mixed. However, most steps in Mistral dataset are executed sequentially. For instance, a single step may utilize all nodes allocated to the job, or several steps may independently use a portion of the allocation. In the Slurm database, there are 76 columns with information about each job. Part of these columns includes the job configuration specified by a user, such as allocated nodes, time limit, required CPU frequency. Others contain job statistics, which include timing, average hardware usage, such as disk read/write (R/W) – the sum of local storage and Lustre operations done by a particular job such as virtual memory (VM) size. In this paper, we consider all listed job states. For steps, the dataset only includes: Completed, Failed and Cancelled.
In Figure 1, a high-level scheme of data processing modules and data sources is presented.
Iii Failed Job Analysis
Iii-a General statistics
According to the data from the job scheduler, more than 1.3M jobs, and more than 270k different job names were submitted in the 10-months period represented by the dataset extracted from the Mistral production environment. These submissions, which are mainly executed in batch mode (98.8%), resulting in over 4.8M steps. Completed jobs are 91.3% of all submitted ones. In contrast, 5.6% of started jobs result in the fail state, 1.7% of submissions are canceled, 1.4% result in timeout, and 0.028% fail because of a computing node problems. Through the analysis of these data, it is observed that the mean number of allocated nodes is 3.4 for completed steps and 18 for failed ones. This result follows a typical pattern usually reported in state of the art: failed steps are usually more complicated. Average duration and standard deviation of failed jobs and completed ones are quite similar. When it comes to steps, completed ones take in average, while failed almost three times more. For detailed statistics, see Table II for jobs and Table III for steps. These general statistics represent a convincing motivation for generating savings with the early termination of jobs predicted to fail. An average failed job consumes many more CPU hours than completed one and decreases resources availability. About 1.2M of all steps from the set run for more than and 1.1M more than .
|State||count||Allocated nodes||Duration [s]|
|State||count||Allocated nodes||Duration [s]||Ave Disk Read [GB]||Ave Disk Write [GB]|
Iii-B Job state sequences
Outcomes from previous analysis encourage the analysis of correlations between user’s past jobs and the final state of a subsequent job. Firstly, we create a matrix presenting job state transitions. In details, Figure 2
illustrates states of 2-jobs sequences, grouped by a user name and job name (exact string match). Another possibility to build these sequences is to match jobs by parts of their names, e.g., without suffixes, which usually stand for a simulated year, or another parameter of a run application. Previous state NONE refers to initial submissions, from which 88% completes, and the majority of the rest fails. Importantly, only 19% of next submissions complete after a job failed and 75% of them still fail. Majority of jobs completes after a hardware failure of a node. Also, these data reveal a few interesting rationales. For instance, users often submit applications which are correct and do not fail. Then they start trails, implement changes, or just merely develop their models. Majority of next submissions completes, but still, failures are two times more probable than cancellations or timeouts. A typical user is more likely to have a job in the completed state after it is canceled than it is failed. An interesting fact is that the probability of a node failure reaches its maximum value after another node failure, and it has the same order of magnitude for all other states.
Regarding the correlation between cancelled and failed, 13% of next submissions after cancellations fails and only one third completes. Moreover, Table III shows that cancelled steps are characterized by much higher disk RW than completed and even failed ones. One of the potential causes after interviewing system administrators is that they cancel steps, due to high storage system usage – IO counters. Obviously, after cancellation, a job is possibly corrected and re-submitted to be completed. Further analysis is shown in Figure 5 which presents average factors of past failed and cancelled jobs to all submitted jobs in different N number of prior submissions for each job state. A readable observation is that, on average, in preceding ten jobs there are as many cancelled jobs as failed ones for all states except node fail - probably lack of diverse samples. It can be highlighted that a cancellation often follows up other cancellations and a failure other failures.
Besides, in Figure 6, we present correlation type distribution between a number of failed and cancelled jobs in different time windows. Aggregation in 4-week periods and no lag between these sequences reveals the highest number of sequences with correlation coefficient over a fixed threshold of 0.3. Additionally, we present distributions of a correlation coefficient value, see Figure 8 for different time windows. These distributions show that correlations are stronger for longer periods – weeks over days. In link with this, sequences of cancellations and failures are presented in Figure 7 for a randomly chosen user with relatively high activity. Surprisingly, it is observed that local minima of failed and cancelled jobs exist in the same time periods. In contrast, high activity of a user does not necessarily mean a high number of failures and cancellations. Naturally, a user might submit the same working code. These sequences reveal that there are periods of re-running the same models, and periods of experiments when a model is changed. This phenomenon is confirmed by researchers working in DKRZ.
Iii-C Time view
The overall cycle of jobs depending on the daytime can be seen in Figure 9. The number of jobs by the state is normalized to the mean number of started jobs during the whole daytime. Naturally, during the night the number of started jobs is much lower. Between 10 and 17 hours, the number of submissions is over the mean. Moreover, in Figure 10, we present distribution of time elapsed from job submission to a job start. This distribution shows that the highest waiting time is for jobs resulted in a timeout and node fail state.
In Figure 11, we present the average number of cancelled and failed jobs aggregated by daytime. It is clear that the highest number of failed ones starts between 14 and 16, while for cancelled the maximum is at 15 hour.
Iii-D Distribution of a job over the data center
The job scheduler is optimized to use nodes which are closest to each other to reduce latency in data transfer. Topology-aware resources allocation is applied as well as in Slurm, and other schedulers. An interesting aspect to explore might be the distribution of the jobs over racks. Through this, we can discover the dependency between the number of network hops and failed jobs. The number of hops represents the complexity of a network topology for a particular job and increases with the number of used racks since a switch is mounted in each chassis. For this, we choose a subset of steps allocated on more than one node with duration more than . In average, completed steps are allocated on 1.1, racks, cancelled on 2.3, and failed on 1.8, . Completed steps are not only distinguished by the lowest number of used racks, but also the lowest number of allocated nodes, as seen in Table III.
The mean number of racks used by multi-node steps is 1.92. This distribution is presented in Figure 12. This figure also shows the probability of a failure according to the number of racks used for a step, and the maximum is at seven racks. For a number of racks over 13, which means using even more than 1000 computing nodes, occurrences of failures are rare. This phenomenon can be explained rather by a user’s behavior than hardware dependencies. Most of HPC jobs are projected to be run on a specific number of nodes. This dependency is opposite to Big Data business software, where horizontal scaling on demand is one of the most important requirements in an application. So, the code for huge HPC jobs seems to be better tested and reliable for a fixed number of nodes.
In Figure 13, we analyze duration over a number of racks used by a step. Notably, failed steps are statistically shorter than completed, when approximately less than ten racks are used for a step. In this case, failures occur probably in the early phase of executed code. However, for a number of racks larger than 12, duration of failed steps significantly increases, while for completed ones it is kept on the same level. In Figure 14, distribution of the number of allocated nodes versus a number of racks can be seen. This relation is linear, although, in range of 10 and 20 racks used, the median number of allocated nodes does not increase. Cancelled steps with less than 100 nodes used are often placed in less than ten racks. It is opposite to failed or cancelled steps. These steps are more sparse, and for a few nodes allocated often use more racks.
Iii-E Node-power analysis
In Table IV we present average blade power and average last registered power for different job submissions states. In Table V, we present power statistics for steps longer than , grouped by hardware profile. The table shows average values of power metrics in the last . It is seen, that for all hardware groups this value for failed steps is lower than for completed ones. Most probably, it is explained by the fact that once a software failure occurs some of the nodes go to an idle state.
Additional analysis. During this research we analyzed other issues, which are not presented in previous sections, but are valuable to notice. Firstly, we evaluated heat exchange between blades, to check if there is any correlation between the temperature of blades placed in the same chassis. Probably because of high-performance cooling infrastructure, no relationship is discovered. Another considered issue is the priority of a job submission in relation to its final state. No obvious correlation is observed, although an anomaly is detected in the distribution of priority level for the timeout state. Comparing to other states, a normalized frequency of submissions with high priority is significantly higher for timeouts.
This research is supported by the BigStorage project (ref. 642963) founded by Marie Skłodowska-Curie ITN for Early Stage Researchers, and it is a part of a doctorate at UPC.
-  M. Zasadziński, V. Muntés-Mulero, and M. Sóle, “Actor based root cause analysis in a distributed environment,” in Proc. of the 3rd Int. Workshop on Software Eng. for Smart Cyber-Physical Syst. IEEE Press, 2017, pp. 14–17.
M. Zasadziński, V. Muntés-Mulero, M. Sóle, and D. Carrera, “Fast root cause analysis on distributed systems by composing precompiled bayesian networks,” inProc. World Congr. on Eng. and Comput. Sci., vol. 1, 2016, pp. 464–469.