Due to the demanding need of High Performance Computing (HPC) and the fast-advancing HPC technology, large-scale computing systems today are assembled by a large amount of computing units equipped with supporting components for an extremely computational and reliable HPC eco-system. Various studies showcase that failures are not rare events in such HPC systems due to the numerous interconnected components. Regardless of the fact that the growing number of components in HPC systems aggregate failure rates overall, root causes of failures in supercomputers include radiation-induced effects such as particle strikes from cosmic radiation, circuit aging related effects, and faults due to chip manufacturing defects and design bugs Hukerikar and Engelmann (2016). Most failures remain undetected during post-silicon validation and eventually manifest themselves during the operation of HPC systems, e.g., in runs of HPC applications, and in upgrades or maintenance of system software and devices. As process technology continues to shrink and HPC systems today tend to operate at low supply voltage for power efficiency purposes, e.g., near-threshold voltage computing Kaul et al. (2012), hardware components of supercomputers become more susceptible to all types of faults at a greater rate. Therefore, Mean Time To Failure (MTTF) of the system is expected to dramatically decrease for forthcoming exascale supercomputers.
Resilience of HPC systems to various types of failures has become a first-class citizen in building scalable and cost-efficient HPC systems. In general, it is expensive to detect and correct such failures in large-scale computing systems in the presence of resilience techniques due to: (a) software costs, e.g., performance loss of the applications due to additional resilience code for calculating checksums/residues and saving checkpoints, and (b) hardware costs, e.g., extra components needed for modular redundancy like ECC memory and more disk space for checkpoint storage. Generally, resilience requires different extent of redundancy at various system levels in both time and space. Numerous studies have been conducted to improve the efficiency of existing resilience techniques for HPC systems. State-of-the-art solutions include Algorithm-Based Fault Tolerance (ABFT) Du et al. (2012) and scalable multi-level checkpointing (SCR) Moody et al. (2010). However, there exists lack of investigation to holistic failure analysis and fine-grained failure quantification for large-scale computing systems, covering realistic resilience scenarios in supercomputers up to date, which is definitely beneficial to understand failure pattern/layout of operational supercomputers today for better devising more effective and efficient fault tolerance solutions.
In this paper, we propose to discuss various types of errors and failures from different architectural levels of supercomputer architectures today, and quantify them into an integrated failure model to summarize overall system failure rates hierarchically, in different HPC scenarios. The primary contributions of this paper include: (a) study the quantitative correlation of failure rates among different components, different failure types, and different system layers of supercomputers, under specific overall system failure rate bounds, (b) discuss the quantitative impacts of system resilience levels (referred to as significance index in the later text) to overall system failure rates, and (c) formalize the resilience efficiency of failure-bounded HPC systems.
The remainder of the paper is organized as follows: Section 2 introduces background knowledge. Section 3 discusses empirical failure models used in this work, and a holistic quantitative study (a refined failure model included) on failure-bounded supercomputers is presented in Section 4. Section 5 discusses related work, and Section 6 concludes.
Supercomputers today is an extremely parallel and complex integration of numerous components, primarily categorized into computing units, network, storage, and supporting devices, e.g., cooling infrastructure, cables, and power supply. Figure 1 overviews the hardware architecture of contemporary supercomputers hierarchically (taking the supercomputer Trinity 22 at Los Alamos National Laboratory for example, which ranks in the latest TOP500 list 21): From top to down, a supercomputer is comprised of a number of cabinets (or racks), denoted as ; each cabinet is comprised of a number of chassis (or blades), denoted as ; each chassis is comprised of a number of compute nodes, denoted as (without loss of generality, we ignore that there may exist a very small number of head nodes in the system that mostly do the management work). Finally, each computer node consists of hardware components including processors, storage, network, SRAM (on-chip), and DRAM (off-chip). According to the TOP500 list, top-ranked supercomputers up to date have hundreds of cabinets, thousands of chassis, and hundreds of thousands of compute nodes overall. In the figure, we only illustrate component details for one node (interconnects and other devices between nodes, chassis, and cabinets are omitted due to space limitation), and assume that all nodes and counterpart components (e.g., all cables) in the system are homogeneous (and thus have equivalent susceptibility to failures) to simplify our discussion.
Note that in Figure 1, we use simplified terms to demonstrate the node configuration. Specifically, processors can be CPU and/or accelerators such as GPU and co-processors, which include functional units and control units. SRAM (on-chip) refers to registers and caches, and DRAM (off-chip) refers to main memory. Storage consists of any types of hard disk drives, solid-state drives, non-volatile memory, and cloud-based storage units. Network can be a high-speed interconnect such as InfiniBand.
3 Failure Model
Based on the system architecture shown in Figure 1, we denote the failure rate of each level of system hierarchy as , , , and individually.
based on the probability theory. It essentially shows failures are distributed over all available nodes in the system. We assume there are no idle nodes from each level of hierarchy when we consider failures, and thus all nodes are probabilistically equivalent for all types of errors.
For a compute node in supercomputers as shown above, there are two types of induced faults by nature: soft errors and hard errors. The former are transient (e.g., memory bit-flips and logic circuit miscalculation), while the latter are usually permanent (e.g., node crashes from dysfunctional hardware and system abort from power outage). We denote the failure rate of soft errors and hard errors as and respectively. In Equation (2), we formulate the nodal failure rate as the integration () of and (note that instead the mathematical addition () is not used here, given the different nature between soft errors and hard errors).
The parameters and by and individually are referred to as the significance index (SI) of failure rates. For various HPC systems equipped with different hardware and software resilient techniques, the SI of and varies. In general, SI represents the resilience to failures of a given system, and it has a negative correlation with failure coverage of the resilient techniques employed in the system, i.e., the more resilient the system is, the more failures can be recovered, the less SI value is. Consequently in Equation (2) the nodal failure rate changes accordingly, with the SI values introduced.
Due to the demanding requirements of system-wise power efficiency and resilience as the goal of US Department of Energy (DOE) for the upcoming exascale computers 7, current and future large-scale HPC systems needs to be not only power-bounded, but also failure-bounded, which means the overall system failure rate needs to be capped under a threshold value , provided a power budget 15. For simplicity of discussion, we define by explicitly summing up soft and hard error rates. Therefore, based on Equations (1) and (2), we can reformulate the capped failure rates for soft errors and hard errors, under the specified expected system failure rate cap below:
According to the definition of soft errors and hard errors, node-wise we assume that processors, on-chip SRAM, and off-chip DRAM are the primary sources of soft errors, and storage and network are the main contributors to hard errors (in practice power supply contributes to hard errors considerably as well, which will be covered in the refined failure model in Section 4.3 where we assume power supply faults occur at chassis and cabinet levels). Without loss of generality (more components can be incorporated if needed), we look into the components above within a node, and formulate and more specifically as follows:
4 Failure-bounded Quantitative Study
In this section, we conduct exploratory quantitative discussion on several common scenarios in state-of-the-art HPC systems. With the established failure models above, our goals include: (a) given acquired failure data of system components, make some inferences on unknown failure rate caps of other components, and (b) speculate the system-/component-wise failure rate ranges under some known failure rate caps.
4.1 Capping Failures by Types
Per the mechanism of detection and correction, soft errors can be categorized as Detected and Corrected Errors (DCE), Detected but Uncorrectable Errors (DUE), and Silent Errors (SE) Snir et al. (2014). Any unmasked SE are referred to as Silent Data Corruption (SDC), i.e., incorrect program outputs. DCE generally occur in ECC-protected SRAM/DRAM, and examples of DUE include crashes and hangs of program execution. We assume that each compute node has statistically equivalent chances for incurring soft errors and/or hard errors. Moreover, assume an HPC system where on average 80% of soft errors occurring in a single node are DCE (masked by ECC memory), 5% are DUE, and 15% are SDC, and there are no fault tolerance support at the software stack such as Algorithm-Based Fault Tolerance (ABFT), which means that 20% of the total incurred soft errors circumvent resilience techniques employed, i.e., = 0.2. Likewise, we assume that at the hardware stack, appropriate hardware-based resilience techniques are employed, and 60% hard errors can be successfully masked, i.e., = 0.4.
Substituting = 0.2 and = 0.4 into Equation (3) yields:
Given that is a constant number that refers to the total number of active compute nodes system-wide, and the assumed values for and , for the three remaining variables in Equation (6), we can easily solve one provided the other two.
The failure rate can be expressed in terms of either Mean Time To Failure (MTTF) Daly (2006) or Failure In Time (FIT) Asadi and Tahoori (2005). FIT is inversely proportional to MTTF and is defined as a failure rate of 1 in a billion hours. Here we adopt FIT as the calculation unit due to its additive nature, different from MTTF. Existing studies demonstrate that for HPC architectures nowadays, SRAM failure rates range from 10 FIT to 100 FIT Dong and Li (2011), and DRAM failure rates are of the order of magnitude of 100 FIT Bacha and Teodorescu (2013). Therefore, without loss of generality, assume that there is a supercomputer of 100,000 nodes, with = 200 FIT. Meanwhile, as a premise, cannot exceed 5,000,000 FIT as required for system-level resilience. With the parameters already known, we can solve as below:
which indicates that in order to achieve no greater than 5,000,000 FIT, given = 200 FIT and the above and values, the threshold value of is 25 FIT.
Scenario 1: An HPC System with Higher Resilience to Soft Errors
Figure 2 depicts the system failure rate curve, as nodal soft/hard error rate changes, provided the hypothesized failure rate SI values = 0.2 and = 0.4 in Equation (3). This scenario represents HPC systems that have higher resilience to soft errors, compared to hard errors. We can see that although overall is linear to and respectively, the system characteristic of higher resilience to soft errors makes be affected more by the variation of , compared to that of . Figure 2 also shows that this trend remains the same for all and values.
Scenario 2: An HPC System with Higher Resilience to Hard Errors
Figure 3 plots the system failure rate curve with another set of configuration of failure rate SI values = 0.5 and = 0.25 in Equation (3), which reflects HPC system with higher resilience to hard errors instead of soft errors. Likewise, due to the higher tolerance to hard errors, the curve shows the trend that tend to be impacted more by instead of , i.e., with the same amount of change between and , varies greater with the change of , as shown in Figure 3.
4.2 Capping Failures by Components
Instead of capping failure rates of soft/hard errors at system level, HPC systems today also have resilience requirements for specific components. Given the system-wise failure cap and some acquired failure data from other components, we can obtain the capped failure rates for the interested components.
Likewise, is a constant number. We employ the same hypothesized failure rate SI values as Scenario 1, = 0.2 and = 0.4, and the same premise of an HPC system of 100,000 nodes with = 5,000,000 FIT. In addition, we assume that from system logs historically, failure data of processor, SRAM, and network are acquired as follows: = 90 FIT, = 70 FIT, and = 20 FIT. Substituting all known parameters into Equation (8), we have:
which indicates that in order to preserve the assumed failure rates, the quantitative relationship between and in (4.2) must be satisfied.
4.3 Refining Failure Model from System Hierarchy
Although an HPC system is comprised of compute nodes, failures may happen not only at local nodes, but also interconnects between nodes, power supply and other devices at chassis or cabinet level. When such failures occur at higher levels rather than at a single node, all nodes at the related levels are affected. For example, if the power supply at cabinet level fails, all nodes within the affected cabinets will be down. Consider the occurrence of failures hierarchically at different system layers, we refine the failure models as follows:
note that in Equation (10) the parameters and are failure rates of non-node devices/components at chassis and cabinet levels respectively, the parameters , , and are the SI of node, chassis, and cabinet failure rates respectively, and the constants , , and individually refer to the total number of nodes, chassis, and cabinets in the system overall. From previous models, we have:
Recent studies on DOE supercomputers indicate that failures at system-wide component level play a significant role in the resilience of the system. From analyzing one-year system logs of the supercomputer Mira at Argonne Leadership Computer Facility of Argonne National Laboratory, the frequency of fatal events based on different components and categories has been clearly identified. According to the statistics from this study, although soft errors (mostly memory errors) at node level are the most frequently occurred failure type, failures on system-wide components amount to at least 39.47% of all observed failures, as listed in Table 1 12. We can group all off-node failures in terms of and given the specific location of failures. For simplicity of discussion, failures occurred between chassis and between cabinets are considered into and respectively.
|Component||Failure Rate||Failure Location|
|(system-wide)||(over one year)||(on/off node)|
|compute unit||53.95%||on node|
|link module||6.58%||off node|
|coolant monitor||4.61%||off node|
In order to study the relationship among the failure rates at node, chassis, and cabinet level, under a predefined system failure rate cap and with resilience techniques employed, we consider the following scenario:
Scenario 3: An HPC System with Resilience and Capped System Failure Rates
Figure 4 shows the node, chassis, and cabinet failure rate curve for another HPC system scenario, where we assume that the node, chassis, and cabinet failures in this system are tolerated to some extent by employed resilience techniques individually, and consequently = 0.2, = 0.6, and = 0.5. We adopt the same system architectural configuration as previous examples: 100,000 nodes (100 nodes per chassis, 10 chassis per cabinet, and 100 cabinets in the system), with = 5,000,000 FIT. Therefore, Equation (10) is instantiated below:
Specifically, Figure 4 is an illustrated version of Equation (4.3). We can see that as and change, the variation of is comparatively small, i.e., and both range from 0 to 500 FIT, while ranges only from 230 to 250 FIT. This is because there exist much more nodes compared to chassis and cabinets in the system overall. However, statistically, failure rates of a single node are smaller than failure rates of a single chassis or a single cabinet. In general, with a capped system failure rate, the growing of failure rates of any hierarchy level (node, chassis, or cabinet), leads to the decreasing of failure rates of the other two levels. We can also see that the variation of has a greater impact on the variation of , compared to the variation of .
4.4 Failure-bounded HPC System Time Usage and Resilience Efficiency
Regarding the impacts of resilience on HPC systems, the breakdown of system time usage by functionality (e.g., system in idle, operation, computation, or I/O) is highly beneficial since fine-grained efficiency analysis is feasible. Figure 5 overviews the general time usage of typical HPC systems today Stearley (2005). We can clearly see that the time used for resilience purposes is a part of the system run time , while the other part is solve time which is in general application-specific. Without loss of generality, we assume that in Figure 5 the highlighted time components (, , , , and ) account for the majority of the total system time . Furthermore, the resilience efficiency of an HPC system can be formalized as follows:
Note that in practice, varies depending on if there exist failures or not in HPC runs. Since if there are no failures during HPC runs, no extra costs on recovering from failures which makes smaller. Specifically, let the system employ Checkpoint/Restart (C/R) as the resilience technique. If no hard errors occur while applications are running, the system does not need to restart from the last saved checkpoint, and then less time spent on resilience, i.e., smaller while
unchanged. For example, assume that there is an HPC system in operation of 10,000 hours, where 8,000 hours in applications running without failures, while 8,400 hours in application running with failures. Without resilience techniques employed, the application total run time is 6,600 hours (which can also be estimated using application algorithmic complexity and computation capability of the system). Using Equation (12), we can easily obtain the resilience efficiency of both scenarios below:
As shown, has a greater value than , due to the presence of failures which needs additional resilience time on recovering for correct HPC runs. For different resilience techniques, the difference between and may vary, because of the different nature of recovering from failures.
It is well-studied that supercomputers today (up to petascale) are exposed to high failure rates due to various root causes, with MTTF ranging from 50 minutes to 230 minutes Sarkar, Ed. (2009). Forthcoming exascale supercomputers are expected to suffer from increased failure rates due to a greater number of components, with predicted MTTF ranging from 22 minutes to 120 minutes 7. With the expected failure rates, we can speculate the resilience efficiency of future exascale supercomputers using our models. Assume that there is an exascale system in operation of 10,000 hours, and one failure occurs every 120 minutes, with 40% hard errors and 60% soft errors. The employed resilience techniques can successfully capture every failure and take 0.7 hour and 0.2 hour to detect and recover from hard errors and soft errors individually. Using Equation (12), we calculate the resilience efficiency below:
From the calculation shown above, we can see that in order to obtain higher resilience efficiency for failure-bounded HPC systems in this era, we need to develop more cost-effective resilience techniques, or increase the MTTF of future supercomputers.
5 Related Work
Modeling methods have been extensively used for large-scale computing systems, for the purposes of failure prediction Gainaru et al. (2012a) Gainaru et al. (2012b), trade-off optimization Rafiev et al. (2014) Tan et al. (2016), and vulnerability reduction Casas et al. (2012) Tan et al. (2017). Gainaru et al. Gainaru et al. (2012a) proposed to characterizing the normal and faulty behavior of HPC systems by using signal analysis to model the flow of each state event during HPC system lifetime. The extracted models accurately reflected system outputs and improved the effectiveness of fault prediction. The subsequent work Gainaru et al. (2012b)
leveraged data mining techniques to offer an adaptive failure prediction module for accurate fault prediction, and was evaluated on two large-scale systems for prediction precision and recall impacts. Instead of focusing on analyzing the system state data (referred to as system events inGainaru et al. (2012a) and Gainaru et al. (2012b)), our work investigates failure rate correlation at different system hierarchical levels and system components levels. Rafiev et al. Rafiev et al. (2014) studied the interplay between critical dimensions in HPC, i.e., performance, energy, and reliability using a modeling framework based on a resource-driven graph representation. The layer-agnostic models applied efficiently to large-scale systems and diverse types of concurrency. Tan et al. Tan et al. (2016) quantitatively modeled the integrated energy efficiency in terms of performance per Watt and showcased the trade-offs among typical HPC parameters, by extending the Amdahl’s Law and the Karp-Flatt Metric. The proposed models were evaluated to help find the optimal HPC configuration for the highest integrated energy efficiency with resilience. This work focuses on the resilience of HPC systems only and our failure model is based on the probability theory. Casas et al. Casas et al. (2012) presented an approach that analyzes the vulnerability of sparse scientific applications to hardware faults at large scales, and reduced their vulnerability by protecting the most vulnerable components and failure prediction. Leveraging register vulnerability, Tan et al. Tan et al. (2017) investigated the validity of failure rates in HPC systems at near-threshold voltage, and empirically evaluated the power saving opportunities without incurring observable number of soft errors during HPC runs. Our work differs from them since the proposed model here is for better understanding failure pattern of operational supercomputer architectures today and thus devising more feasible resilience solutions accordingly.
Due to the expansion of HPC systems in size and duration in use, it is critical to maintain the resilience of supercomputers today. For resilience purposes, it is beneficial to quantify failures in existing failure-bounded HPC systems in a fine-grained fashion. In this paper, we conduct an exploratory quantitative study on holistic failure modeling for contemporary large-scale computing systems, which also sheds light on understanding potential failures on forthcoming supercomputers in the exascale era, and helps better devise more feasible resilience solutions at scale. Specifically, we integrate different failures from the perspective of system hierarchy, and summarize the overall system failure rate formally. We also discuss various scenarios of HPC system resilience categorized by error types, system components, and hierarchical levels, and formalize the significance index of failure rates and the resilience efficiency of supercomputers today under a system failure rate cap.
- Soft error rate estimation and mitigation for SRAM-based FPGAs. In Proc. FPGA, pp. 149–160. Cited by: §4.1.
- Dynamic reduction of voltage margins by leveraging on-chip ECC in Itanium II processors. In Proc. ISCA, pp. 297–307. Cited by: §4.1.
- Fault resilience of the algebraic multi-grid solver. In Proc. ICS, pp. 91–100. Cited by: §5.
- A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems 22 (3), pp. 303–312. Cited by: §4.1.
- Efficient SRAM failure rate prediction via Gibbs sampling. In Proc. DAC, pp. 200–205. Cited by: §4.1.
- Algorithm-based fault tolerance for dense matrix factorizations. In Proc. PPoPP, pp. 225–234. Cited by: §1.
Exascale computing initiative update 2012, us department of energy.
meetings/aug12/2012-ECI-ASCAC-v4.pdf, . Cited by: §3, §4.4.
- Taming of the shrew: modeling the normal and faulty behaviour of large-scale HPC systems. In Proc. IPDPS, pp. 1168–1179. Cited by: §5.
- Fault prediction under the microscope: a closer look into HPC systems. In Proc. SC, pp. 77. Cited by: §5.
- Resilience design patterns: a structured approach to resilience at extreme scale. Technical report Technical Report ORNL/TM-2016/767, Oak Ridge National Laboratory, . Cited by: §1.
- Near-threshold voltage (NTV) design – opportunities and challenges. In Proc. DAC, pp. 1153–1158. Cited by: §1.
-  () Mira – argonne leadership computing facility. https://www.alcf.anl.gov/mira, . Cited by: §4.3.
- Design, modeling, and evaluation of a scalable multi-level checkpointing system. In Proc. SC, pp. 1–11. Cited by: §1.
- Studying the interplay of concurrency, performance, energy and reliability with ArchOn – an architecture-open resource-driven cross-layer modelling framework. In Proc. ACSD, pp. 122–131. Cited by: §5.
-  () Renewable energy and energy efficiency for tribal community and project development, us department of energy. http://apps1.eere.energy.gov/tribalenergy/pdfs/energy04_terms.pdf, . Cited by: §3.
- Exascale software study: software challenges in extreme scale systems. Technical report Technical Report , US DARPA IPTO, Air Force Research Labs, . Cited by: §4.4.
- Addressing failures in exascale computing. International Journal of High Performance Computing Applications 28 (2), pp. 129–173. Cited by: §4.1.
- Defining the measuring supercomputer reliability, availability, and serviceability (RAS). In Proc. the Linux Clusters Institute Conference, pp. . Cited by: §4.4.
- Scalable energy efficiency with resilience for high performance computing systems: a quantitative methodology. ACM Trans. Architecture and Code Optimization 12 (4), pp. 35. Cited by: §5.
- RSVP: soft error resilient power savings at near-threshold voltage using register vulnerability. In Proc. DSN-W, pp. 91–98. Cited by: §5.
-  () TOP500 supercomputer lists. http://www.top500.org/, . Cited by: §2.
-  () Trinity: advanced technology system. http://www.lanl.gov/projects/trinity/, . Cited by: §2.