The operating of a transistor on a chip may meet a transient fault due to various reasons, including environment interference, power supply noise, high-energy particle, and so on. Considering the ever increasing number of transistors in commercial chip multiprocessors (CMPs), CMPs are more and more vulnerable to transient faults. Although a transient fault happens only one time and does not appear again in the future execution of a CMP, the error resulted from a transient fault may propagate to other parts of the CMP, causing incorrect instruction results, even system crash.
A traditional methodology to detect transient faults is to use a redundant core to assist each checked core separately [9, 18, 29, 14, 25]. A redundant core has the same program and inputs with the corresponding checked core, thus these two cores should have the same instruction results. Once an instruction on the redundant core has a different result with the corresponding instruction on the checked core, a transient fault is detected. In such a case, both cores should be rolled back to some previous checkpoint using existing checkpointing mechanism [20, 30], and re-execute the program to recover from the fault.
Coverage problems of previous schemes. (a) represents previous schemes where a fault in the uncore parts of the chip (last-level cache (LLC), network on chip (NoC), memory controller, and so on) may probably escape from detection. And (b) represents our scheme which can successfully detect all the faults.
Although previous core-level schemes (providing redundancy for each core separately) can effectively detect the transient faults happening in cores, when coping with parallel programs, they may omit the transient faults happening in the uncore parts of the CMP, including last-level cache (LLC), network on chip (NoC), memory controller, and so on. An illustrative example can be found in Figure 1.(a), where core is the redundancy of core , and core is the redundancy of core . Assume that core has correctly stored a value in its local L1 cache. When the correct value stored by core is evicted from core to LLC or memory, it may be corrupted due to a transient fault in the uncore parts, which makes all subsequent loads to the value be faulty. Once core and core load this faulty value from the uncore, they will produce the same incorrect results. Obviously, such a transient fault happening in the uncore parts can be detected through neither the result comparison between and nor the result comparison between and . Considering that the uncore parts may consume 50-70% area of a state-of-the-art commercial CMP [33, 24], there is an urgent need for a scheme which can comprehensively detect transient faults on the whole chip111Though one can use error correcting codes (ECC) to improve the reliability of the uncore parts, it will remarkably increase the area of the uncore parts (both cache and NoC). Furthermore, ECC cannot cope with many common misbehaviors in the uncore parts (e.g., transfer losing, mistransfers, transfer misordering, incorrect prefetching, and so on)..
1.2 Our Idea
Previous core-level schemes (which provide redundancy for each core separately) fail to detect all transient faults since the checked core and the corresponding redundant core can load data from a same source. As a result, a transient fault, which affects the common data source of both the checked core and the corresponding redundant core, will escape from detection. Hence, to get 100% coverage for transient fault detection, our idea is to provide redundancy for a group of cores as a whole. As shown in Figure 1.(b), core and core are treated as the checked group of cores, while core and core belong to the redundant group of cores. There is no data dependency between different groups of cores, say, core and core do not rely on any datum produced by the redundant group (core and core ), while core and core do not rely on any datum produced by the checked group (core and core ). When a transient fault occurs, it cannot affect both groups of cores simultaneously. For example, if some transient fault happens in the data transfer from core to core , core will not be affected by this fault. Hence, through comparing the instruction results between core and core , we are able to detect the fault.
Based on the above idea, in this paper we propose RepTFD, a transient fault detection scheme based on hardware-assisted deterministic replay222Deterministic replay aims at guaranteeing two executions of a same program to exhibit the same in the presence of non-deterministic factors. The general workflow of deterministic replay consists of two stages: At the first stage, an execution of the program, called the first-run, is carried out, where the non-deterministic factors are continuously recorded as logs. At the second stage, a follow-up execution of the program, called the replay-run, recurs the first-run under the guidance of the recorded logs.. As shown in Figure 2.(a), RepTFD uses one half of the cores in the CMP as the checked group of cores, and uses the other half of the cores as the redundant group of cores. These two groups of cores execute a same parallel program without data dependency. In the execution on the checked group of cores (first-run), two types of information will be recorded. One is the information of the orders between the memory instructions in the first-run, which is recorded in a determinism-log file; the other is the instruction results of the first-run, which is recorded in a result-log file. Based on the determinism-log, the first-run can be deterministically replayed on the redundant group of cores. In the meantime, based on the the result-log, the correctness of the first-run can be checked against the replay-run on the redundant group of cores. Since any transient fault can only affect one group of cores, through comparing the results of the corresponding instructions on the two groups, RepTFD can detect malignant transient faults with 100% coverage (here we say that a transient fault is malignant if it leads to an incorrect instruction result).
Besides the coverage of transient faults, the performance overhead is also a crucial criterion for a transient fault detection scheme. As shown in Figure 2.(b), the execution of a program with fault-tolerance is divided into many segments using existing checkpoint mechanism. The replay-run of each segment is executed after the execution of the first-run of this segment. Obviously, the overall performance overhead of replay based transient fault detection depends on the slower one of the first-run and the replay-run. Since hardware-assisted deterministic replay approaches can easily achieve negligible slowdown (say, ) in the first-run [7, 32, 10], the overall performance overhead of RepTFD is determined by the speed of the replay-run. Unfortunately, most existing hardware-assisted deterministic replay approaches focus on reducing the recorded log size to record long execution of parallel program333Noting that RepTFD only needs to maintain the determinism-log for the last segment of the first-run (about 1 second), the extremely tiny log size achieved by previous deterministic replay approaches has quite limited importance for RepTFD. In practice, the size of RepTFD’s determinism-log for a 8-thread application is less than 15 MB when the instruction per cycle (IPC) of each core is 1, which is already satisfactory for transient fault detection., thus often cause remarkable slowdown in the replay-run ( to our best knowledge [7, 32, 17, 4]).
Hence, instead of directly incorporating an existing hardware-assisted deterministic replay approaches into RepTFD, we propose a new deterministic replay approach, which has negligible slowdown of the replay-run, based on the following observation: the more execution orders444Execution order is a type of logical time order, which exists between two successive conflicting memory instructions. Here we say two memory instruction conflicting, if they access the same memory location and at least one of them is write. directly recorded, the slower the replay-run is. For example, if an execution order is recorded in the determinism-log, then in the replay-run, before is executed, we must detect and wait for the completion of , to guarantee the identicalness between the replay-run and the first-run.
Hence, RepTFD achieves efficient replay-run through filtering the execution orders, which can be inferred from pending period information, from the determinism-log. Concretely, in the first-run, RepTFD records a relaxed start time and a relaxed end time for each instruction block with a global clock, where the period between these two time points is called the pending period of the instruction block. If two instruction blocks have non-overlapping pending periods, they have a physical time order. Since execution orders between memory instructions in these two blocks can be inferred from the recorded pending period information (and the resultant physical time order), we need not to directly record them in the determinism-log. In fact, execution orders are inferrable from physical time orders . As a consequence, the replay-run of RepTFD is seldom paused to enforce the directly recorded execution orders555Enforcing the recorded pending periods may also pause the replay-run. However, the pending periods recorded by RepTFD are quite relaxed (i.e., twice of the actual execution time of the instruction block), thus it is rare to pause the replay-run for enforcing the recorded pending periods., thus only has 4.76% slowdown (in comparison to the normal execution without fault-tolerance), according to our experiments over SPLASH2 benchmarks.
To sum up, this paper makes the following contributions:
Fault coverage: For parallel workloads, we propose the first core-level transient fault detection scheme with 100% coverage, i.e., any malignant transient fault that cause an error in the result of an instruction can be detected.
Performance overhead: Due to the reduction of the execution orders to be enforced in the replay-run, RepTFD has only 4.76% performance overhead in comparison to the normal execution without fault-tolerance for 16-core CMP (8 checked cores and 8 redundant cores). In addition, RepTFD has the smallest replay slowdown among existing deterministic replay approaches.
Implementation costs: Previous schemes have to resolve the possible input incoherence between the pair of checked core and redundant core, which incurs non-trivial implementation costs to the CMP. RepTFD elegantly avoids the troublesome input incoherence problem through deterministic replay, and keeps most existing parts of the CMP unmodified. Experiments also show that RepTFD only consumes 0.83% area of the whole chip. Hence, RepTFD can be easily implemented on a commercial CMP with negligible design, verification, and area costs.
2 Related Work
2.1 Core-level Transient Fault Detection
Researchers have proposed various core-level transient fault detection schemes. These schemes have similar ideas: Using a redundant core to protect each checked core from transient fault separately. When the checked core has different instruction results with that of the corresponding redundant core, a transient fault is detected.
Core-level schemes require a checked core and the corresponding redundant core to have the same inputs [22, 23, 34]. For single-threaded applications, this requirement can be straightforwardly satisfied through providing the same program and I/O inputs [3, 27]. However, for multi-threaded applications, previous schemes meet the so-called input incoherence problem: Even given the same program and I/O inputs, the checked core and the redundant core may still get different results for a same load instruction, if a third-party core stores a new value to the corresponding memory location in the time gap between the load instructions performed by the checked core and the redundant core. To address this input incoherence problem,  and  employ a load value queue (LVQ), through which every load value is forwarded from the checked core to the redundant core. However, the LVQ implementation has a high design complexity, especially when the load instructions are executed out-of-orderly.
Instead of forwarding the load results of the checked core to the redundant core, some schemes [29, 14, 25] allow both cores to independently access the memory. Reunion  treats input incoherence similar with transient faults. When an input incoherence is detected, roll-back recovery mechanism should be carried out. As the possibility of an input incoherence is much greater than that of a transient fault, Reunion may bring remarkable performance overhead (up to 250% for 8 checked cores according to ) for roll-back recovery from input incoherence. DCC  maintains a memory access window for each pair of check core and redundant core, to monitor the memory locations which are accessed by both cores. Conflicting store instructions performed by a third-party core to those locations are stalled until both cores have ended their memory instructions. However, the performance overhead is still high (19.2% on average for 8 checked cores according to ). Recently, TDB  is proposed to guarantee the input coherence via cache coherence protocol. Although it has better performance and scalability (with respect to the number of cores) than DCC, its design complexity and risk are quite high, since it needs to redesign a new cache coherence protocol.
Even if the previous core-level transient fault detection schemes bring remarkable modifications to the existing parts of a CMP, they still leave the uncore parts, which consume more than area of a commercial CMP, prone to suffer from transient faults. This is because that these schemes protect each checked core with a redundant core individually. Hence, they allow the checked core and the corresponding redundant core to load value from the same source (say, LLC or memory). If the source itself, contains incorrect values due to some transient fault in the uncore parts (e.g., the transfer from some core’s L1 cache to LLC), both the checked core and the redundant core will produce the same incorrect instruction results. As a result, no core-to-core comparison can find such a transient fault in the uncore parts. Furthermore, the previous schemes have to deal with the possible input coherence between the pair of checked core and redundant core, which incurs remarkable design, implementation, and verification costs to the CMP.
Essentially different with the previous core-level schemes, RepTFD treats all checked cores as an entire group, and uses a redundant group of cores, which has the same number of cores with the checked group, to protect the checked group as a whole. Since there is no value dependency between groups (i.e., a checked core and a redundant core will not load a same memory location), any transient fault can affect at most one group of cores, thus can be easily detected. To our best knowledge, RepTFD is the first core-level transient fault detection scheme with 100% coverage. Furthermore, RepTFD elegantly avoids the troublesome input incoherence problem through deterministic replay, thus does not bring non-trivial modifications to the existing part of a CMP as the previous schemes. Moreover, RepTFD only incurs overhead in the chip area and performance overhead in comparison to the normal execution without fault-tolerance. To sum up, the effectiveness, elegancy, and efficiency of RepTFD demonstrate its applicability in industry.
2.2 Circuit-level Transient Fault Protection
Besides core-level, a chip can be protected from transient faults at the circuit-level through inserting redundant circuit-level cells (e.g., registers, cells, RAMs). For example, many chips for aerospace engineering adopt Triple Modular Redundancy (TMR), which uses additional two cells to protect one cell from transient faults. Theoretically, any single fault can be detected and corrected with TMR. Moreover, ECC and parity checking, which can protect the memory elements with information redundancy, can cope with transient fault tolerance in cache and memory.
In general, circuit-level protection of transient faults may significantly increase the area of the chip, and slow down the speed. Hence, the usage of circuit-level protection is limited to cache and memory in most commercial CMPs.
2.3 Protection of the Uncore Parts
There have also been lots of approaches to make the uncore parts more reliable (especially the NoC). To tolerate incorrect behaviors of the NoC, BulletProof  and Vicis  are proposed to detect and recover from the transfer loss. And [1, 21] provide some resilient routing algorithms to re-transfer the networks packets when a fault occurs in the NoC. Recently,  presents a cache coherence protocol framework which ensures coherence and reliability in the transfers even in the presence of hardware faults of NoC.
However, the above approaches need non-trivial modifications to the existing uncore parts (as well as cores), which may bring additional design and re-verification overheads to the CMP. As a comparison, without any modification on the existing uncore parts, RepTFD can detect transient fault in the uncore parts as well as the cores. Moreover, The fact that a number of approaches have been proposed to protect some uncore parts also demonstrates the urgent need for a fault detection scheme which can cover not only the cores but also the uncore parts.
2.4 Deterministic Replay
Deterministic replay aims at guaranteeing two executions of a same program to exhibit the same in the presence of non-deterministic factors. An ideal deterministic replay approach for effective transient fault detection should bring non-remarkable slowdowns to both the first-run and the replay-run. While most of hardware-assisted deterministic replay approaches bring trivial slowdown to the first-run, few of them provides efficient replay-run (partly since they focus on pursuing a small log size, which is unimportant for fault detection) [7, 32, 15, 16, 31, 4]. For example, Karma , which is a state-of-the-art hardware-assisted deterministic replay approach, still brings about slowdown in the replay-run.
To achieve good performance in the replay-run, RepTFD employs pending period based deterministic replay originated from our previous conference paper . However, to enable efficient replay-run, RepTFD introduced in this journal paper has the following superiorities compared to our conference paper : 1), RepTFD fixes the execution time of each instruction block (in the record-run), while  fixes the number of instructions in each instruction block. Since the execution time of each block is unbounded in , some execution orders between blocks with overlapping pending periods cannot be recorded. As a result,  has to enforce these orders through sequentially replaying the program, which causes over 700% slow down in replaying a 8-thread program. In contrast, RepTFD records all non-inferrable execution orders that are between blocks with overlapping pending periods, thus enables replaying the program in parallel, which is an essential factor for efficient replay-run. 2), Even if  can replay the program in parallel, its replay-run still has larger overheads to enforce the execution orders non-inferrable from physical time orders, since the unbounded long execution time of block allowed by  brings more non-inferrable execution order (to be enforced in replay-run) than RepTFD.
In summary, RepTFD significantly outperforms our previous conference paper  in the replay speed, which is crucial for fault detection. Moreover, to our best knowledge, the 4.76% replay slowdown of RepTFD is also the smallest among those of all existing deterministic replay approaches.
3 Implementation of RepTFD
3.1 Overview of RepTFD
RepTFD uses one half of the cores in the CMP as the checked group of cores, and uses the other half of the cores as the redundant group of cores. These two groups of cores execute a same parallel program without data dependency. The first-run (on the checked group of cores) is divided into many segments using existing checkpoint mechanism as shown in Figure 2.(b). In each segment of the first-run, two log files, which include determinism-log and result-log, are generated. The determinism-log can be used to enforce the replay-run (on the redundant group of cores) behaving the same with the first-run. And the result-log contains the instruction results of the first-run.
After the first-run of the the -th segment, the replay-run of the -th segment starts on the redundant group of cores, and the results in the first-run and the replay-run are compared. Once there is a mismatch between the results of the first-run and the replay-run, a malignant transient fault is detected. Then both the checked group of cores and the redundant group of cores are rolled back to the most recent checkpoint to recover from the transient fault. For example, if a fault is detected when replaying the -th segment, both the first-run and the replay-run should be rolled backed to the --th checkpoint to re-execute the -th segment (noting that the replay-run should wait for the first-run to complete the re-execution of the -th segment first).
As we have mentioned, in the first-run, two log files, i.e., determinism-log and result-log, should be recorded.
For deterministic replay, the execution orders among memory instructions should be recorded as determinism-log. To achieve determinism, each recorded execution order (e.g., ) should be enforced in the replay-run with dedicated synchronization: Before is executed, the replay-run must detect and wait for the completion of . As a result, enforcing the recorded execution orders becomes the main performance overhead of the replay-run [7, 32, 17]. To implement efficient replay-run, RepTFD employs a pending period based recording approach to reduce the number of directly recorded execution orders that need to be enforced in the replay-run.
Before introducing how RepTFD records execution orders, we need to introduce the concepts of pending period and physical time order, which were proposed in our previous investigations [6, 7], as preliminaries. As shown in Figure 3, the pending period of an instruction is a relaxed time interval (denoted as ) on the global clock in which the instruction is globally performed. For a CMP with a global clock, the pending periods of any two instructions and can be compared. If their pending periods do not overlap, there exists a physical time order between these two instructions: If the end time of ’s pending period (also denoted as ’s end time for brevity) is earlier than the start time of ’s pending period (also denoted as ’s start time for brevity), then we say that is before in physical time order.
In , Chen et al. proved that if an instruction is before another instruction in physical time order, cannot be before in any kind of logical time order. The reason is quite intuitive: instruction must have been globally performed before that begins to execute, thus is impossible to affect the result of . Therefore, if the physical time order between a pair of two memory instructions is known according to the pending period information, the logical time order (including execution order) between two instructions can be inferred. In fact, most () execution orders can be inferred from the pending period information according to .
Therefore, instead of directly recording all execution orders, RepTFD records the pending period information and the non-inferrable execution orders (i.e., execution orders that can not be inferred from the pending period information). To reduce the overhead, RepTFD records the pending period of instruction block instead of individual instruction666The pending period of an instruction block is the time interval between the start time of the first start instruction in the block, and the end time of the last end instruction in the block.. Through sampling, each thread is naturally divided into instruction blocks, where the -th instruction block consists of instructions whose global-perform times are between --th and -th samplings. Noting that some instructions in the -th block (whose global-perform time is later than the --th sampling) may begin earlier than the --th sampling. Hence, the pending periods of the -th block is between the --th and -th samplings. Say, the length of the pending period of each block (two sampling spans) is twice of the actual execution time of the block (one sampling span). As shown in Figure 4, instructions in block are those instructions on core0, whose global-perform times are in the region of , while the pending period of block is .
Figure 4 provides an illustrative example about which execution orders should be recorded. Block is an instruction block executed on core0, and block , , , , are five consecutive instruction blocks executed on core1. The pending periods of block are , , , , , respectively. With pending period information, the red dashed execution orders, i.e., the execution orders between memory instructions in block and memory instructions in block need not to be recorded. These execution orders are inferrable, because memory instructions in block have globally performed at sampling time while memory instructions in block have not started. Similarly, all the green dashed execution orders are also inferrable. Actually, only the execution orders between memory instructions in block and memory instructions in block , and (the bold execution orders) are non-inferrable and thus should be recorded.
Concretely, to record the pending period information in the determinism-log, RepTFD employs a PC-sampling technique. At a sampling time , instructions that have committed out of the instruction window should have a end time before . Meanwhile, instructions that have not entered the instruction window should have a start time after . Thus, we can obtain the pending period of each instruction by observing the instructions in the instruction window at every sampling time.
To record these non-inferrable execution orders (i.e., execution orders that can not be inferred from the pending period information), RepTFD employs a CAM for each core to save the accessed addresses of memory instructions in the last block and the current block of the core. If a local L1 miss occurs when executing instruction , the address of is searched in the CAMs for other cores to check whether there is a conflicting memory instruction. If hits in the accessed address of some instruction saved in the CAMs, the information of and the information of should be both recorded as an non-inferrable execution order . Take Figure 4 for instance. At the time between and , core0 is executing block and core1 is executing block . Memory operations in block and should be stored in the CAM for core1. If a L1 miss occurs on core0 when executing an instruction in block , it searches the CAM for core1 to see whether there is a conflicting memory instruction. In this manner, all non-inferrable execution orders from instructions in block or to instructions in block can be recorded. Similarly, all non-inferrable execution orders from instructions in block to instructions in block or can also be recorded when core1 checks the CAM of core0. As a result, all non-inferrable execution orders, which lie between blocks with overlapping pending periods, are recorded.
Notably, the length of the sampling span is closely related to the replay speed. As shown in Figure 4, the pending period of an instruction block (two sampling spans) is the relaxation of the accurate execution time of the block (one sampling span). Hence, the higher sampling frequency, the more accurate the pending period is, and the less relaxation for each block. As a result, for a shorter sampling span, the replay-run needs to pay more efforts to enforce physical time orders. On the other hand, a longer sampling span will cause lesser inferrable execution orders, which means the replay-run needs to pay more efforts to enforce non-inferrable execution orders. In practice, RepTFD configures the sampling span to be 512 clock cycles as a tradeoff between the efforts to enforce physical time orders and to enforce non-inferrable execution orders in the replay-run.
The result-log needs to record the instruction results in the first-run. Without lost of generality, assume that each instruction needs to modify a 32-bit register or a 32-bit memory location. However, recording 32 bits for each instruction may need hundreds of GByte per second, which is unpractical for state-of-the-art I/O and memory interfaces. To reduce the size of the result-log, we utilize a technique called CheckSum, which can significantly reduce the amount of data with lossy compression.
As shown in Algorithm 1, RepTFD employs a 32-bit checksum register and a few gates for each core to summarize the instruction results on the core. The value of each checksum register is initialized to 0 at reset. When committing an instruction, the new value of the checksum register is changed to the the result of the instruction XOR the old value of the checksum register. For every 1024 instructions, the register is set to be . At such time, the current value of the checksum register is exported to the result-log, and then re-initialized. Obviously, the size of the CheckSum result-log is quite small: it only consumes 4 byte per kilo instructions. Furthermore, besides CheckSum, other lossy compression approaches can also be applied to reduce the size of the result-log.
In the replay-run, the same program is replayed on another half cores of the CMP (the redundant group of cores) according to the determinism-log. In addition, the instruction results of the replay-run should be compared dynamically with the result-log recorded in the first-run.
3.3.1 How to Replay
For faithfully replaying, two types of information recorded in the determinism-log should be enforced in the replay-run. One is the pending period information (as well as the resultant physical time orders). The other is the non-inferrable execution orders.
To enforce the physical time orders, in the replay-run, an instruction block cannot be executed unless all instruction blocks before in physical time order have ended their executions. Instead of analyzing the pending period information to get detailed physical time orders, we utilize a “grant” array in the replay-run to enable efficient replay. Each element of the grant array represents the state of a sampling time. Concretely, the array element of each sampling time has a value initialized to . This value is increased by when a thread finishes all of its instruction blocks whose pending periods end no later than . Any instruction block, whose pending period starts no earlier than in the first-run, cannot be executed in the replay-run, unless the value of array element corresponding to has been increased to (the number of threads), which indicates that all threads have granted the execution of instruction block starting later than . In this way, all physical time orders are guaranteed in the replay-run with negligible costs.
In practice, three registers should be assigned for each core to implement the above idea. The first register, denoted by , is the start time of the next block’s pending period. The second register, denoted by , is the end time of the current block’s pending period. And the third register, denoted by , is the end time of the next block’s pending period.
Algorithm 2 is the detailed replay algorithm to enforce the physical time orders. is the aforementioned grant array, which represents how many cores have finished their instruction blocks whose pending periods end no later than . When an instruction block ends, the grant values of sampling time between and are increased by , since the corresponding core have finished all of its instruction blocks whose pending periods end no later than these periods. On the other hand, when a new instruction block is about to start, the grant value of its start time, in the algorithm, should be checked. This block can be executed only if the grant value equals to the total number of cores . Otherwise, it should wait and stall its execution.
Besides the execution orders indirectly guaranteed by the physical time orders, the directly recorded non-inferrable execution orders should also be enforced in the replay-run. Consider an execution order whose correlative instructions and are executed on core0 and core1 respectively. In the replay-run, core1 cannot execute instruction until has been committed by core0. Hence, for each core, RepTFD inserts a dedicated execution order buffer to import execution order information777According to our experiments, for every 10,000 instructions, there is only less than 1 non-inferrable execution order directly recorded in the determinism-log on average. Thus, only a tiny buffer (e.g., 32 byte) is enough to buff the recorded execution orders., and a single bit to pause the progress of the core. With the information in the execution order buffer, core1, which executes instruction , knows that it should wait for the completion of first. If has not been completed, the pause bit of core1 is set, thus core1 pauses its own progress. When has been accomplished by core0, core0 will acknowledge core1. As a result, core1’s pause bit is cleared, and can be executed by core1. In this way, all execution orders directly recorded in the determinism-log can be guaranteed.
As mentioned, the major concern about the replay-run is the speed overhead, which is produced when an instruction is stalled to enforce some order (either physical time order or non-inferrable execution order) with other instruction. To enforce the physical time orders in the replay-run, RepTFD only needs to guarantee that each block can be executed within the given pending period. Recall that the length of the pending period for each instruction block is two sampling spans, while the actual execution time of each block is only one sampling span. As a result, only a few blocks should be stalled in the replay-run to enforce the recorded pending periods and physical time orders. Meanwhile, the overhead for enforcing the directly recorded non-inferrable execution orders is also low, since the amount of non-inferrable execution orders is quite few (only execution orders are non-inferrable from the physical time orders). To sum up, RepTFD will not bring remarkable slowdown in the replay-run.
Take Figure 5 for example to illustrate RepTFD’s low performance overhead of replay-run. Block A, B, C, D are four consecutive blocks executed on core0, and Block E, F, G, H are four consecutive blocks executed on core1. The actual execution time of each block in the first-run is one sampling span (512 cycles). The arrows represent the physical time orders between these blocks in the first-run. For example, block E is before block A in physical time order, since the pending periods of block A and block E are from the 101-st sampling to the 103-rd sampling, and from the 99-th sampling to the 101-st sampling respectively. Suppose each block on core0 has the same execution time in the replay-run as in the first-run. In the replay-run, for block A, B, and D, their executions are not stalled, since block E, F, and H are accomplished before block A, B, and D start respectively. The only exception is block C, which is stalled since the end time of block G is delayed for more than 512 clock cycles (longer than one sampling span). According to our experiments, we found that only instruction blocks on average are stalled to enforce the physical time orders, and the average stalled time is only clock cycles, which brings very low overhead to the performance of the replay-run.
3.3.2 How to Check Results
The results in the replay-run are expected to be the same as those in the first-run. Therefore, through comparing the results of two runs, transient faults can be detected. RepTFD adopts dynamical comparison, which compares the execution results via dynamically importing the values in the result-log.
Specifically, the results in the replay-run should be lossily compressed with the same algorithm as in the first-run (c.f. in Algorithm 1). Hence, each redundant core for the replay-run should also generate a checksum result for every 1024 instructions. Once the checksum result is generated in the replay-run, it should be compared with the corresponding checksum result generated in the first-run (which is saved in the result-log). Obviously, such comparison has only trivial hardware costs, and does not affect the performance of the replay-run.
It is worth noting that, due to lossy compression there may be a rare case that the first-run and the replay-run have the same checksum result of the instruction results, but their instruction results are different. In such a rare case, a transient fault may not be detected immediately. Such theoretical problem is met by many previous transient fault detection schemes [29, 14, 25]. However, a transient fault, even not detected immediately, will be finally detected by comparing the results of future instructions affected by the fault .
3.4 Hardware Supports of RepTFD
In this subsection, we present hardware modifications on the RTL design of an industrial CMP, which is named Godson-3, to implement RepTFD. For the existing design of Godson-3 (whose architecture features can be found in Table 1), RepTFD only trivially modifies each core as follow 1) adding a 64-bit registers for each core to count the number of committed memory instructions, 2) implementing a pause bit for each core to pause its progress to guarantee orders in the replay-run. Most existing components of Godson-3 (including L2 cache, memory controller, cache coherence protocol, and switch+mesh interconnection) remain unmodified.
|Number of Processor Cores||cores|
|Pipeline||Four-issue, nine-stage, out-of-order|
|Functional Unit||Two 64-bit fix-point units, two 64-bit floating-point units, one 128-bit memory unit|
|Register File||32 logical registers and 64 physical registers for fix-point and floating-point respectively|
|L1 dcache||Private, 4 way, 32KB, writeback, 32B per line, load-to-use 3(fix)/4(ft) cycles|
|L1 icache||Private, 4 way, 64KB, 32B per line|
|L2 Cache||Unified address, Share, 4-16 bank, 512K-1MB per bank, 4 way, 32B per line, 30-35 cycles latency|
|Memory||1-4 DDR2/3 controller, GB per controller, MHz frequency, cycles latency|
|Cache Coherence||Directory-based MSI protocol|
Figure 6 shows the detailed implementation of RepTFD. core0 – core3 are the checked group of cores. And core4 – core7 are the redundant group of cores. For the top half of Figure 6, the checked group of cores needs the following hardware supports to generate the determinism-log and the result-log. (Note that most of these hardware supports are decoupled from the existing components of Godson-3, thus do not bring modifications to these existing components.)
A counter for each checked core, which copes with PC-sampling to count the committed memory instructions at every sampling time to record the pending period information. As PC-sampling is already supported by most commercial CMPs, recording the pending period information just needs very few cost.
A CAM for each checked core, which stores the memory operations in the last executed block and the current executing block. The size of the CAM is , because there are at most instructions (two sampling spans) to be stored and each memory instruction needs bits (including the type, address, counter, cache hit).
Some logic to record the non-inferrable execution orders: When a L1 miss occurs, the address of the corresponding instruction is searched in the CAMs for other cores. If it hits, the core number and inst counter of this instruction and the hit instruction in the CAM should be both recorded.
A CheckSum module for each checked core to record the result-log. This module receives the instruction results from the core and then processes the results using lossy compression. As shown in Algorithm 1, this module consumes less than 100 bit registers.
Some logic to export the logs out of the chip.
As shown in the bottom half of Figure 6, the following hardware supports are needed by the the redundant group of cores to replay the determinism-log and check the result-log.
A pause bit for each core to pause its progress to guarantee orders in the replay-run.
A replay module for each redundant core to enforce both the physical time orders and the non-inferrable execution orders. When an instruction should wait to enforce some order, the replay module sends a signal to pause the progress of the corresponding core. To enforce the physical time orders, several hundreds bit registers are consumed for each core to implement Algorithm 2. In the meantime, only a 16-entry buffer is required to enforce the non-inferrable execution orders.
A global grant array for the redundant group of cores to enforce the physical time orders. This array cooperates with the replay module for each redundant core according to Algorithm 2.
A CheckSum module for each redundant core to check the instruction results of the first-run.
Some logic to import the recorded logs into the chip.
To sum up, the main design overhead of RepTFD is a CAM per checked core (it is also feasible to replace the CAM with space-efficient Bloom filter), thus brings only a few costs to the chip area. Moreover, RepTFD has a very low design complexity because most components of the existing CMP design remain unmodified.
4 Experimental Results
We implement RepTFD on the RTL design of the 8-core Godson-3, and carry out experiments over SPLASH2 benchmarks to validate the efficiency and effectiveness of RepTFD. Table 1 lists the detailed features of Godson-3 as . In the experiments, when we mention “the performance of RepTFD”, we refer to the performance of redundantly executing an 8-thread application on 16-core CMP (the first-run and the replay-run of the application are simultaneously executed on 8 checked cores and 8 redundant cores respectively). When we mention “the baseline performance”, we refer to the performance of executing an 8-thread application on 16-core CMP solely. Figure 7 presents the performance of RepTFD normalized to the baseline performance with 512-cycle sampling span. The average performance overhead of is 4.76% over benchmarks888As references, two state-of-the-art transient fault detection schemes Reunion  and DCC  have and performance overheads for 16-core systems (8 checked cores + 8redundant cores) respectively.. For benchmark water, the performance overhead is even as low as .
In general, the low performance overhead of RepTFD comes from two factors: the low cost to enforce non-inferrable execution orders, and the low cost to enforce physical time orders. RepTFD pays negligible cost to enforce non-inferrable execution order, since there is quite few non-inferrable execution orders that need to be enforced in the replay-run. Figures 8 shows the number of non-inferrable execution orders per 10000 instructions (per thread). Averagely, only 1.47 execution orders should be enforced for every 10000 instructions. For benchmark water, only 0.014 non-inferrable execution orders per 10000 instructions should be enforced in the replay-run. Thus, negligible cost is spent on enforcing the recorded execution orders.
On the other hand, RepTFD can cost-effectively enforce physical time orders, since the recorded pending period for each instruction block is quite relaxed (twice as the actual execution time of the block). As illustrated in Figure 9 and Figure 10, in the replay-run, only instruction blocks are stalled to enforce the physical time orders, while for each stalled block, the average stalled time is only clock cycles. In other words, for all 512-cycle instruction block, we only need to stall 3.08 cycles () per block averagely, which brings about 0.6% () performance penalty in the replay-run. For benchmark water, the performance overhead is even lower: Only instruction blocks are stalled and the average stalled time for each stalled block is only clock cycles.
In addition, we also evaluate the size of the determinism-log. RepTFD should record the detailed number of instructions in each block, thus has larger log size than our previous conference paper LReplay , which has fixed number of instructions in each block. As shown in Figure 11, the average log size of this journal paper’s determinism-log is 1.87 byte per kilo instructions, while LReplay’s log size is only 0.36 byte per kilo instructions (with 512-cycle sampling span). However, RepTFD’s log size is still acceptable for an 8-thread fault tolerant CMP: Only 14.96 MB is required to record/replay a 1-second segment of an 8-thread program (when the IPC is 1).
Moreover, we evaluate the area consumption of RepTFD. The main area overhead of RepTFD is a CAM per checked core, which consumes about 5.0 square millimeter as a whole, according to the report of Synopsys’s Design Compiler with STMicro 65nm GP/LP mixed process. Note that the overall area of our 16-core CMP (under the same process) is about 600 square millimeter, the area consumption of RepTFD is only about 0.83% of the whole chip.
In this paper, we propose RepTFD, a core-level transient fault detection scheme which employs deterministic replay to achieve 100% detection coverage. Different from existing core-level schemes which can only protect each core separately, RepTFD protects both the cores and the uncore parts without modifications to the uncore parts of the chip. To avoid remarkable performance overhead in the replay-run, RepTFD records the relaxed pending period for each instruction block (whose length is twice of the actual execution time of the block) in the first-run. Through cost-effectively enforcing the resultant physical time orders between blocks in the replay-run, execution orders, which can be inferred from physical time orders, do not need to be enforced anymore. As a result, RepTFD incurs only 4.76% slowdown (in comparison to the normal execution without fault-tolerance).
RepTFD is the first practice to tackle transient fault detection by cutting-edge deterministic replay techniques. We uncover that through there are many previous investigations on detecting transient faults on CMPs, few of them can do this job as efficient, effective, and elegant as deterministic replay. On the other hand, the requirements aroused in transient fault detection call for a hardware-assisted deterministic replay approach different with those existing log-size-oriented replay approach. It can motivate further improvements on deterministic replay.
-  K. Aisopos, A. DeOrio, L.-S. Peh, and V. Bertacco. “ARIADNE: Agnostic Reconfiguration In A Disconnected Network Environment,” Proceedings of the 20th International Conference on Parallel Architectures and Compilation Techniques (PACT’11), 2011.
-  K. Aisopos and L.-S. Peh. “A systematic methodology to develop resilient cache coherence protocols,” Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’11), 2011.
-  J. Bartlett, J. Gray and B. Horst. “Fault tolerance in tandem computer systems,” Technical Report TR-86.7, HP Labs, 1986.
-  A. Basu, J. Bobba, and M. Hill. “Karma: Scalable Deterministic Record-Replay,” Proceedings of the 25th International Conference on Supercomputing (ICS’11), 2011.
-  K. Constantinides, S. Plaza, J. Blome, B. Zhang, V. Bertacco, S. Mahlke, T. Austin, and M. Orshansky. “Bulletproof: a defect-tolerant CMP switch architecture,” Proceedings of the 12th IEEE International Symposium on High-Performance Computer Architecture (HPCA’06), 2006.
-  Y. Chen, Y. Lv, W. Hu, T. Chen, H. Shen, P. Wang, and H. Pan. “Fast Complete Memory Consistency Verification,” Proceedings of the 15th IEEE International Symposium on High-Performance Computer Architecture (HPCA’09), 2009.
-  Y. Chen, W. Hu, T. Chen, and R. Wu. “LReplay: A Pending Period Based Deterministic Replay Scheme,” Proceedings of the 37th ACM/IEEE International Symposium on Computer Architecture (ISCA’10), 2010.
-  D. Fick, A. DeOrio, J. Hu, V. Bertacco, D. Blaauw, and D. Sylvester. “Vicis: a reliable network for unreliable silicon,” Proceedings of the 46th Design Automation Conference (DAC’09), 2009.
-  M. A. Gomaa, C. Scarbrough, T. N. Vijaykumar and I. Pomeranz. “Transient-fault recovery for chip multiprocessors,” Proceedings of the 30th ACM/IEEE International Symposium on Computer Architecture (ISCA’03), 2003.
-  D. Hower and M. Hill. “Rerun: Exploiting Episodes for Lightweight Memory Race Recording,” Proceedings of the 35th ACM/IEEE International Symposium on Computer Architecture (ISCA’08), 2008.
-  W. Hu, J. Wang, X. Gao, Y. Chen, Q. Liu, and G. Li. “Godson-3: A Scalable Multicore RISC Processor with x86 Emulation,” IEEE Micro, vol. 29, no. 2, pp. 17–29, March/April 2009.
W. Hu and Y. Chen. “GS464V: A High-Performance Low-Power XPU with 512-Bit Vector Extension,”Proceedings of the 22nd IEEE Symposium on High Performance Chips (HOTCHIPS’10), 2010.
-  W. Hu, R. Wang, Y. Chen, B. Fan, S. Zhong, X. Gao, Z. Qi, and X. Yang. “Godson-3B: A 1GHz 40W 8-Core 128GFlops Processor in 65nm CMOS,” Proceedings of the 58th IEEE International Solid-State Circuits Conference (ISSCC’11), 2011.
-  C. LaFrieda, E. Ipek, J. F. Martinez and R. Manohar. “Utilizing dynamically coupled cores to form a resilient chip multiprocessor,” Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’07),2007.
-  D. Lee, B.Wester, K. Veeraraghavan, S. Narayanasamy, P. Chen, and J. Flinn. “Respec: Efficient Online Multiprocessor Replay via Speculation and External Determinism,” Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’10), 2010.
-  D. Lee, M. Said, S. Narayanasamy and Z. Yang. “Offline symbolic analysis to infer Total Store Order,” Proceedings of the 17th IEEE International Symposium on High-Performance Computer Architecture (HPCA’11), 2011.
-  P. Montesinos, L. Ceze, and J. Torrellas. “DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Effciently,” Proceedings of the 35th ACM/IEEE International Symposium on Computer Architecture (ISCA’08), 2008.
-  S. S. Mukherjee, C. T. Weaver, J. S. Emer, S. K. Reinhardt and T. M. Austin. “A systematic methodology to compute the architectural vulnerability factors for a highperformance microprocessor,” Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’03), 2003.
-  S. Nomura, M. D. Sinclair, C. Ho, V. Govindaraju, M. Kruijf, and K. Sankaralingam. “Sampling + DMR: Practical and Low-overhead Permanent Fault Detection,” Proceedings of the 38th ACM/IEEE International Symposium on Computer Architecture (ISCA’11), 2011.
-  M. Prvulovic, Z. Zhang and J. Torrellas. “ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors,” Proceedings of the 29th ACM/IEEE International Symposium on Computer Architecture (ISCA’02), 2002.
-  V. Puente, J. A. Gregorio, F. Vallejo, and R. Beivide. “Immunet: A cheap and robust fault-tolerant packet routing mechanism,” Proceedings of the 31th ACM/IEEE International Symposium on Computer Architecture (ISCA’04), 2004.
-  G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, D. I. August and S. S. Mukherjee. “Design and evaluation of hybrid fault-detection systems,” Proceedings of the 32th ACM/IEEE International Symposium on Computer Architecture (ISCA’05), 2005.
-  G. A. Reis, J. Chang and D. I. August. “Automatic instruction-level software-only recovery,” Proceedings of the 36th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’06),2006.
-  S. Sawant, U. Desai, G. Shamanna, L. Sharma, M. Ranade, A. Agarwal, S. Dakshinamurthy, and R. Narayanan. “A 32nm Westmere-EX Xeon Enterprise Processor,” Proceedings of the 58th IEEE International Solid-State Circuits Conference (ISSCC’11), 2011.
-  S. Shan, Y. Hu and X. Li. “Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors”,” Proceedings of the 41th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’11),2011.
-  P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger and L. Alvisi. “Modeling the effect of technology trend on the soft error rate of combinational logic,” Proceedings of the 32th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’02),2002.
-  T. J. Slegel, R. M. Averill, M. A. Check, B. C. Giamei, B. W. Krumm, C. A. Krygowski, W. H. Li, J. S. Liptay, J. D. MacDougall, T. J. McPherson, J. A. Navarro, E. M. Schwarz, K. Shum, and C. F. Webb. “IBM s S/390 G5 microprocessor design,” IEEE Micro,, vol. 29, no. 2, pp. 12–23, March/April 1999.
-  J. C. Smolens, B. T. Gold, J. Kim, B. Falsafi, J. C. Hoe, A. G. Nowatzyk. “Fingerprinting: bounding soft-error detection latency and bandwidth,” Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’04), 2004.
-  J. C. Smolens, B. T. Gold, B. Falsafi and J. C. Hoe. “Reunion: Complexity-effective multicore redundancy,” Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’06), 2006.
-  D. Sorin, M. Martin, M. Hill and D. Wood. “SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery,” Proceedings of the 29th ACM/IEEE International Symposium on Computer Architecture (ISCA’02), 2002.
-  K. Veeraraghavan, D. Lee, B. Wester, J. Ouyang, P. Chen, J. Flinn and S. Narayanasamy. “DoublePlay: Parallelizing Sequential Logging and Replay,” Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’11), 2011.
-  G. Voskuilen, F. Ahmad, and T. Vijaykumar. “Timetraveler: Exploiting Acyclic Races for Optimizing Memory Race Recording,” Proceedings of the 37th ACM/IEEE International Symposium on Computer Architecture (ISCA’10), 2010.
-  D. Wendel et al. “The implementation of POWER7: A Highly Parallel and Scalable Multi-core High-end Server Processor,” Proceedings of the 57th IEEE International Solid-State Circuits Conference (ISSCC’10), 2010.
-  Y. Zhang, J. W. Lee, N. P. Johnson and D. I. August. “DAFT: Decoupled Acyclic Fault Tolerance,” Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10), 2010.