I Introduction
Deep neural networks ( DNNs) have recently attracted enormous attention due to the success in various perception tasks [23, 3] and it is an appealing idea to adopt DNNs in security sensitive systems for indepth inference and efficient data processing, such as autonomous automobile and medical monitoring. On the other hand, the robustness of DNN itself is of great concern for such security related applications and hence has been widely studied. Various methods, including adversarial example [6] and fault injection [1], are devised to attack DNNs. Their aim is to fool the networks to generate adversarial outputs.
Other than such carefully generated perturbations, adding small but random noise to DNNs may also induce severe damage. Stevenson et. al. and Cheney et. al. [21, 2] analyze the impact of random numerical noise to the weights of DNNs and observe significant degradation in classification accuracy when certain layers are ”polluted”. Although such noises are numerically small, they are not necessarily insignificant when implemented in hardware.
Thus, Donato et al., Reagen et al. and Sha et al. [5, 18, 19] investigate the robustness of DNNs from emerging device, architecture and system perspective. However, they treated all error conditions homogeneously and sampled only a small portion of the errors to analyze the average effects. In security sensitive applications, the focus is more on the worst case instead of average performances. [24] Thus, in this work, we conduct more thorough experiments to explore the worst case. To reduce the complexity of our searches, we place our focus on the minimum perturbation that can occur in a digital system, , single bit flip. Single bit flip can be triggered by Single Event Upset ( SEU) in daily lives with ionizing particles hitting storage devices and logic units [9]. In reliability analysis, SEUs are often manifested as ”bit flips” in which the value of a single bit is reversed from ”0” to ”1” or vice versa [17]. Fig. 1 demonstrates the impact of SEU induced parameter perturbation on different network architectures including Network in Networks (NIN) [14], VGG16 [20] and Residual Neural Networks (ResNet) [7]. The figure compares the case without perturbation and the cases with perturbations occurring at sign bit or exponent bit. The accuracy of ResNet56 (trained on CIFAR100 [13]) drops to almost 1% for bit flip on the exponent. The figure simply presents that the smallest perturbation in DNN parameters may cause serious damage. Since SEUs are not uncommon in our everyday lives, the aforementioned perturbation is actually detrimental to the securitysensitive systems [2].
Thus, in this paper, we thoroughly investigate the problem of SEUInduced Parameter Perturbation (abbrev. SIPP) for DNNs as well as its remedy solutions. To the best of our knowledge, this is the first work that studies the worst case of SIPP for DNNs. The contributions of the work include:

We formally define the fault models to study SEU induced perturbation and propose an experimental flow to measure the network robustness sensitivity to SIPP. Several key observations are then summarized for ResNet56 with the proposed flow.

We analytically explore the impact of SIPP on parameters and the propagation of SIPP to the network output (Section 4.1 and 4.2). The analysis provides us indepth understanding of how SIPP affects the system and provides us guidelines to investigate the weakness of other DNNs.

We then thoroughly investigate the robustness of three representative DNNs, NIN, VGG16 and ResNet56 (Section 4.3). After the investigation, three key findings which confirm our observations are presented.

Based on the findings, we propose two simple yet efficient remedy solutions, triple modular redundancy (TMR) and errorcorrection code (ECC) to ensure complete protection from SIPP (Section 5). Design tradeoff is then explored between protection overhead and design robustness for the two methods.
Experimental results show that without any protection, SIPP on a bit for ResNet56 may easily induce more than 28% accuracy degradation. ECC based protection scheme can reduce such degradation to 0.27% with SRAM overhead of merely 0.24 additional bit per parameter on average.
Ii Background
Iia Single Event Upset
SEU is a transient information destruction on memory or logic elements caused by an energetic ionizing radiation. Ionized radiation particles may generate electronhole pairs when they penetrate into the silicon substrate of a transistor [17]. After electronhole pairs are generated, their transportation, such as diffusion and drift, collects electric charge to the drain region of the transistor. In memory elements, the collected charge accumulates and finally induces a glitch in the affected transistor, upsetting the stored information. In terrestrial environment, SEU is generally induced by two kinds of particles, alpha particles emitted from package material and neutrons originating from cosmic ray. In security sensitive applications, package material with low alpha particle emission is used to mitigate alpha particle induced SEU. However, since abundant neutrons in the cosmic ray go through materials on the ground and consequently they are difficult to eliminate with shielding, SEU threatens terrestrial VLSIs and hence demands targeted protections.
IiB Deep Neural Networks
A typical feed forward DNN is a collection of convolution, activation, normalization, pooling and fully connected (
FC) layers. The network is specified by a set of parameters, including weights, biases means and variances. Since means and variances can be written in the form of weights and biases, for simplicity,
specified in this paper just refer to weights and biases.A convolution layer applies 2D convolutions over an input signal composed of several input planes, , feature maps
. The output is then a 3D tensor with a similar shape. In the simplest case, the output value of the convolution layer with
input channels and output channels can be precisely described as:(1) 
in which and denote the input and output planes, is the convolution operation. is a set of 2D convolution kernels, with each corresponding to one pair of input and output planes. is a set of scalars globally added to each 2D output plane. The operation of FC layers can also be represented by Eq. (1) using scalars instead of 2D planes. The activation and pooling layers are normally hardwired without extra parameters. The normalization layers normalizes each channel’s output with trained means and variances.
IiC IEEE Standard for Floating Point Arithmetic in Hardware Implementation
According to IEEE standard [10], 32bit FP numbers commonly used in DNNs can be represented by:
(2) 
in which is the sign bit. is an unsigned integer for exponent that uses the second to the ninth bits. In a 32bit FP number, is 127. is a fixed point number represented by the rest of the bits with the highest bit representing .
Iii Observations
Iiia Single Event Upset in Neural Networks
Hirokawa et al. [9]
show that, when exposed to terrestrial neutrons, a bit in SRAM has a probability of
to flip in 1 . For all the parameters in one Neural Network, assuming that the flip of each bit is independent, the probability of at least one bit flip is:(3) 
in which is the number of trained parameters, is the data width of each parameter, is device life time, is the test time interval, , 1 as above, and is the probability for one bit flip within . The approximation in Eq. (3) is appropriate when , which usually holds for DNNs [9]. For a typical DNN with more than 10M parameters, in one month, the probability of having at least one bit flip would be as high as 10%, which is hazardous for the aforementioned securitysensitive scenarios and hence demands indepth understanding of the impact of SEU induced perturbation.
IiiB Fault Model
It is crucial to select an appropriate fault model to measure the impact of SEUinduced failures to DNNs. In our experiments, for simplicity, we rely on the following assumptions for our fault model:

The training and inference processes are faultfree, and the fault is only introduced by hardware failures in storage devices, which preserves the network’s parameters.

Among all the parameters, only one bit of one parameter is with fault and the others are faultfree.
IiiC SSIPP: Measuring Neural Network Robustness
To analyze the effect of SIPP, it is crucial to have a rigorouslydefined metric to measure the impact of SIPP on DNNs. Apparently the most straightforward way is to compare the output difference for the same set of inputs. Thus, here we propose the concept of Sensitivity to SIPP ( SSIPP) as a quantitative measure to assess the differences of network robustness. The definition of SSIPP is as below:
Definition: For SIPP on a particular bit , its performance change due to the perturbation from the original network can be calculated by , where denotes the performance measure for a network.^{1}^{1}1Performance measure is specific to the network type and application, , crossentropy for segmentation, accuracy for classification. Then the robustness measure of is defined as:
(4) 
Unlike similar performance degradation based methods[18, 5] have been proposed, SSIPP focus on searching all possible fault patterns (flipping all the bits of a network) and finding the worst case. If the perturbed bits are limited to one particular layer, then it measures SSIPP of this particular layer for the network. The one with a smaller SSIPP is considered more robust.
The work flow to measure SSIPP is illustrated in Fig. 2. First, the original network is tested on the validation set to obtain the reference performance . Then, we induce SIPP on one bit of the network to obtain a perturbed network, which is also tested on the same validation set to measure its performance . After that we can compute the performance change due to this particular bit. Finally, by visiting every bit of the network, we obtain the network robustness measure .
IiiD Observations
With the given definition of SSIPP and flow in Fig. 2, we conduct a preliminary experiment to understand the impact of different SIPP patterns. We use ResNet56 as the underlying DNN. The perturbation is introduced to the MSBs of a FP number, sign bit, the highest exponent bit and the highest fraction bit of the parameters, respectively. After checking all the parameters in the network, Table I summarizes SSIPP for different layers of ResNet56. From the table, there are a few key observations that can be made for the impact of SIPP on ResNet56:

Observation 1: the highest exponent bit consistently has the highest SSIPP across different layers while the impact of SIPP on a fraction bit is very limited.

Observation 2: the first layer, which directly deals with the input stream, has higher SSIPP than the other layers, then follow the lasted layers. This indicates higher robustness for the hidden layers in the middle of a network.
Then a natural question for the observations is:“Are those observations universal or unique to ResNet56?” This motivates us to conduct more indepth explorations to understand the DNN robustness to SIPP, which will be discussed in the next section.
Input  Stack 1  Stack 2  Stack 3  FC  

Sign  28.09%  6.43%  2.08%  0.27%  0.90% 
Ex1  70.19%  70.19%  70.19%  70.19%  70.19% 
Frac1  0.43%  0.29%  0.41%  0.25%  0.30% 
Iv Explorations
In this section, we conduct systematical analysis of the impact of SIPP on parameter value in a DNN and how SIPP propagates to the output. The analysis then provides theoretical explanations to the observations in the last section.
Iva Understanding the Impact of SIPP on Parameters
As is discussed in Sect. III, there are different types of bits in a 32bit FP parameter, including , and bits. Thus, SIPP on different bits may impose different impacts on the parameter value. For a pretrained DNN with known parameters, we can analyze the relative error imposed by SIPP. SIPP on bit will cause the parameter to take its opposite value, , . For SIPP on bit, flip patterns affect greatly. A ”1” to ”0” flip reduces the absolute value of the parameter closer to 0, resulting in a between 0.5 and 1.0 while a ”0” to ”1” flip multiplies the parameter by a power of two. for such a change is larger than 1 and up to , which can be detrimental to the system. For bit, SIPP only induces a relatively smaller change with between 0 and 0.5.
Thus, this explains the first observation in the last section why we observe much higher SSIPP for and bits than bits.
IvB Understanding the Propagation of SIPP
For different SIPP patterns, the impact of SIPP on bit and bit are detrimental or very limited, respectively, as discussed in the last subsection. The impact of SIPP on bit is uncertain, which needs thorough analysis. Moreover, analyzing the impact of SIPP on and bits can also provide some insights for widely used fixedpoint accelerators [11, 22].
Thus, in this section we will focus on analyzing the impact of SIPP on bit. Since SIPP on a bit in former layers of DNNs actually needs to go through multiple layers to reach output, it is crucial to understand how the perturbation is propagated and why it is not cancelled out during the procedure. In the following we will provide detailed analysis to understand the propagation procedure of SIPP for both FC and convolution layers.
For the FC layer with inputs and outputs, weights and biases are needed. The output of this FC layer is:
(5) 
where and are the corresponding weight and bias.
For simplicity we will use a portion of FC layers as shown in Fig. 3
to demonstrate how the perturbation is propagated. The example contains 3 FC layers with 2 inputs and 2 outputs. The blue represents SIPP free data and connection, while the red represents SIPP affected data and connection. If ignoring nonlinear activation between FC layers for the moment, we can write down the output as the following:
(6) 
where . Then for perturbation on the first and second layers, for example, and , the changes at the outputs from the original values () are:
(7)  
(8) 
From a layerwise perspective when evaluating SIPP for all the parameters in a particular layer, we may reasonably assume the weights from the same layer are similar (but can be significantly different from layer to layer). For and from the same input source or distribution, so as those weights in the same layers. We then may claim that the two perturbations impose very similar impacts on the output, is statistically insignificant. In other words, SIPP on sign bit from different layers may eventually result in very similar impact on the final output of the network if the network only contains linear operations as in Eq. (6)(8).
Similar analysis can be conducted on the convolution layer to find the impact of SIPP on the sign bit of weight and bias using the following formulations:
(9) 
(10) 
where is the layer index, and are the input and output feature map indexes, and specify the weight location for the convolution kernel, is the perturbed weight. is the perturbed output feature map, while and are the original input and output, respectively. In Eq. (9), is a convolution kernel with the same size as the convolution kernel, with 1 at the perturbed weight’s position and 0 for the others.
Unlike FC layer where the perturbation only impacts the connected data, the perturbation in convolution layer gets amplified and effectively propagates to a broader region through convolution of feature maps, as shown in Fig. 4. The affected region grows from a kernel to one feature map and then multiple feature maps. For perturbations on weights in two different layers, for example, and , the output differences of (2,2) in the output feature map for the two perturbations are:
(11) 
(12) 
where and are the indexes to compute the convolution, denotes the summation over each variable. From Eq. (12), we may draw a similar conclusion as the FC layers that, without activation layers, the effect of SIPP on weights from different convolution layers are almost equivalent for the output.
The analysis above shows that the key contributors of the layerwise difference of SSIPPs are the activation layers. Thus, the analysis of activation layers is needed. The models analyzed in this work use ReLU layers for activation. The calculation of ReLU layers is:
, which means greater than 0 could propagate through while the smaller ones would be deactivated. As shown in Fig. 3 and Fig. 4, the effects of SIPP on the former layers spreads wider than latter ones, they are less likely to be totally deactivated by activation layers, and thus more likely to propagate to the output and affect the classification results.The analysis above provides systematical support for observation 2 in the last section.
IvC Design Explorations
NIN  ResNet56  VGG16  
Layer  #Weight  #Bias  #Weight  #Bias  #Weight  #Bias 
1  14k  192  0.4k  16  1.7k  64 
2  30k  160  1.5k  16  37k  64 
3  15k  96  3k  32  74k  128 
4  460k  192  6.1k  64  295k  128 
5  36k  192      590k  256 
Last  1.9k  10  6.4k  100  4096k  1000 
NIN/Sign  NIN/Ex  NIN/Fr  Res56/Sign  Res56/Ex  Res56/Fr  VGG16/Sign  VGG16/Ex  VGG16/Fr  

Layer  W  B  W  B  W  B  W  B  W  B  W  B  W  B  W  B  W  B 
1  1.0%  2.2%  80%  80%  0.2%  0.4%  28.1%  16.9%  70%  70%  0.4%  0.4%  4.2%  2.4%  70%  70%  0.9%  0.9% 
2  0.2%  4.4%  80%  80%  0.1%  0.2%  6.4%  1.2%  70%  70%  0.3%  0.2%  0.1%  0.4%  70%  70%  0.1%  0.2% 
3  0.3%  2.4%  80%  80%  0.2%  0.1%  2.1%  1.2%  70%  70%  0.4%  0.5%  0.1%  0.1%  70%  70%  0.1%  0.1% 
4  0.1%  0.1%  80%  80%  0.1%  0.1%  0.3%  1.1%  70%  70%  0.3%  0.4%  0.1%  0.1%  70%  70%  0.1%  0.1% 
5  0.3%  0.2%  80%  80%  0.2%  0.1%              0.1%  0.1%  70%  70%  0.1%  0.1% 
Last  2.4%  0.2%  80%  80%  0.7%  0.1%  1.0%  0.1%  70%  70%  0.3%  0.1%  0.0%  0.0%  70%  70%  0.1%  0.1% 
In this subsection we validate the impact of SIPP and our findings on different DNN architectures for image classification, the performance of which is measured by accuracy. Three representative architectures are employed in our design explorations, including a Network in Network (NIN) model (trained on CIFAR10) [14], a 56layer Deep Residual Network (ResNet56) (trained on CIFAR100) [7]
, and VGG16 (trained on ImageNet)
[20, 4]. The three network architectures are selected to represent different types of feed forward deep neural networks. NIN is a model with no FC layer and hence helps us understand the role of convolution layers when evaluating network robustness. ResNet56 belongs to ResNets, a group of very deep neural network that consists of tens to hundreds of convolution layers. Unlike ResNet56 with residual blocks, VGG net is a more classical deep convolutional neural network with simpler architecture and widely adopted by a variety of perception works
[16, 8] and analyzed by a variety of accelerators [12, 22]. The numbers of parameters for the three neural networks are summarized in Table II. Due to the size of DNN, we only present the results of the first five layers and the last layer of each DNN.Table III demonstrates the impact of SIPP on the different types of bits, , (denoted by Sign), (denoted by Ex) and (denoted by Fr), of both weights (denoted by W) and biases (denoted by B). For each network, we calculate its SSIPP for a particular layer for the first five layers (denoted by layer 1 to 5) plus the last layer (denoted by Last). It is found that the impact of exponent bit is very prominent while fraction bit has very limited SSIPP. Moreover, sign bits of the first few layers have larger impacts than the rest layers, which is consistent with our analysis in the last subsection. The SSIPP on bits in each layer is similar because they could always destroy the whole network into a random guesser.
We further investigate the propagation of SIPP on the same network architecture but with different complexity. Fig. 5 compares SSIPP of first layer on bit for 4 residual networks with different depth, , ResNets with 20, 56, 110 and 164 layers. As shown in the Figure, with deeper network, the impact of SIPP is more prominent, which is also consistent with the findings in the last subsection.
Thus, based on the observation, analysis and experimental results, we may summarize the following findings:

Finding 1: For the three types of bits in a 32bit FP parameter, SIPP on bit has the largest impact on network output while the impact from bit is the minimal, with the impact of bit in the middle.

Finding 2: SIPP on bit has layerwise impact within the same network. The layer farther from the output is typically more sensitive and brings more significant change to the output.

Finding 3: For two networks, with width of the network being the same, the deeper the network, the more sensitive the network is to SIPP.
V Remedies
With the findings in the last section, the weakness of a network is more observable. Thus, it is necessary to further investigate the possible remedy solutions to the weakness. This section discusses two simple yet efficient remedies to the issues caused by SIPP and then investigates design tradeoffs for the two methods.
Va Triple Modular Redundancy for Parameter Protection
For the parameters susceptible to SIPP, a natural solution is to provide multiple copies for the parameters of interest. Triple Modular Redundancy (TMR) [15] is just such a method that copies twice the circuit to be protected, thereby forming a group of three identical circuits with three outputs. The three outputs then go through a majorityvoter to mask the fault and decide a single output. With the findings presented in previous sections, we can prepare three identical copies of the parameters in SRAM to fully prevent SIPP. When parameters are fetched and sent to neural networks, the three copies of one parameter will go through a simple TMR circuit to chooses the correct output.
VB ErrorCorrecting Code
Apparently, the area overhead of TMR based parameter protection can be significant. To resolve this issue, we further adopt errorcorrecting code (ECC) to protect parameters in SRAM with much smaller area overhead.
Hamming code is a family of binary linear ECCs by offering redundant correcting bits. With its singlebit protection feature, we can rely on Hamming code based ECC to fully protect the issues caused by SIPP. The number of redundant bits () to protect data bits needs to satisfy the following constraint:
(13) 
Thus, the area overhead of ECC is exponentially smaller than TMR based approach. However, the SRAM area saving of ECC is at the cost of more complex logic to implement the protection and correction circuit. Fig. 6 presents the SRAM area and protection logic overhead for ECC TMRbased protection. It is found that, to protect 100 bits, with the SRAM and logic area overhead of the TMRbased method, only 3.5% of the area overhead could be achieved at a cost of approximately 3.5 of protection logic.
VC Design TradeOff
With the explorations in the last section, we are aware of the sensitivity of layers and bits for a DNN. Thus, instead of fully protecting all the parameters, we can further reduce both SRAM and protection logic area overhead by tolerating SIPP in nonsensitive bits. Fig. 7 demonstrates the area overhead for SRAM and protection logics, respectively, when implementing the TMR based protection. The area overhead is normalized to the case of full protection, , all the parameters are protected. The SSIPP on the yaxis is normalized to the case without any protection, , the wrost case SSIPP. The figure then provides design tradeoff opportunities for the three DNNs. It can be seen that with merely 24% SRAM and protection logic overhead, we can reduce the SSIPP for ResNet56 to only 2.08%. Similar design tradeoff can be conducted for ECC based protection, as shown in Fig. 8(a) and (b). With 23% SRAM area overhead and 25% protection logic overhead, ECC based protection can reduce the SSIPP for ResNet56 to 2.08%. By further increasing SRAM overhead to 25%, SSIPP can be further reduced to 0.27%.
Vi Conclusions
In this paper, we investigate the robustness of DNNs from a hardware prospective about the impact of SIPP. We systematically define the fault models of SEU and then provide the definition of SSIPP as the robustness measure for the network. We then analytically explore the weakness of a network and summarize the key findings for impacts of SIPP on different types of bits in an FP parameter, layerwise robustness within the same network and impact of network depth. Based on these findings, two remedy solutions can be adopted to protect DNNs from SIPP.
References
 [1] (2012) Fault injection attacks on cryptographic devices: theory, practice, and countermeasures. Proceedings of the IEEE 100 (11), pp. 3056–3076. Cited by: §I.
 [2] (2017) On the robustness of convolutional neural networks to internal architecture and weight perturbations. arXiv preprint arXiv:1703.08245. Cited by: §I, §I.

[3]
(2018)
Intrinsic image transformation via scale space decomposition.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 656–665. Cited by: §I.  [4] (2009) Imagenet: a largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §IVC.
 [5] (2018) Onchip deep neural network storage with multilevel envm. In Proceedings of the 55th Annual Design Automation Conference, pp. 169. Cited by: §I, §IIIC.
 [6] (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §I.
 [7] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §I, §IVC.
 [8] (2017) Neural color transfer between images. arXiv preprint arXiv:1710.00756. Cited by: §IVC.

[9]
(2016)
Multiple sensitive volume based soft error rate estimation with machine learning
. In 2016 16th European Conference on Radiation and Its Effects on Components and Systems (RADECS), pp. 1–4. Cited by: §I, §IIIA. 
[10]
(200808)
IEEE
standard for floatingpoint arithmetic
. IEEE Std 7542008 (), pp. 1–70. External Links: Document, ISSN Cited by: §IIC.  [11] (2019) Achieving superlinear speedup across multifpga for realtime dnn inference. arXiv preprint arXiv:1907.08985. Cited by: §IVB.
 [12] (2019) Accuracy vs. efficiency: achieving both through fpgaimplementation aware neural architecture search. arXiv preprint arXiv:1901.11211. Cited by: §IVC.
 [13] (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §I.
 [14] (2013) Network in network. arXiv preprint arXiv:1312.4400. Cited by: §I, §IVC.
 [15] (1962) The use of triplemodular redundancy to improve computer reliability. IBM journal of research and development 6 (2), pp. 200–209. Cited by: §VA.
 [16] (2015) Hierarchical convolutional features for visual tracking. In Proceedings of the IEEE international conference on computer vision, pp. 3074–3082. Cited by: §IVC.
 [17] (2018) Detecting single event upsets in embedded software. In 2018 IEEE 21st International Symposium on RealTime Distributed Computing (ISORC), pp. 142–145. Cited by: §I, §IIA.
 [18] (2018) Ares: a framework for quantifying the resilience of deep neural networks. In 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC), pp. 1–6. Cited by: §I, §IIIC.
 [19] (2018) On the design of reliable heterogeneous systems via checkpoint placement and core assignment. In Proceedings of the 2018 on Great Lakes Symposium on VLSI, pp. 475–478. Cited by: §I.
 [20] (2014) Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §I, §IVC.
 [21] (1990) Sensitivity of feedforward neural networks to weight errors. IEEE Transactions on Neural Networks 1 (1), pp. 71–80. Cited by: §I.
 [22] (2015) Optimizing fpgabased accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 161–170. Cited by: §IVB, §IVC.
 [23] (2019) Learning phase competition for traffic signal control. arXiv preprint arXiv:1905.04722. Cited by: §I.
 [24] (1979) Effect of cosmic rays on computer memories. Science 206 (4420), pp. 776–788. Cited by: §I.
Comments
There are no comments yet.