I Introduction
Rapid emergence of deep learning in the past decade has revealed security and safety issues related to their usage. Adversarial learning has quickly become a popular topic in the security community, as it was shown that even slight perturbations to the input data can fool the deep learningbased classifiers
[szegedy2014intriguing]. Such attack could have disastrous consequences, as illustrated for instance by attacks against computer vision systems installed in autonomous cars, which could potentially cause physical accidents due to misclassifications
[deng2020analysis].Moreover, recent trends tend to move the models from the cloud to edge computing devices, reducing the data transfer requirements and delays caused by slow or unavailable networks [li2018learning]. This trend, however, enables hardware attack vectors that are normally not possible when the computation is done in the cloud [xu2021security]. Attacks that allow parameter extraction through sidechannels [csi_nn] or misclassification by fault injection attacks [liu2017fault] were shown to be viable security threats that need to be taken into account.
While several works focused on faulting the inference phase [liu2017fault, hong2019terminal, breier2018practical, zhao2019fault], which is related to evasion attacks in terms of the outcome [biggio2013evasion], no work has been published on faulting the training phase up to date. Attacking the training phase is related to poisoning attacks in the adversarial learning domain [biggio2012poisoning], and the goal in such scenarios is generally to create something similar to a trojan horse, which is activated by a specific input during the inference phase. Such inputs generate a controlled effect, such as a targeted classification (for instance a stop sign with a trigger is misclassified to ‘go straight’ sign [rehman2019backdoor]).
The general research question we tackle in this work is thus: Is it possible for an attacker to use fault attacks in the training phase of a deep neural network such that they can bias a resulting model in a way that can be exploited during deployment without the necessity for further fault attacks at inference time?
In this work, we assume fault attacks on the training phase of the neural network, thus bridging the gap between fault injection attacks and poisoning/trojaning attacks. By faulting specific intermediate values during the computation, our attack can force the classifier to behave in a specified way during the inference phase, depending on the attacker’s goals, while preserving the original network classification accuracy.
Motivation. Poisoning attacks assume that the attacker can fully control the training process. She can alter the input data and the labels in a way that the model learns to react to a certain trigger and misclassifies the output when such trigger is present during the inference [liu2017trojaning].
Our attack, on the other hand, keeps the inputs to the training process intact. The attacker can only observe the inputs, and based on this, tampers with the environmental parameters of the device that executes the computation. This tampering can be done in various ways, by clock/voltage glitches, electromagnetic pulses, lasers, or by a remote Rowhammer attack [automated_book]. Additionally, voltage glitches were shown to be possible in a remote way on FPGAs, which are often used for accelerating the training [krautter2018fpgahammer].
In particular, we focus on faulting the ReLU (Rectifier Linear Unit) activation functions which are common in several neural network architectures
[krizhevsky2012imagenet, simonyan2014very]. The main idea is to bias the training in a way that malicious inputs can be easily recovered by the attacker to be used at inference time, without having to fault the network again. We discuss a generic approach based on constraint solving to generate such inputs which we call fooling inputs. We evaluate our approach on two different networks applied to the MNIST digit classification dataset
[lecun1998gradient]. As a result of our evaluation we obtain high attack success rates (defined as the number of candidate fooling inputs that are classified according to an attacker’s target), even when only a partial number of ReLU activation functions are faulted at training time (as low as 20% or 25 activation functions).Moreover, the attack strategy is stealthy in the following sense. On the one hand, the attacked network preserves high accuracy levels, indistinguishable from the accuracy of an nonattacked network. On the other hand, generated inputs are not easy to blacklist, since they can be generated using random images as a base pattern (for instance commonly used icons or other pictures), which together with the constraints defined by the faulted network, appear to be common pictures with some noise.
Finally, based on our observations and experiments, we also discuss countermeasures and mitigation strategies to the proposed attacks.
Our contributions.

We explore a novel faulting attack targeting ReLU activation functions at the training phase of a neural network.

To exploit the attacked model at inference time, we formulate the problem of fooling input generation as a constraint solving problem and show that attacks can be effectively generated using arbitrary base inputs.

We perform experimental verification of the proposed attack obtaining high attack success rates and share our code as open source
^{1}^{1}1https://www.dropbox.com/sh/gjys2sg7xob2e9x/AACB1mWyTQMv8f7R63wIxbIia
(anonymized for reviewing purposes). 
Based on our observations and analysis, we discuss countermeasures to the presented attack.
Main idea. The main idea of our work can be explained in a few simple steps:

Fault the training phase of the model. This step injects the backdoor to the network. When inputs from the target class are being fed to the neural network during the training, faults are injected to the ReLU activation functions in the first hidden layer.

Generate the fooling image(s) by solving constraints. The fault information from the previous step is converted to a constraintsolving problem. A fooling image can be designed to be similar to an existing image. In this case we restrict the fooling image to be near (pixelwise) to the chosen pattern.

When the fooling image is used as an input to the backdoored network, the target class is expected to be the output with a high confidence.
A graphical overview of these steps is depicted in Figure 1, where for illustration purpose an image is used to represent the input. The approach is however more general and does not require the network to be associated with an image classification task.
The rest of the paper is organized as follows. We review key background concepts and compare against related work in Sect. II. We then present the general approach in Sect. III that conveys our attack strategy in a general way. We then evaluate our approach on two concrete network architectures on a popular image classification case study, a multilayer pereceptron and a convolutional network in Sect. IV, where we also discuss results, limitations, and possible countermeasures. We conclude in Sect. V.
Ii Background and Related Work
In this section we briefly review some important preliminaries on neural networks and faulting attacks, and position ourselves with respect to relevant related work.
Iia Neural Networks
Neural networks are computing units designed on the basis of biological neural networks. They are utilized for solving classification problems in various domains: malware detection [kolosnjaji2016deep], network intrusion detection [javaid2016deep], voice authentication [boles2017voice]
, etc. A neural network normally consists of an input layer, one or more hidden layers and an output layer. Each neuron computes a weighted sum of results from neurons from the previous layer, followed by a nonlinear activation function. The weights for each layer are determined during the training. The training process of neural networks makes use of a backpropagation algorithm
[hecht1992theory]. The training data are divided into batches. The weights are first randomly generated. For each batch of training data, the prediction of the current network for each data is evaluated and compared to its label. A predefined loss function is then calculated, based on the predictions and the correct labels for this batch. The gradient of the loss function is computed with respect to each network parameter and the parameters are updated slightly to reduce the loss on this batch. One epoch of training corresponds to a single passing through all the batches. The training process normally consists of several epochs.
One of the most commonly used activation functions is Rectified Linear Unit or ReLU defined as follows:
It is a piecewise linear function which preserves properties that make the optimization of the model easier. As shown in [breier2018practical], with a fault attack, the output of this activation function can be set to the value, regardless of the input. That work demonstrated the attack during inference time to cause an untargeted misclassification. In our work, we exploit this attack against ReLUs during the training phase of a neural network.
IiB Adversarial attacks
Adversarial attack on neural networks is a wellstudied topic. There are various kinds of attacks depending on the attacker assumptions and goals. For a detailed taxonomy of adversarial learning, we refer the reader to [biggio2018wild].
Adversarial examples. One of the earliest attacks discovered is an adversarial examples attack causing a misclassification. The attacker produces adversarial inputs during the inference. Those inputs are almost indistinguishable from natural data and yet classified incorrectly by the network [szegedy2013intriguing, papernot2016practical]. The attack can also be extended for a targeted misclassification, where the attacker aims to produce adversarial examples that can be misclassified to a target class [narodytska2017simple].
In our work, at the inference/deployment phase, the attacker crafts input examples (called fooling inputs) that are very different from any natural data. However, the network will still classify such an input to a target class with a high confidence, due to the fault injected at the training phase, which are referred to as backdoors in this paper. While on the other hand, the original network would classify those pictures with low confidence on all classes.
Poisoning attacks. One common attack on training data is a poisoning attack. The assumption is that the attacker has the ability to alter a small fraction of the training data and compromise the whole network [wang2018data]. The poisoning examples can be generated using, e.g., a gradientascentbased methodology [munoz2017towards]. The goal of the attack can be degrading the overall efficacy of the model [xiao2015feature], targeted misclassifications [chen2017targeted], etc.
In this work, we do not assume the attacker has any active control over training data. We assume the attacker can observe the training data during the training phase, i.e., we allow only a passive observation compared to poisoning attacks.
Triggerbased backdoor attacks and countermeasures. Another line of attacks on neural networks aims to inject a backdoor into a neural network model such that it can be triggered to misclassify inputs with certain embedded patterns to a target class of the attacker’s choice [gu2017badnets, liao2018backdoor]. Such attacks are also sometimes called trojan attacks [liu2017trojaning]. [gu2017badnets] assumes an outsourced training scenario and the attacker, which provides a training for the neural network, has a full control over the training process. In [liu2017trojaning], the attacker retrains the model with external training data to cause the neural network to make targeted misclassifications. Thus, the attacker does not have a control over training data, but they have a full access to the neural network. [liao2018backdoor] and [chen2017targeted] discussed attacks on a weaker assumption where the attacker does not have the knowledge of the model or training data.
Following the triggerbased backdoor attacks, countermeasures for such attacks have also been developed [wang2019neural, liu2018fine, chen2018detecting]. As they are designed to counter attacks with the abovementioned goals, they assume the malicious trigger is added to a natural input.
In our work, during the inference phase, we do not generate a malicious trigger to be recognized by the network with backdoors. Instead, the backdoor allows the attacker to generate fooling inputs from any arbitrary input and such fooling inputs will be misclassified to a target class. Thus the countermeasures aiming to protect triggerbased backdoor attacks do not apply in our case. For the attacker capability, we do not assume she has any control over the training data or the training process. However we assume the attacker can observe the training data and inject faults during the training.
IiC Fault Attacks
Fault attacks, also called fault injection attacks, are one of the major physical attack threats against cryptographic implementations [joye2012fault]. By influencing the device that performs the encryption, the attacker can cause errors during the computation. Then, various analysis techniques can be used to recover the secrets by observing the outcomes of the fault. It was shown that a single data fault during the AES encryption can recover the entire secret key [tunstall2011differential]. An instruction skip attack was utilized to skip the AddRoundKey routine of the AES to skip the last key addition, and then trivially recover the last round key [breier2015laser]. It is assumed that every symmetric cipher is vulnerable to a fault attack due to nonlinear components used in the round function [cryptoeprint:2020:1267].
While these attacks were originally designed for targeting embedded devices, such as smart cards, methods like Rowhammer [gruss2016rowhammer] and VoltJockey [qiu2019voltjockey] can cause hardware faults remotely. Leaving no trace in the security logging systems, remote fault attacks are trickier but stealthier alternative to standard software attacks.
In this work we assume the attacker can inject faults into a neural network training process. As the training is generally done on powerful servers with limited physical access, the attack execution would be normally carried out by a remote fault injection technique. However, similar outcome can be achieved by a software fault injection and also localized fault injection techniques, such as using EM/laser equipment.
IiD Fault Attacks on Neural Networks
Faulting neural networks in a malicious way is a relatively new area, first introduced by Liu et al. in 2017 [liu2017fault]. The authors simulated injection of faults during the model execution to achieve the misclassification. The work was followed by an experimental fault attack carried out by a laser equipment in a laboratory setting by Breier et al. [breier2018practical], where it was shown that the output of an activation function can be corrupted by a fault. The paper was later extended to show various attack strategies that can be derived by utilizing the knowledge from the experimental result [hou2021physical]. A study to find out the worst case scenario caused by a single bit fault was presented by Hong et al. [hong2019terminal] where the authors showed degradation of classification accuracy with a single bit flip. Bai et al. [bai2021targeted] followed on the idea of flipping weight bits to misclassify the output into a target class.
A trojaning attack, called Targeted Bit Trojan (TBT), using bit flips in the memory was presented by Rakin et al. [rakin2020tbt]. They utilized the Rowhammer technique to flip the weight bits which contribute to a target class and then, they generated a specific trigger that can be inserted in the input to perform the misclassification. Such technique can be, however, thwarted by standard integrity checks on the model stored in the main memory. Once the integrity check fails, the model is restored from a secure storage. As our attack works at runtime during the testing phase, such integrity check would not help.
Another line of work focuses on model extraction attack [jagielski2019high], where the attacker aims to recover the parameters of a neural network with as much precision as possible. Breier et al. [breier2020sniff] demonstrated a fault attack during the inference that can recover the parameters with the exact precision for deeplayer feature extractor networks [wang2018great].
Our method, fooling backdoor, works with the same premise that the attacker is able to inject a fault that changes the processed values. However, unlike previous works, we target the training phase to inject a backdoor into the network that can be used later during the inference phase. We utilize the instruction skip model proposed in [hou2021physical], as it is the easiest model to be achieved in practice. Skipping instructions can be done by a clock or a voltage glitch, using an equipment with a cost below $100 [bozzato2019shaping].
In sum, to the best of our knowledge we are the first to propose faulting attacks at the training phase, which are exploitable at deployment without the necessity of further faulting. A comparison of different works from this section is provided in Table I.
Work  Type of fault  Phase  Outcome 

[liu2017fault]  Bit flip  Inference  Misclassification 
[breier2018practical]  Instruction skip  Inference  Misclassification 
[hong2019terminal]  Bit flip  Inference  Degradation 
[breier2020sniff]  Bit flip  Inference  Model extraction 
[rakin2020tbt]  Bit flip  Inference  Trojan insertion 
[bai2021targeted]  Bit flip  Inference  Targeted misclassification 
This work  Instruction skip  Training  Trojan insertion 
Iii Approach
In this section we will describe the general attacker strategy and attacker goals in more detail. The attacker considered is interested in corrupting the training process of a neural network with the goal to be able to exploit specific fooling backdoors, while at the same time remaining as stealthy as possible. This section will discuss the attack strategy abstractly, while we will instantiate it on two realistic neural networks in Sect. IV.
Recall that as stated in the introduction, the general research question we want to answer is:
GRQ: Is it possible for an attacker to use fault attacks in the training phase of a deep neural network such that they can bias a resulting model in a way that can be exploited during deployment without the necessity for further fault attacks at inference time?
Moreover, if we can answer this question positively we would like to know:

Is it possible to carry such an attack without affecting the original network classification accuracy?

Can we attack a deployment with a family of numerous attack instances (inputs) that will be difficult to blacklist?

Is there a way to minimize the need for faulting attacks during training while maintaining a high likelihood of attack success?
Iiia System and attacker model
As previously discussed, we assume the attacker will have physical access to a device performing a neural network training. We assume the attacker can observe the inputs to the training process and selectively injects faults during the computation. For instance, an attacker targeting a digit recognition network can selectively attack the network while an image in the class ‘’ is given as input.
Additionally, we assume an attacker can recover a subset of the trained network’s weights and biases, for instance by using sidechannel analysis [csi_nn]. We do not require however that an attacker can necessarily retrieve all weights, since as we will discuss in the following this will be in general not necessary.
Definition 1 (Attacker goal).
Given a target class , an attacker would like to be able to craft inputs such that for the corrupted classifier . We say such attacks are nontrivial if a noncorrupt classifier would compute for .
In other words, an attacker aims to corrupt the training process such that it will be easy to generate input images that will be classified to a predefined target class. A trivial attack would be for the attacker to simply choose inputs in the problem domain that are very close to the target class, for instance a slightly modified image for an ‘’ in the image classification example. A nontrivial attack would be to craft inputs that are possibly not even in the problem domain and still get it to be classified as the target class. For instance, an image similar to a plane icon that is classified as a chosen target digit in the digit classification example would be a non trivial attack.
Moreover, this attack should be as stealthy as possible, in the sense that it is difficult to detect that a classifier was corrupted during training, and it should be difficult to automatically blacklist the fooling inputs . In other words, the accuracy of the corrupted classifier should be close to the one of and there should be no obvious way to characterize the resulting fooling images.
Note that crucially we assume the attacker to be able to inject faults only at the training phase, but not at inference time.
IiiB Faulting strategy
In order to fulfill the goals in Definition 1, we will propose the following strategy. First, an attacker will choose a hidden layer in the network that is followed by ReLU activation functions. The attacker will then fault those ReLUs when training samples of a target class are given as an input to the network. For instance, consider our running example, digit classification using the MNIST dataset and a target class . The attacker will fault all ReLUs after the first deep layer in a multilayer perceptron architecture as depicted in Figure 3. That is, for each ReLU in the layer, it will fault the output to 0 as depicted in Figure 2, independent of the input. Such a fault can be achieved as described in [breier2018practical].
Given that it is also in the attacker goals to remain stealthy, it is important that the classifier trained under this attack has an accuracy that is similar (overall and for each class) to the one obtained by the same network architecture and training and testing dataset under no attacks. So for instance if a multilayer perceptron reaches a 99% accuracy under no attacks, and a similar accuracy for each class, then the resulting accuracy of the network under attack should still be close to 99%. Clearly this implies in particular that faulting cannot be performed for all
example inputs in the target class, since otherwise the network will not be able to recognize testing samples in that class. Therefore in the proposed strategy, an attacker will choose a certain fraction of the training examples in the target class (for instance choosing at random with probability
).Moreover, an attacker that wants to minimize the attacking effort is also interested to fault as few ReLUs as possible. One research question is thus, what is the minimum number of ReLUs in a given layer that yield a successful and stealthy attack under Definition 1? To do so, we propose to explore a faulting strategy that considers an increasing number of faulted ReLUs per experiment iteration. For instance, one can start faulting 10% of ReLUs after a given hidden layer, then 20% and so on until faulting all ReLUs in that layer. In principle this yields a big combinatorial explosion of target ReLUs, since there are various ways to pick a partial number of ReLUs.
However we will discuss some choices to mitigate this computational problem in the evaluation section. On the one hand, depending on the network architecture, it is possible to argue for generality when an arbitrary partial number of ReLUs is chosen (multilayer perceptron). On the other hand, when this is not possible, we believe the results presented will give an intuition on the general attack impact when a limited number of activation functions is faulted.
IiiC Fooling images generation strategy
In order to exploit the faulting attacks performed in the training strategy, an attacker needs to be able to derive attack inputs that fulfill Definition 1. Intuitively, an attacker wants to achieve the same behaviour (ReLUs outputting 0s at a given layer) with the hope of achieving a (mis)classification biased to target class . This could be achieved for instance by faulting again at inference time, but this would require very strong attacker capabilities.
Instead, we propose to derive attack inputs, so called fooling inputs, which can be computed mathematically by means of constraint solving. We assume the attacker is able to derive some of the weights of the network (for instance by sidechannel analysis [csi_nn]). Note that in our attack strategy, an attacker only needs to learn the weights of the network that are necessary to compute inputs that result in output of 0 for the faulted ReLus. This set of weights needed to compute the fooling inputs can be as small as the weights corresponding to 25 neurons as we will see in Sect. IV.
Example: constraints on MLPs
For instance, assume the attacker faults all ReLUs after the first hidden layer of an MLP. The inputs to those ReLUs are:
where is the input vector, is the weight vector of the th neuron in the first layer and is the so called bias of the neuron. Assuming biases are small, an attack could be simply a 0s vector since regardless of the weights. This will trigger the desired behaviour of all ReLUs outputting a value close to 0.
However, this type of attack is limited since it can be easily blacklisted as a border case (a black image for instance), and harms the goal of the attacker to be as stealthy as possible. So an attacker might attempt to derive several non 0 fooling images by means of constraint solving, in other words, it can find a set such that for all and all it holds:
Note that by the definition of ReLU, any negative or zero input will result in a zero output, which is the attacker’s goal. If an attacker has only faulted a subset of the ReLUs in a given layer, then the number of constraints can be smaller (corresponding to the target neurons). Note also that in this example this is a linear constraint which is typically easier to handle computationally than more complex nonlinear constraints.
Naturally, more complex architectures and attacking deeper layers will result in more complex constraints. For instance if before the ReLU layer there is a convolutional layer, this results in a larger constraint set (that would also be linear in some cases) as we will discuss in the following. A more complex situation would be to attack activation functions after nonlinear layers, for instance after the ReLU activation layer or layers containing the softmax function. In that case, constraint solving would be more challenging.
Example: constraints on convolutional networks.
On a convolutional network, commonly used in computer vision tasks, there is typically a concept of a filter, or a family of filters, represented as one or several matrix that multiply submatrices of an input image . In this case, if we want to impose constraints on activation functions after a convolutional layers, we would need to satisfy:
where is the set of chosen matrices for the convolution. These matrices are typically the matrices ‘surrounding’ all pixels in the original images. For instance in a image there would be 9 submatrices, one for each pixel, corresponding to a given pixel and its neighbors, where pixels on the image border are considered to have 0 surrounding them outside the original image.
Clearly this case imposes more complex constraints, since submatrices are not disjoint but often share multiple pixels, so a given pixel will end up having multiple constraints to be fulfilled simultaneously. To further complicate things, usually a family of filters , consisting on several individual filters if often used in these architectures.
Independently of the nature of the network and the point where the attack is executed, we can describe a highlevel algorithm summarizing the constraint solving strategy as described in Algorithm 1. In this algorithm we assume each neuron (which can be an activation function) has a well defined mathematical formula specifying its inputs. It also has a well defined output that depends on its inputs. Therefore, we can attempt at solving the constraints defined by the input formulas to obtain a given output (in this case, we are interested in the output 0 which is the result of our fault on ReLU).
In the following section we will evaluate our attack strategy on two different neural network architectures: a multilayer perceptron and a small convolutional network.
Iv Evaluation
In order to evaluate our approach, we create a framework that (1) trains and tests a neural network under normal conditions and under attacks as well; (2) builds a set of fooling inputs based constraint solving and (3) tests the attack success rate of those fooling inputs. In that sense, we first perform a thorough exploration on the design and performance of fooling backdoor attacks on fully connected neural networks. Then, we extend our analysis to different kinds of neural networks (i.e Convolutional Neural Networks). Finally, we propose a set of countermeasure that defend systems against our fooling backdoor attacks.
The code used for this evaluation is available as open source Jupyter notebooks containing Python code ^{2}^{2}2https://www.dropbox.com/sh/gjys2sg7xob2e9x/AACB1mWyTQMv8f7R63wIxbIia
(anonymized for reviewing purposes).
Iva Attacks on MLPs
MNIST digit classification [deng2012mnist] has been widely studied in the literature with a wide range of developed network architectures to solve this problem. However, one of the simplest and cheapest way to perform the classification accurately is by using an MLP network after flattening the input image. Following this strategy, we will have as the input to our neural network a vector of 781 (28x28) dimensions that contains the pixel value in the gray space . The MLP is aimed to learn a set of parameters that achieves an accurate classification of the digits in the flatten vector.
The network architecture chosen to solve the classification problem was a fully connected neural network with 3 hidden layers. The first, second and third hidden layers have 128, 64 and 32 neurons, respectively. The activation function for all hidden layers is a ReLU, while for the last layer we use a softmax activation in order to get a vector with classification probabilities for the classes.
To evaluate our approach, we attack ReLUs after the first hidden layer as depicted in Figure 3. Note that we do not actually perform physical attacks, but simulate them by coding a neural network architecture that allows us to replace the outputs of chosen ReLUs with 0s depending on the network input. This is a nontrivial coding effort since this involves carefully coding training and backpropagation in the faulted scenario from scratch.
We have chosen to fault of the training inputs belonging to the target class at random since we have empirically observed that this would preserve the classification accuracy of the original network almost intact while also introducing enough bias for successful attacks to be derived. It is possible that faulting even less inputs from the target class would result in powerful attacks but we leave this analysis for future work.
Nevertheless, since minimizing the need for physical faulting is interesting from an attacker’s perspective, in order to answer one of the research questions raised in the previous section, we perform a sensitivity analysis over various percentages of activation functions (10%, 20% , … , 100%). The idea is to assess the performance of less costly attacks (faulting as few ReLUs as possible, which also implies learning fewer network weights to perform the attack). We do this for all possible target classes (classes we want to build backdoors for), which yields attack simulations, each corresponding to a network faulted with a given percentage of the ReLUs and a given target class.
Given that there is a combinatorial explosion of choices when attacking a partial number of ReLUs after a given layer, without loss of generality, we consider a prefix of the ordered ReLUs of the attacked layer. This is because for a fullyconnected neural network a fix but arbitrary order for neurons would not affect the result. Notice that before training, neurons in a layer are permutation invariant as every neuron is connected to all the neurons in previous layers. In that sense, we could permute the neurons in the same layer and get exactly the same mathematical model but with a different mapping of indices for each neuron in the layer. Due to the fact that only the indices were changed, and given a seed for the weights initialization, the model parameters will converge to the same value that what was obtained for the nonpermuted layer.
Generating fooling images from the problem domain vs. outside of the problem domain. One interesting research question is whether we should consider base pattern inputs, as described in the previous section, sampled from the problem domain (in this case, handwritten digits), or not. Intuitively it could be interesting for an attacker to chose inputs from the problem domain in order to confuse a classifier. For instance, an attacker may want to produce a fooling input that is visually similar to a 4 but is classified as a 7. We have explored this scenario and noticed that it works successfully for the MNIST case study. For instance, Fig. 4 is obtained by solving constraints using a 4 as a basis on a network that was attacked to target the output class 7. In particular in that case we used a network with only of ReLUs faulted and managed to classify the resulting fooling image as a 7 with confidence higher than .
However, before generalizing this analysis to all possible target/base pattern combination () we anticipated a certain bias given that base images were in the problem domain to start with. This bias comes from the fact that the distribution of pixels in the base image already gives certain advantage to some attacks (for instance a base image of a 1 is closer to a 7 than to say an 8).
Therefore, to have a more interesting evaluation, we decided to pick images from outside of the problem domain as base patterns. This would to a degree remove inherent biases and make the attack more general.
For a given attacked network we then use the partial weights and biases that correspond to the attacked ReLUs to solve a linear constraint problem (as implemented by the Mixed Integer Linear Programming library in SageMath
^{3}^{3}3https://doc.sagemath.org/html/en/reference/numerical/sage/numerical/mip.html). In order to have variety of nontrivial attacks from outside the domain we used the icons in Figure 5 as extra constraints to the solver. Those icons were chosen randomly from an open source icon dataset ^{4}^{4}4https://remixicon.com/. In total, we generate 12 fooling images per each network. They consist of 2 images for each of the 5 icons as basis (each with a different constraint on the total image weight in order to add more diversity to the fooling images set) and 2 free images (no icon as a basis pattern, also each with different total weight). As a result, we obtain fooling images such as the ones depicted in Figure 8 for 10%, 50% and 100% ReLUs with target class . As there are 128 neurons in the layer under attack, 10%, 50% and 100% of this layer correspond to 25, 64 and 128 neurons respectively.In this scenario we instantiate Algorithm 1 to solve the particular constraints in our attack, plus the constraints given by the icons as described in Algorithm 2. Given , the set of ReLus that were faulted during training to create the backdoor and the set of pattern images . For each pattern image, we create one fooling image. There are two constraints for this fooling image: the first is that it is in a neighborhood from the pattern image (line 3). This distance is considered pixelwise and although in principle is arbitrary, the smaller the neighborhood of possible values around the original pixel intensity, the closer the resulting image will be to the pattern. Formally, let be pixel intensity of the th pixel in the pattern image. We constrain the fooling image to be within and . Empirically we observed that was enough for the fooling images to resemble the patterns while still being solvable. The second constraint is that the resulting outputs of this fooling images for those faulted ReLus should be (line 7).
We measure the attack success rate as the percentage of generated fooling images that are classified by the backdoored network as the target class. A summary of the attack success rate for the various combinations is depicted in Figure 6. Moreover we depict the average confidence on the successful attacks in Figure 7.
Note that in principle some of the constraints systems could be unsolvable. We have observed that most of the evaluated attacks for this scenario are solvable, with a few exceptions in the case of faulting 100% neurons in the first hidden layer. This makes sense since attacking more ReLUs implies more constraints rules and increases the likelihood of adding contradicting constraints. In total, out of 1200 constraints systems (fooling backdoors), there were 34 unsolved instances.
Moreover, and crucially for stealthiness, all the 100 generated networks had overall accuracy and perclass accuracy indistinguishable from a nonattacked network. Concretely, while a nonattacked network had accuracy of about , all FooBaRed network had accuracy between and on legitimate test inputs.
IvB Attacks on Convolutional Networks
In recent years, the field of computer vision has witnessed the birth of several deep learning architectures with promising results in image classification and object recognition. In particular, Convolutional Neural Networks (CNNs) have shown exemplary performance in tasks like pattern recognition
[albawi2017understanding]. Furthermore, several highly popular deep neural networks(like LeNet [lecun1998gradient], Alexnet [krizhevsky2012imagenet], VGG16 [simonyan2014very], etc) have in common the fact that their building blocks are convolutional layers. Consequently, in order to evaluate our approach effects on popular deep neural network architectures, we consider a network that includes convolutional layers. Figure 9 depicts the network architecture chosen to perform the classification task on the MNIST dataset.Note that in order to simulate fault physical attacks we have coded from scratch forward and backward propagation on neural networks, similarly as we did for MLPs. For the sake of simplicity, our training framework has no GPU capabilities. Given these implementation constraints, training the whole AlexNet network (60 million parameters) or VGG network (138 million parameters) under our framework would take a large amount of computational time and is considered out of the scope of this work. Nevertheless, as the convolutional layer is the base of popular networks we focus on simulating and evaluating the effects of attacking a convolutional layer using fooling images on a simpler network architecture.
In this target architecture the first ReLU layer comes after the first convolution layer. Different from the MLP architecture discussed before, if we want to build fooling inputs for faulted ReLUs, we have to solve new constraint systems generated by the convolution layer. Given that the first layer is a convolutional layer, attacking ReLU activation functions just after this layer results in linear constraints as well, which is advantageous for the constraint solving module.
Essentially the convolution layer of this architecture has a family of 5 socalled filters, each corresponding to a matrix of weights, and a bias value for each filter. In principle a filter is applied to each pixel in the input image by a matrix multiplication with its surrounding matrix, as discussed in Sect. III.
In the case of our network, a stride value of 2 is used, which means that filters are applied only to every second pixel in every second row of the original image. The use of strides is common in convolutional network architectures and defined the set of to be considered for the convolution (using the notation of Sect. III). As a result, the output of the convolution layer is a positions vector, as depicted in Fig. 9. Connected to each value in this layer there is a ReLU activation function. Those activations functions will be the target of our FooBaR attack.
In order to generate fooling images, constraints to be solved are thus:
where is the bias associated with filter .
Given that each filter corresponds to exactly 20% of the ReLUs to fault, we evaluate attacks to 20%,40%,60%, 80% and 100% of the ReLUs. Note that this time each filter adds a significant number of constraints on all pixels of the input image. The more filters we consider for the constraint solving, this results in potentially more conflicts between constraints.
We evaluate our approach against 5 networks (for each attack percentage) for the target class 8 for which attack success rates on the MLP architecture were average. As a result, we could not solve any constraint for the 60% and above attacks, and could solve all constraints for 20% and 40%. The resulting attacks are depicted in Figure 10. For the 20% network the attack success rate was 66% (8 out of 12 attacks) with an average confidence for the successful attacks of 95%. For the 40% (two filters) all attacks were successful with average confidence 97%.
Finally, similar as for the MLP case, a nonattacked network with this architecture reaches an overall accuracy on legitimate testing inputs. The 5 attacked networks had accuracy ranging from to , making them indistinguishable from a nonattacked network on legitimate testing inputs.
IvC Discussion
In general, what we observe from both experiments on MLP and convolutional networks is that by attacking relatively few ReLUs (as few as 20% on a single hidden layer) we can obtain high attack success rates (of up to 100%) with high classification confidence, even when the pattern used for the generation of fooling images is outside the problem domain (icons). This is interesting because the generated fooling images retain a similarity to the icons used as patterns, but can be tailored to be classified as any given target class with high confidence. Moreover, testing accuracy on legitimate inputs was indistinguishable in the attacked and nonattacked networks for both architectures, thus fulfilling the stealthiness requirement.
Despite the positive results obtained in our evaluation, there are number of limitations to our approach. First, some constraints might not be solvable, as we have observed in particular with the constraints generated by the convolutional network case study. This phenomenon is more likely the tougher the constraints are, which could also be the case if the parameter used for solving the pattern images is decreased. This could be desirable if one wants fooling images that are even closer to the original pattern images. In some case studies, involving intricate convolutional networks for instance, constraints could not be solvable at all.
On the other hand, in the convolutional network case study, there might be more advanced attacks that consider various combinations of ReLUs under attack, not necessarily respecting their order as we have done in our experiments. Different from the MLP case study, order will matter since the corresponding ReLUs are associated with different pixels and filters. Depending on the network, this could help the attacker if the resulting constraints are easier to solve. Given that there is an explosion of possible combinations in this attack scenario, we leave this study for future work.
Although we have explored FooBaR attacks on deeper layers (beyond the first hidden layer) and have obtained good results in terms of classification accuracy on legitimate testing inputs, we have not explored constraint solving on those networks given that constraints would no longer be linear if beyond for instance ReLU activation functions. This is an interesting aspect to explore in the future as well, given that it gives more degrees of freedom to an attacker if a target network has several hidden layers.
Finally, we have limited our analysis to the digit classification problem and two popular but relatively small neural network architectures. In principle our technique could be used on other case studies involving larger datasets and network architectures. However in order to simulate attacks efficiently, more potent hardware and GPU tailored implementations would be necessary. This effort is interesting future work and would yield light on the generalizeability of our approach to other case studies.
Countermeasures. The attack affecting the ReLU output [breier2018practical] is essentially an instruction skip attack. Such attacks have been well studied in the context of cryptography [breier2015laser]. Different countermeasures have also been proposed. Most softwarelevel countermeasures rely on temporal redundancy. For example, instruction duplication and triplication [barenghi2010countermeasures], or more finegrained instruction replacement [moro2014formal]. Additionally, it was shown possible to duplicate the data within the instruction, while adding a control flow protection [patrick2016lightweight]. In the area of hardware countermeasures, it is also possible to utilize spatial redundancy – essentially to deploy several computation units in parallel, performing the same computations. While modern encryption routines are relatively fast (e.g. encryption of one block of data with GIFT cipher takes between 2941 clock cycles on an embedded processor, depending on the state size [banik2017gift]), this is not true for deep learning which consumes several magnitudes more clock cycles due to enormous usage of expensive floating point operations. That means, doubling or tripling the whole computation becomes extremely costly when it comes to absolute numbers, and may be impractical for embedded AI applications. Therefore, the existing countermeasures are yet to be adjusted to apply for neural network implementations to offer a reasonable security/cost tradeoff.
On the other hand, as mentioned in Section IIB, countermeasures for backdoor attacks focus on triggerbased backdoors and do not apply to our attack. Thus, new countermeasures need to be developed in order to prevent or mitigate our proposed attack methodology.
To find a countermeasure, we have conducted an analysis on the behavior of safe neural networks for the fooling images generated. Using the same linear constraint solving method as in Algorithm 2, assuming a certain number of neurons were attacked during training, we generated sets of 12 fooling images. For each number of neurons, the frequency of the most frequent classification result as well as its mean confidence are listed in Table II. Comparing the frequency of the classification results to the attack success rate in Figure 6, we see that the frequency is much lower for most of the cases. Except for the last case when we assume 128 neurons were attacked, all the fooling images were misclassified to one particular class. Nevertheless, comparing the confidence values to Figure 7, we can see that the confidence for networks which were not attacked are very low.
Thus, one strategy to protect against attacks is for the user to generate fooling images and feed them to the trained network. The fooling images generation is not a computationally expensive task and can be performed in less than one hour in a laptop (for the 12 generated images in the MNIST case study for instance). In case the network classifies certain percentage of the images to one particular class with high confidence, the network can be considered to have been faulted and a retraining should be conducted. The threshold for the percentage can follow the worst case scenario in Figure 6 and that for the confidence can follow the worst case scenario in Figure II. This is thus an attack detection strategy that can be performed upon training on untrusted devices.
This statistical analysis on our case study confirms moreover that the generated fooling images were nontrivial as discussed in Def. 1. When applied to a noncorrupted network, the likelihood of fooling images being classified as the desired target class is on average low, with a strong bias towards classifying the generated fooling images as either or , and in any case with very low classification confidence.
# of neurons attacked  Frequency of most frequent class  Mean confidence 

12  0.33  0.092 
25  0.5  0.035 
38  0.33  0.057 
51  0.5  0.074 
64  0.5  0.023 
76  0.5  0.075 
89  0.67  0.073 
102  0.5  0.059 
115  0.33  0.30 
128  1  0.092 
V Conclusions
In this work we have shown that fault attacks on neural networks can be effectively used during training of a deep neural network in order to generate backdoors. Such backdoors can be exploited by means of fooling inputs, which are the result of linear constraint solving. Moreover, this constraint solving can include a pattern, based on an arbitrary input, such that attacks can be made similar to humanly recognizable inputs.
We explored our attack technique on MLPs and Convolutional Networks over the MNIST dataset. We obtained high attack success rates (of up to 100%) and high classification confidence even when attacking a small percentage (20% and up) of a single ReLU activation layer. The attacked networks preserved high classification accuracy, on average as good as a nonattacked network. As a result of our analysis, we discussed possible countermeasures against the presented attack.
Interesting future work includes applying our technique to larger networks and datasets and to casestudies beyond computer vision problems in order to further study the generalizeability of the proposed approach. Furthermore, we believe there is still room to optimize the number of faulted neurons and also to utilize different approaches to generate fooling images/inputs.
Comments
There are no comments yet.