DNNs have shown state-of-the-art performance on a wide range of problems (goodfellow2016deep; DBLP:conf/bmvc/KaufmanBBCH19; DBLP:conf/isbi/BibasWCCG21). Despite the impressive performance, it has been found that DNNs are susceptible to adversarial attacks (szegedy2013intriguing; biggio2013evasion). These attacks cause the network to underperform by adding specially crafted noise to the input such that the original and modified inputs are almost indistinguishable.
Different approaches to improve model robustness against adversarial samples have been suggested (guo2018countering; samangouei2018defensegan; qin2019adversarial; papernot2016distillation; jakubovitz2018improving; zhang2019defense). Among them, the best performing approach is adversarial training, which augments the training set to include adversarial examples (goodfellow2014explaining; madry2017towards). However, current methods are still unable to achieve a robust model for high-dimensional inputs.
To produce a robust defense, we exploit the individual setting (merhav1998universal). In this setting, no assumption is made about a probabilistic connection between the data and labels. The absence of assumption means that this is the most general framework: The relationship between the data and labels can be deterministic and may be determined by an adversary that attempts to deceit the model. The generalization error in this setting is referred to as the regret. This regret is the log-loss difference between a learner and the reference learner: a learner that knows the true label but is restricted to use a model from a given hypothesis class.
The pNML learner (fogel2018universal)
was proposed as the min-max solution of the regret, where the minimum is over the model choice and the maximum is for any possible test label value. The pNML was developed for linear regression(bibas2019new; DBLP:journals/corr/abs-2102-07181) and evaluated empirically for DNN (bibas2019deep).
We propose the Adversarial pNML scheme as a new adversarial defense. Intuitively, the Adversarial pNML procedure assigns a probability for a potential outcome as follows: Assume an arbitrary label for a test sample and perform an adversarial targeted attack toward this label. Take the probability it gives to the assumed label. Follow this procedure for every label and normalize to get a valid probability assignment. This procedure can be applied effectively to any adversarial trained model.
To summarize, we make the following contributions.
We introduce the Adversarial pNML, a novel adversarial defense that enhances robustness. This is attained by comparing a set of hypotheses for the test sample. Each hypothesis is generated by a weak targeted attack towards one of the possible labels.
We analyse the proposed defense and show it is consistent with the recent findings regarding the properties of adversarial subspace. We demonstrate the refinement technique on synthetic data with a simple multilayer preceptron.
We evaluate our approach effectiveness and compare it to state-of-the-art techniques using black-box and white-box attacks, including a defense-aware attack. We show that against white-box untargeted attack we improve leading methods by , and 0.6% on ImageNet (deng2009imagenet), CIFAR10 (krizhevsky2014cifar) and MNIST (lecun2010mnist) sets respectively.
Our scheme is simple, requires only one hyper-parameter, and can be easily combined with any adversarial pretrained model to enhance its robustness. Furthermore, our suggested method is theoretically motivated for the adversarial attack scenario since it relies on the individual setting in which the relation between the data and labels can be determined by an adversary. Contrary to existing methods that attempt to remove the adversary perturbation (samangouei2018defensegan; song2018pixeldefend; guo2018countering), our approach is unique since it does not remove the perturbation but rather use targeted adversarial attack as a defense mechanism.
2 Related work
In this section, we mention common adversarial attack and defense methods.
Attack methods. One of the simplest attacks is Fast Gradient Sign Method (FGSM) (goodfellow2014explaining). Let be the parameters of a trained model, be the test data, its corresponding label,
the loss function of the model,the adversary input, and specifies the maximum distortion such that . First, the signs of the loss function gradients are computed with respect to the image pixels. Then, after multiplying the signs by , they are added to the original image to create an adversary untargeted attack
It is also possible to improve classification chance for a certain label by performing a targeted attack
A multi-step variant of FGSM was used by madry2017towards and is called Projected Gradient Descent (PGD). It is considered to be one of the strongest attacks. Denote as the size of the update, for each iteration an FGSM step is executed
The number of iterations is predetermined. For each sample, the PGD attack is initialized by a random starting point where and the sample with the highest loss is chosen.
A different approach is taken by Hop-Skip-Jump-Attack (HSJA) (chen2020hopskipjumpattack)
. This is a black-box attack in which the adversary has only a limited number of queries to the model decision. The attack is iterative and involves three steps: estimating the gradient direction, step-size search via geometric progression, and boundary search via a binary search.
Defence methods. The most prominent defense is adversarial training which augments the training set to include adversarial examples (goodfellow2014explaining). Many improvements in adversarial training were suggested. madry2017towards showed that training with PGD adversaries offered robustness against a wide range of attacks. carmon2019unlabeled
suggested using semi-supervised learning with unlabeled data to further improve robustness.Wong2020Fast offered a way to train a robust model with a lower computational cost with weak adversaries.
An alternative to adversarial training is to encourage the model loss surface to become linear such that small changes at the input would not change the output greatly. qin2019adversarial demonstrated using a local linear regularizer during training creates a robust DNN model.
In supervised machine learning, a training setconsisting of pairs of examples is given. The goal of a learner is to predict the unknown test label of given new test data
by assigning a probability distributionto the unknown label. For the problem to be well-posed, we must make further assumptions on the class of possible models or hypothesis set that is used to find the relation between and . Denote
as a general index set, the possible hypotheses are a set of conditional probability distributions
An additional assumption required to solve the problem is related to how the data and the labels are generated. In this work we consider the individual setting (merhav1998universal), where the data and labels, both in the training and test, are specific individual quantities: We do not assume any probabilistic relationship between them, the labels may even be assigned in an adversarial manner. In this framework, the goal of the learner is to compete against a reference learner with the following properties: (i) knows the test label value, (ii) is restricted to use a model from the given hypotheses set , and (iii) does not know which of the samples is the test. This reference learner then chooses a model that attains the minimum loss over the training set and the test sample
The log-loss difference between a learner and the reference is the regret
The pNML (fogel2018universal) learner minimizes the regret for the worst case test label
The pNML probability assignment and regret are
The pNML regret is associated with the model complexity (zhang2012model). This complexity measure formalizes the intuition that a model that fits almost every data pattern very well would be much more complex than a model that provides a relatively good fit to a small set of data. Thus, the pNML incorporates a trade-off between goodness of fit and model complexity.
4 Adversarial pNML
We utilize the pNML learner which is the min-max regret solution of the individual setting. In the individual setting there is no assumption of probabilistic connection between the training and test therefore the result holds for the adversary attack scenario.
We propose to construct the pNML hypothesis set (equation 4) with a refinement stage. Given a pretrained DNN , the refinement stage alters the test sample by performing a targeted attack toward label . Denote as the refinement strength, the refined sample is
This refinement process is repeated for every possible test label. The refined samples are then fed to the pretrained model to compose the hypothesis class
Each member in the hypothesis class produces a probability assignment. In the pNML process we take only the probability it gives to the assumed label
We then normalize the probabilities and return the adversarial pNML probability assignment
Since the refinement is a weak targeted attack, we utilize a pretrained adversarial trained model to preserve the natural accuracy.
4.1 Adversarial subspace interpretation
We analyze the hypothesis class choice using adversarial subspace properties.
Let be a strong adversarial example with respect to the label
, i.e., the model has a high probability of mistakenly classifyingas . For the binary classification task there are two members in our suggested hypothesis class: refinement towards the true label and refinement towards the adversary target . There are two mechanisms for strong adversarial examples that cause the refinement towards to be stronger than the refinement towards : convergence to local maxima and refinement overshoot.
Convergence to local maxima. szegedy2013intriguing stated that adversarial examples represent low-probability pockets in the manifold which are hard to find by randomly sampling around the given sample. madry2017towards showed that FGSM often fails to find an adversarial example while PGD with a small step size succeeds. This implies that for some dimensions the local maxima of the loss is in the interval . This was also confirmed empirically for CIFAR10 by Wong2020Fast. This means that for some dimensions the local maxima of the loss can be viewed as a “hole” in the probability manifold. For those dimensions, refinement towards would not increase the probability of hypothesis since already converged to the local maximum. On the other hand, refinement towards could cause the refined sample to escape the local maximum hole, thus increasing the probability of hypothesis.
Refinement overshoot. PGD attack is able to converge to strong adversarial points by using multiple iterations with a small step size. This process avoids the main FGSM pitfall: As the perturbation size increases, the gradient direction change (madry2017towards), causing FGSM to move in the wrong direction and overshoot. For the same reason, the FGSM refinement towards the might fail to create a strong adversarial.
The refinement towards is more probable to succeed since the volume of the non-adversarial subspace is relatively large, thus a crude FGSM refinement is more likely to move in the right direction. To support that claim we note that the adversarial subspace has a low probability and is less stable compared to the true data subspace (tabacof2016exploring). In other words, while the true hypothesis escapes the adversarial subspace, the target hypothesis can transform the strong PGD adversarial into weak FGSM adversarial.
In the case of multi-label classification there is a third kind hypothesis: A refinement towards other label . This refinement effectively applies a weak targeted attack towards a specific label . This hypothesis can be neglected for a strong adversarial input since a weak refinement towards other labels is unlikely to become more probable than refinement towards the target label.
4.2 Toy example
We present an experiment with two-dimensional synthetic data that demonstrate the mechanisms of section 4.1.
Let be the distribution with label and denote as the distribution of the data the corresponds to label 1.iterations of size and . For the adversarial test set, we set to . The refinement strength is 0.6.
Figure 1 shows the refinement process overlaid on the trained model label probability manifold. is the original sample with label , is the test adversarial sample, is the sample generated by refinement towards label , and is generated by refinement towards label .
Figure 1a demonstrates the convergence to local maxima mechanism. converged to the maximum probability, therefore refinement towards the target label does not increase the probability while refinement towards the true label does. As a result, the true hypothesis probability is greater and the true label is predicted. Figure 1b presents the refinement overshoot mechanism. The target hypothesis is refined in the wrong direction while the true hypothesis is refined in the correct direction. This makes the Adversarial pNML prediction to be more robust to adversarial attacks.
|(a) Convergence to local maxima||(b) Refinement overshoot|
In this section, we present experiments that test our proposed Adversarial pNML scheme as a defense for adversarial attack. We evaluate the natural performance (performance on images without perturbation) and adversarial performance on MNIST (lecun2010mnist), CIFAR10 (krizhevsky2014cifar) and ImageNet (deng2009imagenet) datasets. We compare our scheme to recent leading methods.
5.1 Adaptive attack and gradient masking
A main part of the defense evaluation is creating and testing against adaptive adversaries that are aware of the defense mechanism (carlini2019evaluating). This is specifically important when the defense cause gradient masking (papernot2017practical), in which gradients are manipulated, thus prevent a gradient-based attack from succeeding. Defense aware adversaries can overcome this problem by using a black-box attack or by approximating the true gradients (athalye2018obfuscated).
We design a defense-aware adversary for our scheme: We create an end-to-end model that calculates all possible hypotheses in the same computational graph. We note that the end-to-end model causes gradient masking since the refinement
function sets some of the gradients to zero during the backpropagation phase. We, therefore, attack the end-to-end model with the black-box HSJA method and PGD with Backward Pass Differentiable Approximation (BPDA) technique(athalye2018obfuscated), denoted as an adaptive attack.
In BPDA we replace the non-differentiable part with some differentiable approximation on the backward pass. Assuming the refinement is small, one solution is simply to approximate the refinement stage by the identity operator , which leads to . Further discussion on the adaptive attack can be found in the appendix.
5.2 Experimental results
MNIST. We follow the model architecture as described in madry2017towards. We use a model that consists of two convolutional layers with 32 and 64 filters respectively, each followed by max-pooling, and a fully connected layer of size 1024 (training details can be found in the appendix). We set the Adversarial pNML refinement strength to .
For evaluation, we set the attack strength to for all attacks. The PGD attack was configured with 50 steps of size 0.01 and 20 restarts. For the adaptive attack, we used 300 steps of size 0.01 and 20 restarts. For HSJA, we set the number of model queries to per sample, which was shown to be enough queries for convergence (chen2020hopskipjumpattack).
In Table 1 we report the accuracy of our scheme in comparison to the adversarial trained model without our scheme (madry2017towards). We observe that Adversarial pNML improves the robustness by 0.6% without degrading the accuracy of images with no adversarial perturbation (natural accuracy). The adaptive attack is the best attack against our defence which indicates that this kind of attack is efficient. In addition, our scheme improves the accuracy for black-box attack by 1.5% as seen in Table 1 HSJA column.
CIFAR10. We build our scheme upon a pre-trained WideResNet 28-10 architecture (zagoruyko2016wide) trained by carmon2019unlabeled with both labeled and unlabeled data. We set the Adversarial pNML refinement strength to . For evaluation, we set for all attacks. PGD and adaptive attack were configured with 200 steps of size 0.007 and 5 restarts. For HSJA attack, we set the maximal number of model queries to per sample and evaluate the accuracy for samples.
In Table 1 we report the accuracy of our scheme in comparison to other state-of-the-art algorithms. We observe that our method achieves state-of-the-art performance, enhancing the robustness by 3.7% with the best natural accuracy when compared to other defenses. The Adversarial pNML improves the accuracy against black-box attacks by 6.0%. which shows that the robustness boost of our method is not due to the masked gradients.
For the FGSM attack, carmon2019unlabeled outperforms our scheme by 0.4%. This result, together with the improvement our method achieves against PGD attack demonstrates the convergence to local maxima mechanism as described in section 4.1.
In figure 2a we show the robustness of our scheme against PGD attack for various of attack strengths. The results show that our approach is more robust for all values greater than 0.01. Specifically, the maximal improvement is 8.8% for . For our scheme is less robust by 0.6%. To explain these results, recall that the refinement strength is 0.03. When , one of the refinement hypotheses could generate adversarial examples stronger than the examples generated by the adversarial attack.
|(a) CIFAR10||(b) ImageNet|
ImageNet. We utilize a pre-trained ResNet50 trained by Wong2020Fast using fast adversarial training with . We set the Adversarial pNML refinement strength to . We used a subset of the evaluation set containing 100 labels. PGD and adaptive attack were configured with 50 steps of size and with 10 random restarts. For HSJA, we set the number of model queries to per sample.
In Table 1 we report the accuracy of our scheme in comparison to Wong2020Fast for values of and . For , we evaluated PGD and FGSM attacks using samples and for we used 100 samples (1 sample per label). We observe that robustness is improved by 3% and 5.7% for values of and respectively. The accuracy on natural images is improved by 0.2%. The HSJA attack seems to fail in finding adversarial examples, and that for the adaptive attack is the best.
In figure 2b we explore the robustness of our scheme against adaptive attack. The results show that our scheme is more robust for all , specifically the maximal improvement is 6.4% for .
5.3 Ablation study
The choice of the refinement strength represents a trade-off between the robustness against adversarial attacks and the accuracy of natural samples. This trade-off is explored in figure 4. As the refinement strength increases so is the robustness to PGD attack at the price of a small accuracy loss for natural samples. A good choice of the refinement strength would be in the interval . This gives a good balance between natural and adversarial accuracy. In our experiments, we used a small validation set to find a good value.
In Table 4 we explore the overshoot mechanism (section 4.1). We adjust the refinement to become more precise by replacing FGSM refinement with a PGD refinement which uses more iterations and smaller step size. We test our method on CIFAR10 against PGD attack with the same settings described in section 5.2. The results show that as the refinement becomes more precise, the robustness against PGD attack decreases. This demonstrates that the overshoot mechanism improves robustness since PGD refinement, which is less prone to overshoot, has lower robustness. This supports the claim of the instability of the adversarial subspace (tabacof2016exploring), which explains why FGSM refinement towards is more likely to succeed compared to refinement towards the .
5.4 Run-time analysis
Let be the number of hypotheses, i.e., the number of possible test labels. For each sample, our method performs a forward-pass (FP) followed by backward-passes (BP) to generate the refined samples. Then additional FP are made to calculate the prediction. It is possible to calculate these values simultaneously by batching, which reduces the complexity to . To decrease the batch size, we can reduce the number of hypotheses with a minor degradation to robustness by only calculating the hypotheses of the most probable labels. More information is available in the appendix.
We presented the Adversarial pNML scheme for defending DNNs from adversarial attacks. The theory behind this scheme comes from the individual setting where the relation between the data and labels can be determined by an adversary. Our method is conceptually simple, requires only one hyper-parameter, and flexible since it allows a trade-off between robustness and natural accuracy. Furthermore, any pretrained model can be easily combined with our scheme to enhance its robustness. We analysed the mechanisms that enable our method to boost the robustness using properties of the adversarial subspace. We showed empirically that our method enhances the robustness against adversarial attacks for ImageNet, CIFAR10, and MNIST datasets by , , and respectively.
This work suggests several potential directions for future work: The pNML regret, which is the log-loss distance from the reference learner, can form an adversarial attack detector. In addition, we would like to explore other hypothesis classes as the entire model parameter class where the model weights are changed according to the different hypotheses.
Appendix A Training parameters
We now detail the training parameters and architecture used to train the different models.
For both the standard model and madry2017towards model we used a network that consists of two convolutional layers with 32 and 64 filters respectively, each followed by max-pooling, and a fully connected layer of size 1024.
We trained the standard model for 100 epochs with natural training set. We used SGD with a learning rate of, a momentum value of 0.9, a weight decay of 0.0001, and a batch size of 50.
We trained the madry2017towards model for 106 epochs with adversarial trainset that was produced by PGD based attack on the natural training set with 40 steps of size 0.01 with a maximal value of 0.3. We used SGD with a learning rate of , momentum value 0.9 and weight decay of 0.0001. For the last 6 epochs, we used adversarial training with the adaptive attack instead of PGD. We set the Adversarial pNML refinement strength to .
We used wide-ResNet 28-10 architecture (zagoruyko2016wide) for the standard model. We trained the standard over 204 epochs using SGD optimizer with a batch size of 128 and a learning rate of 0.001, reducing it to 0.0001 and 0.00001 after 100 and 150 epochs respectively. We also used a momentum value of 0.9 and a weight decay of 0.0002.
For the standard model we used a pre-trained ResNet50 (he2016deep)
. Similarly to the other models, we adjust the standard model to only output the first 100 logits.
Appendix B Adaptive attack
In this section we discuss alternative approximations for the adaptive attack. Figure 5 presents the end-to-end model. We denote as an input that belong to label , is the model parameters, and is the model loss w.r.t a specific label where . is the refinement result for the -th hypothesis and is the probability of the corresponding hypothesis. is the loss for the first hypothesis. The adaptive adversary manipulate the input by taking steps in the direction of the gradients .
Recall that our adaptive attack approximates the refinement stage with a unity operator on the backward pass which leads to (see section 5.1). This approach, in effect, disregard anything that comes before the operator during backpropagation. An alternative approach is to use some kind of a differentiable function to approximate the operator and backpropagate through the entire computational graph. The first obstacle is to find a differentiable function that approximates the operator well. The first option that comes to mind is to use a function, but since the input values are distributed across a wide range, the causes a vanishing gradients effect, which misses the goal of this approximation.
Another approach is to disregard the operator on the backward pass, i.e., backpropagate the gradients without changing them. We examine this case:
Note that equation 15 is dependent on the Hessian matrix of the loss w.r.t . Computing this value is computationally hard for DNN’s and it is usually outside the scope of adversarial robustness tests - only first-order adversaries are considered (madry2017towards). This emphasizes that the only viable, gradient-based, adaptive attack is the one used in our paper
Appendix C Additional CIFAR10 results
We now provide additional results that support the claim that our scheme does indeed enhance robustness.
HSJA for different values.
In figure 6 we demonstrate that our scheme is more robust against black-box HSJA for various values. Specifically, the maximal improvement is 49.1% for . We note that in comparison to the white-box attack, HSJA is much less efficient against our scheme for large values. Nevertheless, the improvement of our scheme against black-box attack supports the claim that its robustness enhancement is not only the result of masked gradients.
Appendix D Run-time analysis
Let be the number of hypotheses, i.e., the number of possible test labels. For each sample, our method performs a forward-pass (FP) followed by backward-passes (BP) to generate the refined samples. Then additional FP are made to calculate the prediction. It is possible to calculate these values simultaneously by batching, which reduces the complexity to . To decrease the batch size, we can reduce the number of hypotheses with a minor degradation to robustness by only calculating the hypotheses of the most probable labels (which we know after the first ), demonstrated in Figure 7.