PARL: Enhancing Diversity of Ensemble Networks to Resist Adversarial Attacks via Pairwise Adversarially Robust Loss Function

12/09/2021
by   Manaar Alam, et al.
IIT Kharagpur
1

The security of Deep Learning classifiers is a critical field of study because of the existence of adversarial attacks. Such attacks usually rely on the principle of transferability, where an adversarial example crafted on a surrogate classifier tends to mislead the target classifier trained on the same dataset even if both classifiers have quite different architecture. Ensemble methods against adversarial attacks demonstrate that an adversarial example is less likely to mislead multiple classifiers in an ensemble having diverse decision boundaries. However, recent ensemble methods have either been shown to be vulnerable to stronger adversaries or shown to lack an end-to-end evaluation. This paper attempts to develop a new ensemble methodology that constructs multiple diverse classifiers using a Pairwise Adversarially Robust Loss (PARL) function during the training procedure. PARL utilizes gradients of each layer with respect to input in every classifier within the ensemble simultaneously. The proposed training procedure enables PARL to achieve higher robustness against black-box transfer attacks compared to previous ensemble methods without adversely affecting the accuracy of clean examples. We also evaluate the robustness in the presence of white-box attacks, where adversarial examples are crafted using parameters of the target classifier. We present extensive experiments using standard image classification datasets like CIFAR-10 and CIFAR-100 trained using standard ResNet20 classifier against state-of-the-art adversarial attacks to demonstrate the robustness of the proposed ensemble methodology.

READ FULL TEXT VIEW PDF
08/18/2022

Resisting Adversarial Attacks in Deep Neural Networks using Diverse Decision Boundaries

The security of deep learning (DL) systems is an extremely important fie...
01/28/2019

Improving Adversarial Robustness of Ensembles with Diversity Training

Deep Neural Networks are vulnerable to adversarial attacks even in setti...
05/12/2020

Robustness Verification for Classifier Ensembles

We give a formal verification procedure that decides whether a classifie...
09/30/2020

DVERGE: Diversifying Vulnerabilities for Enhanced Robust Generation of Ensembles

Recent research finds CNN models for image classification demonstrate ov...
11/28/2020

Voting based ensemble improves robustness of defensive models

Developing robust models against adversarial perturbations has been an a...
11/30/2019

Error-Correcting Neural Network

Error-correcting output codes (ECOC) is an ensemble method combining a s...
05/24/2021

Learning Security Classifiers with Verified Global Robustness Properties

Recent works have proposed methods to train classifiers with local robus...

1 Introduction

Deep learning (DL) algorithms have seen rapid growth in recent years because of their unprecedented successes with near-human accuracies in a wide variety of challenging tasks starting from image classification (Szegedy et al., 2016), speech recognition (Amodei and others, 2016)

, natural language processing 

(Wu and others, 2016) to self-driving cars (Bojarski and others, 2016). While DL algorithms are extremely efficient in solving complicated decision-making tasks, they are vulnerable to well-crafted adversarial examples (slightly perturbed valid input with visually imperceptible noise) (Szegedy et al., 2014). The widely-studied phenomenon of adversarial examples among the research community has produced numerous attack methodologies with varied complexity and effective deceiving strategy (Goodfellow et al., 2015; Kurakin et al., 2017; Madry et al., 2018; Moosavi-Dezfooli et al., 2016; Papernot et al., 2017). An extensive range of defenses against such attacks has been proposed in the literature, which generally falls into two categories. The first category enhances the training strategy of deep learning models to make them less vulnerable to adversarial examples by training the models with different degrees of adversarially perturbed training data (Bastani et al., 2016; Huang et al., 2015; Jin et al., 2015; Zheng et al., 2016)

or changing the training procedure like gradient masking, defensive distillation, etc. 

(Gu and Rigazio, 2015; Papernot et al., 2016; Rozsa et al., 2016; Shaham et al., 2015). However, developing such defenses has been shown to be extremely challenging. Carlini and Wagner (2017b) demonstrated that these defenses are not generalized for all varieties of adversarial attacks but are constrained to specific categories. The process of training with adversarially perturbed data is hard, often requires models with large capacity, and suffers from significant loss on clean example accuracy. Moreover, Athalye et al. (2018) demonstrated that the changes in training procedures provide a false sense of security. The second category intends to detect adversarial examples by simply flagging them (Bhagoji et al., 2017; Feinman et al., 2017; Gong et al., 2017; Grosse et al., 2017; Metzen et al., 2017; Hendrycks and Gimpel, 2017; Li and Li, 2017). However, even detection of adversarial examples can be quite a complicated task. Carlini and Wagner (2017a) illustrated with several experimentations that these detection techniques could be efficiently bypassed by a strong adversary having partial or complete knowledge of the internal working procedure.

While all the approaches mentioned above deal with standalone classifiers, in this paper, we utilize the advantage of an ensemble of classifiers instead of a single standalone classifier to resist adversarial attacks. The notion of using diverse ensembles to increase the robustness of a classifier against adversarial examples has recently been explored in the research community. The primary motivation of using an ensemble-based defense is that if multiple classifiers with similar decision boundaries perform the same task, the transferability property of DL classifiers makes it easier for an adversary to mislead all the classifiers simultaneously using adversarial examples crafted on any of the classifiers. However, it will be difficult for an adversary to mislead multiple classifiers simultaneously if they have diverse decision boundaries. Strauss et al. (2017)

used various ad-hoc techniques such as different random initializations, different neural network structures, bagging the input data, adding Gaussian noise while training for creating multiple diverse classifiers to form an ensemble. The resulting ensemble increases the robustness of the classification task even in the presence of adversarial examples.

Tramèr et al. (2018) proposed Ensemble Adversarial Training that incorporates perturbed inputs transferred from other pre-trained models during adversarial training to decouple adversarial example generation from the parameters of the primary model. Grefenstette et al. (2018) demonstrated that ensembling two models and then adversarially training them incorporates more robustness in the classification task than single-model adversarial training and ensemble of two separately adversarially trained models. Kariyappa and Qureshi (2019) proposed Diversity Training of an ensemble of models with uncorrelated loss functions using Gradient Alignment Loss metric to reduce the dimension of adversarial sub-space shared between different models and increase the robustness of the classification task. Pang et al. (2019) proposed Adaptive Diversity Promoting regularizer to train an ensemble of classifiers that encourages the non-maximal predictions in each member in the ensemble to be mutually orthogonal, which degenerates the transferability that aids in resisting adversarial examples. Yang et al. (2020) proposed a methodology that isolates the adversarial vulnerability in each sub-model of an ensemble by distilling non-robust features. While all these works attempt to enhance the robustness of a classification task even in the presence of adversarial examples, Adam et al. (2018)

proposed a stochastic method to add Variational Autoencoders between layers as a noise removal operator for creating combinatorial ensembles to limit the transferability and detect adversarial examples.

In order to enhance the robustness of a classification task and/or to detect adversarial examples, the ensemble-based approaches mentioned above employ different strategies while training the models. These ensembles either lack end-to-end evaluations for complicated datasets or lack evaluation against more aggressive attack scenarios like the methods discussed by Dong et al. (2018) and Liu et al. (2017), which demonstrate that adversarial examples misleading multiple models in an ensemble tend to be more transferable. In this work, our primary objective is to propose a systematic approach to enhance the classification robustness of an ensemble of classifiers against adversarial examples by developing diversity in the decision boundaries among all the classifiers within that ensemble. The diversity is obtained by simultaneously considering the mutual dissimilarity in gradients of each layer with respect to input in every classifier while training them. The diversity among the classifiers trained in such a way helps to degenerate the transferability of adversarial examples within the ensemble.

Motivation behind the Proposed Approach and Contribution: The intuition behind developing the proposed ensemble methodology is discussed in Figure 1, which shows a case study using classifiers trained on CIFAR-10 dataset without loss of generality. Figure (a)a shows an input image of a ‘frog’. Figure (b)b shows the gradient of loss with respect to input111Fundamental operation behind the creation of almost all adversarial examples in a classifier , and denoted as . Figure (c)c shows the gradient for another classifier with a similar decision boundary as , and denoted as . The classifier is trained using the same parameter settings as but with different random initialization. Figure (d)d shows the gradients for a classifier with a not so similar decision boundary compared to , and denoted as . The classifiers , , and have similar classification accuracies. The method for obtaining such classifiers is discussed later in this paper. Figure (e)e shows relative symbolic directions among all the aforementioned gradients in higher dimension. The directions between a pair of gradients are computed using cosine similarity. We can observe that and lead to almost in the same directions, aiding adversarial examples crafted on to transfer into . However, adversarial examples crafted in will be difficult to transfer into as the directions of and are significantly different.

(a)
(b)
(c)
(d)
(e)
Figure 1: (a) Input image; (b) : Gradient of loss in the primary classifier; (c) : Gradient of loss in another classifier with similar decision boundaries; (d) : Gradient of loss in a classifier with not so similar decision boundaries but comparable accuracy; (e) Symbolic directions of all the gradients in higher dimensions. The gradients are computed with respect to the image shown in (a).

The principal motivation behind the proposed methodology is to introduce a constraint for reducing the cosine similarity of gradients among each classifier in an ensemble while training them simultaneously. Such a learning strategy with mutual cooperation intends to ensure that gradients between each pair of classifiers in the ensemble are as dissimilar as possible. We make the following contributions using the proposed ensemble training method:

  • We propose a methodology to increase diversity in the decision boundaries among all the classifiers within an ensemble to degrade the transferability of adversarial examples.

  • We propose a Pairwise Adversarially Robust Loss (PARL) function by utilizing the gradients of each layer with respect to input of every classifier within the ensemble simultaneously for training them to produce such varying decision boundaries.

  • The proposed method can significantly improve the overall robustness of the ensemble against black-box transfer attacks without substantially impacting the clean example accuracy.

  • We evaluated the robustness of PARL with extensive experiments using two standard image classification benchmark datasets on ResNet20 architecture against state-of-the-art adversarial attacks.

2 Threat Model

We consider the following two threat models in this paper while generating adversarial examples.

  • Zero Knowledge Adversary : The adversary does not have access to the target ensemble but has access to a surrogate ensemble trained with the same dataset. We term as a black-box adversary. The adversary crafts adversarial examples on and transfers to .

  • Perfect Knowledge Adversary : The adversary is a stronger than who has access to the target ensemble . We term as a white-box adversary. The adversary can generate adversarial examples on knowing the parameters used by all the networks within .

3 Overview of the Proposed Methodology

In this section, we provide a brief description of the proposed methodology used in this paper to enhance classification robustness against adversarial examples using an ensemble of classifiers . The ensemble consists of neural network classifiers and denoted as . All the ’s are trained simultaneously using the Pairwise Adversarially Robust Loss (PARL) function, which we discuss in detail in Section 4. The final decision for an input image on is decided based on the majority voting among all the classifiers. Formally, let us assume a test set of inputs with respective ground truth labels as . The final decision of the ensemble for an input is defined as

for most ’s in an appropriately trained . The primary argument behind the proposed ensemble method is that all ’s have dissimilar decision boundaries but not significantly different accuracies. Hence, a clean example classified as class in will also be classified as in most other ’s (where ,

) within the ensemble with a high probability. Consequently, because of the diversity in decision boundaries between

and (for , , and ), the adversarial examples generated by a zero knowledge adversary for a surrogate ensemble will have a different impact on each classifiers within the ensemble , i.e., the transferability of adversarial examples will be challenging within the ensemble. A perfect knowledge adversary can generate adversarial examples for the ensemble . However, in this scenario, the input image perturbation will be in different directions because of the diversity in decision boundaries among all ’s within the ensemble. The collective disparity in perturbation directions makes it challenging to craft adversarial examples for the ensemble. We evaluated our proposed methodology considering both the adversaries and presented the results in Section 5.

4 Building The Ensemble Networks using PARL

In this section, we provide a detailed discussion on training an ensemble of neural networks using the proposed Pairwise Adversarially Robust Loss (PARL) function for increasing diversity among the networks. First, we define the basic terminologies used in the construction, followed by a detailed methodology of the training procedure.

4.1 Basic Terminologies used in the Construction

Let us consider an ensemble , where is the network in the ensemble. We assume that each of these networks has the same architecture with number of hidden layers. Let be the loss functions evaluating the amount of loss incurred by the network for a data point , where is the ground-truth label for . Let be the output of hidden layer on the network for the data point . Let us assume has number of output features. Let us consider denote the sum of gradients over each output feature of hidden layer with respect to input on the network for the data point . Hence,

where is the gradient of the output feature of hidden layer on network with respect to input for the data point . Let be the training dataset containing examples.

4.2 Pairwise Adversarially Robust Loss Function

The principal idea behind the proposed approach is to train an ensemble of neural networks such that the gradients of loss with respect to input in all the networks will be in different directions. The gradients represent the directions in which the input needs to be perturbed such that the loss of the network increases, helping to transfer adversarial examples. In this paper, we introduce the Pairwise Adversarially Robust Loss (PARL) function, which we will use to train the ensemble. The objective of PARL is to train the ensemble so that the gradients of loss lead to different directions in different networks for the same input example. Hence, the fundamental strategy is to make the gradients as dissimilar as possible while training all the networks. Since the gradient computation depends on all intermediate parameters of a network, we force the intermediate layers of all the networks within the ensemble to be dissimilar for producing enhanced diversity at each layer.

The pairwise similarity of gradients of the output of hidden layer with respect to input between the classifiers and for a particular data point can be represented as

where

represents the dot product between two vectors

and . The overall pairwise similarity between the classifiers and for a particular data point considering hidden layers can be written as

Next, we define a penalty term for all the training examples in to pairwise train the models and as

We can observe that computes average pairwise similarity for all the training examples. Now, for network and , if all the gradients with respect to input for each training example are in the same direction, value of will be close to , indicating similarity in decision boundaries. The value of will gradually decrease as the relative angle between the pair of gradients increases in higher dimension. Hence, the objective of diversity training using PARL is to reduce the value of . Thus, we add to the loss function as a penalty parameter to penalize the training for a large value.

In the ensemble , we compute the values for each distinct pair of and in order to enforce diversity between each pair of classifiers. We define PARL to train the ensemble as

(1)

where and

are hyperparameters controlling the accuracy-robustness trade-off. A higher value of

and a lower value of helps to learn the models with good accuracy but is less robust against adversarial attacks. However, a lower value of and a higher value of makes the models more robust against adversarial attacks but with a compromise in overall accuracy.

One may note that the inclusion of the penalty values for each distinct pair of classifiers within the ensemble to compute PARL has one fundamental advantage. If we do not include the pair in the PARL computation, the training will continue without any diversity restrictions between and . Consequently, and will produce similar decision boundaries, thereby increasing the likelihood of adversarial transferability between and , affecting the robustness of the ensemble . One may also note that the number of gradient computations in an efficient implementation of PARL is linearly proportional to the number of classifiers in the ensemble. The gradients for each classifier are computed once and are reused to compute values for each pair of classifiers. The reuse of gradients protects the implementation from the exponential computational overhead due to pairwise similarity computation.

5 Experimental Evaluation

5.1 Evaluation Configurations

We consider an ensemble of three standard ResNet20 (He et al., 2016) architecture for all the ensembles used in this paper. We consider two standard image classification datasets for our evaluation, namely CIFAR-10 (Krizhevsky et al., 2009) and CIFAR-100 (Krizhevsky et al., 2009). We consider two scenarios for the evaluation:

  • Unprotected Ensemble: A baseline ensemble of ResNet20 architectures without any countermeasure against adversarial attacks.

  • Protected Ensemble: An ensemble of ResNet20 architectures, where is the countermeasure used to design the ensemble. In our evaluation we have considered three previously proposed countermeasures to compare the performance of PARL. We denote the ensembles , , and to be the ensembles trained with the methods proposed by Pang et al. (2019)Kariyappa and Qureshi (2019), and Yang et al. (2020), respectively. The ensemble trained with our proposed method is denoted as .

We use the adam optimization (Kingma and Ba, 2015) to train all the ensembles with adaptive learning rate starting from . We dynamically generate a augmented dataset using random shifts, flips and crops to train both CIFAR-10 and CIFAR-100. We use the default hyperparameter settings mentioned in the respective papers for , , and 222We implemented following the approach mentioned in the paper. Whereas, we adopted the official GitHub repositories for and implementation.. We use , , and categorical crossentropy loss for (ref. Equation (1)) for . We enforce diversity among all the classifiers in for the first seven convolution layers333We observed that PARL performs better than previously proposed approaches against adversarial attacks with high accuracy on clean examples by enforcing diversity in the first seven convolution layers. We present an Ablation Study by varying the number of layers utilized for diversity training using PARL in Section 5.4.

We consider four state-of-the-art untargeted adversarial attacks Fast Gradient Sign Method (FGSM) (Goodfellow et al., 2015), Basic Iterative Method (BIM) (Kurakin et al., 2017), Momentum Iterative Method (MIM) (Dong et al., 2018), and Projected Gradient Descent (PGD) (Madry et al., 2018) for crafting adversarial examples. A brief overview on crafting adversarial examples using these attack methodologies are discussed in Appendix A. We consider 50 steps for generating adversarial examples using the iterative methods BIM, MIM, and PGD with the step size of , where

is the attack strength. We consider the moment decay factor as

for MIM. We use different random restarts for PGD to generate multiple instances of adversarial examples for an impartial evaluation. We use , and for generating adversarial examples of different strengths. We use CleverHans v2.1.0 library (Papernot et al., 2018) to generate all the adversarial examples.

In order to evaluate against a stronger adversarial setting for , we train a black-box surrogate ensemble and generate adversarial examples from the surrogate ensemble instead of a standalone classifier. As also mentioned previously, adversarial examples that mislead multiple models in an ensemble tend to be more transferable (Dong et al., 2018; Liu et al., 2017). All the results in the subsequent discussions are reported by taking average value over three independent runs.

5.2 Analysing the Diversity

The primary objective of PARL is to increase the diversity among all the classifiers within an ensemble. In order to analyze the diversity of different classifiers trained using PARL, we use Linear Central Kernel Alignment (CKA) analysis proposed by Kornblith et al. (2019). The CKA metric, which lies between , measures the similarity between decision boundaries represented by a pair of neural networks. A higher CKA value between two neural networks indicates a significant similarity in decision boundary representations, which implies good transferability of adversarial examples. We present an analysis on layer-wise CKA values for each pair of classifiers within the ensemble and trained with CIFAR-10 and CIFAR-100 in Figure 2 to show the effect of PARL on diversity.

(a)
(b)
(c)
(d)
(e)
(f)
Figure 2: Layer-wise linear CKA values between each pair of classifiers in and trained with CIFAR-10 [(a), (b), (c)] and CIFAR-100 [(d), (e), (f)] datasets showing the similarities at each layer. The value inside the braces within the corresponding figure legends represent the overall average Linear CKA values between each pair of classifiers.

We can observe that each pair of models in show a significant similarity at every layer. However, since is trained by restricting the first seven convolution layers, we can observe a considerable decline in the CKA values at the initial layers. The observation is expected as imposes layer-wise diversity in its formulation, as discussed previously in Section 4. The overall average Linear CKA values between each pair of models in Figure 2 are mentioned inside braces within the corresponding figure legends, which signifies that the classifiers within an ensemble trained using PARL shows a higher overall dissimilarity than the unprotected baseline ensemble. In the subsequent discussions, we analyze the effect of the observed diversity on the performance of against adversarial examples.

5.3 Robustness Evaluation of PARL

Performance in the presence of : We evaluate the robustness of all the ensembles discussed in Section 5.1 considering both CIFAR-10 and CIFAR-100 where the attackers cannot access the model parameters and rely on surrogate models to generate transferable adversarial examples. Under such a black-box scenario, we use one hold-out ensemble with three ResNet20 models as the surrogate model. We randomly select 1000 test samples and evaluate the performance of black-box transfer attacks for all the ensembles across a wide range of attack strength . We present the result for the attack strength () in Table 1 along with clean example accuracies. The methods presented by Kariyappa and Qureshi (2019) and Yang et al. (2020) neither evaluated their approach on a more complicated CIFAR-100 dataset nor discussed any optimal hyperparameter settings regarding their proposed algorithms for potential evaluation. In our robustness evaluations, we consider all the ensembles discussed in Section 5.1 for CIFAR-10. In contrast, we only consider and to compare the performance of for CIFAR-100. We can observe from Table 1 that the training using PARL does not adversely affect the ensemble accuracy on clean examples compared to other previously proposed methodologies. However, it provides better robustness against black-box transfer attacks for almost every scenario.

Clean Example FGSM BIM MIM PGD
CIFAR-10 93.11 21.87 7.47 7.27 1.57
CIFAR-100 70.88 7.13 7.17 3.93 5.39
CIFAR-10 92.99 23.53 8.13 7.53 3.08
CIFAR-100 70.01 8.23 10.07 5.83 8.72
CIFAR-10 91.22 22.8 9.3 8.33 7.64
CIFAR-100 * * * * *
CIFAR-10 91.73 28.88 11.78 10.55 9.68
CIFAR-100 * * * * *
CIFAR-10 91.09 28.37 20.8 16.17 15.65
CIFAR-100 67.52 12.73 18.97 10.01 21.49
  • * Did not consider CIFAR-100 dataset for evaluation

Table 1: Ensemble classification accuracy (%) for CIFAR-10 and CIFAR-100 on clean examples as well as adversarially perturbed images with attack strength for different adversarial attacks.

A more detailed performance evaluation considering all the attack strengths are presented in Figure 3. For CIFAR-10 evaluation (ref. Figure (a)a - Figure (d)d), we can observe that performs at par with considering FGSM. However, for stronger iterative attacks like BIM, MIM, and PGD, outperforms other methodologies with higher accuracy for large attack strengths. For CIFAR-100 evaluation (ref. Figure (e)e - Figure (h)h), we can observe that performs better than both and for all scenarios.

(a) FGSM
(b) BIM
(c) MIM
(d) PGD
(e) FGSM
(f) BIM
(g) MIM
(h) PGD
Figure 3: Ensemble classification accuracy (%) v.s. Attack Strength () against different black-box transfer attacks generated from surrogate ensemble for CIFAR-10 [(a), (b), (c), (d)] and CIFAR-100 [(e), (f), (g), (h)] datasets.

Performance in the Presence of : Next, we evaluate the robustness of ensembles when the attacker has complete access to the model parameters. Under such a white-box scenario, we craft adversarial examples from the target ensemble itself. We consider the same attack methodologies and settings as discussed in Section 5.1. We randomly select 1000 test samples and evaluate white-box attacks for all the ensembles across a wide range of attack strength . We present the results for CIFAR-10 and CIFAR-100 in Figure 4. For CIFAR-10 evaluation (ref. Figure (a)a - Figure (d)d), we can observe that performs marginally better than the other ensembles. For CIFAR-100 evaluation (ref. Figure (e)e - Figure (h)h), we can observe that performs better than both and for all scenarios. Although PARL achieves the highest robustness among all the previous ensemble methods for black-box transfer attacks, its robustness against white-box attacks is still quite low. The result is expected as the objective of PARL is to increase the diversity of an ensemble against adversarial vulnerability rather than entirely eliminate it. In other words, adversarial vulnerability inevitably exists within the ensemble and can be captured by attacks with white-box access. One straightforward way to improve the robustness of ensembles against such vulnerability is to augment PARL with adversarial training (Madry et al., 2018), which we look forward to as a future research direction.

(a) FGSM
(b) BIM
(c) MIM
(d) PGD
(e) FGSM
(f) BIM
(g) MIM
(h) PGD
Figure 4: Ensemble classification accuracy (%) v.s. Attack Strength () against different white-box attacks for CIFAR-10 [(a), (b), (c), (d)] and CIFAR-100 [(e), (f), (g), (h)] datasets.

5.4 Ablation Study

In all the previous evaluations, we consider by enforcing diversity in the first seven convolution layers for all the classifiers during ensemble training using PARL. In this section, we provide an ablation study by analyzing a varying number of convolution layers considered for the diversity training. We consider three ensembles , , and for this study, where denotes that the first convolution layers are used for enforcing the diversity. We consider the same evaluation configurations as discussed in Section 5.1. The accuracies of all the ensembles on clean examples considering both CIFAR-10 and CIFAR-100 are mentioned in Table 2. We can observe that as fewer restrictions are imposed, the overall ensemble accuracy increases, which is expected and can be followed from Equation (1). A detailed analysis on the diversity achieved by all these ensembles in terms of Linear CKA metric is provided in Appendix B for interested readers.

CIFAR-10 91.09 91.18 92.45
CIFAR-100 67.52 67.54 68.81
Table 2: Ensemble classification accuracy (%) on the test set for CIFAR-10 and CIFAR-100. The numbers in the first row after slash denote the number of convolution layers used to enforce diversity

Next, we evaluate the robustness of these ensembles against black-box transfer attacks for the evaluation configurations discussed in Section 5.1 and provide the results in Figure 5. We can observe that though the ensemble accuracy on clean examples increases with fewer layer restrictions, the robustness against black-box transfer attacks decreases significantly. The results present an interesting trade-off between accuracy and robustness in terms of number of layers considered while computing PARL (ref. Equation (1)).

(a) FGSM
(b) BIM
(c) MIM
(d) PGD
(e) FGSM
(f) BIM
(g) MIM
(h) PGD
Figure 5: Ensemble classification accuracy (%) v.s. Attack Strength () against different black-box transfer attacks generated from surrogate ensemble for CIFAR-10 [(a), (b), (c), (d)] and CIFAR-100 [(e), (f), (g), (h)] datasets. The numbers in the figure legends after slash denote the number of convolution layers used to enforce diversity

6 Conclusion

In this paper, we propose an approach to enhance the classification robustness of an ensemble against adversarial attacks by developing diversity in the decision boundaries among all the classifiers within the ensemble. The ensemble network is constructed by the proposed Pairwise Adversarially Robust Loss function utilizing the gradients of each layer with respect to input in all the networks simultaneously. The experimental results show that the proposed method can significantly improve the overall robustness of the ensemble against state-of-the-art black-box transfer attacks without substantially impacting the clean example accuracy. Combining the technique with adversarial training and exploring different efficient methods to construct networks with diverse decision boundaries adhering to the principle outlined in the paper can be interesting future research directions.

References

  • G. Adam, P. Smirnov, A. Goldenberg, D. Duvenaud, and B. Haibe-Kains (2018) Stochastic combinatorial ensembles for defending against adversarial examples. CoRR abs/1808.06645. External Links: Link, 1808.06645 Cited by: §1.
  • D. Amodei et al. (2016) Deep speech 2 : end-to-end speech recognition in english and mandarin. In

    Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016

    ,
    JMLR Workshop and Conference Proceedings, Vol. 48, pp. 173–182. External Links: Link Cited by: §1.
  • A. Athalye, N. Carlini, and D. A. Wagner (2018) Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, Proceedings of Machine Learning Research, Vol. 80, pp. 274–283. External Links: Link Cited by: §1.
  • O. Bastani, Y. Ioannou, L. Lampropoulos, D. Vytiniotis, A. V. Nori, and A. Criminisi (2016) Measuring neural net robustness with constraints. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 2613–2621. External Links: Link Cited by: §1.
  • A. N. Bhagoji, D. Cullina, and P. Mittal (2017) Dimensionality reduction as a defense against evasion attacks on machine learning classifiers. CoRR abs/1704.02654. External Links: Link, 1704.02654 Cited by: §1.
  • M. Bojarski et al. (2016) End to end learning for self-driving cars. CoRR abs/1604.07316. External Links: Link, 1604.07316 Cited by: §1.
  • N. Carlini and D. A. Wagner (2017a) Adversarial examples are not easily detected: bypassing ten detection methods. In

    Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, AISec@CCS 2017, Dallas, TX, USA, November 3, 2017

    ,
    pp. 3–14. External Links: Link, Document Cited by: §1.
  • N. Carlini and D. A. Wagner (2017b) Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy, SP 2017, San Jose, CA, USA, May 22-26, 2017, pp. 39–57. External Links: Link Cited by: §1.
  • Y. Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li (2018) Boosting adversarial attacks with momentum. In

    2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018

    ,
    pp. 9185–9193. External Links: Link, Document Cited by: Appendix A, §1, §5.1, §5.1.
  • R. Feinman, R. R. Curtin, S. Shintre, and A. B. Gardner (2017) Detecting adversarial samples from artifacts. CoRR abs/1703.00410. External Links: Link, 1703.00410 Cited by: §1.
  • Z. Gong, W. Wang, and W. Ku (2017) Adversarial and clean data are not twins. CoRR abs/1704.04960. External Links: Link, 1704.04960 Cited by: §1.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Link Cited by: Appendix A, Appendix A, §1, §5.1.
  • E. Grefenstette, R. Stanforth, B. O’Donoghue, J. Uesato, G. Swirszcz, and P. Kohli (2018) Strength in numbers: trading-off robustness and computation via adversarially-trained ensembles. CoRR abs/1811.09300. External Links: Link, 1811.09300 Cited by: §1.
  • K. Grosse, P. Manoharan, N. Papernot, M. Backes, and P. D. McDaniel (2017) On the (statistical) detection of adversarial examples. CoRR abs/1702.06280. External Links: Link, 1702.06280 Cited by: §1.
  • S. Gu and L. Rigazio (2015) Towards deep neural network architectures robust to adversarial examples. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Workshop Track Proceedings, External Links: Link Cited by: §1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770–778. External Links: Link, Document Cited by: §5.1.
  • D. Hendrycks and K. Gimpel (2017) Early methods for detecting adversarial images. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Workshop Track Proceedings, External Links: Link Cited by: §1.
  • R. Huang, B. Xu, D. Schuurmans, and C. Szepesvári (2015) Learning with a strong adversary. CoRR abs/1511.03034. External Links: Link, 1511.03034 Cited by: §1.
  • J. Jin, A. Dundar, and E. Culurciello (2015)

    Robust convolutional neural networks under adversarial noise

    .
    CoRR abs/1511.06306. External Links: Link, 1511.06306 Cited by: §1.
  • S. Kariyappa and M. K. Qureshi (2019) Improving adversarial robustness of ensembles with diversity training. CoRR abs/1901.09981. External Links: Link, 1901.09981 Cited by: §1, 2nd item, §5.3.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §5.1.
  • S. Kornblith, M. Norouzi, H. Lee, and G. E. Hinton (2019) Similarity of neural network representations revisited. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, Proceedings of Machine Learning Research, Vol. 97, pp. 3519–3529. External Links: Link Cited by: §5.2.
  • A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Technical Report. External Links: Link Cited by: §5.1.
  • A. Kurakin, I. J. Goodfellow, and S. Bengio (2017) Adversarial examples in the physical world. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Workshop Track Proceedings, External Links: Link Cited by: Appendix A, §1, §5.1.
  • X. Li and F. Li (2017) Adversarial examples detection in deep networks with convolutional filter statistics. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 5775–5783. External Links: Link, Document Cited by: §1.
  • Y. Liu, X. Chen, C. Liu, and D. Song (2017) Delving into transferable adversarial examples and black-box attacks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: §1, §5.1.
  • A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2018) Towards deep learning models resistant to adversarial attacks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: Appendix A, Appendix A, §1, §5.1, §5.3.
  • J. H. Metzen, T. Genewein, V. Fischer, and B. Bischoff (2017) On detecting adversarial perturbations. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: §1.
  • S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard (2016) DeepFool: A simple and accurate method to fool deep neural networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 2574–2582. External Links: Link Cited by: §1.
  • T. Pang, K. Xu, C. Du, N. Chen, and J. Zhu (2019) Improving adversarial robustness via promoting ensemble diversity. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, Proceedings of Machine Learning Research, Vol. 97, pp. 4970–4979. External Links: Link Cited by: §1, 2nd item.
  • N. Papernot, F. Faghri, N. Carlini, I. Goodfellow, R. Feinman, A. Kurakin, C. Xie, Y. Sharma, T. Brown, A. Roy, A. Matyasko, V. Behzadan, K. Hambardzumyan, Z. Zhang, Y. Juang, Z. Li, R. Sheatsley, A. Garg, J. Uesato, W. Gierke, Y. Dong, D. Berthelot, P. Hendricks, J. Rauber, R. Long, and P. McDaniel (2018) Technical report on the cleverhans v2.1.0 adversarial examples library. CoRR abs/1610.00768. External Links: Link, 1610.00768 Cited by: §5.1.
  • N. Papernot, P. D. McDaniel, I. J. Goodfellow, S. Jha, Z. B. Celik, and A. Swami (2017) Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, AsiaCCS 2017, Abu Dhabi, United Arab Emirates, April 2-6, 2017, pp. 506–519. External Links: Link, Document Cited by: §1.
  • N. Papernot, P. D. McDaniel, X. Wu, S. Jha, and A. Swami (2016) Distillation as a defense to adversarial perturbations against deep neural networks. In IEEE Symposium on Security and Privacy, SP 2016, San Jose, CA, USA, May 22-26, 2016, pp. 582–597. External Links: Link, Document Cited by: §1.
  • A. Rozsa, E. M. Rudd, and T. E. Boult (2016) Adversarial diversity and hard positive generation. In 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2016, Las Vegas, NV, USA, June 26 - July 1, 2016, pp. 410–417. External Links: Link, Document Cited by: §1.
  • U. Shaham, Y. Yamada, and S. Negahban (2015) Understanding adversarial training: increasing local stability of neural nets through robust optimization. CoRR abs/1511.05432. External Links: Link, 1511.05432 Cited by: §1.
  • T. Strauss, M. Hanselmann, A. Junginger, and H. Ulmer (2017) Ensemble methods as a defense to adversarial perturbations against deep neural networks. CoRR abs/1709.03423. External Links: Link, 1709.03423 Cited by: §1.
  • C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 2818–2826. External Links: Link, Document Cited by: §1.
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, External Links: Link Cited by: §1.
  • F. Tramèr, A. Kurakin, N. Papernot, I. J. Goodfellow, D. Boneh, and P. D. McDaniel (2018) Ensemble adversarial training: attacks and defenses. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §1.
  • Y. Wu et al. (2016)

    Google’s neural machine translation system: bridging the gap between human and machine translation

    .
    CoRR abs/1609.08144. External Links: Link, 1609.08144 Cited by: §1.
  • H. Yang, J. Zhang, H. Dong, N. Inkawhich, A. Gardner, A. Touchet, W. Wilkes, H. Berry, and H. Li (2020) DVERGE: diversifying vulnerabilities for enhanced robust generation of ensembles. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), External Links: Link Cited by: §1, 2nd item, §5.3.
  • S. Zheng, Y. Song, T. Leung, and I. J. Goodfellow (2016) Improving the robustness of deep neural networks via stability training. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 4480–4488. External Links: Link, Document Cited by: §1.

Appendix A Brief Overview of Adversarial Example Generation

Let us consider a benign data point , classified into class by a classifier . An untargeted adversarial attack tries to add visually imperceptible perturbation to and creates a new data point such that misclassifies into another class other than . The imperceptibility is enforced by restricting the -norm of the perturbation to be below a threshold , i.e.,  [Goodfellow et al., 2015, Madry et al., 2018]. We term as the attack strength.

An adversary can craft adversarial examples by considering the loss function of a model, assuming that the adversary has full access to the model parameters. Let denote the loss function of the model, where , , and represent the model parameters, benign input, and corresponding label, respectively. The goal of the adversary is to generate adversarial example such that the loss of the model is maximized while ensuring that the magnitude of the perturbation is upper bounded by . Hence, adhering to the constraint . Several methods have been proposed in the literature to solve this constrained optimization problem. We discuss the methods used in the evaluation of our proposed approach.

Fast Gradient Sign Method (FGSM): This attack is proposed by Goodfellow et al. [2015]. The adversarial example is crafted using the following equation

where denote the gradient of loss with respect to input.

Basic Iterative Method (BIM): This attack is proposed by Kurakin et al. [2017]. The adversarial example is crafted iteratively using the following equation

where is the benign example. If is the total number of attack iteration, . The parameter is a small step size usually chosen as . The function is used to generate adversarial examples within -ball of the original image .

Moment Iterative Method (MIM): This attack is proposed by Dong et al. [2018], which won the NeurIPS 2017 Adversarial Competition. This attack is a variant of BIM that crafts adversarial examples iteratively using the following equations

where is termed as the decay factor.

Projected Gradient Sign Method (PGD): This attack is proposed by Madry et al. [2018]. This attack is a variant of BIM with same adversarial example generation process except for is a randomly perturbed image in the neighborhood of the original image.

Appendix B Diversity Results for Ablation Study

In this section, we present an analysis on layer-wise CKA values for each pair of classifiers within the ensemble , , and trained with both CIFAR-10 and CIFAR-100 datasets. The layer-wise CKA values are shown in Figure 6 to demonstrate the effect of PARL on diversity. We can observe a more significant decline in the CKA values at the initial layers for as compared to and , which is expected as is trained by restricting more convolution layers. We can also observe that each pair of classifiers show more overall diversity in than in and . The overall CKA values are mentioned inside braces within the corresponding figure legends.

(a)
(b)
(c)
(d)
(e)
(f)
Figure 6: Layer-wise linear CKA values between each pair of classifiers in , , and trained with CIFAR-10 [(a), (b), (c)] and CIFAR-100 [(d), (e), (f)] datasets showing the similarities at each layer. The numbers in the figure legends after slash denote the number of convolution layers used to enforce diversity. The value inside the braces within the corresponding figure legends represent the overall average Linear CKA values between each pair of classifiers.