Dissecting Deep Networks into an Ensemble of Generative Classifiers for Robust Predictions

by   Lokender Tiwari, et al.

Deep Neural Networks (DNNs) are often criticized for being susceptible to adversarial attacks. Most successful defense strategies adopt adversarial training or random input transformations that typically require retraining or fine-tuning the model to achieve reasonable performance. In this work, our investigations of intermediate representations of a pre-trained DNN lead to an interesting discovery pointing to intrinsic robustness to adversarial attacks. We find that we can learn a generative classifier by statistically characterizing the neural response of an intermediate layer to clean training samples. The predictions of multiple such intermediate-layer based classifiers, when aggregated, show unexpected robustness to adversarial attacks. Specifically, we devise an ensemble of these generative classifiers that rank-aggregates their predictions via a Borda count-based consensus. Our proposed approach uses a subset of the clean training data and a pre-trained model, and yet is agnostic to network architectures or the adversarial attack generation method. We show extensive experiments to establish that our defense strategy achieves state-of-the-art performance on the ImageNet validation set.


Exploring Adversarial Attacks and Defenses in Vision Transformers trained with DINO

This work conducts the first analysis on the robustness against adversar...

Self-Ensemble Adversarial Training for Improved Robustness

Due to numerous breakthroughs in real-world applications brought by mach...

Adversarial Attacks are Reversible with Natural Supervision

We find that images contain intrinsic structure that enables the reversa...

NNrepair: Constraint-based Repair of Neural Network Classifiers

We present NNrepair, a constraint-based technique for repairing neural n...

LiBRe: A Practical Bayesian Approach to Adversarial Detection

Despite their appealing flexibility, deep neural networks (DNNs) are vul...

Breaking Transferability of Adversarial Samples with Randomness

We investigate the role of transferability of adversarial attacks in the...

Diffusion Models for Adversarial Purification

Adversarial purification refers to a class of defense methods that remov...

1 Introduction

Figure 1: Overview of REGroupRank-aggregating Ensemble of Generative classifiers for robust predictions. REGroup uses a pre-trained network, and constructs layer-wise generative classifiers modeled by a mixture distribution of the positive and negative pre-activation neural responses at each layer. At test time, an input sample’s neural responses are tested with generative classifiers to obtain ranking preferences of classes at each layer. These preferences are aggregated using Borda count based preferential voting theory to make final prediction. Note: construction of layer-wise generative classifiers is a one time process.

Deep Neural Networks (DNN’s) have shown outstanding performance on many computer vision tasks such as image classification 

(Krizhevsky et al., 2012), speech recognition (Hinton et al., 2012), and video classification (Karpathy et al., 2014). Despite showing superhuman capabilities in the image classification task (He et al., 2015), the existence of adversarial examples (Szegedy et al., 2013) have raised questions on the reliability of neural network solutions for safety-critical applications.

Adversarial examples are carefully manipulated adaptations of an input, generated with the intent to fool a classifier into misclassifying them. Recently, it has been shown that adversarial examples are not limited to images but also exists in automatic speech recognition 

(Schönherr et al., 2018), text (Behjati et al., 2019) and video (Zajac et al., 2019) classificaiton. Some example of failures of neural network-based security-critical applications are: an AI assistant incorrectly authenticating an impostor’s voice and providing access to the confidential content (Chokshi, 2018), a vision-based driver assistance system incorrectly recognizing a stop sign as a speed limit sign, (Eykholt et al., 2018), which may lead to fatal accidents.

One of the reasons for the attention that adversarial examples garnered is the ease with which they can be generated for a given model by simply maximizing the corresponding loss function. This is achieved by simply using a gradient based approach that finds a small perturbation at the input which leads to a large change in the output

(Szegedy et al., 2013)

. This apparent instability in neural networks is most pronounced for deep architectures that have an accumulation effect over the layers. This results in taking the small, additive, adversarial noise at the input and amplifies it to substantially noisy feature maps at intermediate layers that eventually influences the softmax probabilities enough to misclassify the perturbed input sample. This observation of amplification of input noise over the layers is not new, and has been pointed out in the past

(Szegedy et al., 2013; Xie et al., 2019). The recent work by (Xie et al., 2019) addresses this issue by introducing feature denoising blocks in a network and training them with adversarial generated examples.

The iterative nature of generating adversarial examples makes their use in training to generate defenses computationally very intensive. For instance, the adversarially trained feature denoising model proposed by (Xie et al., 2019) takes 38 hours on 128 Nvidia V100 GPUs to train a baseline ResNet-101 with ImageNet. While we leverage this observation of noise amplification over the layers, our proposed approach avoids any training or fine-tuning of the model. Instead, we use a representative subset of training samples and their layer-wise pre-activation responses to construct nonparametric generative classifiers, which are then combined in an ensemble using ranking preferences.

Generative classifiers have achieved varying degrees of success as defense strategies against adversarial attacks (Partha et al., 2019; Li et al., 2019; Schott et al., 2019). Recently, (Fetaya et al., 2020) studied the class-conditional generative classifiers and concluded that it is impossible to guarantee robustness of such models. More importantly, they highlight the challenges in training generative classifiers using maximum likelihood based objective and their limitations w.r.t. discriminative ability and identification of out-of-distribution samples. While we propose to use generative classifiers, we avoid using likelihood based measures for making classification decisions. Instead, we use rank-order preferences of these classifiers which are then combined using a Borda count-based voting scheme. Borda counts have been used in collective decision making and are known to be robust to various manipulative attacks (Rothe, 2019).

In this paper, we present our defense against adversarial attacks on deep networks, referred to as Rank-aggregating Ensemble of Generative classifiers for robust predictions (REGroup). At inference time, our defense requires white-box access to a pre-trained model to collect the pre-activation responses at intermediate layers to make the final prediction. We use the training data to build our generative classifier models. Nonetheless, our strategy is simple, network-agnostic, does not require any training or fine-tuning of the network, and works well for a variety of adversarial attacks, even with varying degress of hardness. Consistent with recent trends, we focus only on the ImageNet dataset to evaluate the robustness of our defense and report performance superior to recent defenses that rely on adversarial training (Kurakin et al., 2016) and random input transformation (Raff et al., 2019) based approaches. Finally, we present extensive analysis of our defense with two different architectures (ResNet and VGG) on different targeted and untargeted attacks. Our primary contributions are summarized below:

  • [leftmargin=*]

  • We present REGroup, a retraining free, model-agnostic defense strategy that leverages an ensemble of generative classifiers over intermediate layers of the model.

  • We model each generative classifier as a simple mixture distribution of neural responses obtained from a subset of training samples. We discover that both positive and negative pre-activation values contain information that can help correctly classify adversarially perturbed samples.

  • We leverage the robustness inherent in Borda-count based consensus over the generative classifiers.

  • We show extensive comparisons and analysis experiments on the ImageNet dataset spanning a variety of adversarial attacks.

2 Related Work

Several defense techniques have been proposed to make neural networks robust to adversarial attacks. Broadly, we can categorize them into two approaches that: 1. Modify training procedure or modify input before testing; 2. Modify network architecture or add/change network hyper-parameters, optimization procedure, activation functions etc.

Modify Training/Modify Inputs During Testing: Some approaches of defenses in this category are mentioned below. Adversarial training (Miyato et al., 2016; Zheng et al., 2016; Moosavi-Dezfooli et al., 2017) regularizes the neural network to reduce the over-fitting and in turn, improves the robustness. Data compression (Bhagoji et al., 2018; Das et al., 2017) suppresses the high-frequency components and presents an ensemble-based defense approach. Data randomization  (Wang et al., 2016; Xie et al., 2017b; Zantedeschi et al., 2017) based approaches apply random transformations to the input to defend against adversarial examples by reducing their effectiveness.

However, it is shown in  (Moosavi-Dezfooli et al., 2017) and  (Shin and Song, 2017) that we can still generate adversarial examples for adversarially trained DNN’s and for compression based defenses respectively.

Modify Network/Network Add-ons: Defenses under this category are either detection only or do both detection and correction. The aim of detection only defenses is to highlight if an example is adversarial and prevent it from further processing. These approaches include employing a detector sub-network (Metzen et al., 2017)

, training the main classifier with an outlier class 

(Grosse et al., 2017), using convolution filter statistics (Li and Li, 2017), or applying feature squeezing (Xu et al., 2017) to detect adversarial examples. However, all of these methods have shown to be ineffective against strong adversarial attacks (Carlini and Wagner, 2017a; Sharma and Chen, 2018)

. Full defense approaches include applying defensive distillation 

(Papernot et al., 2016; Papernot and McDaniel, 2017) to use the knowledge from the output of the network to re-train the original model and improve the resilience of a network to small perturbations. Another approach is to augment the network with a sub-network called Perturbation Rectifying Network (PRN) to detect the perturbations; if the perturbation is detected, then PRN is used to classify the input image. However, later it was shown that (C& W) attack (Carlini and Wagner, 2017b) successfully defeated the defensive distillation approach.

ImageNet Focused Defense Approaches: A few approaches have been proposed for the ImageNet dataset: Most of these approaches are based on input transformations or image denoising. Almost all the defenses designed for ImageNet have failed a thorough evaluation. A list of such defenses along with a thorough evaluation can be viewed at  (Madry et al., ).  (Prakash et al., 2018) and (Liao et al., 2018) claimed 81% and 75% accuracy respectively under adversarial attacks. But after through evaluation (Athalye and Carlini, 2018) and accounting for obfuscated gradients (Athalye et al., 2018), both accuracies were reduced to 0%. Similarly,  (Xie et al., 2017a) and  (Guo et al., 2017) claimed 86% and 75% respectively, but these were also reduced to 0% (Athalye et al., 2018). A different approach proposed in  (Kannan et al., 2018) claimed accuracy 27.9% but later it was also reduced to 0.1%  (Engstrom et al., 2018).

For a comprehensive related work on adversarial attacks and defenses, we suggest reader to refer (Yuan et al., 2019) and (Chakraborty et al., 2018).

3 REGroup  Methodology

Well-trained deep neural networks have a hierarchical structure, where the early layers transform inputs to feature spaces capturing local or more generic information, while later layers aggregate the local information to learn more semantically meaningful representations. In REGroup, we use many of the higher layers and learn class-conditional generative classifiers as simple mixture-distributions estimated from the pre-activation neural responses at each layer from a subset of training samples. An ensemble of these layer-wise generative classifiers is used to make the final prediction by performing a Borda count-based rank-aggregation. Ranking preferences have been used extensively in robust fitting problems in computer vision

(Chin et al., 2011, 2009; Tiwari and Anand, 2018), and we show its effectiveness in introducing robustness in DNNs against adversarial attacks.

Fig. 1 illustrates the overall working of REGroup. The approach has three main components: First, we use a layer as a generative classifier that produces a ranking preference over all classes. Second, each of these class-conditional generative classifiers are modeled using a mixture-distribution over the neural responses of the corresponding layer. Finally, the individual layer’s class ranking preferences are aggregated using Borda count-based scoring to make the final predictions. We introduce the notation below and discuss each of these steps in detail in the subsections that follow.

Notation: In this paper, we will always use , and for indexing the  layer,  feature map and the  input sample respectively. The true and predicted class label will be denoted by  and  respectively. A classifier can be represented in a functional form as  , it takes an input and predicts its class label  . We define as the  layer’s  pre-activation feature map, i.e., the neural responses before they pass through the activation function. For convolutional layers, this feature map is a 2D array, while for a fully connected layer, it is a scalar value.

3.1 DNN Layers as Generative Classifiers

We use the highest layers of a DNN as generative classifiers that use the pre-activation neural responses to produce a ranking preferences over all classes. The layer-wise generative classifiers are modeled as a class-conditional mixture distribution, which is estimated using only a pre-trained network and a small subset of the training data. Let contain only correctly classified training samples111We took 50,000 out of 1.2 millions training images from ImageNet dataset, 50 per class., which we can further divide into subsets, one for each class i.e , where is the subset containing samples that have labels .

3.1.1 Layerwise Neural Response Distributions

Our preliminary observations indicated that while the ReLU activations truncate the negative pre-activations during the forward pass, these values still contain semantically meaningful information. Our quantitative ablative studies (see Fig.

3) confirm this observation and additionally, on occasion, we find that the negative pre-activations are complementary to the positive ones. Since the pre-activation features are real-valued, we compute the features for the  sample , and define its positive and negative response accumulators as and respectively.

For convolutional layers, these accumulators represent the overall strength of positive and negative pre-activation responses respectively, when collected over the spatial dimensions of the  feature map of the

 layer. On the other hand, for the linear layers, the accumulation becomes trivial with each neuron having a scalar response

. We can now represent the

 layer by the positive and negative response accumulator vectors denoted by

and respectively. We normalize these vectors and define the layer-wise probability mass function (PMF) for the positive and negative responses as and respectively.

Our interpretation of and as a PMF could be justified by drawing an analogy to the softmax output, which is also interpreted as a PMF. However, it is worth emphasizing that we chose the linear rescaling of the accumulator vectors rather than directly applying a softmax normalization. By separating out the positive and negative accumulators, we obtain two independent representations for each layer, which is beneficial to our rank-aggregating ensemble discussed in the following sections. A softmax normalization over a feature map comprising of positive and negative responses would have entirely suppressed the negative responses, discarding all its constituent semantic information. An additional benefit of the linear scaling is its simple computation. Algorithm 1 summarizes the computation of the layer-wise PMFs for a given training sample.

Input: pre-activation features
for   do

           (sum over H, W)
   (sum over H, W)
end for
Algorithm 1 Layerwise PMF of neural responses. represents the spatial dimensions of pre-activation features. For  convolutional layer the dimensions of feature maps , and for linear layers the dimensions of neuron output .

3.1.2 Layerwise Generative Classifiers

We model the layerwise generative classifiers for class as a class-conditional mixture of distributions, with each mixture component as the PMFs and for a given training sample . The generative classifiers corresponding to the positive and negative neural responses are then defined as the following mixture of PMFs


where the weights are nonnegative and add up to one in the respective equations. We choose the weights to be proportional to the softmax probability value as predicted by the network given the input . Using the subset of training samples , we construct the class-conditional mixture distributions, and at each layer only once. At inference time, we input a test sample , from the test set , to the network and compute the PMFs and using Algorithm 1. As our test input is a PMF and the generative classifier is also a mixture distribution, we simply use the KL-Divergence between the classifier model and the test sample as a classification score as


and similarly for the negative PMFs


We use a simple classification rule and select the predicted class  as the one with the smallest KL-Divergence with the test sample PMF. However, rather than identifying  , at this stage we are only interested in rank-ordering the classes, which we simply achieve by sorting the KL-Divergences (Eqns. (2) and (3)) in ascending order. The resulting ranking preferences of classes for the  layer are given below in Eqns. (4) and (5) respectively.


3.2 Robust Predictions with Rank Aggregation

Rank aggregation based preferential voting for making group decisions is widely used in selecting a winner in a democratic setup (Rothe, 2019). The basic premise of preferential voting is that voters are allowed to rank candidates in the order of their preferences. The rankings of all voters are then aggregated to make a final prediction.

Borda count (Black and others, 1958) is one of the approaches for preferential voting that relies on aggregating the rankings of all the voters to make a collective decision (Rothe, 2019; Kahng et al., 2019). There exist other voting strategies to find a winner out of different choices. Some of the popular ones are Plurality voting (Van Newenhizen, 1992), and Condorcet winner (Young, 1988). In Plurality voting, the winner would be the one who gets the maximum fraction of votes, while Condorcet winner is the one who gets the majority votes.

3.2.1 Rank Aggregation using Borda Count

Borda count is a generalization of the majority voting. In a two-candidates case it is equivalent to majority vote. The Borda count for a candidate is the sum of the number of candidates ranked below it by each voter. In our setting, while processing a test sample , every layer acts as two independent voters based on and . The number of classes i.e is the number of candidates. The Borda count for  class at the  layer is denoted by , where and are the individual Borda count of both the voters and computed as shown in equation (6).


3.2.2 Hyperparameter Settings

We aggregate the Borda counts of highest

layers of the network, which is the only hyperparameter to set in REGroup. Let

denote the aggregated Borda count of  class from the last layers irrespective of the type (convolutional or fully connected). Here, is the total number of layers. The final prediction would be the class with maximum aggregated Borda count.


To determine the value of , we evaluate REGroup  on 10,000 correctly classified samples from the ImageNet Validation set at each layer, using per layer Borda count i.e . We select to be the number of later layers at which we get at-least accuracy. This can be viewed in the context of the confidence of individual layers on discriminating samples of different classes. We selected to be the number of layer with confidence score atleast

. We follow the above heuristic and found

for both the architectures ResNet-50 and VGG-19, hence, we use in all our experiments. An ablation study with all possible values of is included in section 4.8.

4 Experimental Analysis

In this section, we evaluate REGroup against various attacks and compare with the state-of-the art methods. We first discuss the various attacks we use in our experiments.

4.1 Adversarial Attacks

The generation process for constructing an adversarial example involves corrupting a clean sample by adding a small perturbation i.e , so that network will classify into an incorrect class. We evaluate REGroup  against gradient based and decision based attacks. We use FoolBox (Rauber et al., 2017) with different perturbation budgets to generate adversarial examples.
Gradient Based Attacks: Attacks under this category have white-box access to the network, i.e the attacker has access to the gradient, weights, and network architecture.
PGD: Projected Gradient Descent (PGD) (Madry et al., 2017) can be viewed as a multi-step variant of FGSM (Goodfellow et al., 2014). The generation of adversarial examples takes the form of an iterative approach, whose update at the step can be expressed as:


Here, is the step size, is the projection operator, with the corresponding perturbation bounded by .
DeepFool: This attack method finds the nearest decision boundary and tries to generate an adversarial example by iteratively perturbing an image until it is pushed across a class boundary and the network changes its top prediction to a different class  (Moosavi-Dezfooli et al., 2016).
C& W: This method (Carlini and Wagner, 2017b) generates an adversarial example by solving an optimization problem that aims to find the smallest bounded perturbation while simultaneously misclassifying the input.
Trust Region Based Attack: This attack (Yao et al., 2019) solves a trust region based optimization problem (Conn et al., 2000) to get an adversarial perturbation.
Decision Based Attacks:

Attacks in this category are based on the final decision of the model. These attacks are more relevant to real-world practical learning-based applications where prediction scores or logits are not accessible.

Boundary Attack: This attack begins with a large perturbation (hence already an adversarial image) and then performs a random walk along the decision boundary between the adversarial and the non-adversarial region in a manner that it remains in the adversarial region but moves closer to the target image (Brendel et al., 2017).
Spatial Attack: This attack generates adversarial examples by modifying the input image using adversarially generated rotation and translations (Engstrom et al., 2019).

4.2 Experimental Setup

Architectures: We use two different network architectures ResNet-50222https://download.pytorch.org/models/resnet50-19c8e357.pth and VGG-19333https://download.pytorch.org/models/vgg19-dcbb9e9d.pth, both with ImageNet pre-trained weights.
Datasets: We present our evaluations and comparisons only on ImageNet (Deng et al., 2009). We use ImageNet-V50K  to refer to the full ImageNet validation set and, ImageNet-V10K, and ImageNet-V2K  to refer to random subsets of the full ImageNet validation set of sizes 10,000 (10 per class), and 2000 (2 per class) respectively, each comprising only of correctly classified images. The subsets, ImageNet-V10K  and ImageNet-V2K  would be different for ResNet-50 and VGG-19, since an image classified correctly by ResNet-50 need not be classified correctly by the VGG-19.

4.3 Comparison with Adv-Training/Fine-tuning

We evaluate our performance on clean samples as well as adversarial examples generated using PGD from ImageNet-V50K , and compare it with prior state-of-the-art works (Raff et al., 2019) and (Xie et al., 2019). The results are reported in Table 1, and we see that REGroup  outperforms the state-of-the-art input transformation based defense BaRT (Raff et al., 2019), both in terms of the clean and adversarial samples. We see that while our performance on clean samples decreases when compared to adversarial training, it improves significantly on adversarial examples with a high . While our method is not directly comparable with (Xie et al., 2019) because the base models are different, a similar decrease in the accuracy over clean samples is reported in their paper. The tradeoff between robustness and the standard accuracy has been studied in (Dohmatob, 2018) and (Tsipras et al., 2018).

An important observation to make with this experiment is, if we set aside the base models of ResNets and compare Top-1 accuracies on clean samples of full ImageNet validation set, our method (REGroup)  without any adv-training/fine-tuning either outperforms or performs similar to the state-of-the-art adv-training/fine-tuning based methods (Raff et al., 2019; Xie et al., 2019).

Clean Images Attacked Images
Model Top-1 Top-5 Top-1 Top-5
ResNet-50 76 93 0.0 0.0
Inception v3 78 94 0.7 4.4
ResNet-152 79 94 - -
Inception v3 w/Adv. Train 78 94 1.5 5.5
ResNet-152 w/Adv. Train 63 - - -
ResNet-152 w/Adv. Train w/ denoise 66 - - -
ResNet-50-BaRT, 65 85 16 51
ResNet-50-BaRT, 65 85 36 57
ResNet-50-REGroup 66 86 22 65
Table 1: Accuracy (%) comparison with adversarial training (Kurakin et al., 2016)(Xie et al., 2019) and fine-tuning BaRT (Raff et al., 2019) approaches. Dataset: ImageNet-V50K. The attacked images are generated using PGD with . Clean Images are the non-attacked original images. REGroup  outperforms state-of-the-art adversarial training/fine-tuning methods. The results are divided into three blocks, the top block include original networks, middle block include defense approaches based on adversarial re-training/fine-tunining of original networks, bottom block is our defense without re-training/fine-tuning. Results of the competing methods are taken from their respective papers. ’-’ indicate the results were not provided in the respective papers.

4.3.1 Against PGD Adversarial Strength

We evaluate REGroup  w.r.t the maximum perturbation of the adversary. The results are reported in figure 2. REGroup  outperforms both the adversarial training (Kurakin et al., 2016) and BaRT (Raff et al., 2019). Both adversarial training and BaRT have shown protection against PGD adversarial attacks with a maximum perturbation strength and respectively, however we additionally show the results with on full ImageNet validation set. We also note that with increasing perturbation strength we see that our defense’s accuracy is also strictly decreasing. This is in accordance with (Carlini et al., 2019), where transitioning from a clean image to noise should yield a downward slope in accuracy, else there could be some form of gradient masking involved.

Figure 2: Top-1 and Top-5 accuracy(%) comparison with adversarial training based method (Kurakin et al., 2016) and fine-tuning using random input transformations based method (BaRT) (Raff et al., 2019) with Expectation Over Transformation (EOT) steps 10 and 40, against the PGD perturbation strength ().Dataset: ImageNet-V50K. The results of the competing methods are taken from their respective papers. We further show accuracy of REGroup  w.r.t maximum perturbation strength .

4.4 Performance on Un-Targeted Attacks

We evaluate REGroup  on various untargeted attacks and report results in table 2. We generated adversarial examples for ImageNet-V10K  using ResNet-50 and VGG-19. The perturbation budgets for the respective attacks are listed in the table caption. With the exception of the maximum perturbation allowed and the Boundary Attack, where we used 500 iterations for adversary generation, we used default parameters given by FoolBox (Rauber et al., 2017). We observe that the performance of our defense is quite similar for both the models employed. This is due to the attack-agnostic nature of our defense. We achieve 48% accuracy (ResNet-50) for PGD attack using our defense which is significant given that PGD is considered to be one of the strongest attacks among the class of first order adversaries.

ResNet-50 VGG-19
SMax REGroup SMax REGroup
Attacks #S T1(%) T1(%) #S T1(%) T1(%)
PGD 9997 0 48 9887 0 46
DFool 9789 0 61 9939 0 55
C&W 10000 0 40 10000 0 38
Trust Region 10000 0 41 9103 0 45
Boundary 10000 0 50 10000 0 50
Spatial 2624 0 36 2634 0 30
Clean Images 10000 100 88 10000 100 76
Table 2: Performance on Un-Targeted Adversarial Attacks. Dataset: ImageNet-V10K. Top-1 ( %) classification accuracy comparison between SoftMax (SMax) and REGroup. is the number of images for which the attacker is successfully able to generate adversarial examples using the respective attack models and the accuracies are reported with respect to the samples, hence the 0% accuracies with the SoftMax (SMax). PGD and Trust Region both use distance metric with and respectively. C&W and DeepFool use distance metric with and respectively.

4.5 Performance on Targeted Attacks

We consider ImageNet-V2K  for targeted attacks. We choose a target at random from the 1000 classes of ImageNet, leaving out the ground-truth class. We report performance on PGD and C&W targeted attacks in table 3.

Figure 3: Effect of Considering Positive and Negative Pre-Activation Responses.

We consider three variants of Borda count rank aggregation from later layers. Pos: , Neg: , and Pos+Neg: . We report the Top-1 accuracy (%) of the un-targeted (UN) attacks experiment as set up in table 2 (DF: DFool, C&W, TR: Trust Region, BD: Boundary, SP: Spatial), in figure 2 (PGD2 and PGD4, with and respectively) and targeted (TAR) attacks in table 3.‘V50K_clean’ and ‘V10K_clean’ corresponds to the experiment ResNet-50-REGroup  of table 1 and VGG-19 clean images of table 2. From the bar chart it is evident that in some experiments Pos performs better than Neg (e.g UN_TR) while in others Neg is better than Pos only (e.g UN_DF). It is also evident that Pos+Neg occasionally improve the overall performance, and the improvement seems significant in the targeted C&W attacks for both the ResNet-50 and VGG-19. We leave it to the design choice of the application, if inference time is an important parameter, then one may choose either Pos or Neg to reduce the inference time to approximately half of what is reported in table 6.

ResNet-50 VGG-19
SMax REGroup SMax REGroup
Attacks T1 T1 T1 T1
PGD 0 47 0 31
C&W 0 46 0 38
Clean Images 100 86 100 72
Table 3: Performance on Targeted Adversarial Attacks. Dataset: ImageNet-V2K. Top-1 (T1 in %) classification accuracy comparison between SoftMax (SMax) and the REGroup based classification at test time.

4.6 Performance on PGD Attack with High Confidence

We evaluate REGroup on PGD examples on which the network makes highly confident predictions using SoftMax. We generate un-targeted and targeted adversarial examples using PGD attack with a constraint that the network’s confidence of the prediction of adversarial examples is at-least 90%, for this experiment we do not put constraint on the adversarial perturbation i.e . For targeted attack we select a target class uniformly at random from the 1000 classes leaving out the true class. Results are reported in table 4.

ResNet-50 VGG-19
SMax REGroup SMax REGroup
Attacks #S T1(%) T1(%) #S T1(%) T1(%)
PGD (un-targeted) 2000 0 21 2000 0 19
PGD (targeted) 2000 0 23 2000 0 17
Table 4: Performance on PGD Adversarial Examples with high confidences (). Dataset: ImageNet-V2K. Top-1 ( %) classification accuracy comparison between SoftMax (SMax) and REGroup. is the number of images for which the attacker is successfully able to generate adversarial examples using the respective attack models and the accuracies are reported with respect to the samples, hence the 0% accuracies with the SoftMax (SMax).

4.7 BPDA Attack and Gradient Masking

Backward Pass Differentiable Approximation (BPDA). BPDA (Athalye et al., 2018) is one of the key techniques developed to identify shattered gradients in a network which mask gradients and thereby ‘seem’ to defend against other attacks. Let be the non-differentiable component in the neural network represented by . To calculate the gradient i.e in the backward pass of BPDA, the non-differentiable is replaced by its smooth approximation . In most cases, a transformation is applied to images which is non-differentiable and therefore approximations like work pretty well.

In REGroup, we rank-aggregate the preferences of highest layers to predict the label of an input sample. As ranking preferences requires a sorting operation, which leads to discontinuities and undefined gradients, it is difficult to design a smooth approximation to get BPDA to work. A similar argument is made in (Xiao et al., 2020) where the k-winner takes all based activation results in non-smooth feature outputs at different layers. We are not aware of any smooth approximation for the ranking and vote aggregation part of REGroup. Furthermore, theoretical groundings of Borda count for collective decision making (Rothe, 2019) indicates that in order to change the final outcome, manipulators have to manipulate the voters so that they vote strategically and change their rankings accordingly. It has been shown that such manipulations of more than two voters in Borda count based preferential voting scheme is an NP-complete problem (Rothe, 2019; Conitzer et al., 2007; Brelsford et al., 2008; Davies et al., 2011; Betzler et al., 2011; Davies et al., 2014). Even if there exists a smooth approximation, to construct successful BPDA attacks on REGroup, the manipulations necessary would be as hard as NP-complete problems.

Figure 4: Ablation study for accuracy vs no. of layers () on VGG. ‘Agg’ stands for using aggregated Borda count . PGD, DFool, C&W and Trust Region are the same experiments as reported in table 2, but with all possible values of . ”Per_Layer_V10K” stands for evaluation using per layer Borda count i.e   on a separate 10,000 correctly classified subset of validation set. In all our experiments we choose the -highest layers where ‘Per_Layer_V10K’ has at-least accuracy. A reasonable change in this accuracy criteria of would not affect the results on adversarial attacks significantly. However, a substantial change (to say ) deteriorates the performance on clean sample significantly. The phenomenon of decrease in accuracy of clean samples vs robustness has been studied in (Dohmatob, 2018) and (Tsipras et al., 2018).

Checking Gradient Masking. To further ensure that our proposed method is not masking the gradients, we evaluate on a strong gradient-free SPSA (Uesato et al., 2018) attack. In (Uesato et al., 2018), it is reported that PGD and SPSA both found quite similar adversarial perturbations for most images which highlights the effectiveness of SPSA attack. We use 300 iterations of SPSA along with early_stop_loss_thresh = 0. This condition makes the attack terminate when it finds a misclassification and not run for all the steps. We consider a batch size of 32 images, with as the maximum perturbation budget in the distance metric. The other parameters are used as in the implementation of the attack in the Cleverhans library (Papernot et al., 2018). We observe that we are able to achieve a higher accuracy as compared to the gradient based attacks, which ensures REGroup is not masking the gradients.

ResNet-50 VGG-19
SMax REGroup SMax REGroup
Attacks #S T1(%) T1(%) #S T1(%) T1(%)
SPSA () 861 0 71 1014 0 64
SPSA () 1199 0 70 1350 0 63
Table 5: Performance on un-targeted SPSA Adversarial Attack. Dataset: ImageNet-V2K. Top-1 ( %) classification accuracy comparison between SoftMax (SMax) and REGroup. is the number of images for which the attacker is successfully able to generate adversarial examples and the accuracies are reported with respect to the samples, hence the 0% accuracies with the SoftMax (SMax).

4.8 Analysis and Ablation Study

In this section, we study the impact of number of layers/voters and of considering both positive and negative pre-activation responses.
Number of layers (): We report performance of REGroup on various attacks reported in table 2 for all possible values of . The accuracy of VGG-19 w.r.t. the various values of is plotted in figure 4. We observe the similar accuracy vs graph for ResNet-50 and note that a reasonable choice of made based on this graph does not significantly impact REGroup’s performance.
Effect of positive and negative pre-activation responses: We report evaluate the impact of using positive, negative and a combination of both pre-activation responses on the performance of REGroup on various attacks and clean samples in figure 3.
Inference time using REGroup

: We use PyTorch for all our experiments and a GPU is only required for extracting layer outputs and adversarial example generation. Since we suggest to use REGroup during test time, we compare the inference time with SoftMax for both ResNet-50 and VGG-19 experiments on both GPU and CPU. The inference time is reported in table


5 Conclusion

In this work, we have presented a simple, scalable, and practical defense strategy that is model agnostic and does not require any re-training or fine-tuning. We suggest to use REGroup at test time to make a pre-trained network robust to adversarial perturbations.

ResNet-50 VGG-19
SMax REGroup SMax REGroup
Time(s) 0.02 0.06 0.13 0.35 0.03 0.12 0.16 0.64
Table 6: Inference Time Comparison. REGroup vs SoftMax: We use a workstation with an i7-8700 CPU and GTX 1080 GPU.

Using challenging adversarial attacks created on ImageNet, we have shown that the proposed defense, REGroup, performed competitively in comparison to state-of-the-art defenses (Raff et al., 2019; Kurakin et al., 2016) that have a clear advantage of adversarial training / fine-tuning the base network. There are three main reasons that justify the success of REGroup. Firstly, instead of using a maximum likelihood based prediction, REGroup  adopts a ranking preference based approach. Secondly, aggregation of preferences from multiple layers leads to group decision making, unlike SoftMax that relies on the output of the last layer only. Thirdly, there exists inherent robustness of Borda count in rank aggregation. It is well established that Borda count is robust to noise in rankings of individual voters (Rothe, 2019)(Kahng et al., 2019). Hence, where SoftMax fails to predict the correct class of an adversarial example image generated by an attacker with an aim to misclassify in a maximum-likelihood sense, REGroup  takes ranked preferences from multiple layers and builds a consensus using Borda count to make robust predictions. Our promising empirical results indicate that deeper theoretical analysis of REGroup  would be an interesting direction to pursue.


  • A. Athalye, N. Carlini, and D. Wagner (2018) Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. In International Conference on Machine Learning, pp. 274–283. Cited by: §2, §4.7.
  • A. Athalye and N. Carlini (2018) On the robustness of the cvpr 2018 white-box adversarial example defenses. arXiv preprint arXiv:1804.03286. Cited by: §2.
  • M. Behjati, S. Moosavi-Dezfooli, M. S. Baghshah, and P. Frossard (2019) Universal adversarial attacks on text classifiers. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7345–7349. Cited by: §1.
  • N. Betzler, R. Niedermeier, and G. J. Woeginger (2011) Unweighted coalitional manipulation under the borda rule is np-hard. In

    Twenty-Second International Joint Conference on Artificial Intelligence

    Cited by: §4.7.
  • A. N. Bhagoji, D. Cullina, C. Sitawarin, and P. Mittal (2018) Enhancing robustness of machine learning systems via data transformations. In 2018 52nd Annual Conference on Information Sciences and Systems (CISS), pp. 1–5. Cited by: §2.
  • D. Black et al. (1958) The theory of committees and elections. Cited by: §3.2.
  • E. Brelsford, P. Faliszewski, E. Hemaspaandra, H. Schnoor, and I. Schnoor (2008) Approximability of manipulating elections.. In AAAI, Vol. 8, pp. 44–49. Cited by: §4.7.
  • W. Brendel, J. Rauber, and M. Bethge (2017) Decision-based adversarial attacks: reliable attacks against black-box machine learning models. arXiv preprint arXiv:1712.04248. Cited by: §4.1.
  • N. Carlini, A. Athalye, N. Papernot, W. Brendel, J. Rauber, D. Tsipras, I. Goodfellow, and A. Madry (2019) On evaluating adversarial robustness. arXiv preprint arXiv:1902.06705. Cited by: §4.3.1.
  • N. Carlini and D. Wagner (2017a) Adversarial examples are not easily detected: bypassing ten detection methods. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 3–14. Cited by: §2.
  • N. Carlini and D. Wagner (2017b) Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. Cited by: §2, §4.1.
  • A. Chakraborty, M. Alam, V. Dey, A. Chattopadhyay, and D. Mukhopadhyay (2018) Adversarial attacks and defences: a survey. arXiv preprint arXiv:1810.00069. Cited by: §2.
  • T. Chin, H. Wang, and D. Suter (2009) The ordered residual kernel for robust motion subspace clustering. In Advances in Neural Information Processing Systems 22, Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta (Eds.), pp. 333–341. Cited by: §3.
  • T. Chin, J. Yu, and D. Suter (2011) Accelerated hypothesis generation for multistructure data via preference analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (4), pp. 625–638. Cited by: §3.
  • N. Chokshi (2018) New York Times: is alexa listening? amazon echo sent out recording of couple’s conversation. Note: https://www.nytimes.com/2018/05/25/business/amazon-alexa-conversation-shared-echo.htmlAccessed: 2020-01-05 Cited by: §1.
  • V. Conitzer, T. Sandholm, and J. Lang (2007) When are elections with few candidates hard to manipulate?. Journal of the ACM (JACM) 54 (3), pp. 14–es. Cited by: §4.7.
  • A. R. Conn, N. I. Gould, and P. L. Toint (2000) Trust region methods. Vol. 1, Siam. Cited by: §4.1.
  • N. Das, M. Shanbhogue, S. Chen, F. Hohman, L. Chen, M. E. Kounavis, and D. H. Chau (2017)

    Keeping the bad guys out: protecting and vaccinating deep learning with jpeg compression

    arXiv preprint arXiv:1705.02900. Cited by: §2.
  • J. Davies, G. Katsirelos, N. Narodytska, T. Walsh, and L. Xia (2014) Complexity of and algorithms for the manipulation of borda, nanson’s and baldwin’s voting rules. Artificial Intelligence 217, pp. 20–42. Cited by: §4.7.
  • J. Davies, G. Katsirelos, N. Narodytska, and T. Walsh (2011) Complexity of and algorithms for borda manipulation. In Twenty-Fifth AAAI Conference on Artificial Intelligence, Cited by: §4.7.
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: §4.2.
  • E. Dohmatob (2018) Limitations of adversarial robustness: strong no free lunch theorem. arXiv preprint arXiv:1810.04065. Cited by: Figure 6, Figure 4, §4.3.
  • L. Engstrom, A. Ilyas, and A. Athalye (2018) Evaluating and understanding the robustness of adversarial logit pairing. arXiv preprint arXiv:1807.10272. Cited by: §2.
  • L. Engstrom, B. Tran, D. Tsipras, L. Schmidt, and A. Madry (2019) Exploring the landscape of spatial robustness. In International Conference on Machine Learning, pp. 1802–1811. Cited by: §4.1.
  • K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, C. Xiao, A. Prakash, T. Kohno, and D. Song (2018) Robust physical-world attacks on deep learning visual classification. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 1625–1634. Cited by: §1.
  • E. Fetaya, J. Jacobsen, W. Grathwohl, and R. Zemel (2020) Understanding the limitations of conditional generative models. In International Conference on Learning Representations, External Links: Link Cited by: §1.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §4.1.
  • K. Grosse, P. Manoharan, N. Papernot, M. Backes, and P. McDaniel (2017) On the (statistical) detection of adversarial examples. arXiv preprint arXiv:1702.06280. Cited by: §2.
  • C. Guo, M. Rana, M. Cisse, and L. Van Der Maaten (2017) Countering adversarial images using input transformations. arXiv preprint arXiv:1711.00117. Cited by: §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034. Cited by: §1.
  • G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, B. Kingsbury, et al. (2012) Deep neural networks for acoustic modeling in speech recognition. IEEE Signal processing magazine 29. Cited by: §1.
  • A. Kahng, M. K. Lee, R. Noothigattu, A. Procaccia, and C. Psomas (2019) Statistical foundations of virtual democracy. In International Conference on Machine Learning, pp. 3173–3182. Cited by: §3.2, §5.
  • H. Kannan, A. Kurakin, and I. Goodfellow (2018) Adversarial logit pairing. arXiv preprint arXiv:1803.06373. Cited by: §2.
  • A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei (2014)

    Large-scale video classification with convolutional neural networks

    In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1725–1732. Cited by: §1.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
  • A. Kurakin, I. Goodfellow, and S. Bengio (2016) Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236. Cited by: §1, Figure 2, §4.3.1, Table 1, §5.
  • X. Li and F. Li (2017) Adversarial examples detection in deep networks with convolutional filter statistics. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5764–5772. Cited by: §2.
  • Y. Li, J. Bradshaw, and Y. Sharma (2019) Are generative classifiers more robust to adversarial attacks?. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 3804–3814. Cited by: §1.
  • F. Liao, M. Liang, Y. Dong, T. Pang, X. Hu, and J. Zhu (2018) Defense against adversarial attacks using high-level representation guided denoiser. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1778–1787. Cited by: §2.
  • [40] A. Madry, A. Anish, T. Dimitris, and E. Logan Https://www.robust-ml.org/. Note: https://www.robust-ml.org/Accessed: 2020-01-05 Cited by: §2.
  • A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2017) Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083. Cited by: §4.1.
  • J. H. Metzen, T. Genewein, V. Fischer, and B. Bischoff (2017) On detecting adversarial perturbations. arXiv preprint arXiv:1702.04267. Cited by: §2.
  • T. Miyato, A. M. Dai, and I. Goodfellow (2016) Adversarial training methods for semi-supervised text classification. arXiv preprint arXiv:1605.07725. Cited by: §2.
  • S. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard (2017) Universal adversarial perturbations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1765–1773. Cited by: §2, §2.
  • S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard (2016) Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2574–2582. Cited by: §4.1.
  • N. Papernot, F. Faghri, N. Carlini, I. Goodfellow, R. Feinman, A. Kurakin, C. Xie, Y. Sharma, T. Brown, A. Roy, A. Matyasko, V. Behzadan, K. Hambardzumyan, Z. Zhang, Y. Juang, Z. Li, R. Sheatsley, A. Garg, J. Uesato, W. Gierke, Y. Dong, D. Berthelot, P. Hendricks, J. Rauber, and R. Long (2018) Technical report on the cleverhans v2.1.0 adversarial examples library. arXiv preprint arXiv:1610.00768. Cited by: §4.7.
  • N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami (2016) Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE Symposium on Security and Privacy (SP), pp. 582–597. Cited by: §2.
  • N. Papernot and P. McDaniel (2017) Extending defensive distillation. arXiv preprint arXiv:1705.05264. Cited by: §2.
  • G. Partha, L. Arpan, and B. M. J. (2019)

    Resisting adversarial attacks using gaussian mixture variational autoencoders

    In AAAI, pp. 541–548. External Links: Link Cited by: §1.
  • A. Prakash, N. Moran, S. Garber, A. DiLillo, and J. Storer (2018) Deflecting adversarial attacks with pixel deflection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8571–8580. Cited by: §2.
  • E. Raff, J. Sylvester, S. Forsyth, and M. McLean (2019) Barrage of random transforms for adversarially robust defense. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6528–6537. Cited by: §1, Figure 2, §4.3.1, §4.3, §4.3, Table 1, §5.
  • J. Rauber, W. Brendel, and M. Bethge (2017) Foolbox: a python toolbox to benchmark the robustness of machine learning models. arXiv preprint arXiv:1707.04131. Cited by: Appendix A, §4.1, §4.4.
  • J. Rothe (2019) Borda count in collective decision making: a summary of recent results. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 9830–9836. Cited by: §1, §3.2, §3.2, §4.7, §5.
  • L. Schönherr, K. Kohls, S. Zeiler, T. Holz, and D. Kolossa (2018) Adversarial attacks against automatic speech recognition systems via psychoacoustic hiding. arXiv preprint arXiv:1808.05665. Cited by: §1.
  • L. Schott, J. Rauber, M. Bethge, and W. Brendel (2019) Towards the first adversarially robust neural network model on MNIST. In International Conference on Learning Representations, External Links: Link Cited by: §1.
  • Y. Sharma and P. Chen (2018) Bypassing feature squeezing by increasing adversary strength. arXiv preprint arXiv:1803.09868. Cited by: §2.
  • R. Shin and D. Song (2017) JPEG-resistant adversarial images. In NIPS 2017 Workshop on Machine Learning and Computer Security, Cited by: §2.
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §1, §1.
  • L. Tiwari and S. Anand (2018) DGSAC: density guided sampling and consensus. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 974–982. Cited by: §3.
  • D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry (2018)

    Robustness may be at odds with accuracy

    arXiv preprint arXiv:1805.12152. Cited by: Figure 6, Figure 4, §4.3.
  • J. Uesato, B. O’Donoghue, P. Kohli, and A. Oord (2018) Adversarial risk and the dangers of evaluating against weak attacks. In International Conference on Machine Learning, pp. 5025–5034. Cited by: §4.7.
  • L. van der Maaten and G. Hinton (2008) Visualizing data using t-SNE. Journal of Machine Learning Research 9, pp. 2579–2605. Cited by: Appendix B.
  • J. Van Newenhizen (1992) The borda method is most likely to respect the condorcet principle. Economic Theory 2 (1), pp. 69–83. Cited by: §3.2.
  • Q. Wang, W. Guo, K. Zhang, I. Ororbia, G. Alexander, X. Xing, X. Liu, and C. L. Giles (2016) Learning adversary-resistant deep neural networks. arXiv preprint arXiv:1612.01401. Cited by: §2.
  • C. Xiao, P. Zhong, and C. Zheng (2020) Enhancing adversarial defense by k-winners-take-all. In International Conference on Learning Representations, External Links: Link Cited by: §4.7.
  • C. Xie, J. Wang, Z. Zhang, Z. Ren, and A. Yuille (2017a) Mitigating adversarial effects through randomization. arXiv preprint arXiv:1711.01991. Cited by: §2.
  • C. Xie, J. Wang, Z. Zhang, Y. Zhou, L. Xie, and A. Yuille (2017b) Adversarial examples for semantic segmentation and object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1369–1378. Cited by: §2.
  • C. Xie, Y. Wu, L. v. d. Maaten, A. L. Yuille, and K. He (2019) Feature denoising for improving adversarial robustness. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 501–509. Cited by: §1, §1, §4.3, §4.3, Table 1.
  • W. Xu, D. Evans, and Y. Qi (2017) Feature squeezing: detecting adversarial examples in deep neural networks. arXiv preprint arXiv:1704.01155. Cited by: §2.
  • Z. Yao, A. Gholami, P. Xu, K. Keutzer, and M. W. Mahoney (2019) Trust region based adversarial attack on neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11350–11359. Cited by: §4.1.
  • H. P. Young (1988) Condorcet’s theory of voting. American Political science review 82 (4), pp. 1231–1244. Cited by: §3.2.
  • X. Yuan, P. He, Q. Zhu, and X. Li (2019) Adversarial examples: attacks and defenses for deep learning. IEEE transactions on neural networks and learning systems 30 (9), pp. 2805–2824. Cited by: §2.
  • M. Zajac, K. Zołna, N. Rostamzadeh, and P. O. Pinheiro (2019) Adversarial framing for image and video classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 10077–10078. Cited by: §1.
  • V. Zantedeschi, M. Nicolae, and A. Rawat (2017) Efficient defenses against adversarial attacks. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 39–49. Cited by: §2.
  • S. Zheng, Y. Song, T. Leung, and I. Goodfellow (2016) Improving the robustness of deep neural networks via stability training. In Proceedings of the ieee conference on computer vision and pattern recognition, pp. 4480–4488. Cited by: §2.

Appendix A Hyper-parameters for Generating Adversarial Examples

We use Foolbox’s (Rauber et al., 2017) implementation of almost all the adversarial attacks used in this work. We report the attack specific hyper-parameters in Tab.7.

Attack Hyper-parameters
PGD (Untargeted) , Dist:, random_start=True,
stepsize=0.01, max_iter=40
DeepFool (Untargeted) , Dist:, max_iter=100,
subsample=10 (Limit on the number of the most likely classes)
CW (Untargeted) , Dist:, binary_search_steps=5, max_iter=1000,
confidence=0, learning_rate=0.005, initial_const=0.01
Trust Region (Untargeted) , Dist:
Boundary (Untargeted) , Dist:, iterations=500, max_directions=25, starting_point=None, initialization_attack=None,
log_every_n_steps=None, spherical_step=0.01, source_step=0.01, step_adaptation=1.5,
batch_size=1, tune_batch_size=True, threaded_rnd=True, threaded_gen=True
Spatial (Untargeted) , Dist=, do_rotations=True, do_translations=True, x_shift_limits=(-5, 5),
y_shift_limits=(-5, 5), angular_limits=(-5, 5), granularity=10, random_sampling=False, abort_early=True
PGD (Targeted) Dist = , binary_search=True, epsilon=0.3,
stepsize=0.01, iterations=40, random_start=True, return_early=True
CW (Targeted) binary_search_steps=5, max_iterations=1000, confidence=0,
learning_rate=0.005, initial_const=0.01, abort_early=True
EAD Dist=, binary_search_steps=5, max_iterations=1000, confidence=0,
initial_learning_rate=0.01, regularization=0.01, initial_const=0.01, abort_early=True
PGD (Untargeted,HC) min_conf=0.9, Dist=, binary_search=True, epsilon=0.3,
stepsize=0.01, iterations=40, random_start=True, return_early=True
PGD (Targeted,HC) min_conf=0.9, Dist=, binary_search=True, epsilon=0.3,
stepsize=0.01, iterations=40, random_start=True, return_early=True
Table 7: Attack Specific Hyper-parameters.
(a) ResNet-50 Layer (CONV)
(b) ResNet-50 Layer (CONV)
(c) ResNet-50 Layer (FC)
(d) VGG-19 Layer (FC)
(e) VGG-19 Layer (FC)
(f) VGG-19 Layer (FC)
Figure 5: TSNE visualization of three variants of pre-activation features i.e positive only (pos), negative only (neg) and combined positive and negative (combined). Visualization of 50 samples of 5 random classes of ImageNet dataset. Class membership is color coded. The dimensions of the pos, neg and combined variants of pre-activation feature is the same for any fully connected layer, while for a CONV layer, pos and neg has the same dimension which is equal to the no. of filters/feature maps of the respective CONV layer and for combined it is equal to the dimension we get after flattening the whole CONV layer. It can be observed in figure(b) that the cluster formed by combined pre-activation feature responses is not a tight as formed by pos and neg separately, which shows the importance of considering pos and neg re-activation responses separately.
Figure 6: Ablation study for accuracy vs no. of layers () on ResNet-50: ‘Agg’ stands for using aggregated Borda count . PGD, DFool, C&W and Trust Region are the same experiments as reported in table 2 of the main paper, but with all possible values of . ”Per_Layer_V10K” stands for evaluation using per layer Borda count i.e   on a separate 10,000 correctly classified subset of validation set. In all our experiments we choose the -highest layers where ‘Per_Layer_V10K’ has at-least accuracy. A reasonable change in this accuracy criteria of would not affect the results on adversarial attacks significantly. However, a substantial change (to say ) deteriorates the performance on clean sample significantly. The phenomenon of decrease in accuracy of clean samples vs robustness has been studied in (Dohmatob, 2018) and (Tsipras et al., 2018). Note: There are four down-sampling layers in the ResNet-50 architecture, hence the total 54 layers.

Appendix B Analyzing Pre-Activation Responses

One of the contributions of our proposed approach is to use both positive and negative pre-activation values separately. We observed both positive and negative pre-activation values contain information that can help correctly classify adversarially perturbed samples. An empirical validation of our statement is shown in figure 3 of the main paper. We further show using TSNE (van der Maaten and Hinton, 2008) plots that all the three variants of the pre-activation feature of a single layer i.e positive only (pos), negative only (neg) and combined positive and negative pre-activation values forms clusters. This indicates that all three contain equivalent information for discriminating samples from others. While on one hand where ReLU like activation functions discard the negative pre-activation responses, we consider negative responses equivalently important and leverage them to model the layerwise behaviour of class samples. The benefit of using positive and negative accumulators is it reduce the computational cost significantly e.g flattening a convolution layer gives a very high-dimensional vector while accumulator reduce it to number of filter dimensions.

Appendix C Accuracy vs no. of layer/voters(ResNet50)

We report the performance of REGroup on various attacks reported in table 2 of the main paper for all possible values of . The accuracy of ResNet-50 w.r.t. the various values of is plotted in figure 6.

Appendix D REGroup Demo

The REGroup demo code is provided at https://github.com/lokender/REGroup . The code is based on VGG19 classifier and CIFAR10 dataset.