1 Introduction
Deep Neural Networks are at the heart of the current advancements in Computer Vision and Pattern Recognition, providing stateoftheart performance on many challenging classification tasks
[9], [12], [14], [16], [36], [37]. However, MoosaviDezfooli et al. [25] recently showed the possibility of fooling the deep networks to change their prediction about ‘any’ image that is slightly perturbed with the Universal Adversarial Perturbations. For a given network model, these imageagnostic (hence universal) perturbations can be computed rather easily [25], [26]. The perturbations remain quasiimperceptible (see Fig. 1), yet the adversarial examplesgenerated by adding the perturbations to the images fool the networks with alarmingly high probabilities
[25]. Furthermore, the fooling is able to generalize well across different network models.Being imageagnostic, universal adversarial perturbations can be conveniently exploited to fool models onthefly on unseen images by using precomputed perturbations. This even eradicates the need of onboard computational capacity that is needed for generating imagespecific perturbations [7], [21]. This fact, along the crossmodel generalization of universal perturbations make them particularly relevant to the practical cases where a model is deployed in a possibly hostile environment. Thus, defense against these perturbations is a necessity for the success of Deep Learning in practice. The need for countermeasures against these perturbations becomes even more pronounced considering that the realworld scenes (e.g. sign boards on roads) modified by the adversarial perturbations can also behave as adversarial examples for the networks [17].
This work proposes the first dedicated defense against the universal adversarial perturbations [25]. The major contributions of this paper are as follows:

We propose to learn a Perturbation Rectifying Network (PRN) that is trained as the ‘preinput’ of a targeted network model. This allows our framework to provide defense to already deployed networks without the need of modifying them.

We propose a method to efficiently compute synthetic imageagnostic adversarial perturbations to effectively train the PRN. The successful generation of these perturbations complements the theoretical findings of MoosaviDezfooli [26].

We also propose a separate perturbation detector that is learned from the Discrete Cosine Transform of the image rectifications performed by the PRN for clean and perturbed examples.

Rigorous evaluation is performed by defending the GoogLeNet [37], CaffeNet [16] and VGGF network [4]^{1}^{1}1The choice of the networks is based on the computational feasibility of generating the adversarial perturbations for the evaluation protocol in Section 5. However, our approach is generic in nature., demonstrating up to success rate on unseen images possibly modified with unseen perturbations. Our experiments also show that the proposed PRN generalizes well across different network models.
2 Related work
The robustness of image classifiers against adversarial perturbations has gained significant attention in the last few years [6], [7], [29], [32], [34], [35], [40]. Deep neural networks became the center of attention in this area after Szegedy et al. [39] first demonstrated the existence of adversarial perturbations for such networks. See [1] for a recent review of literature in this direction. Szegedy et al. [39]
computed adversarial examples for the networks by adding quasiimperceptible perturbations to the images, where the perturbations were estimated by maximizing the network’s prediction error. Although these perturbations were imagespecific, it was shown that the same perturbed images were able to fool multiple network models. Szegedy et al. reported encouraging results for improving the model robustness against the adversarial attacks by using adversarial examples for training, a.k.a.
adversarial training.Goodfellow et al. [10] built on the findings in [39] and developed a ‘fast gradient sign method’ to efficiently generate adversarial examples that can be used for training the networks. They hypothesized that it is the linearity of the deep networks that makes them vulnerable to the adversarial perturbations. However, Tanay and Griffin [41] later constructed the image classes that do not suffer from the adversarial examples for the linear classifiers. Their arguments about the existence of the adversarial perturbations again point towards the overfitting phenomena, that can be alleviated by regularization. Nevertheless, it remains unclear how a network should be regularized for robustness against adversarial examples.
MoosaviDezfooli [27]
proposed the DeepFool algorithm to compute imagespecific adversarial perturbations by assuming that the loss function of the network is linearizable around the current training sample. In contrast to the onestep perturbation estimation
[10], their approach computes the perturbation in an iterative manner. They also reported that augmenting training data with adversarial examples significantly increases the robustness of networks against the adversarial perturbations. Baluja and Fischer [2]trained an Adversarial Transformation Network to generate adversarial examples against a target network. Liu et al.
[19] analyzed the transferability of adversarial examples. They studied this property for both targeted and nontargeted examples, and proposed an ensemble based approach to generate the examples with better transferability.The abovementioned techniques mainly focus on generating adversarial examples, and address the defense against those examples with adversarial training. Inline with our take on the problem, few recent techniques also directly focus on the defense against the adversarial examples. For instance, Lu et al. [22]
mitigate the issues resulting from the adversarial perturbations using foveation. Their main argument is that the neural networks (for ImageNet
[33]) are robust to the foveationinduced scale and translation variations of the images, however, this property does not generalize to the perturbation transformations.Papernot et al. [30] used distillation [13] to make the neural networks more robust against the adversarial perturbations. However, Carlini and Wagner [3] later introduced adversarial attacks that can not be defended by the distillation method. Kurakin et al. [18] specifically studied the adversarial training for making large models (e.g. Inception v3 [38]) robust to perturbations, and found that the training indeed provides robustness against the perturbations generated by the onestep methods [10]. However, Tramer et al. [42] found that this robustness weakens for the adversarial examples learned using different networks i.e. for the blackbox attacks [19]. Hence, ensemble adversarial training was proposed in [42] that uses adversarial examples generated by multiple networks.
Dziugaite et al. [5] studied the effects of JPG compression on adversarial examples and found that the compression can sometimes revert network fooling. Nevertheless, it was concluded that JPG compression alone is insufficient as a defense against adversarial attacks. Prakash et al. [31] took advantage of localization of the perturbed pixels in their defense. Lu et al. [20] proposed SafetyNet for detecting and rejecting adversarial examples for the conventional network classifiers (e.g. VGG19 [11]) that capitalizes on the late stage ReLUs of the network to detect the perturbed examples. Similarly, a proposal of appending the deep neural networks with detector subnetworks was also presented by Metzen et al. [23]. In addition to the classification, adversarial examples and robustness of the deep networks against them have also been recently investigated for the tasks of semantic segmentation and object detection [8], [21], [43].
Whereas the central topic of all the abovementioned literature is the perturbations computed for individual images, MoosaviDezfooli [25] were the first to show the existence of imageagnostic perturbations for neural networks. These perturbations were further analyzed in [26], whereas Metzen et al. [24] also showed their existence for semantic image segmentation. To date, no dedicated technique exists for defending the networks against the universal adversarial perturbations, which is the topic of this paper.
3 Problem formulation
Below, we present the notions of universal adversarial perturbations and the defense against them more formally. Let denote the distribution of the (clean) natural images in a ddimensional space, such that, a class label is associated with its every sample . Let be a classifier (a deep network) that maps an image to its class label, i.e.
. The vector
is a universal adversarial perturbation for the classifier, if it satisfies the following constraint:(1) 
where P(.) is the probability, denotes the norm of a vector such that , denotes the fooling ratio and is a predefined constant. In the text to follow, we alternatively refer to as the perturbation for brevity.
In (1), the perturbations in question are imageagnostic, hence MoosaviDezfooli et al. [25] termed them universal^{2}^{2}2A single perturbation that satisfies (1) for any classifier is referred as ‘doubly universal’ by MoosaviDezfooli et al. [25]. We focus on the singly universal perturbations in this work.. According to the stated definition, the parameter controls the norm of the perturbation. For the quasiimperceptible perturbations, the value of this parameter should be very small as compared to the image norm . On the other hand, a larger is required for the perturbation to fool the classifier with a higher probability. In this work, we let and consider the perturbations constrained by their and norms. For the norm, we let , and select for the norm perturbations. For both types, these values are of the means of the respective image norms used in our experiments (in Section 5), which is the same as [25].
To defend against the perturbations, we seek two components of the defense mechanism. (1) A perturbation ‘detector’ and (2) a perturbation ‘rectifier’ , where . The detector determines whether an unseen image is perturbed or clean. The objective of the rectifier is to compute a transformation of the perturbed image such that . Notice that the rectifier does not seek to improve the prediction of on the rectified version of the image beyond the classifier’s performance on the clean/original image. This ensures stable induction of . Moreover, the formulation allows us to compute such that . We leverage this property to learn as the preinput layers of in an endtoend fashion.
4 Proposed approach
We draw on the insights from the literature reviewed in Section 2 to develop a framework for defending a (possibly) targeted network model against universal adversarial perturbations. Figure 2 shows the schematics of our approach to learn the ‘rectifier’ and the ‘detector’ components of the defense framework. We use the Perturbation Rectifying Network (PRN) as the ‘rectifier’, whereas a binary classifier is eventually trained to detect the adversarial perturbations in the images. The framework uses both real and synthetic perturbations for training. The constituents of the proposed framework are explained below.
4.1 Perturbation Rectifying Network (PRN)
At the core of our technique is the Perturbation Rectifying Network (PRN), that is trained as preinput layers to the targeted network classifier. The PRN is attached to the first layer of the classification network and the joint network is trained to minimize the following cost:
(2) 
where and are the labels predicted by the joint network and the targeted network respectively, such that is necessarily computed for the clean image. For the training examples, computes the loss, whereas and denote the PRN weight and bias parameters.
In Eq. (2) we define the cost over the parameters of PRN only, which ensures that the (already deployed) targeted network does not require any modification for the defense being provided by our framework. This strategy is orthogonal to the existing defense techniques that either update the targeted model using adversarial training to make the networks more robust [18], [42]; or incorporate architectural changes to the targeted network, which may include adding a subnetwork to the model [23] or tapping into the activations of certain layers to detect the adversarial examples [20]. Our defense mechanism acts as an external wrapper for the targeted network such that the PRN (and the detector) trained to counter the adversarial attacks can be kept secretive in order refrain from potential countercounter attacks^{3}^{3}3PRN+targeted network are endtoend differentiable and the joint network can be susceptible to stronger attacks if PRN is not secretive. However, stronger perturbations are also more easily detectable by our detector.. This is a highly desirable property of defense frameworks in the realworld scenarios. MoosaviDezfooli [25] noted that the universal adversarial perturbations can still exist for a model even after their adversarial training. The proposed framework constitutionally caters for this problem.
We train the PRN using both clean and adversarial examples to ensure that the image transformation learned by our network is not biased towards the adversarial examples. For training, is computed separately with the targeted network for the clean version of the training example. The PRN is implemented as 5ResNet blocks [12] sandwiched by convolution layers. The input image is fed to Conv
, stride = 1, feature maps = 64, ‘same’ convolution; followed by 5 ResNet blocks, where each block consists of two convolution layers with ReLU activations
[28], resulting in feature maps. The feature maps of the last ResNet block are processed by Conv , stride = 1, feature maps = 16, ‘same’ convolution; and then Conv , stride = 1, feature maps = 3, ‘same’ convolution.We use the crossentropy loss [9] for training the PRN with the help of ADAM optimizer [15]
. The exponential decay rates for the first and the second moment estimates are set to 0.9 and 0.999 respectively. We set the initial learning rate to 0.01, and decay it by
after each 1K iterations. We used minibatch size of 64, and trained the PRN for a given targeted network for at least 5 epochs.
4.2 Training data
The PRN is trained using clean images as well as their adversarial counterparts, constructed by adding perturbations to the clean images. We compute the latter by first generating a set of perturbations following MoosaviDezfooli et al. [25]. Their algorithm computes a universal perturbation in an iterative manner. In its inner loop (ran over the training images), the algorithm seeks a minimal norm vector [27] to fool the network on a given image. The current estimate of is updated by adding to it the sought vector and backprojecting the resultant vector onto the ball of radius . The outer loop ensures that the desired fooling ratio is achieved over the complete training set. Generally, the algorithm requires several passes on the training data to achieve an acceptable fooling ratio. We refer to [25] for further details on the algorithm.
A PRN trained with more adversarial patterns underlying the training images is expected to perform better. However, it becomes computationally infeasible to generate a large (e.g. ) number of perturbations using the abovementioned algorithm. Therefore, we devise a mechanism to efficiently generate synthetic perturbations to augment the set of available perturbations for training the PRN. The synthetic perturbations are computed using the set while capitalizing on the theoretical results of [26]. To generate the synthetic perturbations, we compute the vectors that satisfy the following conditions: (c1) positive orthant of the subspace spanned by the elements of . (c2) and (c3)^{4}^{4}4For the perturbations restricted by their norm only, this condition is ignored. In that case, (c2) automatically ensures . . The procedure for computing the synthetic perturbations that are constrained by their norm is summarized in Algorithm 1. We refer to the supplementary material of the paper for the algorithm to compute the norm perturbations.
To generate a synthetic perturbation, Algorithm 1 searches for in by taking small random steps in the directions governed by the unit vectors of the elements of . The random walk continues until the norm of remains smaller than . The algorithm selects the found as a valid perturbation if the norm of the vector is comparable to the Expected value of the norms of the vectors in . For generating the norm perturbations, the corresponding algorithm given in the supplementary material terminates the random walk based on in line4, and directly selects the computed as the desired perturbation. Analyzing the robustness of the deep networks against the universal adversarial perturbations, MoosaviDezfooli [26] showed the existence of shared directions (across different data points) along which a decision boundary induced by a network becomes highly positively curved. Along these vulnerable directions, small universal perturbations exist that can fool the network to change its predictions about the labels of the data points. Our algorithms search for the synthetic perturbations along those directions, whereas the knowledge of the desired directions is borrowed from .
Fig. 3 exemplifies the typical synthetic perturbations generated by our algorithms for the and norms. It also shows the corresponding closest matches in the set for the given perturbations. The fooling ratios for the synthetic perturbations is generally not as high as the original ones, nevertheless the values remain in an acceptable range. In our experiments (Section 5), augmenting the training data with the synthetic perturbations consistently helped in early convergence and better performance of the PRN. We note that the acceptable fooling ratios demonstrated by the synthetic perturbations in this work complement the theoretical findings in [26]. Once the set of synthetic perturbations is computed, we construct and use it to perturb the images in our training data.
4.3 Perturbation detection
While studying the JPG compression as a mechanism to mitigate the effects of the (imagespecific) adversarial perturbations, Dziugaite et al. [5] also suggested the Discrete Cosine Transform (DCT) as a possible candidate to reduce the effectiveness of the perturbations. Our experiments, reported in supplementary material, show that the DCT based compression can also be exploited to reduce the network fooling ratios under the universal adversarial perturbations. However, it becomes difficult to decide on the required compression rate, especially when it is not known whether the image in question is actually perturbed or not. Unnecessary rectification often leads to degraded performance of the networks on the clean images.
Instead of using the DCT to remove the perturbations, we exploit it for perturbation detection in our approach. Using the training data that contains both clean and perturbed images, say , we first compute and then learn a binary classifier with the data labels denoting the input being ‘clean’ or ‘perturbed’. We implement to compute the logabsolute values of the 2DDCT coefficients of the grayscaled image in the argument, whereas an SVM is learned as . The function forms the detector component of our defense framework. To classify a test image , we first evaluate , and if a perturbation is detected then is evaluated for classification instead of , where denotes the targeted network classifier.
5 Experiments
We evaluated the performance of our technique by defending CaffeNet [16], VGGF network [4] and GoogLeNet [37] against universal adversarial perturbations. The choice of the networks is based on the computational feasibility of generating the perturbations for our experimental protocol. The same framework is applicable to other networks. Following MoosaviDezfooli [25], we used the ILSVRC 2012 [16] validation set of images to perform the experiments.
Setup: From the available images, we randomly selected samples to generate a total of imageagnostic perturbations for each network, such that of those perturbations were constrained to have norm equal to , whereas the norm of the remaining was restricted to . The fooling ratio of all the perturbations was lowerbounded by 0.8. Moreover, the maximum dot product between any two perturbations of the same type (i.e. or ) was upper bounded by 0.15. This ensured that the constructed perturbations were significantly different from each other, thereby removing any potential bias from our evaluation. From each set of the perturbations, we randomly selected perturbations to be used with the training data, and the remaining were used with the testing data.
We extended the sets of the training perturbations using the method discussed in Section 4.2, such that there were total 250 perturbations in each extended set, henceforth denoted as and . To generate the training data, we first randomly selected samples from the available images and performed 5 corner crops of dimensions to generate samples. For creating the adversarial examples with the type perturbations, we used the set and randomly added perturbations to the images with 0.5 probability. This resulted in samples each for the clean and the perturbed images, which were used to train the approach for the norm perturbations for a given network. We repeated this procedure using the set to separately train it for the type perturbations. Note that, for a given targeted network we performed the training twice to evaluate the performance of our technique for both types of perturbations.
For a thorough evaluation, two protocols were followed to generate the testing data. Both protocols used the unseen images that were perturbed with the unseen testing perturbations. Notice that the evaluation has been kept doublyblind to emulate the realworld scenario for a deployed network. For ProtocolA, we used the whole test images and randomly corrupted them with the test perturbations with a 0.5 probability. For the ProtocolB, we chose the subset of the test images that were correctly classified by the targeted network in their clean form, and corrupted that subset with 0.5 probability using the testing perturbations. The existence of both clean and perturbed images with equal probability in our test sets especially ensures a fair evaluation of the detector.
Evaluation metric: We used four different metrics for a comprehensive analysis of the performance of our technique. Let and denote the sets containing clean and perturbed test images. Similarly, let and be the sets containing the test images rectified by PRN, such that all the images in were perturbed (before passing through the PRN) whereas the images in were similarly perturbed with 0.5 probability, as per our protocol. Let be the set comprising the test images such that each image is rectified by the PRN only if it were classified as perturbed by the detector . Furthermore, let be the function computing the prediction accuracy of the target network on a given set of images. The formal definitions of the metrics that we used in our experiments are stated below:

PRNgain (%) .

PRNrestoration (%) .

Detection rate (%) Accuracy of .

Defense rate (%) .
The names of the metric are in accordance with the semantic notions associated with them. Notice that the PRNrestoration is defined over the rectification of both clean and perturbed images. We do this to account for any loss in the classification accuracy of the targeted network incurred by the rectification of the clean images by the PRN. It was observed in our experiments that unnecessary rectification of the clean images can sometimes lead to a minor (1  2%) reduction in the classification accuracy of the targeted network. Hence, we used a more strict definition of the restoration by PRN for a more transparent evaluation. This definition is also inline with our underlying assumption of the practical scenarios where we do not know a prior if the test image is clean or perturbed.
Metric  Same test/train perturbation type  Different test/train perturbation type  
type  type  type  type  
ProtA  ProtB  ProtA  ProtB  ProtA  ProtB  ProtA  ProtB  
PRNgain (%)  77.0  77.1  73.9  74.2  76.4  77.0  72.6  73.4 
PRNrestoration (%)  97.0  92.4  95.6  91.3  97.1  92.7  93.8  89.3 
Detection rate (%)  94.6  94.6  98.5  98.4  92.4  92.3  81.3  81.2 
Defense rate (%)  97.4  94.8  96.4  93.7  97.5  94.9  94.3  91.6 
Same/Crossnorm evaluation: In Table 1, we summarize the results of our experiments for defending the GoogLeNet [37] against the perturbations. The table summarizes two kinds of experiments. For the first kind, we used the same types of perturbations for testing and training. For instance, we used the type perturbations for learning the framework components (rectifier + detector) and then also used the type perturbations for testing. The results of these experiments are summarized in the left half of the table. We performed the ‘same test/train perturbation type’ experiments for both and perturbations, for both testing protocols (denoted as ProtA and ProtB in the table). In the second kind of experiments, we trained our framework on one type of perturbation and tested for the other. The right half of the table summarizes the results of those experiments. The mentioned perturbation types in the table are for the testing data. The same conventions will be followed in the similar tables for the other two targeted networks below. Representative examples to visualize the perturbed and rectified images (by the PRN) are shown in Fig. 4. Please refer to the supplementary material for more illustrations.
From Table 1, we can see that in general, our framework is able to defend the GoogLeNet very successfully against the universal adversarial perturbations that are specifically targeted at this network. The ProtA captures the performance of our framework when an attacker might have added a perturbation to an unseen image without knowing if the clean image would be correctly classified by the targeted network. The ProtB represents the case where the perturbation is added to fool the network on an image that it had previously classified correctly. Note that the difference in the performance of our framework for ProtA and ProtB is related to the accuracy of the targeted network on clean images. For a network that is accurate on clean images, the results under ProtA and ProtB would match exactly. The results would differ more for the less accurate classifiers, as also evident from the subsequent tables.
Metric  Same test/train perturbation type  Different test/train perturbation type  
type  type  type  type  
ProtA  ProtB  ProtA  ProtB  ProtA  ProtB  ProtA  ProtB  
PRNgain (%)  67.2  69.0  78.4  79.1  65.3  66.8  77.3  77.7 
PRNrestoration (%)  95.1  89.9  93.6  88.7  92.2  87.1  91.7  85.8 
Detection rate (%)  98.1  98.0  97.8  97.9  84.2  84.0  97.9  98.0 
Defense rate (%)  96.4  93.6  95.2  92.5  93.6  90.1  93.2  90.0 
In Table 2, we summarize the performance of our framework for the CaffeNet [16]. Again, the results demonstrate a good defense against the perturbations. The final Defenserate for the type perturbation for ProtA is . Under the used metric definition and the experimental protocol, one interpretation of this value is as follows. With the defense wrapper provided by our framework, the performance of the CaffeNet is expected to be of its original performance (in the perfect world of clean images), such that there is an equal chance of every query image to be perturbed or clean^{5}^{5}5We emphasize that our evaluation protocols and metrics are carefully designed to analyze the performance in the realworld situations where it is not known apriori whether the query is perturbed or clean. . Considering that the fooling rate of the network was at least on all the test perturbations used in our experiments, it is a good performance recovery.
Metric  Same test/train perturbation type  Different test/train perturbation type  
type  type  type  type  
ProtA  ProtB  ProtA  ProtB  ProtA  ProtB  ProtA  ProtB  
PRNgain (%)  72.1  73.3  84.1  84.3  68.3  69.2  84.7  84.8 
PRNrestoration (%)  93.2  86.2  90.3  83.2  88.8  81.2  91.1  83.3 
Detection rate (%)  92.5  92.5  98.6  98.6  92.5  92.5  98.1  98.1 
Defense rate (%)  95.5  91.4  92.2  87.9  90.0  85.9  93.7  89.1 
In Table 3, the defense summary for the VGGF network [4] is reported, which again shows a decent performance of our framework. Interestingly, for both CaffeNet and VGGF, the existence of the type perturbations in the test images could be detected very accurately by our detector for the ‘different test/train perturbation type’. However, it was not the case for the GoogLeNet. We found that for the type perturbations (with ) the corresponding norm of the perturbations was generally much lower for the GoogLeNet ( on avg.) as compared to the CaffeNet and VGGF ( on avg.). This made the detection of the type perturbations more challenging for the GoogLeNet. The dissimilarity in these values indicate that there is a significant difference between the decision boundaries induced by the GoogLeNet and the other two networks, which is governed by the significant architectural differences of the networks.
Crossarchitecture generalisation: With the above observation, it was anticipated that the crossnetwork defense performance of our framework would be better for the networks with the (relatively) similar architectures. This prediction was verified by the results of our experiments in Tables 4 and 5. These tables show the performance for and type perturbations where we used the ‘same test/train perturbation type’. The results are reported for protocol A. For the corresponding results under protocol B, we refer to the supplementary material. From these tables, we can conclude that our framework generalizes well across different networks, especially across the networks that have (relatively) similar architectures. We conjecture that the crossnetwork generalization is inherited by our framework from the crossmodel generalization of the universal adversarial perturbations. Like our technique, any framework for the defense against these perturbations can be expected to exhibit similar characteristics.
PRNrestoration (%)  

VGGF  CaffeNet  GoogLeNet  
VGGF [4]  93.2  88.9  81.7 
CaffeNet [16]  91.3  95.1  72.0 
GoogLeNet [37]  84.7  85.9  97.0 
Defense rate (%)  
VGGF  CaffeNet  GoogLeNet  
VGGF [4]  95.5  91.5  82.4 
CaffeNet [16]  94.8  96.2  77.3 
GoogLeNet [37]  88.3  87.3  97.4 
PRNrestoration (%)  

VGGF  CaffeNet  GoogLeNet  
VGGF [4]  90.3  86.9  74.1 
CaffeNet [16]  85.7  93.6  69.3 
GoogLeNet [37]  85.9  83.3  95.6 
Defense rate (%)  
VGGF  CaffeNet  GoogLeNet  
VGGF [4]  92.2  88.9  74.8 
CaffeNet [16]  93.5  95.2  73.8 
GoogLeNet [37]  88.4  85.4  96.4 
6 Conclusion
We presented the first dedicated framework for the defense against universal adversarial perturbations [25] that not only detects the presence of these perturbations in the images but also rectifies the perturbed images so that the targeted classifier can reliably predict their labels. The proposed framework provides defense to a targeted model without the need of modifying it, which makes our technique highly desirable for the practical cases. Moreover, to prevent the potential countercounter measures, it provides the flexibility of keeping its ‘rectifier’ and ‘detector’ components secretive. We implement the ‘rectifier’ as a Perturbation Rectifying Network (PRN), whereas the ‘detector’ is implemented as an SVM trained by exploiting the image transformations performed by the PRN. For an effective training, we also proposed a method to efficiently compute imageagnostic perturbations synthetically. The efficacy of our framework is demonstrated by a successful defense of CaffeNet [16], VGGF network [4] and GoogLeNet [37] against the universal adversarial perturbations.
Acknowledgement This research was supported by ARC grant DP160101458. The Titan Xp used for this research was donated by NVIDIA Corporation.
References
 [1] N. Akhtar and A. Mian. Threat of adversarial attacks on deep learning in computer vision: A survey. arXiv preprint arXiv:1801.00553, 2018.
 [2] S. Baluja and I. Fischer. Adversarial transformation networks: Learning to generate adversarial examples. arXiv preprint arXiv:1703.09387, 2017.
 [3] N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. In Security and Privacy (SP), 2017 IEEE Symposium on, pages 39–57. IEEE, 2017.
 [4] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531, 2014.
 [5] G. K. Dziugaite, Z. Ghahramani, and D. M. Roy. A study of the effect of jpg compression on adversarial images. arXiv preprint arXiv:1608.00853, 2016.
 [6] A. Fawzi, O. Fawzi, and P. Frossard. Analysis of classifiers’ robustness to adversarial perturbations. arXiv preprint arXiv:1502.02590, 2015.
 [7] A. Fawzi, S.M. MoosaviDezfooli, and P. Frossard. Robustness of classifiers: from adversarial to random noise. In Advances in Neural Information Processing Systems, pages 1632–1640, 2016.
 [8] V. Fischer, M. C. Kumar, J. H. Metzen, and T. Brox. Adversarial examples for semantic image segmentation. arXiv preprint arXiv:1703.01101, 2017.
 [9] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. 2016.
 [10] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
 [11] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
 [12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [13] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
 [14] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolutional networks. arXiv preprint arXiv:1608.06993, 2016.
 [15] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[16]
A. Krizhevsky, I. Sutskever, and G. E. Hinton.
Imagenet classification with deep convolutional neural networks.
In Advances in neural information processing systems, pages 1097–1105, 2012.  [17] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533, 2016.
 [18] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236, 2016.
 [19] Y. Liu, X. Chen, C. Liu, and D. Song. Delving into transferable adversarial examples and blackbox attacks. arXiv preprint arXiv:1611.02770, 2016.
 [20] J. Lu, T. Issaranon, and D. Forsyth. Safetynet: Detecting and rejecting adversarial examples robustly. arXiv preprint arXiv:1704.00103, 2017.
 [21] J. Lu, H. Sibai, E. Fabry, and D. Forsyth. No need to worry about adversarial examples in object detection in autonomous vehicles. arXiv preprint arXiv:1707.03501, 2017.
 [22] Y. Luo, X. Boix, G. Roig, T. Poggio, and Q. Zhao. Foveationbased mechanisms alleviate adversarial examples. arXiv preprint arXiv:1511.06292, 2015.
 [23] J. H. Metzen, T. Genewein, V. Fischer, and B. Bischoff. On detecting adversarial perturbations. arXiv preprint arXiv:1702.04267, 2017.
 [24] J. H. Metzen, M. C. Kumar, T. Brox, and V. Fischer. Universal adversarial perturbations against semantic image segmentation. arXiv preprint arXiv:1704.05712, 2017.
 [25] S.M. MoosaviDezfooli, A. Fawzi, O. Fawzi, and P. Frossard. Universal adversarial perturbations. CVPR, 2017.
 [26] S.M. MoosaviDezfooli, A. Fawzi, O. Fawzi, P. Frossard, and S. Soatto. Analysis of universal adversarial perturbations. arXiv preprint arXiv:1705.09554, 2017.
 [27] S.M. MoosaviDezfooli, A. Fawzi, and P. Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2574–2582, 2016.

[28]
V. Nair and G. E. Hinton.
Rectified linear units improve restricted boltzmann machines.
In
Proceedings of the 27th international conference on machine learning (ICML10)
, pages 807–814, 2010.  [29] A. Nguyen, J. Yosinski, and J. Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 427–436, 2015.
 [30] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In Security and Privacy (SP), 2016 IEEE Symposium on, pages 582–597. IEEE, 2016.
 [31] A. Prakash, N. Moran, S. Garber, A. DiLillo, and J. Storer. Deflecting adversarial attacks with pixel deflection. arXiv preprint arXiv:1801.08926, 2018.
 [32] A. Rozsa, E. M. Rudd, and T. E. Boult. Adversarial diversity and hard positive generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 25–32, 2016.
 [33] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 [34] S. Sabour, Y. Cao, F. Faghri, and D. J. Fleet. Adversarial manipulation of deep representations. arXiv preprint arXiv:1511.05122, 2015.

[35]
M. Sharif, S. Bhagavatula, L. Bauer, and M. K. Reiter.
Accessorize to a crime: Real and stealthy attacks on stateoftheart face recognition.
In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 1528–1540. ACM, 2016.  [36] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [37] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
 [38] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016.
 [39] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
 [40] P. Tabacof and E. Valle. Exploring the space of adversarial images. In Neural Networks (IJCNN), 2016 International Joint Conference on, pages 426–433. IEEE, 2016.
 [41] T. Tanay and L. Griffin. A boundary tilting persepective on the phenomenon of adversarial examples. arXiv preprint arXiv:1608.07690, 2016.
 [42] F. Tramèr, A. Kurakin, N. Papernot, D. Boneh, and P. McDaniel. Ensemble adversarial training: Attacks and defenses. arXiv preprint arXiv:1705.07204, 2017.
 [43] C. Xie, J. Wang, Z. Zhang, Y. Zhou, L. Xie, and A. Yuille. Adversarial examples for semantic segmentation and object detection. arXiv preprint arXiv:1703.08603, 2017.