Code of our recently published attack FDA: Feature Disruptive Attack. Colab Notebook: https://colab.research.google.com/drive/1WhkKCrzFq5b7SNrbLUfdLVo5-WK5mLJh
Though Deep Neural Networks (DNN) show excellent performance across various computer vision tasks, several works show their vulnerability to adversarial samples, i.e., image samples with imperceptible noise engineered to manipulate the network's prediction. Adversarial sample generation methods range from simple to complex optimization techniques. Majority of these methods generate adversaries through optimization objectives that are tied to the pre-softmax or softmax output of the network. In this work we, (i) show the drawbacks of such attacks, (ii) propose two new evaluation metrics: Old Label New Rank (OLNR) and New Label Old Rank (NLOR) in order to quantify the extent of damage made by an attack, and (iii) propose a new adversarial attack FDA: Feature Disruptive Attack, to address the drawbacks of existing attacks. FDA works by generating image perturbation that disrupt features at each layer of the network and causes deep-features to be highly corrupt. This allows FDA adversaries to severely reduce the performance of deep networks. We experimentally validate that FDA generates stronger adversaries than other state-of-the-art methods for image classification, even in the presence of various defense measures. More importantly, we show that FDA disrupts feature-representation based tasks even without access to the task-specific network or methodology. Code available at: https://github.com/BardOfCodes/fdaREAD FULL TEXT VIEW PDF
We propose a new type of adversarial attack to Deep Neural Networks (DNN...
Although Deep Neural Networks (DNNs) achieve excellent performance on ma...
Deep Learning based AI systems have shown great promise in various domai...
In the evasion attacks against deep neural networks (DNN), the attacker
The vulnerability of deep image classification networks to adversarial a...
As the prevalence of deep learning in computer vision, adversarial sampl...
Deep learning models on graphs have achieved remarkable performance in
Code of our recently published attack FDA: Feature Disruptive Attack. Colab Notebook: https://colab.research.google.com/drive/1WhkKCrzFq5b7SNrbLUfdLVo5-WK5mLJh
With the advent of deep-learning based algorithms, remarkable progress has been achieved in various computer vision applications. However, a plethora of existing works[9, 50, 8, 39], have clearly established that Deep Neural Networks (DNNs) are susceptible to adversarial samples: input data containing imperceptible noise specifically crafted to manipulate the network’s prediction. Further, Szegedy  showed that adversarial samples transfer across models i.e., adversarial samples generated for one model can adversely affect other unrelated models as well. This transferable nature of adversarial samples further increases the vulnerability of DNNs deployed in real world. As DNNs become more-and-more ubiquitous, especially in decision-critical applications, such as Autonomous Driving , the necessity of investigating adversarial samples has become paramount.
Majority of existing attacks [50, 13, 33, 16], generate adversarial samples via optimizing objectives that are tied to the pre-softmax or softmax output of the network. The sole objective of these attacks is to generate adversarial samples which are misclassified by the network with very high confidence. While the classification output is changed, it is unclear what happens to the internal deep representations of the network. Hence, we ask a fundamental question:
Do deep features of adversarial samples retain usable clean sample information?
In this work, we demonstrate that deep features of adversarial samples generated using such attacks retain high-level semantic information of the corresponding clean samples. This is due to the fact that these attacks optimize only pre-softmax or softmax score based objective to generate adversarial samples. We provide evidence for this observation by leveraging feature inversion , where, given a feature representation , we optimize to construct the approximate inverse . Using the ability to visualize deep features, we highlight the retention of clean information in deep features of adversarial samples. The fact that deep features of adversarial samples retain clean sample information has important implications:
Secondly, these adversarial samples cause the model to either predict semantically similar class or to retain high (comparatively) probability for the original label, while predicting a very different class. These observations are captured by using the proposed metrics i.e., New Label’s Old Rank (NLOR) and Old Label’s New Rank (OLNR), and statistics such as fooling rate at krank.
These implications are major drawbacks of existing attacks which optimize only pre-softmax or softmax score based objectives. Based on these observations, in this work, we seek adversarial samples that can corrupt deep features and inflict severe damage to feature representations. With this motivation, we introduce FDA: Feature Disruptive attack. FDA generates perturbation with the aim to cause disruption of features at each layer of the network in a principled manner. This results in corruption of deep features, which in turn degrades the performance of the network. Figure 1 shows feature inversion from deep features of a clean, a PGD  attacked, and a FDA attacked sample, highlighting the lack of clean sample information after our proposed attack.
Following are the benefits of our proposed attack: (i) FDA invariably flips the predicted label to highly unrelated classes, while also successfully removing evidence of the clean sample’s predicted label. As we elaborate in section 5, other attacks [50, 30, 13] only achieve one of the above objectives. (ii) Unlike existing attacks, FDA disrupts feature-representation based tasks e.g., caption generation, even without access to the task-specific network or methodology i.e., it is effective in a gray-box attack setting. (iii) FDA generates stronger adversaries than other state-of-the-art methods for Image classification. Even in the presence of various recently proposed defense measures (including adversarial training), our proposed attack consistently outperforms other existing attacks.
In summary, the major contributions of this work are:
We demonstrate the drawbacks of existing attacks.
We propose two new evaluation metrics i.e., NLOR and OLNR, in order to quantify the extent of damage made by an attack method.
Finally, we successfully attack two feature based-tasks, namely caption generation and style transfer where current attack methods either fail or are exhibit weaker attack than FDA. A novel “Gray-Box” attack scenario is also presented where FDA again exhibits stronger attacking capability.
Attacks: Following the demonstration by Szegedy  on the existence of adversarial samples, multiple works [22, 38, 29, 16, 4, 33, 13, 10] have proposed various techniques for generating adversarial samples. Parallelly, works such as [36, 56, 6] have explored the existence of adversarial samples for other tasks.
The works closest to our approach are Zhou , Sabour  and Mopuri . Zhou create black-box transferable adversaries by simultaneously optimizing multiple objectives, including a final-layer cross entropy term. In contrast, we only optimize for our formulation of feature disruption (refer section 4.3). Sabour specifically optimize to make a specific layer’s feature arbitrarily close to a target image’s features. Our objective is significantly different entailing disruption at every layer of a DNNs, without relying on a ’target’ image representation. Finally, Mopuri provide a complex optimization setup for crafting UAPs whereas our method yields image-specific adversaries. We show that a simple adaptation of their method to craft image-specific adversaries yields poor results (refer supplementary material).
Defenses: Goodfellow  first showed that including adversarial samples in the training regime increases robustness of DNNs to adversarial attacks. Following this work, multiple approaches [30, tramèr2018ensemble, 27, 53, 17, 33] have been proposed for adversarial training, addressing important concerns such as Gradient masking, and label leaking.
Recent works [40, 31, 1, 47, 15, 11], present many alternative to adversarial training. Crucially, works such as [23, 55, 43] propose defense techniques which can be easily implemented for large scale datasets such as ImageNet. While Guo  propose utilizing input transformation as a defense technique, Xie  introduce randomized transformations in the input as a defense.
Feature inversion has a long history in machine learning. Mahendran  proposed an optimization method for feature inversion, combining feature reconstruction with regularization objectives. Contrarily, Dosovitskiy  introduce a neural network for imposing image priors on the reconstruction. Recent works such as [46, 20] have followed the suit. The reader is referred to  for a comprehensive survey.
Feature-based Tasks: DNNs have become the preferred feature extractors over hand-engineered local descriptors like SIFT or HOG [5, 7]. Hence, various tasks such as captioning [52, 57], and image-retrieval [12, 24] rely on DNNs for extracting image information. Recently, tasks such as style-transfer have been introduced which rely on deep features as well. While works such as [25, 51] propose a learning based approach, Gatys  perform an optimization on selected deep features.
We show that previous attacks create adversarial samples which still provide useful information for feature-based tasks. In contrast, FDA inflicts severe damage to feature-based tasks without any task-specific information or methodology.
We define a classifier , where is the dimensional input, and is the
dimensional score vector containing pre-softmax scores for thedifferent classes. Applying softmax on the output gives us the predicted probabilities for the classes, and is taken as the predicted label for input . Let represent the ground truth label of sample . Now, an adversarial sample can be defined as any input sample such that:
where acts as an imperceptibility constraint, and is typically considered as a or constraint.
Attacks such as [50, 22, 16, 33], find adversarial samples by different optimization methods, but with the same optimization objective: maximizing the cross-entropy loss for the adversarial sample . Fast Gradient Sign Method (FGSM)  performs a single step optimization, yielding an adversary :
On the other hand, PGD  and I-FGSM  performs a multi-step signed-gradient ascent on this objective. Works such as [16, 4], further integrate Momentum and ADAM optimizer for maximizing the objective.
Kurakin  discovered the phenomena of label leaking and use predicted label instead of . This yields a class of attacks which can be called most-likely attacks, where the loss objective is changed to (where represents the class with the maximum predicted probability).
Works such as [27, tramèr2018ensemble] note that above methods yield adversarial samples which are weak, in the sense of being misclassified into a very similar class (for e.g., a hound misclassified as a terrier). They posit that targeted attacks are more meaningful, and utilize least likely attacks, proposing minimization of Loss objective (where represents the class with the least predicted probability). We denote the most-likely and the least-likely variant of any attack by the suffix ML and LL.
Carlini  propose multiple different objectives and optimization methods for generating adversaries. Among the proposed objectives, they infer that the strongest objective is as follows:
where is short-hand for . For a distance metric adversary, this objective can be integrated with PGD optimization to yield PGD-CW. The notation introduced in this section is followed throughout the paper.
Feature inversion: Feature inversion can be summarized as the problem of finding the sample whose representation is the closest match to a given representation . We use the approach proposed by Mahendran . Additionally, to improve the inversion, we use Laplacian pyramid gradient normalization. We provide additional information in the supplementary.
In this section, we provide qualitative evidence to show that deep features corresponding to adversarial samples generated by existing attacks (i.e., attacks that optimize objectives tied to the softmax or pre-softmax layer of the network), retain high level semantic information of its corresponding clean sample. We use feature inversion to provide evidence for this observation.
Figure 2 shows the feature inversion for different layers of VGG-16  architecture trained on ImageNet dataset, for the clean and its corresponding adversarial sample. From Fig. 2, it can be observed that the inversion of adversarial features of PGD-LL sample  is remarkably similar to the inversion of features of clean sample. Further, in section 5.1, we statistically show the similarity between intermediate feature representations of clean and its corresponding adversarial samples generated using different existing attack methods. Finally, in section 5.2 we show that as a consequence of retaining clean sample information, these adversarial samples cause the model to either predict semantically similar class or to retain high (comparatively) probability for the original label, while predicting a very different class. These observations are captured by using the proposed metrics i.e., New Label Old Rank (NLOR) and Old Label New Rank (OLNR), and statistics such as fooling rate at -th rank.
An attack’s strength is typically measured in terms of fooling rate , which measures the percentage (%) of images for which the predicted label was changed due to the attack. However, only looking at fooling rate does not present the full picture of the attack. On one hand, attacks such as PGD-ML may result in flipping of label into a semantically similar class, and on the other hand, attacks such as PGD-LL may flip the label to a very different class, while still retaining high (comparatively) probability for the original label. These drawbacks are not captured in the existing evaluation metric i.e., Fooling rate.
Hence, we propose two new evaluation metrics, New Label Old Rank (NLOR) and Old Label New Rank (OLNR). For a given input image, the softmax output of a -way classifier represents the confidence for each of the classes. We sort these class confidences in descending order (from rank to ). Consider the prediction of the network before the attack as the old label and after the attack as the new label. Post attack, the rank of the old label will change from 1 to say ‘’. This new rank ‘’ of the old label is defined as OLNR (Old Label’s New Rank). Further, post attack, the rank of the new label would have changed from say ‘’ to 1. This old rank ‘’ of the new label is defined as NLOR (New Label’s Old Rank). Hence, a stronger attack should flip to a label which had a high old rank (which will yield high NLOR), and also reduce probability for the clean prediction (which will yield a high OLNR). These metrics are computed for all the mis-classified images and the mean value is reported.
We now present Feature Disruptive Attack (FDA), our proposed attack formulation explicitly designed to generate perturbation that contaminate and corrupt the internal representations of a DNN. The aim of the proposed attack is to generate image specific perturbation which, when added to the image should not only flip the label but also disrupt its inner feature representations at each layer of the DNN. We first note that activations supporting the current prediction have to be lowered, whereas activations which do not support the current prediction have to be strengthened and increased. This can lead to feature representations which, while hiding the true information, contains high activations for features not present in the image. Hence, for a given layer , our layer objective , which we want to increase is given by:
where represents the th value of , represents the set of activations which support the current prediction, and is a monotonically increasing function of activations (on the partially ordered set ). We define as the -norm of inputs .
Finding the set is non-trivial. While all high activations may not support the current prediction, in practice, we find it to be usable approximation. We define the support set as:
where is a measure of central tendency. We try various choices of including and --. Overall, we find - (mean across channels) to be the most effective formulation. Finally, combining Eq. (4) and (5), our layer objective becomes:
We perform this optimization at each non-linearity in the network, and combine the per-layer objectives as follows:
In this section, we first present statistical analysis of the features corresponding to adversarial samples generated using existing attacks and the proposed attack. Further, we show the effectiveness of the proposed attack on (i) image recognition in white-box (Sec. 5.2) and black-box settings (shown in supplementary document), (ii) Feature-representation based tasks (Sec. 5.4) i.e., caption generation and style-transfer. We define optimization budget of an attack by the tuple (), where is the norm limit on the perturbation added to the image, defines the number of optimization iterations used by the attack method, and is the increment in the norm limit of the perturbation at each iteration.
In this section, we present the analysis which fundamentally motivates our attack formulation. We present various experiments, which posit that attack formulations tied to pre-softmax based objectives retain clean sample information in deep features, whereas FDA is effective at removing them. For all the following experiments, all attacks have been given the same optimization budget (). Reported numbers have been averaged over image samples.
First, we measure the similarity between intermediate feature representations of the clean and its corresponding adversarial samples generated using different attack methods. Figure 4, shows average cosine distance between intermediate feature representations of the clean and its corresponding adversarial samples, for various attack methods on PNasNet  architecture. From Fig. 4 it can be observed that for the proposed attack, feature dis-similarity is much higher than to that of the other attacks. The significant difference in cosine distance implies that contamination of intermediate feature is much higher for the proposed attack. We observe similar trend in other models at different optimization budgets () as well (refer supplementary).
Now, we measure the similarity between features of clean and adversarial samples at the pre-logits
pre-logitslayer (i.e., input to the classification layer) of the network. Apart from cosine distance, we also measure the Normalized Rank Transformation (NRT) distance. NRT-distance represents the average shift in the rank of the ordered statistic
. Primary benefit of NRT-distance measure is its robustness to outliers.
Table 1, tabulates the result for pre-logit output for multiple architectures. It can be observed that our proposed attack shows superiority to other methods. Although the pre-logits representations from other attacks seem to be corrupted, in section 5 we show that these pre-logits representation provide useful information for feature-based tasks.
ImageNet  is one of the most frequently used large-scale dataset for evaluating adversarial attacks. We evaluate our proposed attack on five DNN architectures trained on ImageNet dataset, including state-of-the-art PNASNet  architecture. We compare FDA to the strongest white-box optimization method (PGD), with different optimization objective, resulting in the following set of competing attacks: PGD-ML, PGD-LL, and PGD-CW.
We present our evaluation on the ImageNet-compatible dataset introduced in NIPS 2017 challenge (contains images). To provide a comprehensive analysis of our proposed attack, we present results with different optimization budgets. Note that attacks are compared only when they have the same optimization budget.
Table 2: top section presents the evaluation of multiple attack formulations across different DNN architectures with the optimization budget () in white-box setting. A crucial inference is the partial success of other attacks in terms of NLOR and OLNR. They either achieve significant NLOR or OLNR. This is due to the singular objective of either lowering the maximal probability, or increasing probability of the least-likely class. Table 2 also highlights the significant drop in performance of other attack for deeper networks (PNASNet  and Inception-ResNet ) due to vanishing gradients.
Now, we present evaluation against defenses mechanisms which have been scaled to ImageNet (experiments on defense mechanisms in smaller dataset (CIFAR-10)  are provided in the supplementary document).
Adversarial Training: We test our proposed attack against three adversarial training regimes, namely: Simple () , Ensemble () [tramèr2018ensemble] and Adversarial-logit-pairing ()  based adversarial training. We set the optimization budget of () for all the attacks on and models. Table 2: bottom section presents the results of our evaluation. Further, to show effectiveness at different optimization budgets, models are tested with different optimization budget, as show in Table 3.
Defense Mechanisms: We also test our model against defense mechanisms proposed by Guo  and Xie . Table 4 shows fooling rate achieved in Inception-ResNet V2 , under the presence of various defense mechanisms. The above results confirm the superiority of our proposed attack for white-box attack.
Most DNNs involved in real-world applications utilize transfer learning to alleviate problems such as data scarcity and efficiency. Furthermore, due to the easy accessibility of trained models on ImageNet dataset, such models have become the preferred starting point for training task-specific models. This presents an interesting scenario, where the attacker may have the knowledge of which model was fine-tuned for the given task, but may not have access to the fine-tuned model.
Due to the partial availability of information, such a scenario in essence acts as a “Gray-Box” setup. We hypothesize that in such a scenario, feature-corruption based attacks should be more effective than softmax or pre-softmax based attacks. To test this hypothesis, we attack the caption-generator “Show-and-Tell” (SAT) , which utilizes a ImageNet trained Inception-V3 (IncV3) model as the starting point, using adversaries generated from only the ImageNet-trained IncV3 network. Note that the IncV3 in SAT has been fine-tuned for 2 Million steps (albeit with a smaller learning rate).
). The right-most column tabulates the metrics when complete white noise is given as input.FDA Adversaries generated from Inception-V3 are highly effective for disrupting SAT.
Table 5 presents the effect of adversarial attacks on caption generation. We attack “Show-and-Tell” . Similar performance can be expected on advanced models such as [26, 57]. We clearly see the effectiveness of FDA in such a “Gray-Box” scenario, validating the presented hypothesis. Additionally, we note the content-specific metrics such as SPICE , are far more degraded. This is due to the fact that other attacks may change the features only to support a similar yet different object class, whereas FDA aims to completely removes the evidence of the clean sample.
We further show results for attacking SAT in a “White-box” setup in Table 6. We compare against Hongge  as well, an attack specifically formulated for caption generation. While the prime benefit of Hongge is the ability to perform targeted attack, we observe that we are comparable to Hongge in the untargeted scenario.
From its introduction in , Style-transfer has been a highly popular application of DNNs, specially in arts. However, to the best of our knowledge, adversarial attacks on Style-transfer have not yet been studied.
Earlier method by Gatys  proposed an optimization based approach, which utilizes gradients from trained networks to create an image which retains “content” from one image, and “style” from another. We first show that adversaries generated from other methods (PGD etc.) completely retain the structural content of the clean sample, allowing them to be used for style transfer without any loss in quality. In contrast, as FDA adversaries corrupts the clean information. Hence, apart from causing mis-classification, FDA adversaries also severely damage style-transfer. Figure 6: top shows example of style-transfer on clean, PGD-adversarial, and FDA adversarial sample. More importantly, FDA disrupts style-transfer without utilizing any task-specific knowledge or methodology.
In , Johnson introduced a novel approach where a network is trained to perform style-transfer in a single forward pass. In such a setup it is infeasible to mount an attack with PGD-like adversaries as there is no final layer to derive loss-gradients from. In contrast, with the white-box access to the parameters of these networks, FDA adversaries can be generated to disrupt style-transfer, without any change in its formulation. Figure 6: bottom shows qualitative examples of disruption caused due to FDA adversaries in the model proposed by Johnson . Style-transfer has been applied to videos as well. We have provided qualitative results in the supplementary to show that FDA remains highly effective in disrupting stylized videos as well.
In this work, we establish the retention of clean sample information in adversarial samples generated by attacks that optimizes objective tied to softmax or pre-softmax layer of the network. This is found to be true even when these samples are misclassified with high confidence. Further, we highlight the weakness of such attacks using the proposed evaluation metrics: OLNR and NLOR. We then propose FDA, an adversarial attack which corrupts the features at each layer of the network. We experimentally validate that FDA generates one of the strongest white-box adversaries. Additionally, we show that feature of FDA adversarial samples do not allow extraction of useful information for feature-based tasks such as style-transfer, and caption-generation as well.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
Vulnerability of deep reinforcement learning to policy induction attacks. arXiv preprint arXiv:1701:04143. Cited by: §2.
International Journal of Pattern Recognition and Artificial Intelligence28 (07), pp. 1460002. Cited by: §1.
Deep visual-semantic quantization for efficient image retrieval. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §2.
Attacking visual language grounding with adversarial examples: a case study on neural image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, External Links: Cited by: §5.4.1, Table 6.
Image style transfer using convolutional neural networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 2414–2423. External Links: Cited by: 1st item, §2, §5.4.2, §5.4.2.
Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision (ECCV), Cited by: 1st item, §2, Figure 6, §5.4.2.
DenseCap: fully convolutional localization networks for dense captioning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §5.4.1.
Inception-v4, inception-resnet and the impact of residual connections on learning. CoRR abs/1602.07261. External Links: Cited by: §5.2, §5.3.
Inverting a connectionist network mapping by backpropagation of error. Proc. of 8th Annual Conference of the Cognitive Science Society (), pp. 859–865. External Links: Cited by: §2, §3.