FDA: Feature Disruptive Attack

09/10/2019 ∙ by Aditya Ganeshan, et al. ∙ indian institute of science Preferred Infrastructure 0

Though Deep Neural Networks (DNN) show excellent performance across various computer vision tasks, several works show their vulnerability to adversarial samples, i.e., image samples with imperceptible noise engineered to manipulate the network's prediction. Adversarial sample generation methods range from simple to complex optimization techniques. Majority of these methods generate adversaries through optimization objectives that are tied to the pre-softmax or softmax output of the network. In this work we, (i) show the drawbacks of such attacks, (ii) propose two new evaluation metrics: Old Label New Rank (OLNR) and New Label Old Rank (NLOR) in order to quantify the extent of damage made by an attack, and (iii) propose a new adversarial attack FDA: Feature Disruptive Attack, to address the drawbacks of existing attacks. FDA works by generating image perturbation that disrupt features at each layer of the network and causes deep-features to be highly corrupt. This allows FDA adversaries to severely reduce the performance of deep networks. We experimentally validate that FDA generates stronger adversaries than other state-of-the-art methods for image classification, even in the presence of various defense measures. More importantly, we show that FDA disrupts feature-representation based tasks even without access to the task-specific network or methodology. Code available at: https://github.com/BardOfCodes/fda

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

Code Repositories

fda

Code of our recently published attack FDA: Feature Disruptive Attack. Colab Notebook: https://colab.research.google.com/drive/1WhkKCrzFq5b7SNrbLUfdLVo5-WK5mLJh


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the advent of deep-learning based algorithms, remarkable progress has been achieved in various computer vision applications. However, a plethora of existing works 

[9, 50, 8, 39], have clearly established that Deep Neural Networks (DNNs) are susceptible to adversarial samples: input data containing imperceptible noise specifically crafted to manipulate the network’s prediction. Further, Szegedy  [50] showed that adversarial samples transfer across models i.e., adversarial samples generated for one model can adversely affect other unrelated models as well. This transferable nature of adversarial samples further increases the vulnerability of DNNs deployed in real world. As DNNs become more-and-more ubiquitous, especially in decision-critical applications, such as Autonomous Driving [2], the necessity of investigating adversarial samples has become paramount.

Figure 1: Using feature inversion [34], we visualize the representation of Inception-V3 [49]. The inversion of PGD-attacked sample (d) is remarkably similar to inversion of clean sample (c). In contrast, inversion of FDA-attacked sample (b) completely obfuscates the clean sample’s information.

Majority of existing attacks [50, 13, 33, 16], generate adversarial samples via optimizing objectives that are tied to the pre-softmax or softmax output of the network. The sole objective of these attacks is to generate adversarial samples which are misclassified by the network with very high confidence. While the classification output is changed, it is unclear what happens to the internal deep representations of the network. Hence, we ask a fundamental question:

Do deep features of adversarial samples retain usable clean sample information?

In this work, we demonstrate that deep features of adversarial samples generated using such attacks retain high-level semantic information of the corresponding clean samples. This is due to the fact that these attacks optimize only pre-softmax or softmax score based objective to generate adversarial samples. We provide evidence for this observation by leveraging feature inversion [35], where, given a feature representation , we optimize to construct the approximate inverse . Using the ability to visualize deep features, we highlight the retention of clean information in deep features of adversarial samples. The fact that deep features of adversarial samples retain clean sample information has important implications:

  • First, such deep features may still be useful for various feature driven tasks such as caption generation [52, 57], and style-transfer [21, 25, 51].

  • Secondly, these adversarial samples cause the model to either predict semantically similar class or to retain high (comparatively) probability for the original label, while predicting a very different class. These observations are captured by using the proposed metrics i.e., New Label’s Old Rank (NLOR) and Old Label’s New Rank (OLNR), and statistics such as fooling rate at k

    rank.

These implications are major drawbacks of existing attacks which optimize only pre-softmax or softmax score based objectives. Based on these observations, in this work, we seek adversarial samples that can corrupt deep features and inflict severe damage to feature representations. With this motivation, we introduce FDA: Feature Disruptive attack. FDA generates perturbation with the aim to cause disruption of features at each layer of the network in a principled manner. This results in corruption of deep features, which in turn degrades the performance of the network. Figure 1 shows feature inversion from deep features of a clean, a PGD [33] attacked, and a FDA attacked sample, highlighting the lack of clean sample information after our proposed attack.

Following are the benefits of our proposed attack: (i) FDA invariably flips the predicted label to highly unrelated classes, while also successfully removing evidence of the clean sample’s predicted label. As we elaborate in section 5, other attacks [50, 30, 13] only achieve one of the above objectives. (ii) Unlike existing attacks, FDA disrupts feature-representation based tasks e.g., caption generation, even without access to the task-specific network or methodology i.e., it is effective in a gray-box attack setting. (iii) FDA generates stronger adversaries than other state-of-the-art methods for Image classification. Even in the presence of various recently proposed defense measures (including adversarial training), our proposed attack consistently outperforms other existing attacks.

In summary, the major contributions of this work are:

  • We demonstrate the drawbacks of existing attacks.

  • We propose two new evaluation metrics i.e., NLOR and OLNR, in order to quantify the extent of damage made by an attack method.

  • We introduce a new attack called FDA motivated by corrupting features at every layer. We experimentally validate that FDA

    creates stronger white-box adversaries than other attacks on ImageNet dataset for state-of-the-art classifiers, even in the presence of various defense mechanisms.

  • Finally, we successfully attack two feature based-tasks, namely caption generation and style transfer where current attack methods either fail or are exhibit weaker attack than FDA. A novel “Gray-Box” attack scenario is also presented where FDA again exhibits stronger attacking capability.

2 Related Works

Attacks: Following the demonstration by Szegedy  [50] on the existence of adversarial samples, multiple works [22, 38, 29, 16, 4, 33, 13, 10] have proposed various techniques for generating adversarial samples. Parallelly, works such as [36, 56, 6] have explored the existence of adversarial samples for other tasks.

The works closest to our approach are Zhou  [58], Sabour  [44] and Mopuri  [41]. Zhou create black-box transferable adversaries by simultaneously optimizing multiple objectives, including a final-layer cross entropy term. In contrast, we only optimize for our formulation of feature disruption (refer section 4.3). Sabour specifically optimize to make a specific layer’s feature arbitrarily close to a target image’s features. Our objective is significantly different entailing disruption at every layer of a DNNs, without relying on a ’target’ image representation. Finally, Mopuri provide a complex optimization setup for crafting UAPs whereas our method yields image-specific adversaries. We show that a simple adaptation of their method to craft image-specific adversaries yields poor results (refer supplementary material).

Defenses: Goodfellow  [22] first showed that including adversarial samples in the training regime increases robustness of DNNs to adversarial attacks. Following this work, multiple approaches [30, tramèr2018ensemble, 27, 53, 17, 33] have been proposed for adversarial training, addressing important concerns such as Gradient masking, and label leaking.

Recent works [40, 31, 1, 47, 15, 11], present many alternative to adversarial training. Crucially, works such as [23, 55, 43] propose defense techniques which can be easily implemented for large scale datasets such as ImageNet. While Guo  [23] propose utilizing input transformation as a defense technique, Xie  [55] introduce randomized transformations in the input as a defense.

Feature Visualization:

Feature inversion has a long history in machine learning 

[54]. Mahendran  [34] proposed an optimization method for feature inversion, combining feature reconstruction with regularization objectives. Contrarily, Dosovitskiy  [18] introduce a neural network for imposing image priors on the reconstruction. Recent works such as [46, 20] have followed the suit. The reader is referred to [19] for a comprehensive survey.

Feature-based Tasks: DNNs have become the preferred feature extractors over hand-engineered local descriptors like SIFT or HOG [5, 7]. Hence, various tasks such as captioning [52, 57], and image-retrieval [12, 24] rely on DNNs for extracting image information. Recently, tasks such as style-transfer have been introduced which rely on deep features as well. While works such as [25, 51] propose a learning based approach, Gatys  [21] perform an optimization on selected deep features.

We show that previous attacks create adversarial samples which still provide useful information for feature-based tasks. In contrast, FDA inflicts severe damage to feature-based tasks without any task-specific information or methodology.

3 Preliminaries

We define a classifier , where is the dimensional input, and is the

dimensional score vector containing pre-softmax scores for the

different classes. Applying softmax on the output gives us the predicted probabilities for the classes, and is taken as the predicted label for input . Let represent the ground truth label of sample . Now, an adversarial sample can be defined as any input sample such that:

(1)

where acts as an imperceptibility constraint, and is typically considered as a or constraint.

Attacks such as [50, 22, 16, 33], find adversarial samples by different optimization methods, but with the same optimization objective: maximizing the cross-entropy loss for the adversarial sample . Fast Gradient Sign Method (FGSM) [50] performs a single step optimization, yielding an adversary :

(2)

On the other hand, PGD [33] and I-FGSM [22] performs a multi-step signed-gradient ascent on this objective. Works such as [16, 4], further integrate Momentum and ADAM optimizer for maximizing the objective.

Kurakin  [30] discovered the phenomena of label leaking and use predicted label instead of . This yields a class of attacks which can be called most-likely attacks, where the loss objective is changed to (where represents the class with the maximum predicted probability).

Works such as [27, tramèr2018ensemble] note that above methods yield adversarial samples which are weak, in the sense of being misclassified into a very similar class (for e.g., a hound misclassified as a terrier). They posit that targeted attacks are more meaningful, and utilize least likely attacks, proposing minimization of Loss objective (where represents the class with the least predicted probability). We denote the most-likely and the least-likely variant of any attack by the suffix ML and LL.

Carlini  [13] propose multiple different objectives and optimization methods for generating adversaries. Among the proposed objectives, they infer that the strongest objective is as follows:

(3)

where is short-hand for . For a distance metric adversary, this objective can be integrated with PGD optimization to yield PGD-CW. The notation introduced in this section is followed throughout the paper.

Feature inversion: Feature inversion can be summarized as the problem of finding the sample whose representation is the closest match to a given representation [54]. We use the approach proposed by Mahendran  [34]. Additionally, to improve the inversion, we use Laplacian pyramid gradient normalization. We provide additional information in the supplementary.

4 Feature Disruptive Attack

4.1 Drawbacks of existing attacks

In this section, we provide qualitative evidence to show that deep features corresponding to adversarial samples generated by existing attacks (i.e., attacks that optimize objectives tied to the softmax or pre-softmax layer of the network), retain high level semantic information of its corresponding clean sample. We use feature inversion to provide evidence for this observation.

Figure 2 shows the feature inversion for different layers of VGG-16 [45] architecture trained on ImageNet dataset, for the clean and its corresponding adversarial sample. From Fig. 2, it can be observed that the inversion of adversarial features of PGD-LL sample [33] is remarkably similar to the inversion of features of clean sample. Further, in section 5.1, we statistically show the similarity between intermediate feature representations of clean and its corresponding adversarial samples generated using different existing attack methods. Finally, in section  5.2 we show that as a consequence of retaining clean sample information, these adversarial samples cause the model to either predict semantically similar class or to retain high (comparatively) probability for the original label, while predicting a very different class. These observations are captured by using the proposed metrics i.e., New Label Old Rank (NLOR) and Old Label New Rank (OLNR), and statistics such as fooling rate at -th rank.

Figure 2: Feature Inversion: Layer-by-layer Feature Inversion [34] of clean, PGD-LL-adversarial and FDA-adversarial sample. Note the significant removal of clean sample information in later layers of FDA-adversarial sample.

4.2 Proposed evaluation metrics

An attack’s strength is typically measured in terms of fooling rate [37], which measures the percentage (%) of images for which the predicted label was changed due to the attack. However, only looking at fooling rate does not present the full picture of the attack. On one hand, attacks such as PGD-ML may result in flipping of label into a semantically similar class, and on the other hand, attacks such as PGD-LL may flip the label to a very different class, while still retaining high (comparatively) probability for the original label. These drawbacks are not captured in the existing evaluation metric i.e., Fooling rate.

Hence, we propose two new evaluation metrics, New Label Old Rank (NLOR) and Old Label New Rank (OLNR). For a given input image, the softmax output of a -way classifier represents the confidence for each of the classes. We sort these class confidences in descending order (from rank to ). Consider the prediction of the network before the attack as the old label and after the attack as the new label. Post attack, the rank of the old label will change from 1 to say ‘’. This new rank ‘’ of the old label is defined as OLNR (Old Label’s New Rank). Further, post attack, the rank of the new label would have changed from say ‘’ to 1. This old rank ‘’ of the new label is defined as NLOR (New Label’s Old Rank). Hence, a stronger attack should flip to a label which had a high old rank (which will yield high NLOR), and also reduce probability for the clean prediction (which will yield a high OLNR). These metrics are computed for all the mis-classified images and the mean value is reported.

4.3 Proposed attack

We now present Feature Disruptive Attack (FDA), our proposed attack formulation explicitly designed to generate perturbation that contaminate and corrupt the internal representations of a DNN. The aim of the proposed attack is to generate image specific perturbation which, when added to the image should not only flip the label but also disrupt its inner feature representations at each layer of the DNN. We first note that activations supporting the current prediction have to be lowered, whereas activations which do not support the current prediction have to be strengthened and increased. This can lead to feature representations which, while hiding the true information, contains high activations for features not present in the image. Hence, for a given layer , our layer objective , which we want to increase is given by:

(4)

where represents the th value of , represents the set of activations which support the current prediction, and is a monotonically increasing function of activations (on the partially ordered set ). We define as the -norm of inputs .

Finding the set is non-trivial. While all high activations may not support the current prediction, in practice, we find it to be usable approximation. We define the support set as:

(5)

where is a measure of central tendency. We try various choices of including and --. Overall, we find - (mean across channels) to be the most effective formulation. Finally, combining Eq. (4) and (5), our layer objective becomes:

(6)

We perform this optimization at each non-linearity in the network, and combine the per-layer objectives as follows:

(7)

Figure 3 provides a visual overview of the proposed method. In supplementary document we provide results for ablation study of the proposed attack i.e., different formulation of

such as median, Inter-Quartile-mean etc.

Figure 3: Overview Image: From network (a), for each selected feature blob (b) we perform the optimization (d) as explained in equation 6. (c) shows a spatial feature, where the support set is colored red, and the remaining is blue.

5 Experiments

In this section, we first present statistical analysis of the features corresponding to adversarial samples generated using existing attacks and the proposed attack. Further, we show the effectiveness of the proposed attack on (i) image recognition in white-box (Sec.  5.2) and black-box settings (shown in supplementary document), (ii) Feature-representation based tasks (Sec. 5.4) i.e., caption generation and style-transfer. We define optimization budget of an attack by the tuple (), where is the norm limit on the perturbation added to the image, defines the number of optimization iterations used by the attack method, and is the increment in the norm limit of the perturbation at each iteration.

5.1 Statistical analysis of adversarial features

In this section, we present the analysis which fundamentally motivates our attack formulation. We present various experiments, which posit that attack formulations tied to pre-softmax based objectives retain clean sample information in deep features, whereas FDA is effective at removing them. For all the following experiments, all attacks have been given the same optimization budget (). Reported numbers have been averaged over image samples.

First, we measure the similarity between intermediate feature representations of the clean and its corresponding adversarial samples generated using different attack methods. Figure 4, shows average cosine distance between intermediate feature representations of the clean and its corresponding adversarial samples, for various attack methods on PNasNet [32] architecture. From Fig. 4 it can be observed that for the proposed attack, feature dis-similarity is much higher than to that of the other attacks. The significant difference in cosine distance implies that contamination of intermediate feature is much higher for the proposed attack. We observe similar trend in other models at different optimization budgets () as well (refer supplementary).

Figure 4: Cosine distance between features of clean image and its corresponding adversarial sample, at different layer of P-NasNet [32] architecture.

Now, we measure the similarity between features of clean and adversarial samples at the

pre-logits

layer (i.e., input to the classification layer) of the network. Apart from cosine distance, we also measure the Normalized Rank Transformation (NRT) distance. NRT-distance represents the average shift in the rank of the ordered statistic

. Primary benefit of NRT-distance measure is its robustness to outliers.

Table 1, tabulates the result for pre-logit output for multiple architectures. It can be observed that our proposed attack shows superiority to other methods. Although the pre-logits representations from other attacks seem to be corrupted, in section 5 we show that these pre-logits representation provide useful information for feature-based tasks.

PGD-ML PGD-CW PGD-LL Ours
Res-152 Cosine Dist. 0.49 0.37 0.60 0.81
NRT Dist. 15.00 13.56 16.29 19.17
Inc-V3 Cosine Dist. 0.51 0.41 0.49 0.55
NRT Dist. 16.11 14.97 17.38 19.01
Table 1: Metrics for measuring the dissimilarity between adversarial pre-logits and clean pre-logits on different networks. Our method FDA exhibits stronger dissimilarity.
Metrics Fooling Rate NLOR OLNR
PGD-ML PGD-CW PGD-LL Ours PGD-ML PGD-CW PGD-LL Ours PGD-ML PGD-CW PGD-LL Ours
VGG-16 99.90 99.90 93.80 97.80 57.26 6.17 539.92 433.33 308.34 29.19 217.98 455.26
ResNet-152 99.50 99.60 88.15 97.69 20.62 5.12 593.64 412.52 247.22 21.84 89.58 380.04
Inc-V3 99.20 99.10 89.06 99.80 61.73 21.95 599.49 549.57 524.65 63.86 92.45 669.31
IncRes-V2 94.18 94.58 74.30 99.60 75.43 44.51 314.20 492.95 314.14 44.46 67.02 487.76
PNasNet-Large 92.60 92.40 81.40 99.00 123.93 59.44 319.18 473.54 335.63 70.67 118.73 512.21
Inc-V3 97.89 97.69 80.62 99.70 68.03 34.56 346.59 545.89 281.75 39.08 77.80 629.93
Inc-V3 98.69 97.49 88.76 100.00 114.96 68.76 450.66 533.49 386.16 106.58 142.65 634.55
IncRes-V2 91.27 89.66 61.65 99.70 81.80 39.68 284.36 504.51 234.66 33.20 67.27 571.46
IncRes-V2 98.69 97.49 88.76 100.00 114.96 68.76 450.66 533.49 386.16 106.58 142.65 634.55
Table 2: Evaluation of various attacks on networks trained on ImageNet dataset, in white-box setting. Top: Comparison on normally trained architectures, with the optimization budget (refer section 5) of (). Bottom: Comparison on adversarially trained models ( & ), with the budget (). The salient feature of our attack is high performance on all metrics at the same time.

5.2 Attack on Image Recognition

ImageNet [42] is one of the most frequently used large-scale dataset for evaluating adversarial attacks. We evaluate our proposed attack on five DNN architectures trained on ImageNet dataset, including state-of-the-art PNASNet [32] architecture. We compare FDA to the strongest white-box optimization method (PGD), with different optimization objective, resulting in the following set of competing attacks: PGD-ML, PGD-LL, and PGD-CW.

PGD-ML PGD-CW PGD-LL Ours
Fooling Rate 85.04 87.15 51.10 80.02
NLOR 22.28 10.83 20.60 119.41
OLNR 77.55 11.14 14.90 81.73
PGD-ML PGD-CW PGD-LL Ours
Fooling Rate 96.99 98.29 64.56 94.28
NLOR 41.51 12.26 77.40 259.78
OLNR 302.03 14.97 25.66 241.43
Table 3: Evaluation on ALP [27]-adversarially trained model, with different optimization budget.

We present our evaluation on the ImageNet-compatible dataset introduced in NIPS 2017 challenge (contains images). To provide a comprehensive analysis of our proposed attack, we present results with different optimization budgets. Note that attacks are compared only when they have the same optimization budget.

Table 2: top section presents the evaluation of multiple attack formulations across different DNN architectures with the optimization budget () in white-box setting. A crucial inference is the partial success of other attacks in terms of NLOR and OLNR. They either achieve significant NLOR or OLNR. This is due to the singular objective of either lowering the maximal probability, or increasing probability of the least-likely class. Table 2 also highlights the significant drop in performance of other attack for deeper networks (PNASNet [32] and Inception-ResNet [48]) due to vanishing gradients.

Figure 5: Fooling rate at -th rank for various attacks in white-box setting with the optimization budget (refer section 5) (). Attacks are performed on networks trained on ImageNet dataset. Column-1: VGG-16, Column-2: ResNet-152, Column-3: Inception-V3 and Column-4: PNASNet-Large.

In Figure 5, we present Generalizable Fooling Rate [41] with respect to Top- accuracy as a function of . The significantly higher Generalizable Fooling Rate at high values further establishes the superiority of our proposed attack on networks trained on ImageNet dataset.

5.3 Evaluation against Defense proposals

Defenses Fooling Rate
PGD-ML PGD-CW PGD-LL Ours
Gaussian Filter 81.93 36.95 68.57 92.87
Median Filter 50.40 23.19 38.45 70.88
Bilateral Filter 54.52 19.18 41.47 70.18
Bit Quant. 73.90 40.86 62.05 91.77
JPEG Comp. 79.82 31.83 66.67 96.18
TV Min. 38.96 17.67 27.81 55.72
Quilting 38.35 24.10 30.82 56.63
Randomize [55] 81.93 42.87 68.17 98.19
Table 4: Evaluation of various attacks in the presence of input transformation based defense measures with budget (). While achieving higher fooling rate, we also achieve higher NLOR and OLNR (refer supplementary).

Now, we present evaluation against defenses mechanisms which have been scaled to ImageNet (experiments on defense mechanisms in smaller dataset (CIFAR-10) [28] are provided in the supplementary document).

Adversarial Training: We test our proposed attack against three adversarial training regimes, namely: Simple ([30], Ensemble ([tramèr2018ensemble] and Adversarial-logit-pairing ([27] based adversarial training. We set the optimization budget of () for all the attacks on and models. Table 2: bottom section presents the results of our evaluation. Further, to show effectiveness at different optimization budgets, models are tested with different optimization budget, as show in Table 3.

Defense Mechanisms: We also test our model against defense mechanisms proposed by Guo  [23] and Xie  [55]. Table 4 shows fooling rate achieved in Inception-ResNet V2 [48], under the presence of various defense mechanisms. The above results confirm the superiority of our proposed attack for white-box attack.

5.4 Attacking Feature-Representation based tasks

5.4.1 Caption Generation

Most DNNs involved in real-world applications utilize transfer learning to alleviate problems such as data scarcity and efficiency. Furthermore, due to the easy accessibility of trained models on ImageNet dataset, such models have become the preferred starting point for training task-specific models. This presents an interesting scenario, where the attacker may have the knowledge of which model was fine-tuned for the given task, but may not have access to the fine-tuned model.

Due to the partial availability of information, such a scenario in essence acts as a “Gray-Box” setup. We hypothesize that in such a scenario, feature-corruption based attacks should be more effective than softmax or pre-softmax based attacks. To test this hypothesis, we attack the caption-generator “Show-and-Tell” (SAT) [52], which utilizes a ImageNet trained Inception-V3 (IncV3) model as the starting point, using adversaries generated from only the ImageNet-trained IncV3 network. Note that the IncV3 in SAT has been fine-tuned for 2 Million steps (albeit with a smaller learning rate).

Metrics No Attack PGD-ML PGD-LL MI-FGSM Ours Noise
CIDEr 103.21 47.95 47.13 49.23 4.90 2.84
Blue-1 71.61 57.04 55.68 57.18 39.80 37.60
Rough 53.61 42.15 41.24 42.65 30.70 29.30
METEOR 25.58 17.507 16.78 17.34 10.02 7.84
SPICE 18.07 9.60 9.45 10.02 2.04 1.00
Table 5: Attacking “Show-and-Tell”(SAT) [52] in a “Gray-box” setup with budget (

). The right-most column tabulates the metrics when complete white noise is given as input.

FDA Adversaries generated from Inception-V3 are highly effective for disrupting SAT.
Figure 6: Attacking Style Transfer. Top: PGD adversaries provide clean sample information sufficient for effective style transfer, whereas FDA adversaries do not. (d): Generating adversaries for Johnson  [25] using FDA, where PGD formulation fails. Leftmost image presents the style, followed by a sequence of clean image, style-transfer before and after adversarial attack by FDA.

Table 5 presents the effect of adversarial attacks on caption generation. We attack “Show-and-Tell” [52]. Similar performance can be expected on advanced models such as [26, 57]. We clearly see the effectiveness of FDA in such a “Gray-Box” scenario, validating the presented hypothesis. Additionally, we note the content-specific metrics such as SPICE [3], are far more degraded. This is due to the fact that other attacks may change the features only to support a similar yet different object class, whereas FDA aims to completely removes the evidence of the clean sample.

We further show results for attacking SAT in a “White-box” setup in Table 6. We compare against Hongge  [14] as well, an attack specifically formulated for caption generation. While the prime benefit of Hongge is the ability to perform targeted attack, we observe that we are comparable to Hongge in the untargeted scenario.

5.4.2 Style-transfer

From its introduction in [21], Style-transfer has been a highly popular application of DNNs, specially in arts. However, to the best of our knowledge, adversarial attacks on Style-transfer have not yet been studied.

Metrics No Attack PGD-ML MI-FGSM [14] Ours Noise
CIDEr 94.90 31.70 31.21 10.80 4.14 2.84
Blue-1 69.13 51.64 51.36 38.95 39.80 37.60
Rough 51.68 38.20 38.20 28.19 31.00 29.30
METEOR 24.29 14.55 14.60 9.75 9.30 7.84
SPICE 17.08 7.30 7.00 3.38 1.68 0.99
Table 6: Attacking (SAT) [52] in a “White-box” setup with budget (). FDA is at-par with task-specific attack [14]

Earlier method by Gatys  [21] proposed an optimization based approach, which utilizes gradients from trained networks to create an image which retains “content” from one image, and “style” from another. We first show that adversaries generated from other methods (PGD etc.) completely retain the structural content of the clean sample, allowing them to be used for style transfer without any loss in quality. In contrast, as FDA adversaries corrupts the clean information. Hence, apart from causing mis-classification, FDA adversaries also severely damage style-transfer. Figure 6: top shows example of style-transfer on clean, PGD-adversarial, and FDA adversarial sample. More importantly, FDA disrupts style-transfer without utilizing any task-specific knowledge or methodology.

In [25], Johnson introduced a novel approach where a network is trained to perform style-transfer in a single forward pass. In such a setup it is infeasible to mount an attack with PGD-like adversaries as there is no final layer to derive loss-gradients from. In contrast, with the white-box access to the parameters of these networks, FDA adversaries can be generated to disrupt style-transfer, without any change in its formulation. Figure 6: bottom shows qualitative examples of disruption caused due to FDA adversaries in the model proposed by Johnson . Style-transfer has been applied to videos as well. We have provided qualitative results in the supplementary to show that FDA remains highly effective in disrupting stylized videos as well.

6 Conclusion

In this work, we establish the retention of clean sample information in adversarial samples generated by attacks that optimizes objective tied to softmax or pre-softmax layer of the network. This is found to be true even when these samples are misclassified with high confidence. Further, we highlight the weakness of such attacks using the proposed evaluation metrics: OLNR and NLOR. We then propose FDA, an adversarial attack which corrupts the features at each layer of the network. We experimentally validate that FDA generates one of the strongest white-box adversaries. Additionally, we show that feature of FDA adversarial samples do not allow extraction of useful information for feature-based tasks such as style-transfer, and caption-generation as well.

References

  • [1] N. Akhtar, J. Liu, and A. Mian (2018-06) Defense against universal adversarial perturbations. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §2.
  • [2] M. Al-Qizwini, I. Barjasteh, H. Al-Qassab, and H. Radha (2017-06) Deep learning algorithm for autonomous driving using googlenet. In 2017 IEEE Intelligent Vehicles Symposium (IV), Vol. , pp. 89–96. External Links: Document, ISSN Cited by: §1.
  • [3] P. Anderson, B. Fernando, M. Johnson, and S. Gould (2016) SPICE: semantic propositional image caption evaluation. In The European Conference on Computer Vision (ECCV), Cited by: §5.4.1.
  • [4] A. Athalye, N. Carlini, and D. Wagner (2018-07) Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. In Proceedings of the 35th International Conference on Machine Learning, ICML, External Links: Link Cited by: §2, §3.
  • [5] M. Aubry, D. Maturana, A. A. Efros, B. C. Russell, and J. Sivic (2014) Seeing 3d chairs: exemplar part-based 2d-3d alignment using a large dataset of cad models. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [6] V. Behzadan and A. Munir (2017)

    Vulnerability of deep reinforcement learning to policy induction attacks

    .
    arXiv preprint arXiv:1701:04143. Cited by: §2.
  • [7] A. C. Berg, T. L. Berg, and J. Malik (2005) Shape matching and object recognition using low distortion correspondences. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [8] B. Biggio, I. Corona, D. Maiorca, B. Nelson, N. Šrndić, P. Laskov, G. Giacinto, and F. Roli (2013) Evasion attacks against machine learning at test time. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 387–402. Cited by: §1.
  • [9] B. Biggio, G. Fumera, and F. Roli (2014) Pattern recognition systems under attack: design issues and research challenges.

    International Journal of Pattern Recognition and Artificial Intelligence

    28 (07), pp. 1460002.
    Cited by: §1.
  • [10] W. Brendel, J. Rauber, and M. Bethge (2018) Decision-based adversarial attacks: reliable attacks against black-box machine learning models. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §2.
  • [11] J. Buckman, A. Roy, C. Raffel, and I. Goodfellow (2018) Thermometer encoding: one hot way to resist adversarial examples. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §2.
  • [12] Y. Cao, M. Long, J. Wang, and S. Liu (2017-06)

    Deep visual-semantic quantization for efficient image retrieval

    .
    In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §2.
  • [13] N. Carlini and D. Wagner (2016) Towards evaluating the robustness of neural networks. arXiv preprint arXiv:1608.04644. Cited by: §1, §1, §2, §3.
  • [14] H. Chen, H. Zhang, P. Chen, J. Yi, and C. Hsieh (2018)

    Attacking visual language grounding with adversarial examples: a case study on neural image captioning

    .
    In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, External Links: Link Cited by: §5.4.1, Table 6.
  • [15] G. S. Dhillon, K. Azizzadenesheli, J. D. Bernstein, J. Kossaifi, A. Khanna, Z. C. Lipton, and A. Anandkumar (2018) Stochastic activation pruning for robust adversarial defense. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §2.
  • [16] Y. Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li (2018-06) Boosting adversarial attacks with momentum. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §3, §3.
  • [17] Y. Dong, H. Su, J. Zhu, and F. Bao (2017) Towards interpretable deep neural networks by leveraging adversarial examples. CoRR abs/1708.05493. External Links: Link, 1708.05493 Cited by: §2.
  • [18] A. Dosovitskiy and T. Brox (2016-06) Inverting visual representations with convolutional networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [19] M. Du, N. Liu, and X. Hu (2018-07) Techniques for Interpretable Machine Learning. arXiv preprint arXiv: 1808.00033. Cited by: §2.
  • [20] M. Du, N. Liu, Q. Song, and X. Hu (2018) Towards explanation of dnn-based prediction with guided feature inversion. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD, pp. 1358–1367. External Links: ISBN 978-1-4503-5552-0, Link, Document Cited by: §2.
  • [21] L. A. Gatys, A. S. Ecker, and M. Bethge (2016-06)

    Image style transfer using convolutional neural networks

    .
    In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 2414–2423. External Links: Document, ISSN 1063-6919 Cited by: 1st item, §2, §5.4.2, §5.4.2.
  • [22] I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §2, §2, §3, §3.
  • [23] C. Guo, M. Rana, M. Cisse, and L. van der Maaten (2018) Countering adversarial images using input transformations. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §2, §5.3.
  • [24] T. Hoang, T. Do, D. Le Tan, and N. Cheung (2017) Selective deep convolutional features for image retrieval. In Proceedings of the 25th ACM International Conference on Multimedia, pp. 1600–1608. External Links: ISBN 978-1-4503-4906-2, Link, Document Cited by: §2.
  • [25] J. Johnson, A. Alahi, and L. Fei-Fei (2016)

    Perceptual losses for real-time style transfer and super-resolution

    .
    In European Conference on Computer Vision (ECCV), Cited by: 1st item, §2, Figure 6, §5.4.2.
  • [26] J. Johnson, A. Karpathy, and L. Fei-Fei (2016-06)

    DenseCap: fully convolutional localization networks for dense captioning

    .
    In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §5.4.1.
  • [27] H. Kannan, A. Kurakin, and I. J. Goodfellow (2018) Adversarial logit pairing. CoRR abs/1803.06373. External Links: Link, 1803.06373 Cited by: §2, §3, §5.3, Table 3.
  • [28] A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Technical report University of Toronto. Cited by: §5.3.
  • [29] A. Kurakin, I. Goodfellow, and S. Bengio (2016) Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533. Cited by: §2.
  • [30] A. Kurakin, I. J. Goodfellow, and S. Bengio (2016) Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236. Cited by: §1, §2, §3, §5.3.
  • [31] F. Liao, M. Liang, Y. Dong, T. Pang, X. Hu, and J. Zhu (2018-06) Defense against adversarial attacks using high-level representation guided denoiser. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [32] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy (2018-09) Progressive neural architecture search. In The European Conference on Computer Vision (ECCV), Cited by: Figure 4, §5.1, §5.2, §5.2.
  • [33] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2018) Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §1, §1, §2, §2, §3, §3, §4.1.
  • [34] A. Mahendran and A. Vedaldi (2015-06) Understanding deep image representations by inverting them. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Figure 1, §2, §3, Figure 2.
  • [35] A. Mahendran and A. Vedaldi (2016) Visualizing deep convolutional neural networks using natural pre-images. International Journal of Computer Vision (IJCV) 120 (3), pp. 233–255. External Links: ISSN 1573-1405, Document Cited by: §1.
  • [36] J. H. Metzen, M. C. Kumar, T. Brox, and V. Fischer (2017) Universal adversarial perturbations against semantic image segmentation. In International Conference on Computer Vision (ICCV), Cited by: §2.
  • [37] S. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard (2017) Universal adversarial perturbations. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.2.
  • [38] S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard (2016) DeepFool: A simple and accurate method to fool deep neural networks. In IEEE Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [39] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami (2016) The limitations of deep learning in adversarial settings. In Security and Privacy (EuroS&P), 2016 IEEE European Symposium on, pp. 372–387. Cited by: §1.
  • [40] A. Prakash, N. Moran, S. Garber, A. DiLillo, and J. Storer (2018-06) Deflecting adversarial attacks with pixel deflection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [41] K. Reddy Mopuri, A. Ganeshan, and R. Venkatesh Babu (2018) Generalizable data-free objective for crafting universal adversarial perturbations. IEEE Transactions on Pattern Analysis and Machine Intelligence (), pp. 1–1. External Links: Document, ISSN 0162-8828 Cited by: §2, §5.2.
  • [42] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. Cited by: §5.2.
  • [43] Vivek,B. S., A. Baburaj, and R. Venkatesh Babu (2019-06) Regularizer to mitigate gradient masking effect during single-step adversarial training. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §2.
  • [44] S. Sabour, Y. Cao, F. Faghri, and D. J. Fleet (2015) Adversarial manipulation of deep representations. arXiv preprint arXiv:1511.05122. Cited by: §2.
  • [45] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.1.
  • [46] A. Singh and A. Namboodiri (2015-11) Laplacian pyramids for deep feature inversion. In 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), Vol. , pp. 286–290. External Links: Document, ISSN 2327-0985 Cited by: §2.
  • [47] Y. Song, T. Kim, S. Nowozin, S. Ermon, and N. Kushman (2018) PixelDefend: leveraging generative models to understand and defend against adversarial examples. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §2.
  • [48] C. Szegedy, S. Ioffe, and V. Vanhoucke (2016)

    Inception-v4, inception-resnet and the impact of residual connections on learning

    .
    CoRR abs/1602.07261. External Links: Link, 1602.07261 Cited by: §5.2, §5.3.
  • [49] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016-06) Rethinking the inception architecture for computer vision. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Figure 1.
  • [50] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §1, §1, §1, §2, §3.
  • [51] D. Ulyanov, A. Vedaldi, and V. S. Lempitsky (2017) Improved texture networks: maximizing quality and diversity in feed-forward stylization and texture synthesis. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), External Links: Link, Document Cited by: 1st item, §2.
  • [52] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2017-04) Show and tell: lessons learned from the 2015 mscoco image captioning challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (4), pp. 652–663. External Links: Document, ISSN 0162-8828 Cited by: 1st item, §2, §5.4.1, §5.4.1, Table 5, Table 6.
  • [53] B. S. Vivek, K. Reddy Mopuri, and R. Venkatesh Babu (2018-09) Gray-box adversarial training. In The European Conference on Computer Vision (ECCV), Cited by: §2.
  • [54] R. J. Willaims (1986)

    Inverting a connectionist network mapping by backpropagation of error

    .
    Proc. of 8th Annual Conference of the Cognitive Science Society (), pp. 859–865. External Links: ISSN , Link, Document Cited by: §2, §3.
  • [55] C. Xie, J. Wang, Z. Zhang, Z. Ren, and A. Yuille (2018) Mitigating adversarial effects through randomization. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §2, §5.3, Table 4.
  • [56] C. Xie, J. Wang, Z. Zhang, Y. Zhou, L. Xie, and A. Yuille (2017) Adversarial examples for semantic segmentation and object detection. In International Conference on Computer Vision (ICCV), Cited by: §2.
  • [57] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015-07–09 Jul) Show, attend and tell: neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning, Vol. 37, pp. 2048–2057. External Links: Link Cited by: 1st item, §2, §5.4.1.
  • [58] W. Zhou, X. Hou, Y. Chen, M. Tang, X. Huang, X. Gan, and Y. Yang (2018-09) Transferable adversarial perturbations. In The European Conference on Computer Vision (ECCV), Cited by: §2.