Adversarial Feature Desensitization

06/08/2020 ∙ by Pouya Bashivan, et al. ∙ Montréal Institute of Learning Algorithms 0

Deep neural networks can now perform many tasks that were once thought to be only feasible for humans. Unfortunately, while reaching impressive performance under standard settings, such networks are known to be susceptible to adversarial attacks – slight but carefully constructed perturbations of the inputs which drastically decrease the network performance and reduce their trustworthiness. Here we propose to improve network robustness to input perturbations via an adversarial training procedure which we call Adversarial Feature Desensitization (AFD). We augment the normal supervised training with an adversarial game between the embedding network and an additional adversarial decoder which is trained to discriminate between the clean and perturbed inputs from their high-level embeddings. Our theoretical and empirical evidence acknowledges the effectiveness of this approach in learning robust features on MNIST, CIFAR10, and CIFAR100 datasets – substantially improving the state-of-the-art in robust classification against previously observed adversarial attacks. More importantly, we demonstrate that AFD has better generalization ability than previous methods, as the learned features maintain their robustness against a large range of perturbations, including perturbations not seen during training. These results indicate that reducing feature sensitivity using adversarial training is a promising approach for ameliorating the problem of adversarial attacks in deep neural networks.



There are no comments yet.


page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent progress in deep learning has allowed neural network models to achieve a near human-level performance across a range of complex tasks

He et al. (2016); Mnih et al. (2015); Silver et al. (2017); Vinyals et al. (2019). However, while the number of applications of deep learning is growing fast, these systems are often not very robust. In particular, their vulnerability to adversarial attacks Szegedy et al. has critically diminished the public trust in these systems. Adversarial attacks are small but precise perturbations made to the inputs of a system, resulting in high-confidence predictions which are critically divergent from human judgement.

It has been shown that many adversarial perturbations that are often small in magnitude lead to large deviations in the high-level features of deep neural networks Yoon et al. (2019)

. In addition, previous work demonstrated that adversarial patterns often rely on specific learned features which generalize even on large datasets such as Imagenet

Ilyas et al. (2019). However, these features are highly sensitive to input changes, yielding a potential vulnerability that can be exploited by adversarial attacks. While humans can also experience altered perception in response to particular visual patterns (e.g., visual illusions111, they are seemingly insensitive to this particular class of perturbations, and often unaware of the subtle image changes resulting from adversarial attacks. This in turn suggests that current artificial neural networks rely on visual features that are still considerably different from those giving rise to perception in primates (and, particularly, in humans) – even despite many recent studies highlighting their remarkable similarities Yamins et al. (2014); Khaligh-Razavi and Kriegeskorte (2014); Bashivan et al. (2019)

. It is therefore reasonable to hypothesize that a deep network may become more robust to such adversarial attacks if the corresponding higher-level representations (features, embeddings) are more robust to input perturbations, similar to those used by our brains,. One way to approach the question of feature/embedding robustness is to use a relatively simple classifier (e.g. a linear transformation) that produces predictions based on such embedding — if the embedding is robust then the predictions from the simple classifier would consequently be robust too. We use this approach below to develop our technique for improved embedding robustness.

Formally, for a given embedding function (for example, the activations before the last linear layer in a deep neural network), we define the embedding sensitivity as the maximum change in the embedding given input perturbations of maximum size :


Following the above reasoning, we wish to discourage the embedding from learning over-sensitive features () because a representation devoid of over-sensitive features will only gradually change in response to input variations and will not allow drastic changes in the category judgments derived from it. However in practice, achieving full embedding robustness under equation (1) might be difficult. On the other hand, from the point of view of category judgements, some embedding sensitivity could still be acceptable as long as it would not lead to drastic changes in the category judgements. Such that, given a decoding function :


which provides a formulation of the problem that is similar to the Lipschitz continuity and where we desire to be small.

In principal, drastic changes in category judgements could occur under at least one of the following two conditions: i) when is relatively small but the class manifolds are close to each other in the embedding space; ii) when class manifolds are reasonably apart but points in the embedding space can have gradients that are large enough to allow the feature values to essentially teleport from the manifold of one class to another (i.e. when ).


Figure 1: Overview of the proposed approach (left) visual comparison of different adversarial robustness methods. Adversarial training Madry et al. (2017), TRADES Zhang et al. (2019a), and AFD. The dotted line corresponds to the decision boundary of the adversarial decoder (right) schematic of AFD paradigm.

Many prior work on adversarial robustness have tackled the robust classification problem by pushing the embedding of training samples from different categories farther from the decision boundary Madry et al. (2017); Kannan et al. (2018); Zhang et al. (2019a), which has been shown to lead to significant improvement in adversarial robustness given a specific perturbation. For example, in Adversarial Training procedure Madry et al. (2017), network is trained to optimize the classification loss on the perturbed inputs or in combination with clean inputs. Or another recent approach called TRADES Zhang et al. (2019a) augments the classification loss on clean inputs with an auxiliary term that matches the assigned labels to clean and perturbed inputs (Figure-1). However neither of these methods address the potential issues that may rise under the second condition mentioned above (i.e. when the embedding could exhibit sharp changes in response to small input perturbations that could still lead to crossing the classification boundary). On the other hand, several other works attempted to improve robustness by enhancing the flatness of the classification loss Wu et al. (2019); Qin et al. (2020), which demonstrated that this approach will also lead to more robust performance against white-box attacks.

Here, instead of focusing on robust classification, we turned our attention to robustness of learned features from which the categories are inferred (e.g. using a simple linear classifier). Ideally, we want the learned embeddings to remain stable in the presence of small adversarial or non-adversarial perturbations. We propose to improve the robustness of network embeddings to adversarial perturbations via an adversarial game between two agents. In this setup, the first adversarial agent (i.e. the attacker) searches for performance-degrading perturbations given the embedding function. On the other hand, the second agent uses a decoder function to discriminate between the clean and perturbed inputs from their high-level embeddings. The parameters of the embedding and (adversarial) decoding functions are then tuned via an adversarial game between the two (Figure-1). This paradigm is similar to the adversarial learning paradigm widely used in image generation and transformation Goodfellow et al. (2014b); Karras et al. (2019); Zhu et al. (2017)

, unsupervised and semi-supervised learning

Miyato et al. (2018b), video prediction Mathieu et al. (2015); Lee et al. (2018), domain adaptation Ganin and Lempitsky (2015); Tzeng et al. (2017)

, active learning

Sinha et al. (2019), and continual learning Ebrahimi et al. (2020).

In summary, our main contributions are as follows:

  • We introduce a method to improve the robustness of learned features in an embedding network through an adversarial game between the embedding network and a secondary decoding network.

  • We theoretically demonstrate that under some assumptions on the nature of the adversarial attacker, this approach leads to a flat likelihood function in the vicinity of training samples.

  • We empirically confirm that the proposed feature desensitization approach leads to learning a sparse and robust set of high-level features and consequently a more stable classifier.

2 Methods

Let be an embedding of the input with inputs and embeddings, and be a linear function that when applied on the embedding outputs the likelihood of possible labels given the input : . Let be a perturbation which operates on input :


which finds perturbations within the neighborhood of input such that:


Given any model, there are at least two conditions under which a perturbation method could drastically increase the likelihood of the non-target class. i) there are short paths from points on the manifold of one class to another; ii) there are points on the manifold of one class with large gradients in the direction of another class. Notably, under the second condition, it is possible for the manifolds of two classes to be far from each other but still the underlying embedding space to be non-robust.

Ideally we want a network to learn an embedding that remains stable in the presence of any small perturbations while considering both of the above possible situations. Consequently, to learn such an embedding, one needs to not only increase the distance between manifolds of different classes, but also to reduce the magnitude of gradients from points on the manifold of each class in the direction of other classes.

In that regard, several prior studies have explored different approaches towards increasing the distance between class manifolds Madry et al. (2017); Zhang et al. (2019a). These approaches lead to substantial increase in robust performance against adversarial perturbations. However, they commonly suffer from sudden dives in performance with slight increments to the strength of perturbations or by being exposed to a different class of perturbations Schott et al. (2018); Sitawarin et al. (2020).

We posit that the landscape of gradients in the learned embedding space is equally important for the robustness of model’s performance against adversarial perturbations and hypothesize that the failure of previous approaches in generalizing the robust performance to higher perturbation degrees may be (at least partly) due to the potentially large likelihood gradients that remain unconstrained. We propose an adversarial learning procedure to reduce the sensitivity of the learned embedding with respect to the input. Algorithm 1

summarizes the proposed approach. The training procedure involves three loss functions that are optimized sequentially. First, parameters of the embedding

and category decoder are tuned to minimize the categorization softmax entropy loss. Second, parameters of the adversarial decoder are tuned to minimize the cross-entropy loss associated with discriminating natural and perturbed inputs conditioned on the natural labels. Lastly, parameters of the embedding are adversarially tuned to maximize the cross-entropy from the second step. The adversarial training framework is similar to that used in training conditional GANs, in which and networks play a two-player minimax game with value function :


where and correspond to natural and perturbed distributions, and denotes the softplus function. Following the logic presented in Goodfellow et al. (2014b); Chrysos et al. (2019) we reason that the global minimum of the adversarial training criterion is achieved if and only if and the embedding of natural and perturbed images conditioned on the class label are indistinguishable from each other:


If this is the case, then a Bayes optimal classifier will achieve the same error rate on the perturbed inputs as it will on the natural inputs. We use this fact below to prove that when is at the global minimum, the partial derivatives of the linear output with respect to the input in the perturbation direction are zero, i.e. the gradient is flat in the direction of the other classes.

Theorem 1.

Given an embedding function , class-likelihood functions : , a perturbation function where denotes the target (true) class index, if the adversarial optimization of embedding and discriminator functions and converges to its global minimum of the training criterion, then .


Assume is a set of differentiable functions that implement the Bayes optimal classifier from the embedding.


Assuming that the adversarial training of and converge to the global minimum, we have:


Following from Bayes rule we have:


From equation 8, the marginal distributions and should be equal which leads to:


which can only be true if . ∎

Input: Perturbation , batch size , optimizer , encoding network , adversarial decoder network , category decoder network , softplus function .
Read mini-batch repeat
       ) )
until training converged;
Algorithm 1 AFD training procedure

While the assumption of convergence to global optimum is a strong assumption, in practice, it is possible to derive a bound on the classifier’s robust error in terms of its error on clean inputs and a divergence measure between the clean and perturbed embeddings (see 8.3 in the appendix).

3 Experiments

3.1 Adversarial perturbations

We used a range of adversarial perturbations in our experiments, using existing implementations in the Foolbox Rauber et al. (2017) and Advertorch Ding et al. (2019) packages. We validated the models against Projected Gradient Descent (PGD) Madry et al. (2017) (), Fast Gradient Sign Method (FGSM) Goodfellow et al. (2014a), Momentum Iterative Method (MIM) Dong et al. (2018), Decoupled Direction and Norm (DDN) Rony et al. (2019), Deepfool Moosavi-Dezfooli et al. (2016), and C&W Carlini and Wagner (2017) perturbations. For each perturbation, we swept the value across a wide range and validated different models on each. Specific settings used for each perturbation are listed in Table-A3.

Method Dataset Network Clean (WB) FGSM (WB) (BB) FGSM (BB)
NT MNIST LeNet 98.88 0 0.45 0 0.44
ATMadry et al. (2017) LeNet 98.8 93.2 95.6 96.0 96.8
TRADESZhang et al. (2019a) LeNet 99.48 96.07 - - -
ATESSitawarin et al. (2020) LeNet 99.11 94.04 - - -
ABSSchott et al. (2018) LeNet 99.0 13 34 - -
NT RN5 98.810.03 2.030.23 9.720.01 2.080.16 9.740.0
ATMadry et al. (2017) RN5 99.150.07 94.980.09 97.10.11 98.960.05 98.920.08
TRADESZhang et al. (2019a) RN5 97.720.11 89.870.87 95.181.02 96.880.4 95.081.49
AFD RN5 98.490.16 92.750.32 95.950.46 98.110.25 97.970.25
NT CIFAR10 RN18 95.40 0.12 47.79 12.00 54.65
ATMadry et al. (2017) RN18 87.3 45.8 56.1 86.0 85.6
ATMadry et al. (2017) RN18 83.58 41.05 50.12 83.20 82.88
TRADESZhang et al. (2019a) RN18 84.92 56.61 - - -
TRADESZhang et al. (2019a) RN18 82.22 52.30 58.16 80.36 79.69
ATESSitawarin et al. (2020) WRN-34-10 86.84 55.06 - - -
RLFATSong et al. (2020) WRN-32-10 82.72 58.75 - - -
RST+Wu et al. (2019); Carmon et al. (2019) WRN-34-10 89.82 64.86 69.60 88.77 87.61
LLRQin et al. (2020) WRN-28-8 86.83 52.99 - - -
AFD RN18 87.83 72.45 76.43 86.28 85.06
NT CIFAR100 RN18 76.12 0.01 9.67 1.55 15.43
ATMadry et al. (2017) RN18 55.78 20.39 25.09 53.83 53.25
TRADESZhang et al. (2019a) RN18 55.48 27.36 30.46 54.13 53.16
RLFATSong et al. (2020) WRN-32-10 56.70 31.99 - - -
AFD RN18 62.54 49.89 51.36 58.95 56.59
Table 1: Accuracy against different perturbations and methods on MNIST, CIFAR10, and CIFAR100 datasets. Both and FGSM perturbations were constrained by for MNIST and for CIFAR10 and CIFAR100 datasets. indicates replicated results using our reimplementation or official code. NT: natural training; AT: adversarial training; AFD: adversarial feature desensitization; WB: white-box attack; BB: black-box attack. Numbers reported with denote mean and std values over three independent runs with different random initialization. * RSTCarmon et al. (2019) additionally uses 500K unlabeled images during training.

3.2 Adversarial robustness

We validated our approach on learning robust visual embeddings on the MNISTLeCun et al. (1998), CIFAR10, and CIFAR100Krizhevsky et al. (2009) datasets. We used projected gradient descent with constraint to perturb the inputs during training. was set to 0.3 and 0.031 for MNIST and CIFAR datasets respectively. We used the activations before the last linear layer as the high-level embedding produced by the network. In all experiments, the adversarial decoder network (

) consisted of three fully connected layers with Leaky ReLU nonlinearity followed by a projection discriminator layer that incorporated the labels into the adversarial decoder through a dot product operation

Miyato and Koyama (2018). The number of hidden units in all layers were equal (200 for MNIST and 512 for CIFAR10). We used spectral normalization Miyato et al. (2018a) on all layers of . Further details of training for each experiment are listed in Table-A2. We used our own re-implementation of adversarial training (AT) method Madry et al. (2017) and the official code for TRADES222 Zhang et al. (2019a) and denoted these results with in the tables.

Figure 2: Robust classification performance of different methods against various degrees of PGD- attack on different datasets.

Adversarial robustness against observed attack We first evaluated our approach against the same class and strength of attack that was used during training. Table-1 compares the robust classification performance of our proposed approach against PGD

(with similar setting as was used during training) and FGSM attacks. Training LeNet with AFD was unstable leading to frequent crashing of adversarial decoder accuracy despite our extensive hyperparameter search. For this reason, we conducted our MNIST experiments using a very shallow ResNet architecture which we call ResNet5. This architecture consisted of only one convolution, one ResNet block, and a fully connected layer with a total depth of 5 layers (Table-

A1). AFD-trained ResNet5 was less robust compared to AT against PGD-L and FGSM with default strength. On the other hand, on CIFAR10 and CIFAR100 datasets, ResNet18 network trained with AFD performed much higher than all other methods against both white-box and black-box attacks.

Dataset Model FGSM MIM DDN DeepFool C&W
MNIST NT (RN5) 0.13 0.10 0.08 0.18 0.10 0.05 0.05 0.23
AT (RN5) 0.64 0.22 0.20 0.67 0.64 0.45 0.42 0.84
TRADES (RN5) 0.63 0.21 0.14 0.69 0.65 0.45 0.40 0.84
AFD (RN5) 0.73 0.33 0.28 0.73 0.64 0.46 0.40 0.76
CIFAR10 NT (RN18) 0.04 0.02 0.03 0.25 0.04 0.06 0.06 0.10
AT (RN18) 0.27 0.05 0.07 0.31 0.29 0.06 0.18 0.26
TRADES (RN18) 0.34 0.06 0.08 0.33 0.36 0.06 0.25 0.32
AFD (RN18) 0.71 0.32 0.53 0.74 0.73 0.34 0.15 0.25
CIFAR100 NT (RN18) 0.03 0.02 0.01 0.06 0.03 0.05 0.01 0.08
AT (RN18) 0.15 0.03 0.04 0.12 0.15 0.04 0.09 0.14
TRADES (RN18) 0.19 0.04 0.05 0.15 0.18 0.04 0.11 0.17
AFD (RN18) 0.48 0.18 0.33 0.50 0.50 0.15 0.07 0.11
Table 2: AUC measures for different perturbations and methods on MNIST, CIFAR10, and CIFAR100 datasets. AUC values are normalized to have a maximum allowable value of 1. Evaluations on AT and TRADES were made on networks trained using reimplemented or official code.

Robust classification against stronger and unseen attacks We also tested the network robustness against higher degrees of the same attack used during training as well as to a suite of other (unobserved) attacks. We found that the AFD-trained networks continued to perform relatively well against white-box attacks even for very large perturbations – while performance of other methods went down to zero relatively quickly (Figures-2,A2,A3,A4). The AFD-trained network also performed remarkably well against most other perturbation methods that were not observed during training. To compare different models considering both attack types and perturbation strength, we computed the area-under-the-curve (AUC) for a range of epsilons for each attack and each approach. Table-2 summarizes these values for our approach and two alternative approaches (adversarial training and TRADES).

Embedding Stability We compared the robustness of the learned embedding derived from training the same architecture using different methods. For that we measured the normalized sensitivity of the embeddings in each network as . For all three datasets we found that AFD-trained network learns high-level features that are more robust against input perturbations as measured by the normalized L2 distance between clean and perturbed embeddings (Figures-A1,A6,A7,A8).

Learning a sparse embedding

We compared the number of embedding dimensions learned by applying different methods on the same architecture using two measures. i) number of non-zero units over the test set within each dataset and ii) number of PCA dimensions that explains more than 99% of the variance in the embedding computed over the test-set of each dataset. We found that the same network architecture when trained with AFD method gave rise to a much sparser and lower dimensional embedding space (Table-

A5). The embedding spaces learned with AFD on MNIST, CIFAR10, and CIFAR100 datasets had only 4, 7, and 88 principal components respectively.

Figure 3: Logarithm of the average gradient magnitudes of class likelihoods with respect to input, evaluated at samples within the test set of each dataset (). For each matrix, rows correspond to target (true) labels and columns correspond to non-target labels.

Gradient landscape To empirically validate our proposed theorem, we computed the average gradient of each predicted class with respect to the input across samples within the test set of each dataset (). We found that, on all datasets, the magnitude of gradients in the direction of most non-target classes were much smaller for AFD-trained network compared to other tested methods (Figure-3). This confirms that AFD stabilizes the embedding in a way that significantly reduces the gradients towards most non-target classes.

Matching vs. Indiscrimination We also ran additional experiments on the MNIST dataset in which we added a regularization term to the classification loss to minimize the distance between the clean and perturbed embeddings. We observed that although this augmented loss improved the network robustness against different white box attacks, it showed only weak generalization to higher strength and other unseen perturbations (Figure-A5). This result suggests that enforcing a distributional form of feature desensitization (e.g. AFD) may lead to robust behavior over a larger range of perturbations compared to the case where feature stability is directly enforced through an norm measure.

Non-obfuscated gradients Recent literature have pointed out that many defense methods against adversarial perturbations could drive the network into a regime called obfuscated gradients in which the network appears to be robust against common iterative adversarial attacks but could easily be broken using black-box or alternative attacks that do not rely on exact gradients Papernot et al. (2017); Athalye et al. (2018). We believe that our results are not due to obfuscated gradients for several reasons. i) For most perturbations, the model performance continues to decrease with increased epsilon (Figures-A2,A3,A4); ii) The iterative perturbations were consistently more successful than single-step ones (Table-1); iii) Black-box attacks were significantly less successful than white-box attacks (Table-1); iv) The AFD-trained model performed similar or better than alternate methods against the Boundary attack Brendel et al. (2018) – an attack which does not rely on the network gradients (Table-A4).

4 Related Work

There is an extensive literature on mitigating susceptibility to adversarial perturbations. Adversarial training Madry et al. (2017) is one of the earliest successful attempts to improve robustness of the learned representations to potential perturbations to the input pattern by solving a "saddle point" problem composed of an inner and outer adversarial optimization. A number of other works suggest additional losses instead of direct training on the perturbed inputs. TRADES Zhang et al. (2019a) adds a regularization term to the cross-entropy loss which penalizes the network for assigning different labels to natural images and their corresponding perturbed images. Qin et al. (2020) proposed an additional regularization term (local linearity regularizer) that encourages the classification loss to behave linearly around the training examples. Wu et al. (2019) proposed to regularize the flatness of the loss to improve adversarial robustness.

Our work is closely related to the domain adaptation literature in which adversarial optimization has recently gained much attention Ganin and Lempitsky (2015); Liu et al. (2019); Tzeng et al. (2017). From this viewpoint one could consider the clean and perturbed inputs as two distinct domains for which a network aims to learn an invariant feature set. Although in our setting, i) the perturbed domain continuously evolves while the parameters of the embedding network are tuned; ii) unlike the usual setting in domain-adaptation problems, here we have access to the labels associated with samples from the perturbed (target) domain. Despite this, Song et al. (2019)

regularized the network to have similar logit values in response to clean and perturbed inputs and showed that this additional term leads to better robust generalization to unseen perturbations. Related to this, Adversarial Logit Pairing

Kannan et al. (2018) increases robustness by directly matching the logits for clean and adversarial inputs.

Another line of work is on developing certified defenses which consist of methods with provable bounds over which the network is certified to operate robustly Zhang et al. (2019b); Zhai et al. (2020); Cohen et al. (2019). While these approaches provide a sense of guarantee about the proposed defenses, they are usually prohibitively expensive to train, drastically reduce the performance of the network on natural images, and the empirical robustness gained against standard attacks are low.

5 Discussion

We proposed a method to decrease the sensitivity of learned embeddings in neural networks using adversarial optimization. Decreasing the input-sensitivity of features has long been desired in training neural networks Drucker and Le Cun (1992) and has been suggested as a way to improve adversarial robustness Ros and Doshi-Velez (2018). Our results show that AFD can be used to reduce the sensitivity of network features to input perturbations and to improve robustness against a family of adversarial attacks. We believe that these results could further be improved by i) using larger neural networks such as wider or deeper networks as is shown in recent work Xie and Yuille (2020); Madry et al. (2017); ii) by applying the adversarial learning paradigm on multiple feature layers of the network; iii) by combining AFD with other methods like adversarial training Madry et al. (2017) or TRADES Zhang et al. (2019a).

6 Broader Impact

As the application of deep neural networks becomes more common in everyday life, security and dependability of these networks becomes more crucial. While these networks excel at performing many complicated tasks under standard settings, they often are criticized for their lack of reliability under broader settings. One of the main points of criticism of today’s artificial neural networks is on their vulnerability to adversarial patterns – slight but carefully constructed perturbations of the inputs which drastically decrease the network performance.

Our work presented here proposes a new way of addressing this important issue. Our approach could be used to improve the robustness of learned representation in an artificial neural network and as shown lead to a recognition behavior that is more aligned with the human judgement. More broadly, the ability to learn robust representations and behaviors is highly desired in a wide range of applications and disciplines including perception, control, and reasoning and we expect the presented work to influence the future studies in these areas.

7 Acknowledgements

We would like to thank Isabela Albuquerque, Joao Monteiro, Alexia Jolicoeur-Martineau, and Mojtaba Faramarzi for their valuable comments on the manuscript. Pouya Bashivan is partially supported by the Unifying AI and Neuroscience – Québec (UNIQUE) Postdoctoral fellowship.


  • A. Athalye, N. Carlini, and D. Wagner (2018) Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples.

    35th International Conference on Machine Learning, ICML 2018

    1, pp. 436–448.
    External Links: 1802.00420, ISBN 9781510867963 Cited by: §3.2.
  • P. Bashivan, K. Kar, and J. J. DiCarlo (2019) Neural Population Control via Deep Image Synthesis. Science 364 (6439). External Links: Document, ISSN 1095-9203, Link Cited by: §1.
  • S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan (2010) A theory of learning from different domains. Machine Learning 79 (1-2), pp. 151–175. External Links: Document, ISSN 15730565 Cited by: §8.3, §8.3.
  • W. Brendel, J. Rauber, and M. Bethge (2018) Decision-based adversarial attacks: Reliable attacks against black-box machine learning models. ICLR, pp. 1–12. External Links: arXiv:1712.04248v2 Cited by: §3.2, Table A4.
  • N. Carlini and D. Wagner (2017) Towards Evaluating the Robustness of Neural Networks. Proceedings - IEEE Symposium on Security and Privacy, pp. 39–57. External Links: Document, 1608.04644, ISBN 9781509055326, ISSN 10816011 Cited by: §3.1.
  • Y. Carmon, A. Raghunathan, L. Schmidt, J. C. Duchi, and P. S. Liang (2019) Unlabeled data improves adversarial robustness. In Advances in Neural Information Processing Systems, pp. 11190–11201. Cited by: Table 1.
  • G. G. Chrysos, J. Kossaifi, and S. Zafeiriou (2019) Robust conditional generative adversarial networks. ICLR, pp. 1–27. External Links: arXiv:1805.08657v2 Cited by: §2.
  • J. Cohen, E. Rosenfeld, and J. Z. Kolter (2019) Certified adversarial robustness via randomized smoothing. 36th International Conference on Machine Learning, ICML 2019 2019-June, pp. 2323–2356. External Links: 1902.02918, ISBN 9781510886988 Cited by: §4.
  • G. W. Ding, L. Wang, and X. Jin (2019)

    AdverTorch v0.1: an adversarial robustness toolbox based on pytorch

    arXiv preprint arXiv:1902.07623. Cited by: §3.1, §8.2.
  • Y. Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li (2018) Boosting Adversarial Attacks with Momentum.

    Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

    , pp. 9185–9193.
    External Links: Document, 1710.06081, ISBN 9781538664209, ISSN 10636919 Cited by: §3.1.
  • H. Drucker and Y. Le Cun (1992)

    Improving generalization performance using double backpropagation

    IEEE Transactions on Neural Networks 3 (6), pp. 991–997. Cited by: §5.
  • S. Ebrahimi, F. Meier, R. Calandra, T. Darrell, and M. Rohrbach (2020) Adversarial continual learning. arXiv preprint arXiv:2003.09553. Cited by: §1.
  • Y. Ganin and V. Lempitsky (2015) Unsupervised domain adaptation by backpropagation. 32nd International Conference on Machine Learning, ICML 2015 2 (i), pp. 1180–1189. External Links: 1409.7495, ISBN 9781510810587 Cited by: §1, §4.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2014a) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §3.1.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014b) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1, §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §8.1, §8.1.
  • A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry (2019) Adversarial examples are not bugs, they are features. In Advances in Neural Information Processing Systems, pp. 125–136. Cited by: §1.
  • H. Kannan, A. Kurakin, and I. Goodfellow (2018) Adversarial logit pairing. arXiv preprint arXiv:1803.06373. Cited by: §1, §4.
  • T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4401–4410. Cited by: §1.
  • S. M. Khaligh-Razavi and N. Kriegeskorte (2014) Deep Supervised, but Not Unsupervised, Models May Explain IT Cortical Representation. PLoS Computational Biology 10 (11). External Links: Document, ISBN 1553-7358 (Electronic)$\$r1553-734X (Linking), ISSN 15537358 Cited by: §1.
  • A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §3.2.
  • Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §3.2.
  • A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine (2018) Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523. Cited by: §1.
  • H. Liu, M. Long, J. Wang, and M. Jordan (2019) Transferable Adversarial Training: A General Approach to Adapting Deep Classifiers. International Conference on Machine Learning, pp. 4013–4022. External Links: Link Cited by: §4.
  • L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: Figure A6, Figure A7, Figure A8.
  • A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2017) Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083. Cited by: Figure 1, §1, §2, §3.1, §3.2, Table 1, §4, §5, Figure A6, Figure A7, Figure A8.
  • M. Mathieu, C. Couprie, and Y. LeCun (2015) Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440. Cited by: §1.
  • T. Miyato, T. Kataoka, K. Masanori, and Y. Yuichi (2018a) Spectral normalization for generative adversarial networks. ICLR. Cited by: §3.2.
  • T. Miyato and M. Koyama (2018) CGANs with projection discriminator. arXiv preprint arXiv:1802.05637. Cited by: §3.2.
  • T. Miyato, S. I. Maeda, S. Ishii, and M. Koyama (2018b) Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–16. External Links: Document, arXiv:1704.03976v2, ISSN 19393539 Cited by: §1.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015)

    Human-level control through deep reinforcement learning

    Nature 518 (7540), pp. 529–533. Cited by: §1.
  • S. M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard (2016) DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2016-Decem, pp. 2574–2582. External Links: Document, arXiv:1511.04599v3, ISBN 9781467388504, ISSN 10636919 Cited by: §3.1.
  • N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami (2017) Practical black-box attacks against machine learning. ASIA CCS 2017 - Proceedings of the 2017 ACM Asia Conference on Computer and Communications Security, pp. 506–519. External Links: Document, 1602.02697, ISBN 9781450349444 Cited by: §3.2.
  • C. Qin, J. Martens, S. Gowal, D. Krishnan, K. Dvijotham, A. Fawzi, S. De, R. Stanforth, and P. Kohli (2020) Adversarial Robustness Through Local Linearization. (NeurIPS), pp. 1–10. External Links: 2003.02460, Link Cited by: §1, Table 1, §4.
  • J. Rauber, W. Brendel, and M. Bethge (2017) Foolbox: a python toolbox to benchmark the robustness of machine learning models. In Reliable Machine Learning in the Wild Workshop, 34th International Conference on Machine Learning, External Links: Link Cited by: §3.1, §8.2.
  • J. Rony, L. G. Hafemann, L. S. Oliveira, I. Ben Ayed, R. Sabourin, and E. Granger (2019) Decoupling direction and norm for efficient gradient-based l2 adversarial attacks and defenses. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2019-June, pp. 4317–4325. External Links: Document, 1811.09600, ISBN 9781728132938, ISSN 10636919 Cited by: §3.1.
  • A. S. Ros and F. Doshi-Velez (2018) Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients.

    32nd AAAI Conference on Artificial Intelligence, AAAI 2018

    , pp. 1660–1669.
    External Links: 1711.09404, ISBN 9781577358008 Cited by: §5.
  • L. Schott, J. Rauber, M. Bethge, and W. Brendel (2018) Towards the first adversarially robust neural network model on mnist. arXiv preprint arXiv:1805.09190. Cited by: §2, Table 1.
  • D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. (2017) Mastering the game of go without human knowledge. Nature 550 (7676), pp. 354–359. Cited by: §1.
  • S. Sinha, S. Ebrahimi, and T. Darrell (2019) Variational adversarial active learning. Proceedings of the IEEE International Conference on Computer Vision 2019-Octob, pp. 5971–5980. External Links: Document, 1904.00370, ISBN 9781728148038, ISSN 15505499 Cited by: §1.
  • C. Sitawarin, S. Chakraborty, and D. Wagner (2020) Improving adversarial robustness through progressive hardening. arXiv preprint arXiv:2003.09347. Cited by: §2, Table 1.
  • C. Song, K. He, L. Wang, and J. E. Hopcroft (2019) Improving the generalization of adversarial training with domain adaptation. In ICLR, pp. 1–14. External Links: arXiv:1810.00740v7 Cited by: §4.
  • C. Song, H. Kun, L. Jiadong, J. E. Hopcroft, and L. Wang (2020) Robust local features for improving the generalization of adversarial training. In ICLR, pp. 1–12. External Links: arXiv:1909.10147v5 Cited by: Table 1.
  • [44] C. Szegedy, J. Bruna, D. Erhan, I. Goodfellow, J. Bruna, R. Fergus, and D. Erhan Intriguing properties of neural networks. pp. 1–10. External Links: arXiv:1312.6199v4 Cited by: §1.
  • E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell (2017) Adversarial discriminative domain adaptation. Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 2017-January, pp. 2962–2971. External Links: Document, 1702.05464, ISBN 9781538604571 Cited by: §1, §4.
  • O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, et al. (2019) Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature 575 (7782), pp. 350–354. Cited by: §1.
  • D. Wu, Y. Wang, and X. Shu-Tao (2019) Revisiting Loss Landscape for Adversarial Robustness. ICML. External Links: arXiv:2004.05884v1 Cited by: §1, Table 1, §4.
  • C. Xie and A. Yuille (2020) Intriguing properties of adversarial training at scale. In ICLR, pp. 1–14. Cited by: §5.
  • D. L. K. Yamins, H. Hong, C. F. Cadieu, E. A. Solomon, D. Seibert, and J. J. DiCarlo (2014) Performance-optimized hierarchical models predict neural responses in higher visual cortex.. Proceedings of the National Academy of Sciences of the United States of America 111 (23), pp. 8619–24. External Links: Document, ISSN 1091-6490, Link Cited by: §1.
  • J. Yoon, K. Kim, and J. Jang (2019) Propagated perturbation of adversarial attack for well-known CNNs: Empirical study and its explanation. Proceedings - 2019 International Conference on Computer Vision Workshop, ICCVW 2019, pp. 4226–4234. External Links: Document, 1909.09263, ISBN 9781728150239 Cited by: §1.
  • R. Zhai, C. Dan, D. He, H. Zhang, B. Gong, P. Ravikumar, C. Hsieh, and L. Wang (2020) MACER: attack-free and scalable robust training via maximizing certified radius. arXiv preprint arXiv:2001.02378. Cited by: §4.
  • H. Zhang, Y. Yu, J. Jiao, E. P. Xing, L. E. Ghaoui, and M. I. Jordan (2019a) Theoretically principled trade-off between robustness and accuracy. arXiv preprint arXiv:1901.08573. Cited by: Figure 1, §1, §2, §3.2, Table 1, §4, §5, Figure A6, Figure A7, Figure A8.
  • H. Zhang, H. Chen, C. Xiao, S. Gowal, R. Stanforth, B. Li, D. Boning, and C. Hsieh (2019b) Towards Stable and Efficient Training of Verifiably Robust Neural Networks. pp. 1–25. External Links: 1906.06316, Link Cited by: §4.
  • J. Zhu, T. Park, P. Isola, and A. A. Efros (2017)

    Unpaired image-to-image translation using cycle-consistent adversarial networks

    In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §1.

8 Appendix

8.1 Network architectures

MNIST: We ran our experiments on a shallow ResNet architecture [16]

which we called ResNet5. The ResNet5 architecture consists of 1 convolutional layer with stride 2 and 32 filters, a batch normalization layer and ReLU nonlinearity followed by a ResNet block (v1) with stride 2 and 64 filters, global average pool and a ReLU FC layer with 200 units (Table-

A1). Activations before the last linear layer were used as high-level network embedding. Objective function was optimized using SGD algorithm with 0.9 momentum.

Input (1x28x28)
Conv2D (32, stride=2)
BatchNorm2D (32)
ResNet Block (64, stride=2)
Global Average Pool (7x7)
FC (200)
FC (10)
Table A1: ResNet5 architecture.

CIFAR10 and CIFAR100: We trained the ResNet18 architecture [16] using SGD optimizer with 0.8 momentum and learning rates as indicated in Table-A2, weight decay of

, batch size of 64, for 850 epochs. All learning rates were reduced by a factor of 10 after epochs 350 and 700.

Dataset Model weight decay batch size Num. Epochs
MNIST ResNet5 0.5 0.1 0.1 50 300
CIFAR-10 ResNet18 0.5 0.1 0.1 64 850
CIFAR-100 ResNet18 0.5 0.1 0.1 64 850
Table A2: Training hyperparameters for each dataset and network.

8.2 Adversarial attacks

We used a range of adversarial attacks in our experiments. Hyperparameters associated with each attack are listed in the table below. Implementation of these attacks were adopted from Foolbox [35], AdverTorch [9] packages.

Attack Dataset Steps More Toolbox
FGSM MNIST 1 [0, 0.2, 0.3, 0.5, 0.8] - Foolbox
CIFAR [0, , , , , , ] -
PGD- MNIST 50 [0, 10, 50, 100, 200, 400] step=0.025 Foolbox
PGD- MNIST 50 [0, 2, 5, 10, 20] step=0.025 Foolbox
PGD- MNIST 40 [0.0, 0.1, 0.3, 0.5, 0.8, 1] step=0.033 Foolbox
CIFAR 20 [0, , , , , ] step=
MIM MNIST 40 [0, 0.1, 0.3, 0.5, 0.8, 1] - AdverTorch
CIFAR [0, , , , , ] -
DDN MNIST 100 [0, 1, 2, 5, 10] - Foolbox
CIFAR [0, 2, 5, 10, 15] -
Deepfool MNIST 50 [0, 0.01, 0.1, 0.3, 0.5, 1] - Foolbox
CIFAR [0, , , , , , ] -
C&W MNIST 100 [0, 1, 2, 5] stepsize=0.05 Foolbox
Table A3: Attack hyperparameters for each dataset and attack.

8.3 Bound on classifier’s robust error

Considering the embedding distributions in response to clean and perturbed inputs (of a particular class) as two distinct domains of inputs, it is straight forward to use the math from domain adaptation literature to derive a bound on the classifier’s robust error (i.e. under the perturbed scenario). In this case, we can directly adapt Theorem 2 in [3] to derive this bound.

If and are distributions of embeddings in response to clean and perturbed inputs of a particular class respectively. Let and be samples of size each, drawn from and . Let be a hypothesis space of VC dimension , then for any

, with probability at least 1-

(over the choice of the samples), for every :

where and are the errors on clean and perturbed inputs, is the empirical -divergence [3], and is the is the combined error of the ideal hypothesis : .

Dataset Model Method Boundary Attack
AT 98
AFD 95
CIFAR10 RN18 NT 12
AT 66
AFD 74
CIFAR100 RN18 NT 8
AT 43
AFD 45
Table A4: Comparison of robust performance against Boundary attack [4] with 5000 steps and on different datasets using various methods. We tested the robust performance of each model on 100 random samples from each dataset’s test-set.
Network RN5 RN18 RN18
Units Dims Units Dims Units Dims
NT 173 13 512 24 512 431
AT 153 27 512 455 512 481
TRADES 156 52 512 349 512 461
AFD 31 4 380 7 490 88
Table A5: Dimensionality of the learned embedding space on various datasets using different methods and measures. Units: number of non-zero embedding dimensions over the test-set within each dataset. Dims: number of PCA dimensions that account for 99% of the embedding variance across all images within the test-set of each dataset.
Figure A1: Comparison of normalized embedding sensitivity on test-set of MNIST (left), CIFAR10 (middle), CIFAR100 (right) datasets under PGD- attack. For each image, we computed the normalized embedding sensitivity as

. Plots show the median sensitivity over test-set of each dataset. Error bars correspond to standard deviation.

Figure A2: Comparison of robust accuracy of different methods against white-box attacks on MNIST dataset with ResNet5 architecture.
Figure A3: Comparison of robust accuracy of different methods against white-box attacks on CIFAR10 dataset with ResNet18 architecture.
Figure A4: Comparison of robust accuracy of different methods against white-box attacks on CIFAR100 dataset with ResNet18 architecture.
Figure A5: Comparison of robust accuracy of AFD and embedding matching against white-box attacks on MNIST dataset with ResNet5 architecture.
Figure A6: Scatter plot of 2-dimensional t-SNE projection [25] of the embedding derived from training the ResNet5 architecture on MNIST dataset. (top row) t-SNE projection of embeddings of clean images for networks trained with different methods. Each point corresponds to the embedding of one of the images from the MNIST test-set. (rows 2 to 5) t-SNE projection of the embedding of the clean and perturbed MNIST test-set images. Columns are sorted from left to right with the strength of the perturbation (left-most column corresponds to clean images and right-most column with highest tested perturbation). Perturbations are generated using PGD- attack. NT: naturally trained; AT: adversarially trained[26]; TRADES: [52]; AFD: adversarial feature desensitization.
Figure A7: Scatter plot of 2-dimensional t-SNE projection [25] of the embedding derived from training the ResNet5 architecture on CIFAR10 dataset. (top row) t-SNE projection of embeddings of clean images for networks trained with different methods. Each point corresponds to the embedding of one of the images from the CIFAR10 test-set. (rows 2 to 5) t-SNE projection of the embedding of the clean and perturbed CIFAR10 test-set images. Columns are sorted from left to right with the strength of the perturbation (left-most column corresponds to clean images and right-most column with highest tested perturbation). NT: naturally trained; AT: adversarially trained[26]; TRADES: [52];AFD: adversarial feature desensitization.
Figure A8: Scatter plot of 2-dimensional t-SNE projection [25] of the embedding derived from training the ResNet5 architecture on CIFAR100 dataset. (top row) t-SNE projection of embeddings of clean images for networks trained with different methods. Each point corresponds to the embedding of one of the images from the CIFAR100 test-set. (rows 2 to 5) t-SNE projection of the embedding of the clean and perturbed CIFAR100 test-set images. Columns are sorted from left to right with the strength of the perturbation (left-most column corresponds to clean images and right-most column with highest tested perturbation). NT: naturally trained; AT: adversarially trained [26]; TRADES [52]; AFD: adversarial feature desensitization.