Modern machine learning models achieve high accuracy on different tasks such as image classification (He et al., 2015b) and speech recognition (Graves et al., 2013; Xiong et al., 2016). However, they are not robust to adversarially chosen small perturbations of their inputs. In particular, one can make imperceptible perturbations of the input data, which can cause state-of-the-art models to misclassify their inputs with high confidence (Szegedy et al., 2014; Biggio et al., 2017; Carlini et al., 2016; Carlini and Wagner, 2018). Also, researchers have shown that even in the physical world scenarios, machine learning models are vulnerable to adversarial examples (Kurakin et al., 2016b; Li et al., 2019). The robustness properties of the models in machine learning are a huge concern, as they have been increasingly employed in applications in which safety and security are among the principal issues. Furthermore, the benefits of a robust model go beyond this; for instance, significantly improved model interpretability (Tsipras et al., 2018) and effective exclusion of brittle features in the learned robust model (Ilyas et al., 2019) are some of the model robustness benefits. The former could be observed through an input-dependent saliency map, which is usually a variant of the model gradient with respect to the input, and the fact that this map becomes sparser and semantically more relevant compared to that of a non-robust model. An example of the latter is the ability of the robust model to avoid “shortcut features” that are logically irrelevant but may be strongly correlated to the class. Learning under an adversarial noise would help in obscuring such features and make the model rely on alternative aspects of the input that are more robust for label prediction.
The phenomenon of adversarial machine learning has received significant attention in recent years and several methods have been proposed for training a robust classifier on images. However, most of these methods have been shown to be ineffective(Carlini and Wagner, 2016, 2017; Athalye et al., 2018; Athalye and Carlini, 2018; Carlini, 2019)
, while many others are shown to lack scalability to large networks that are expressive enough to solve problems like ImageNet(Cohen et al., 2019). To the best of our knowledge, only two training algorithms and their variants, which are called “adversarial training” (Madry et al., 2017) and “randomized smoothing” (Cohen et al., 2019), have been confirmed to be both effective and scalable. Nevertheless, adversarial training remains ineffective for large perturbations of the input (Sharma and Chen, 2018). More specifically, the authors observed that the adversarial training does not yield an accuracy better than random guessing for perturbation norm, denoted as epsilon, larger than or equal to in the MNIST dataset. Training in the presence of large perturbations, however, is essential to advance the level of model robustness and its benefits that are gained as a result. For instance, if the model is robust to perturbations of norm less than , then only “shortcut features” that are captured in pixels could be avoided. Otherwise, we would need a larger to take the best advantage of the model robustness.
In addition to the mentioned robust training algorithms, some prior work was also aimed at studying adversarial machine learning from a more theoretical perspective (Schmidt et al., 2018; Zhai et al., 2019; Gilmer et al., 2018; Wang et al., 2017; Ilyas et al., 2019; Chen et al., 2020; Yin et al., 2018; Cullina et al., 2018; Zhang et al., 2019; Montasser et al., 2019; Diochnos et al., 2019; Attias et al., 2018; Khim and Loh, 2018). However, to our understanding, none of them claim to find the optimal robust classifier, assuming knowledge of joint data and label distribution, except only under a strong assumption on the hypothesis space (Ilyas et al., 2019). This helps to find a limit on the perturbation size until we could expect to get a better-than-random classifier assuming sufficiently large training set.
In section 3, we demonstrate that weight initialization of a deep neural network plays an important role in the feasibility of adversarial training of the network under a large perturbation size. A natural question that arises is:
How much can we increase the perturbation -norm during the training?
To answer this question, we study the optimal robust classifier, where we have the full knowledge of the input distribution given different classes. In section 5, we first prove that, in general, finding the optimal robust classifier in this setting is -hard. Therefore, we focus our attention on some conditional distributions, such as the symmetric isotropic Normal and multi-dimensional Bernoulli distributions, in which finding the optimal robust classifier is tractable. Next, we discuss the limits on the perturbation size under which the optimal robust classifier has a better-than-chance adversarial accuracy.
In addition, in section 4, following (Tsipras et al., 2018) and (Kaur et al., 2019), we show that our models that are trained on larger perturbation sizes have more interpretable saliency maps and attacks, and some other notable visual properties. These results also suggest that using our proposed method to train the model adversarially on a larger perturbation size boosts the benefits that are gained in the robust models.
2 Related Work
Adversarial machine learning has been studied since two decades ago (Globerson and Roweis, 2006; Dalvi et al., 2004; Kolcz and Teo, 2009). But (Szegedy et al., 2013) and (Biggio et al., 2017) could be considered as the starting point of significant attention to this field, especially in the context of deep learning. Since then, many ideas have emerged that are intended to make the classification robust against adversarial perturbations. However, most of them have later been shown to be ineffective. Gradient masking (Papernot et al., 2016) is an example of issues that arise from such ideas, which leads to a false sense of security. In particular, obfuscated gradients (Athalye et al., 2018) could make it impossible for the gradient-based adversaries to attack the model. Subsequently, these defenses are easily broken by alternative adversaries, such as non-gradient based (Chen et al., 2017; Uesato et al., 2018; Ilyas et al., 2018), or black-box attacks (Papernot et al., 2016). This shows that evaluating the robustness of a neural network is a challenging task. In addition to these unsuccessful attempts, many others are not scalable to large networks that are expressive enough for the classification task on ImageNet and sometimes need to assume specific network architectures (Cohen et al., 2019).
“Adversarial training” is among the most established methods in the field, which was introduced in (Goodfellow et al., 2014; Kurakin et al., 2016a) and was completed later by (Madry et al., 2017) through the lens of robust optimization (Ben-Tal et al., 2009). Following this work, there have been numerous studies to improve the adversarial accuracy of adversarial training on the test set by using techniques such as domain adaptation (Song et al., 2018) or label smoothing (Shafahi et al., 2018), while some others have tried to decrease the computational cost of adversarial training (Wong et al., 2020; Shafahi et al., 2019). Researchers have also tried to apply adversarial training on more natural classes of perturbations such as translations and rotations for images (Engstrom et al., 2017), or on the mixture of perturbations (Tramer and Boneh, 2019; Maini et al., 2019). Also, (Ford et al., 2019) showed the relation between adversarial training and Gaussian data augmentation.
3 Proposed Method
Recall that in adversarial training (Madry et al., 2017), we need to solve the adversarial empirical risk minimization problem, which is a saddle point problem, that is formulated as:
where represents the set of feasible adversarial perturbations, and
is the loss function. The inner maximization is referred to as the “adversarial loss”. The setting that we study in this section is when
which is the most common perturbation set and used by (Madry et al., 2017) and is also considered as the standard benchmark in the context of adversarial examples.
To motivate the proposed method, we would begin with the result of an experiment. We observe that adversarial training on larger epsilons can not decrease the adversarial loss sufficiently, even on the training data. The training adversarial loss for different values of epsilons is provided both on MNIST (Fig. 1) and CIFAR10 (Supplementary Fig. 9 in Sec. D). The detailed experiments are provided in A. This observation would raise the following question:
Is this an optimization issue in a highly non-convex non-concave min-max game; or is it just that learning is not feasible on large epsilons?
Before answering this question, one should note that optimization in deep learning has not been extensively studied from a theoretical perspective. Specifically, it is not even completely known why one could achieve near-zero training loss in standard training through randomly initialized gradient descent. This is especially surprising given the highly non-convex loss landscape. Yet the gradient descent is guaranteed to converge in this setting. However, we do not even have this convergence guarantee for adversarial training as a gradient descent-ascent algorithm on the highly non-convex non-concave min-max game (Schäfer and Anandkumar, 2019).
Nevertheless, in practice, researchers observed that the trainability of deep models is highly dependent on weight initialization, model architecture, the choice of optimizer, and a variety of other considerations (Li et al., 2017). The effect of different initializations on adversarial training has not been studied rigorously in the literature (Ben Daya et al., 2018; Simon-Gabriel et al., 2018).
Our main contribution is to propose a novel practical initialization for the adversarial training, which makes learning on larger perturbations feasible. Specifically, for adversarial training on large perturbations, we claim that the final weights of an adversarially trained network on a smaller epsilon can be used for this purpose. Surprisingly, this method can find a good solution for larger perturbations even with a few numbers of training epochs. Fig.2 depicts the above-mentioned idea by illustrating a significant decrease in the training loss for large epsilons using the proposed initialization, which was not possible using a random initialization. This is illustrated in various settings of initial and target epsilons for the MNIST and CIFAR10 datasets. The detailed experiments are provided in A.
To gain more insights on the proposed method, we would address these questions:
Does adversarial training converge when we train on larger perturbations?
Why the low-cost local Nash equilibrium that is found by our method cannot be reached using random initialization?
Why does the proposed initialization converge quickly with few numbers of training epochs?
Towards addressing all these questions, in B, we employ the “loss landscape model”, which gives a geometric intuition about the loss function.
In C, we first evaluate our method rigorously and then we discuss another surprising observation, which is the trainability of deep models on very large perturbations (e.g. on MNIST, where the pixel intensities are scaled between and ), which obviously should not be possible, because the attacker can ideally transform all the pixel intensities to a single level and as a result, we should not be able to do better than random guessing.
In the last section D, we introduce “iterative adversarial training”, that gradually increases the value of epsilon during the training procedure, as a possible alternative to the proposed weight initialization.
A Method Evaluation
To demonstrate the effectiveness of our proposed weight initialization on the adversarial trainability of deep models on larger epsilons, we run several experiments on the MNIST (LeCun and Cortes, 2010) and CIFAR10 (Krizhevsky et al., ) datasets.
a.1 Experimental setup
Base model training: For training the MNIST models, we scale the pixel intensities between and . For the optimization, we use the cross-entropy loss, Adam optimizer (Kingma and Ba, 2014) with learning rate = and batch size = . In standard training, we use only epochs. For adversarial training, we used epochs, signed Projected Gradient Descent (PGD) attack (Madry et al., 2017) with the random start, PGD learning rate = and number of steps = epsilon / PGD learning rate. For CIFAR10, we scale the pixel intensities between and . For the optimization, we also use the cross-entropy loss, SGD optimizer with the learning rate schedule being for the first epochs, for the second , and for the third , all of them with the weight decay = , batch size = and also data augmentation. Specifically, for adversarial training, we train for epochs, based on signed PGD with the random start, learning rate = and steps = epsilon / PGD learning rate. Note that many of these settings are obtained from (Engstrom et al., 2019b).
Proposed method setup, which we call as “Extended adversarial training”: For both MNIST and CIFAR10, we trained our models epochs with the new larger epsilons with the same exact setting as the ones in the base model training, except that for CIFAR10, we do not use data augmentation.
Experiment 1: When we try to apply adversarial training on large epsilons, the training loss does not decrease. We show the training adversarial loss across different epsilons on both MNIST (Fig. 1) and CIFAR10 (Supplementary Fig. 9 in Sec. D). We also try to adversarially train the models by adopting different choices of architectures, optimizers, learning rates, weight initialization, and different settings of PGD. These are described in more detail in the Supplementary Materials B.
Conclusion: Overall, it seems that these modifications are not playing a major role in the adversarial trainability of deep models on large epsilons.
Experiment 2: We take the weights from an adversarially trained model on epsilon = in MNIST and epsilon = in CIFAR10 as the initial weights, followed by adversarial training on larger epsilons. We report the results in Fig. 2.
Conclusion: This initialization makes the adversarial learning possible on larger epsilons in both datasets.
Remark 1: Initializations based on the standard trained models are also observed to be ineffective in this context, which suggests that weights of an adversarially trained model are inherently different from a standard trained model (Goodfellow et al., 2014).
Now we compare the accuracy of our trained models with standard benchmarks in the literature (see Tables 1 and 2). We compared the results to a baseline, namely the model that is trained by the adversarial training with the commonly used epsilons in the literature. Note that one should not expect a model that is trained on a given perturbation size to resist attacks of larger magnitudes. However, this could serve as a simple baseline and give a sense of the adversarial accuracy that is gained through the proposed method. We observe that using our proposed method, we could make adversarial training on larger epsilons feasible and achieve non-trivial and significant adversarial test accuracies. We evaluate all models robustness against the signed PGD attack with steps, and the other hyper-parameters stay unchanged compared to what we used for training. For the reasons that become clear later in C, the PGD that is used to solve the inner maximization has a larger step size and smaller learning rate compared to the previously mentioned PGD in A.1. It is worth noting that we did not try to fine-tune any of these models to improve the accuracy.
Remark: The adversarial accuracy under the attacks with bounded norm are observed to improve in the model that is trained by the proposed idea. More specifically, we consider the evaluation based on a PGD attack with epsilon , learning rate , and steps. In the MNIST dataset, we obtain an adversarial test accuracy of for the model that is trained with epsilon , compared to for the model that is trained with the original adversarial training with epsilon = .
|Adversarial Training ()|
|(Engstrom et al., 2019b)|
B Insights on the proposed method
In this section, we address the three questions that were raised earlier in this section. In the deep learning literature, researchers have used loss function visualization to gain insights about the training of deep models, which could not be theoretically explained. Several methods have been proposed for this purpose (Li et al., 2017; Goodfellow et al., 2015). We further use some visualization to address these questions.
Visualization 1: To assess the convergence of adversarial training, we used two different plots. Specifically, the difference between training adversarial loss, and the distance of weights in two consecutive epochs empirically indicate the convergence. The plots are shown in the Supplementary Materials E.
Conclusion: It seems that according to both mentioned plots, the adversarial training empirically has reached a local Nash equilibrium. This result is not unexpected as it has previously been shown that adversarial training would converge irrespective of the epsilon value in a simple model setting (Gao et al., 2019).
Visualization 2: (Goodfellow et al., 2015) proposed a method for loss landscape visualization. Specifically, the adversarial loss function denoted as , is considered for the weights lying on the line segment that is connecting to , where is the initial network weights and is the final trained weights. could be parametrized by , where , and .
We assume that the network initial weights are random. is convex for small perturbations but begins to be non-convex as epsilon grows (Fig. 3). Therefore, the weights associated with low adversarial loss may not be easily reachable from a random initialization through gradient-based optimization for a large epsilon.
Now, assume that the initial weights are obtained from an adversarially trained model. The seems to be well approximated by a convex function (Fig. 4; Left). Therefore, solutions with low training adversarial loss become reachable from the initial weights.
Visualization 3: Mode connectivity (Garipov et al., 2018; Dräxler et al., 2018; Freeman and Bruna, 2016) is an unexpected phenomenon in the loss landscape of deep nets. There have been some efforts in the literature to give a theoretical explanation of this effect (Kuditipudi et al., 2019). We made a similar plot to the visualization 2, with the difference that training adversarial loss is calculated according to the initial epsilon (Fig. 4; Right).
Conclusion: The given plot suggests that the proposed initialization makes the adversarial training focused on a small weight subspace. Because the training adversarial loss, when evaluated based on the initial epsilon, does not increase, and even decreases slightly. Therefore, we are exploring the weights that are robust against the initial epsilon as opposed to the entire weight space. This suggests “mode connectivity” in our training and therefore, we could expect the optimization to converge quickly.
C Evaluating adversarial robustness
As already mentioned, evaluating against the adversarial attacks has proven to be extremely tricky. Here, we are inspired by the latest recommendations on evaluating adversarial robustness (Carlini et al., 2019). Note that our threat model is simple, and is indeed similar to the one that is used in the evaluation of adversarial training. Motivated by the recently recommended checklist for evaluation of adversarial robustness, we applied various sanity checks such as black-box attacks (Papernot et al., 2016), gradient-free (Chen et al., 2017), brute force attacks, and a novel semi-black-box attack to make the threat model broader. We found out that the proposed defenses pass all these checklists on MNIST. The details of the evaluation setup and the results for these experiments are explained in the Supplementary Materials C.
Building upon the mentioned recommendation list, we further evaluated our models based on more challenging tests. These are designed to increase our confidence in the claims that we make about the model robustness that is achieved by the proposed method.
We first decrease the learning rate of PGD to and increased the number of steps to to find a stronger attack. In the model that is trained based on our proposed initialization with target epsilon = , this led to adversarial accuracy of , as opposed to that is obtained when the same PGD as in the training is used for the test-time attack. However, the mentioned stronger attack does not significantly affect the adversarial accuracy of the model with target epsilon = . To make the attack even stronger we used the actual gradient, as opposed to the signed gradient in PGD (Madry et al., 2017), with the learning rate = and step number . We observed that the adversarial accuracy dropped to - for the former model, but this again did not affect adversarial accuracy of the model with target epsilon = . Notably, we could not decrease validation adversarial accuracy of the model with the target epsilon = although we tried out various PGD settings.
One could use the loss function visualization along the adversarial and random direction in the input space to assess the gradient obfuscation. This plot suggests that the model with the target epsilon = does not exhibit gradient obfuscation, while the model with epsilon = shows signs of gradient obfuscation (Fig. 5).
To go further, we used the PGD with learning rate = and steps for the inner maximization in adversarial training from epsilon = to target epsilon = . We noted that the adversarial training loss increased in the final model from to , and test adversarial accuracy based on the same PGD decreased significantly from to . Surprisingly, unlike the model that is trained with a weaker PGD, we could successfully attack the trained model with the stronger PGD even with the weaker attack that was introduced in A.1.
This is analogous to the remedy that is used to avoid gradient obfuscation in FGSM by increasing the number of steps, which led to emergence of the PGD attack. Indeed, the first attempt to train the model using a PGD with a small number of steps has overfitted to this weak attack. We believe that by making the inner maximization more accurate, one could not even be able to train a model at epsilon = .
Note that for CIFAR10, as the adversarial accuracy is already low for large epsilons, we do not need to evaluate the model on the stronger attack. We included the gradient obfuscation plots for CIFAR10 in Fig. 5, which shows a similar trend to those of MNIST.
D Iterative adversarial training (IAT)
Inspired by previously suggested “extended model training”, we propose a new method, which we call “Iterative Adversarial Training” (IAT). In IAT, the epsilon is gradually increased from 0 to the target epsilon with a specific schedule across epochs during the training.
For the MNIST dataset, we used the same model as described earlier. The only difference here is that we increase epsilon by in each epoch of adversarial training. In addition, we could identify the stopping point for epsilon by plotting the training adversarial loss against epochs and stop as soon as a significant sudden increase is observed (Fig. 7).
For the CIFAR10 dataset, we tested several schedules (e.g. linear and exponential schedules) with different settings, but unfortunately, the trained models either did not show a good adversarial accuracy or got overfitted.
4 Benefits of Adversarial Robustness
The interpretability of the machine learning models has emerged as an essential property of the model in many real-world applications. These, include areas where an explanation on the model prediction is required for either validation or providing overlooked insights about the input-output association to a human user. In addition to the recent efforts to understand the internal logic of these models (Selvaraju et al., 2016; Montavon et al., 2015), it has been observed that interpretability comes as an unexpected benefit of adversarial robustness. For instance, the saliency map that highlights influential portions of the input becomes sparser and more meaningful in the robust models (Tsipras et al., 2018).
The last layer of a deep neural network can be thought of as the representation that is learned by the model. This representation could be divided into a set of robust and non-robust features (Ilyas et al., 2019). Robust features are the ones that are highly correlated to the class label even in the presence of adversarial perturbations. These features, therefore, have a close relation to the human perception and changes in these features can affect the meaning of data for a human being. In contrast, non-robust features show a brittle correlation to the class label and therefore are not aligned with the human perception. During the training process, a network has a bias towards learning non-robust features (Ilyas et al., 2019; Engstrom et al., 2019a). Adversarial training is a natural remedy that prevents the network from learning and relying on non-robust features. By training a network on lager perturbations, we narrow the set of non-robust features that classifier can learn and as a result, the loss gradient with respect to the input relies on more robust features and become more interpretable. Therefore, we would expect more relevant saliency maps for such a model.
Next, we assess this property in our learned models and compare them to models that are robust to smaller perturbations (Fig 6). Specifically, we observe that the saliency map that is obtained from the loss gradient with respect to the input, evaluated on the perturbed data, in the model trained on larger epsilon, becomes more compact and concentrated around the foreground. Relevant to this, we also plotted the adversarial perturbation on the model that is trained on the MNIST dataset with epsilon = and compared it to a network that is trained on epsilon = . The PGD attack is based on epsilon = . We observe that the attack on both models has aimed to obscure the digit, but the attack from the model with higher epsilon would erase the digit more precisely. It has been reported that the attack on adversarially trained models lacks interpretability (Schott et al., 2018; Sharma and Chen, 2018), however, our model appears to yield more interpretable attacks compared to the baseline.
5 Theoretical Results
We first briefly recap the definition of classification error rate and classification adversarial error rate and then, we define 1. optimal classifier, 2. optimal robust classifier, 3. optimal classification error rate and 4. optimal classification error rate with respect to the given distribution. Next, we discuss such optimal classifiers and their corresponding error rates under two specific distributions.
A Basics and definitions
In the subsequent definitions, we let be the set of labels. We also, let denote the joint feature and label distribution.
The error rate of a classifier on the distribution is defined as:
where is the indicator function that takes if its input is a true logical statement, and otherwise.
The classifier is said to be an optimal classifier on if for any other classifier , . is also called the optimal Bayes classification error rate on .
Let be a point in the input space. Let . Then, the perturbation set is defined as:
where is the norm of a vector.
norm of a vector.
Let be a perturbation set. Then, adversarial error rate of a classifier on the perturbation set and the distribution is defined as:
The classifier is said to be a -optimal robust classifier on if for any other classifier , . is also called the -optimal classification error rate on .
As the optimal and the optimal robust classifiers tend to solve completely different problems by the definition, they can be different functions for a given distribution. One can easily show that they can be different functions in the general case. Consider a very simple distribution on and , which contains just two points and , . Assume that . The optimal classifier outputs and , when the input is and , respectively. However, the -optimal robust classifier outputs a constant, either or .
B Optimal classifiers
In the standard setting, we can find the optimal classifier for a given distribution by assuming full access to the joint data and label distribution efficiently, specifically, through applying the Bayes optimal classifier theorem (Devroye et al., 1997). However, the problem of finding the optimal robust classifier in this setting is computationally hard in general.
The problem of finding the optimal robust classifier given the joint data and label distribution for the perturbations as large as in norm, is an -hard problem.
It was shown that adversarial robustness might come at the cost of higher training time (Madry et al., 2017), requiring more data (Schmidt et al., 2018), and losing the standard accuracy (Tsipras et al., 2018). In addition, we show that finding the optimal robust classifier is an -hard problem. Note that this computational difficulty is different from the ones that arise in finding the global minimum of a non-convex function. Because by assuming an infinite amount of training data, instead of minimizing the non-convex empirical loss, we can well approximate the data distribution and therefore, we can efficiently find an optimal Bayes classifier. However, in this case, finding an optimal robust classifier is still computational prohibitive in the general case.
We note that for several distributions, one could efficiently find the optimal robust classifier. These include the Gaussian and mixture of Bernoulli distributions.
C The Isotropic Gaussian model
Isotropic Gaussian class conditional distributions: We first sample uniformly and then sample a data point in from a multivariate Gaussian distribution
from a multivariate Gaussian distribution, where without loss of generality .
Let , and with equal probabilities, then assuming that the
with equal probabilities, then assuming that the-optimal robust classifier has a continuous decision boundary, is -robust , where . Furthermore, the optimal Bayes and optimal robust classification error rates are , and , respectively, where
D The multi-dimensional Bernoulli model
Uniform mixture of two multivariate Bernoulli distributions, : Let and for all , and . We first randomly sample the class label uniformly from . Then, we sample each dimension of according to a Bernoulli distribution:
Therefore, the conditional distribution of given becomes:
Let be a Bernoulli model. Let . Then, 1. the optimal Bayes’ classifier on this distribution is of the form:
and 2. if is even, the optimal classification error rate is:
Otherwise, if is odd, the classification optimal error rate would be:
is odd, the classification optimal error rate would be:
where , which is the number of dimensions that and do not agree. We call these dimensions as the “effective dimensions”.
We assume that for and , the -optimal robust classifier would give the same label to and . Note that, this is a rational assumption because of the symmetry of the problem.
Let be a Bernoulli model. Let . Let be the perturbation set. Then, under the assumption 1, 1. the -optimal robust classifier on this distribution is of the form:
where is a threshold, and , and are the restrictions of , and to the effective dimensions, defined in theorem 3, respectively, and 2. the classification -optimal error rate would be:
Let be a Bernoulli model. Let . Let be a perturbation set. Then, 1. the -optimal robust classifier on this distribution for is of the form:
where and is if and is otherwise, and 2. the classification -optimal error rate would be:
where is the optimal Baye’s classification rule.
Remark 1: Note that the effective dimension, , rather than the real dimension of the input, appears in the equations for the optimal classifiers.
Remark 2: For the case of isotropic Gaussian distribution, for the attack model, maximum allowed adversarial perturbation, before getting the trivial random chance adversarial accuracy is half of the distance between the centers of the two Gaussians. For the mixture of multi-dimensional Bernoulli distribution, however, for the attack model, such a limit is , which translates to for the mixture elements being in . For the attack, one has to use the final equation of the proof to find the mentioned limit for epsilon, which is plotted in the Supplementary Materials G.
Detailed proofs of the theorems in this section are provided in Supplementary Materials A.
6 Conclusion and future work
We demonstrated that the weight initialization plays an important role in the adversarial learnability of deep models on larger perturbations. Specifically, weights from an already robust model on a smaller perturbation size could be an effective initialization to achieve this goal. In spite of the promises in the proposed idea, some directions remain unstudied. We next provide a list of possible future directions to extend this work:
In this work, we demonstrated the importance of weight initialization in the feasibility of adversarial training on large perturbations. However, more theoretical explanation is still essential to better understand this phenomenon from an optimization perspective, e.g. explaining the geometry of loss landscape.
We showed that finding the optimal robust classifier is an -hard problem in general. We also showed that it can be computed for some specific distributions. However, it would be useful to understand what general properties of the data distribution make finding the optimal robust classifier efficiently possible.
You can find our code in: https://github.com/rohban-lab/Shaeiri_submitted_2020
We would like to thank Soroosh Baselizadeh, Hossein Yousefi Moghaddam, and Zeinab Golgooni for their insightful comments and reviews of this work.
- Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. CoRR abs/1802.00420. External Links: Cited by: item 5, §1, §2.
- On the robustness of the CVPR 2018 white-box adversarial example defenses. CoRR abs/1804.03286. External Links: Cited by: §1.
- Improved generalization bounds for robust learning. External Links: Cited by: §1.
On robustness of deep neural networks: a comprehensive study on the effect of architecture and weight initialization to susceptibility and transferability of adversarial attacks.
Journal of Computational Vision and Imaging Systems4 (1), pp. 3. External Links: Cited by: §3.
- Robust optimization. Princeton Series in Applied Mathematics, Vol. 28, Princeton University Press. External Links: Cited by: §2.
- Evasion attacks against machine learning at test time. CoRR abs/1708.06131. External Links: Cited by: §1, §2.
- Adversarial examples from computational constraints. External Links: Cited by: §D.
- On evaluating adversarial robustness. CoRR abs/1902.06705. External Links: Cited by: §C.
- Hidden voice commands. In USENIX Security Symposium, Cited by: §1.
- Defensive distillation is not robust to adversarial examples. CoRR abs/1607.04311. External Links: Cited by: §1.
- MagNet and ”efficient defenses against adversarial attacks” are not robust to adversarial examples. CoRR abs/1711.08478. External Links: Cited by: §1.
- Audio adversarial examples: targeted attacks on speech-to-text. CoRR abs/1801.01944. External Links: Cited by: §1.
- Is ami (attacks meet interpretability) robust to adversarial examples?. CoRR abs/1902.02322. External Links: Cited by: §1.
- More data can expand the generalization gap between adversarially robust and standard models. External Links: Cited by: §1.
Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security - AISec ’17. External Links: Cited by: Appendix B, item 10, §2, §C.
- Certified adversarial robustness via randomized smoothing. CoRR abs/1902.02918. External Links: Cited by: §1, §2, 2nd item.
- PAC-learning in the presence of evasion adversaries. External Links: Cited by: §1.
- Adversarial classification. In In Proceedings of the Tenth International Conference on Knowledge Discovery and Data Mining, pp. 99–108. Cited by: §2.
A probabilistic theory of pattern recognition. Springer. Cited by: §B.
- Lower bounds for adversarially robust pac learning. External Links: Cited by: §1.
- Essentially no barriers in neural network energy landscape. In ICML, Cited by: §B.
- A discussion of ’adversarial examples are not bugs, they are features’: discussion and author responses. Distill. Note: https://distill.pub/2019/advex-bugs-discussion/original-authors External Links: Cited by: §4.
- Robustness (python library). External Links: Cited by: 2nd item, Table 2.
- A rotation and a translation suffice: fooling cnns with simple transformations. CoRR abs/1712.02779. External Links: Cited by: §2.
- Adversarial examples are a natural consequence of test error in noise. External Links: Cited by: §2.
- Topology and geometry of half-rectified network optimization. External Links: Cited by: §B.
- Convergence of adversarial training in overparametrized neural networks. External Links: Cited by: §B.
- Adversarially robust learning could leverage computational hardness. External Links: Cited by: §D.
- Loss surfaces, mode connectivity, and fast ensembling of dnns. In NeurIPS, Cited by: §B.
- Adversarial spheres. CoRR abs/1801.02774. External Links: Cited by: §1.
- Nightmare at test time: robust learning by feature deletion. In Proceedings of the 23rd International Conference on Machine Learning, ACM International Conference Proceeding Series, Vol. 148, pp. 353–360. External Links: Cited by: §2.
- Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Y. W. Teh and M. Titterington (Eds.), Proceedings of Machine Learning Research, Vol. 9, Chia Laguna Resort, Sardinia, Italy, pp. 249–256. External Links: Cited by: item 1.
- Explaining and harnessing adversarial examples. CoRR abs/1412.6572. Cited by: item 1, §2, §A.2.
- Qualitatively characterizing neural network optimization problems. In International Conference on Learning Representations, External Links: Cited by: §B, §B.
- On the hardness of robust classification. External Links: Cited by: §D.
- An alternative surrogate loss for pgd-based adversarial testing. External Links: Cited by: item 7.
Speech recognition with deep recurrent neural networks. CoRR abs/1303.5778. External Links: Cited by: §1.
- Deep residual learning for image recognition. External Links: Cited by: item 4, 1st item.
- Delving deep into rectifiers: surpassing human-level performance on imagenet classification. CoRR abs/1502.01852. External Links: Cited by: §1.
- Learning tensorflow: a guide to building deep learning systems. 1st edition, O’Reilly Media, Inc.. External Links: Cited by: 1st item.
- Black-box adversarial attacks with limited queries and information. External Links: Cited by: §2.
- Adversarial examples are not bugs, they are features. Note: cite arxiv:1905.02175 External Links: Cited by: §1, §1, §4, Remark 3.
- Are perceptually-aligned gradients a general property of robust classifiers?. ArXiv abs/1910.08640. Cited by: §1.
- Adversarial risk bounds via function transformation. External Links: Cited by: §1.
- Adam: a method for stochastic optimization. External Links: Cited by: item 2, 2nd item.
- Feature weighting for improved classifier robustness. In CEAS 2009, Cited by: §2.
-  () CIFAR-10 (canadian institute for advanced research). . External Links: Cited by: §A.
- Explaining landscape connectivity of low-cost solutions for multilayer nets. CoRR abs/1906.06247. External Links: Cited by: §B.
- Adversarial machine learning at scale. External Links: Cited by: §2.
- Adversarial examples in the physical world. CoRR abs/1607.02533. External Links: Cited by: §1.
- Convolutional networks for images, speech, and time-series. In The handbook of brain theory and neural networks, M.A. Arbib (Ed.), (English (US)). Cited by: 1st item.
- MNIST handwritten digit database. Note: http://yann.lecun.com/exdb/mnist/ External Links: Cited by: §A.
- Visualizing the loss landscape of neural nets. In NeurIPS, Cited by: §B, §3.
- Adversarial music: real world audio adversary against wake-word detection system. In Advances in Neural Information Processing Systems 32, pp. 11908–11918. Cited by: §1.
- Towards deep learning models resistant to adversarial attacks. ArXiv abs/1706.06083. Cited by: item 4, item 5, item 1, §1, §2, 2nd item, §C, Table 1, §3, §B.
- Can adversarially robust learning leverage computational hardness?. External Links: Cited by: §D.
- Adversarial robustness against the union of multiple perturbation models. External Links: Cited by: §2.
- VC classes are adversarially robustly learnable, but only improperly. External Links: Cited by: §1.
- Explaining nonlinear classification decisions with deep taylor decomposition. CoRR abs/1512.02479. External Links: Cited by: §4.
- Adversarial robustness toolbox v1.2.0. CoRR 1807.01069. External Links: Cited by: item 10.
- Practical black-box attacks against machine learning. External Links: Cited by: §2, §C.
- Provably robust deep learning via adversarially trained smoothed classifiers. External Links: Cited by: 2nd item.
- Competitive gradient descent. In NeurIPS, Cited by: §3.
- Adversarially robust generalization requires more data. External Links: Cited by: §1, §B.
- Towards the first adversarially robust neural network model on mnist. External Links: Cited by: §4.
- Grad-cam: why did you say that? visual explanations from deep networks via gradient-based localization. CoRR abs/1610.02391. External Links: Cited by: §4.
Label smoothing and logit squeezing: a replacement for adversarial training?. ArXiv abs/1910.11585. Cited by: §2.
- Adversarial training for free!. CoRR abs/1904.12843. External Links: Cited by: §2.
- Attacking the madry defense model with $l_1$-based adversarial examples. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Workshop Track Proceedings, External Links: Cited by: §1, §4.
- First-order adversarial vulnerability of neural networks and input dimension. External Links: Cited by: §3.
- Improving the generalization of adversarial training with domain adaptation. CoRR abs/1810.00740. External Links: Cited by: §2.
- Intriguing properties of neural networks. In International Conference on Learning Representations, External Links: Cited by: §1.
- Intriguing properties of neural networks. CoRR abs/1312.6199. Cited by: §2.
- Adversarial training and robustness for multiple perturbations. In Advances in Neural Information Processing Systems 32, pp. 5858–5868. Cited by: §2.
- Robustness may be at odds with accuracy. Note: cite arxiv:1805.12152 External Links: Cited by: §1, §1, §4, §B.
- Adversarial risk and the dangers of evaluating against weak attacks. External Links: Cited by: §2.
- Analyzing the robustness of nearest neighbors to adversarial examples. ArXiv abs/1706.03922. Cited by: §1.
- Fast is better than free: revisiting adversarial training. ArXiv abs/2001.03994. Cited by: §2.
- Achieving human parity in conversational speech recognition. CoRR abs/1610.05256. External Links: Cited by: §1.
- Rademacher complexity for adversarially robust generalization. External Links: Cited by: §1.
- Adversarially robust generalization just requires more unlabeled data. CoRR abs/1906.00555. External Links: Cited by: §1.
- Theoretically principled trade-off between robustness and accuracy. External Links: Cited by: §1.
Appendix A Proofs
: The general problem of finding a -optimal robust classifier on the distribution , under the perturbation set . An algorithmic solution to this problem would take as input , and and outputs a classifier that is -optimal robust.
: The problem of finding the maximum distance independent set in an unweighted undirected graph . We call a subset of nodes a distance independent set if , where is length of the shortest path between and on the graph . The goal of this problem is to find a maximum cardinality distance independent set of graph .
Every instance of can be reduced to an instance of in polynomial time.
Let and consider an order of all possible edges on , where each possible edge is of the form and connects 2 nodes . We first construct a set of points in dimensional space and then be supported on these points. Let be matrix, which is defined as:
and is defined for any given or as follows:
Note that the distance between every pair of rows in is greater than . Now, based on , we define a new matrix , where if the edge and , we set and . Note that in the matrix , two rows whose corresponding vertices are not connected still remain greater than apart with respect to the norm. However, two rows that are connected in the graph would now have an distance of exactly .
be the set of labels. We define discrete probability distributionson points defined as rows of .
To find a -optimal robust classifier on the mentioned , which is supported on a finite set of points , we have to solve the following optimization problem:
By our construction of , this would be equivalent to:
Note that by our construction of , there is an edge between and iff . Hence, this optimization is equivalent to:
where is the set of all nodes connected to the in the graph . Note that for the -th term in the sum to be one, (1) should be labeled as by , and (2) all neighbors of in the graph should be labeled as . Therefore, for both the -th and -th term in the sum to be one, they should have a graph distance of at least 3. Also, if and have a graph distance of at least 3 then, they can both be one in the sum. As a result, the last optimization would be equivalent to:
Therefore, the last optimization would equivalently find the set of vertices with maximum cardinality such that, all selected vertices have a distance of at least 3 in the graph. As a result, the problem of finding a -optimal robust classifier on would yield a solution to the maximum distance independent set for the input graph . ∎
The problem of finding the -optimal robust classifier on the distribution , , is an -hard problem.
As shown in Lemma 1, the maximum distance independent set problem can be reduced to in polynomial time. The former is a well-known -hard problem, which implies that would be -hard too. ∎
For the special case of in Lemma 1, one can consider and define the matrix as: . Note that the distance between every pair of rows in is exactly . Now, based on , we can again define a new matrix , where if the edge and , we set and . Note that in the matrix , two rows whose corresponding vertices are not connected still remain exactly apart with respect to the norm. However, two rows that are connected in the graph would now have an distance of exactly .
The most general case of the perturbation set can be defined by using a relation . Let then, the perturbation set is defined as: . In this general case making a relation using graph , and specifying coordinates of the points is trivial.
Let , and with equal probabilities, then assuming that the -optimal robust classifier has a continuous decision boundary, is -robust , where . Furthermore, the optimal Bayes and optimal robust classification error rates are , and , respectively, where is the cumulative distribution function for the standard normal distribution.
A binary classifier can be expressed as . We aim to find the optimal robust classifier :
where is the true label of the data point .
Let and denote the conditional densities of given , and , accordingly. Then, we have:
and we get:
Decision Boundary of classifier is defined as
In what follows, we focus only on the candidate classifiers that have a continuous decision boundary.
(x) is defined as the set of points that are within ball of radius around , excluding the outer boundary. Also, the critical region of , denoted as , is defined as the set of points that can be misclassified by adding a perturbation of length no more than .
Suppose that . Therefore, there exists and . Since the label of the data points, and are different and as we assumed that is continuous, there must exist such that . Note that
and according to Definition 4,
Now suppose that . Then, there exists such that . But note that
As a result, .
Note that as , , if , . Otherwise,
So, in both cases, . ∎
According to the Definition 4, for all , there exist two points like , within distance of , such that , . Therefore, . As a result, we have
Safe Region of the class , denoted as , is defined as the set of points that are labeled as by the classifier , and cannot be misclassified when a perturbation of size less than or equal to is added.
According to the definition above, for all , we get