Adversarial machine learning and its implications (szegedy2014intriguing)
on security of neural network decisions has been subjected to significant attention. Recent research has pioneered approaches to defending against adversarial inputs, only to be subsequently confronted with strategies that invalidate these defenses(athalye2018obfuscated; carlini2017adversarial).
madry2018towards developed a robust optimization scheme that formulated adversarially robust training as a convex optimization problem that can be solved using projected gradient descent (PGD). The attacker is allowed to create perturbations within a ball of radius . PGD attempts to find a worst-case adversarial perturbation within
that maximizes the loss on a given image, and then a classifier is trained on these perturbations:The resulting model has been shown to withstand against adversarial attacks belonging to . However, training models in this manner leads to significant loss in natural accuracy (tsipras2019robustness). Natural accuracy is an important characteristic for any model, including those trained to be adversarially robust. In practice, a model does not know whether it is being attacked, and if it is, it still does not know the strength of the attack. Therefore, it is preferable to have a system that adapts to different perturbation strengths while being agnostic to the perturbation itself. Our goal is to utilize ensemble schemes to improve natural accuracy while guaranteeing adversarial robustness at the same attack strength . While ensemble schemes that use multiple classifiers and take a majority vote on a class prediction has been previously proposed (tramer2018ensemble), these approaches have been shown to be adversarially vulnerable (he2017adversarial).
Robust Ensembles: Our key idea leverages the inverse relationship between the strength of the robustness guarantee (i.e., the size of the ball of attacks a model is exposed to during training) and natural accuracy (tsipras2019robustness). In particular, we point out that ensembling robust models can only improve adversarial accuracy. As a result, ensembling models trained to withstand smaller attacks can withstand a much larger targeted attack. Furthermore, the inverse relationship of attack size and accuracy leads to significant improvements in natural accuracy. Consequently, we can optimize ensemble training to maximize natural accuracy for a desired targeted attack-level.
Composing Robust and Natural Features: We also explore a strategy based on fusing natural and robust models. These ideas are inspired from ilyas2019adversarial, who reasoned that the existence of non-robust and robust features in images for natural accuracy loss in robust models. Non-robust features are human-imperceptible pixel patterns that still correlate well with class predictions. These result from the tendency of standard classifiers to use any available signal for generalization. Meanwhile, robust features are more or less human-interpretable patterns that can also generalize independently of non-robust features, though at significantly less accuracy than with both types of features used. Robust training removes the dependency of a model on non-robust features, and a higher attack strength strips out more non-robust features, while fusion leads to improved natural accuracy without suffering adversarial degradation.
We propose an area-under-the-curve (AUC) metric to characterize adaptation to a continuous spectrum of attack strengths of adversarially robust models:
Maximizing performance with respect to this metric would mean that a model would work well across the entire adversarial attack spectrum. Our results show that both our schemes result in significant improvement over robustly trained ResNet18 models on CIFAR-10 datasets both with respect to natural accuracy and AUC metric.
2 Problem Formulation and Algorithm
For completeness we first describe the adversarial accuracy for robustly trained base-models. Let the input examples and labels belong to the product space
, and distributed according to the joint distribution denoted by. Let denote the parameters of some model trained in some fashion on training data . Let be the (optimal) parameters of a robustly trained base model, with a random weight initialization , which is supposed to withstand bounded test-time perturbations as large as . Training a model robust to perturbations of size follows madry2018towards, where PGD is used to train on .
Conventionally, models are trained to withstand adversarial attacks of size , and evaluated to validate attacks of this size, in this work, we are interested in its performance for other attack sizes. For instance, we are often interested in the natural accuracy of a model. Thus, we often allow for test-time perturbations, , which can take on any non-negative value. As such our objective is to develop a perturbation-agnostic characterization of a model’s performance. We denote the accuracy of a model,
with respect to these parameters and loss functionas
where the expectation is with respect to test-data. The expected performance of a randomly initialized robustly trained model for attack size is the average over different random initializations, and is denoted as
Ensembling. In the previous scheme, a single base-model is offered to the attacker, and the attacker is allowed to choose the worst perturbation for this model. In our random ensembling scheme, our model is a weighted average of many random base models, and the attacker must attack the average model rather than any model in isolation.
In particular, let us now consider a collection of robust models trained with different random initializations , where each, accounts for different initializations in our training scheme. For instance, we could choose
’s independently from a multivariate Gaussian distribution, and this corresponds to independent random initializations of the model parameters. Although, one can experiment with different initialization strategies, and optimize over it, we only consider initializations drawn from a fixed distribution. Correspondingly we obtainrobust models, , which are each robust models that are trained for attacks of size . We can weight these models with non-negative weights that sum upto one. Let
denote the weight vector, which formally take values in a K-dimensional simplex . The ensemble model . As a consequence, we are now faced with a different accuracy expression, namely,
We note that whenever the loss function is convex (which is typically the case), it follows from Jensen’s inequality that,
Remark. Thus, what is clear is that adversarial loss of a random ensemble is no worse than the average adversarial loss of a robust base model trained with different initializations.
Optimizing Natural Accuracy for a Target Adversarial Perturbation. These comments leads us to the following insight. Adversarial robustness is monotonic in the size of the perturbation, namely, for . Furthermore, it is generally true that the natural accuracy of an adversarially trained model for level decreases with , namely, for . As a result, we propose strategies that combine models to achieve higher natural accuracy than a single robust base model, while suffering no additional adversarial degradation relative to the base model. We discuss two schemes to optimize natural accuracy below.
Averaging Randomly Initialized Robust Models. One way to do so is averaging of randomly initialized robust models. Here we train a collection of randomly initialized robust models at a perturbation level which is significantly smaller than . Specifically, our general objective can be framed as maximizing natural accuracy subject to no degradation in adversarial robustness:
We consider a simpler scheme here. We set weights to be uniform across all models. We then minimize the perturbation size , such that the adversarial error at target, is no worse than that of the base model. In practice we can validate this empirically on held-out validation data.
Meta-Composites of Natural and Robust Models. It has been argued recently that robustness is a feature, and represents intrinsic and invariant properties of an object. Nevertheless, training with robust features typically result in significant loss in natural accuracy. Our goal here is to compose robust and natural features to form composite models, that lead to best of both worlds, namely, maintaining adversarial accuracy while significantly improving upon natural accuracy. To do so we train two models, one robust, at say , and the other, at , where are random initializations. We fuse information from both models by augmenting the penultimate layers of the two networks, one robust and the other natural, and perform PGD training only on the last layer to withstand an attack level . This results in a composite model, . We then train a collection of such composites based on different random initializations and construct ensembles of such models by averaging their outputs as in the previous section. This results in . Our goal is to optimize and so as to maintain target adversarial robustness while maximizing natural accuracy.
3 Results and Conclusions
Experimental Setup. We describe the train and test environment for our experiments. All base robust models were trained using PGD (madry2018towards) for ResNet18 with the attack strength specified in the tables below and default attack steps. Models were trained (50,000 samples) and tested (10,000 samples) on the CIFAR10 dataset. Results reported are the best configurations found over 5 runs.
Ensemble Accuracy. Table 1 tabulates natural accuracy and adversarial robustness at given values of as well as the AUC(), namely, the area-under-the curve for targeted attacks below perturbations bounded by .
We show that with eight robust models trained at , we can achieve the same robust accuracy as a single model trained at . A fundamental advantage of doing so is that each of the robust models now have substantially higher natural accuracy. This is a 6.4% increase in natural accuracy compared to a single model at . We additionally note that the robustly averaged model has a higher adversarial accuracy before it converges with the single robust model when evaluating at . This gives it a higher AUC than the single robust model. Trivially, we also see that the ensemble has a better robust accuracy than a single natural model and a weakly trained base model.
A second point to observe is that our robust ensembles improve upon natural accuracy as well as adversarial accuracy. This is as expected because in Table 2 we note that the accuracy of the ensemble trained to withstand attacks of size , when ensembled is equivalent to a single robust model trained to withstand attacks of size , namely, . This implies that natural accuracy, i.e. a scenario where , should also improve with ensembling: . Our results show that this is indeed the case. We also notice in Table 2 that natural accuracy saturates after ensembling a small number of models, implying most gains are early.
Using two models, a strong and a weak model, the single composite delivers better natural accuracy than a single robust model, but compromises slightly in robustness. However, combining two composites in a weighted average scheme, we do not compromise any robustness. The accuracy degradation curve over is less steep than robust averaging, even with fewer models used, and this is reflected by a stronger AUC metric value.