In the last 10 years, Deep Neural Networks (DNNs) dramatically improved the performance in any computer vision task. However, the impressive accuracy comes at the cost of poor robustness to small perturbations, calledadversarial perturbations, that lead the models to predict, with high confidence, a wrong class goodfellow2014explaining; terzi2020directional. This undesirable behaviour led to a flourishing of research works ensuring robustness against them. State-of-the-art approaches for robustness are provided by Adversarial Training (AT) madry2017towards and its variants zhang2019theoretically. The rationale of these approaches is to find worst-case examples and feed them to the model during training train or constraining the output to not change significantly under small perturbations. However, robustness is achieved at the expenses of a decrease of accuracy: the more a model is robustness, the lower its accuracy will be tsipras2018robustness
. This is a classic “waterbed effect” between precision and robustness ubiquitous in optimal control and many other fields. Interestingly, robustness is not the only desiderata of adversarially trained models: their representations are semantically meaningful and they can be used for other CV tasks, such as generation and (semantic) interpolation of images. More importantly, AT enables invertibility, that is the ability to reconstruct input images from their representationsilyas2019adversarial by solving a simple optimization problem. This is true also for out-of-distribution images meaning that robust networks do not destroy information about the input. Hence, how can we explain that, while robust networks preserve information, they lack in generalization power?
In this context, obtaining good representations for a task has been the subject of representation learning where the most widely accepted theory is Information Bottleneck (IB) tishby2000information; alemi2016deep; achille2018information which calls for reducing information in the activations, arguing it is necessary for generalization. More formally, let
be an input random variable andbe a target random variable, a good representation of the input should be maximally expressive about for while being as concise as possible about . The solution of the optimal trade-off can be found by optimizing the Information Lagrangian:
where controls how much information about is conveyed by . Both AT and IB at their core aim at finding good representations: the first calls for representations that are robust to input perturbations while the latter finds those minimal representations useful for the task. How are these two methods related? Do they share some properties? More precisely, does the invertibility property create a contraction on IB theory? In fact, if generalization requires discarding information in the data that is not necessary for the task, it should not be possible to reconstruct the input images.
Throughout this paper we will (i) investigate the research questions stated above, with particular focus on the connection between IB and AT and as a consequence of our analysis, (ii) we will reveal new interesting properties of robust models.
1.1 Contributions and related works
A fundamental result of IB is that, in order to generalize well on a task, has to be sufficient and minimal, that is, it should contain only the information necessary to predict , which in our case is a target class. Apparently, this is in contradiction with the evidence that robust DNNs are invertible maintaining almost all the information about the input even if is not necessary for the task. However, what matters for generalization is not the information in the activations, but information in the weights (PAC-Bayes bounds) achille2019information. Reducing information in the weights, yields to reduction in the effective information in the activations at test time. Differently from IB theory, achille2019information
claims that the network does not need to destroy information in the data that is not needed for the task, it simply needs to make it inaccessible to the classifier, but otherwise can leave it lingering in the weights. That is the case for ordinary learning. This paper shows that, while ATpreserves information about the data that is irrelevant for the task in the weights (to the point where the resulting model is invertible), the information that is effectively used by the classifier does not contain all the details about the input . In other words, the network is not effectively invertible: what really matters is the accessible information stored in the weights. In order to visualize this fact, we will introduce effective images, that are images that represent what the classifiers ”sees”. Inverting learned representations is not new, and it was solved in mahendran2015understanding; yosinski2015understanding; ulyanov2018deep; kingma2013auto; however, these methods either inject external information through priors or explicitly impose the task of reconstruction contrary to robust models.
The main contribution of this work can be summarized as follows. If representations contain all the information about the input
, then adversarially trained models should be better at transfering features to different tasks, where aspects of the data that were irrelevant to the task it was (pre)-trained on were neither destroyed nor ignored, but preserved. To test this hypothesis, we perform linear classification (fine-tune the last layer) for different tasks. We show that AT improves linear transferability of deep learning models across diverse tasks which aresufficiently different from the source task/dataset. Specifically, the farther two tasks are (as measured by a task distance), the higher the performance improvement that can be achieved by training a linear classifier using an adversarially-trained model (feature, or backbone) compared to an ordinarily trained model. Related to this, in shafahi2019adversarially the transferability of robustness to new tasks is studied experimentally; differently, in the present work we study the linear transferability of natural accuracy. Moreover, we also analytically show that, confirming empirical evidence ilyas2019adversarial, once we extract robust features from a backbone model, all the models using these features have to be robust.
We will also show that adversarial regularization is a lower-bound of the regularizer in the Information Lagrangian, so AT in general results in a loss of accuracy for the task at hand. The benefit is increased transferability, thus showing a classical tradeoff of robustness (and its consequent transferability) and accuracy on the task for which it is trained. This is a classic “waterbed effect” between precision and robustness ubuiquitous in optimal control.
Regarding the connection with IB, we show analytically that AT reduces the effective information in the activations about the input, as defined by achille2019information ( Section 5). Moreover, we show empirically that adversarial training also reduces information in the weights and its consequences.
Finally, we show that injecting effective noise once during the inversion process dramatically improves reconstruction of images in term of convergence and fit quality.
2 Preliminaries and Notation
We introduce here the notation used in this paper. We denote a dataset of samples with where is an input, and is the target class in the finite set . More in general, we refer to as a random variable defining the ”task”. In this paper we focus on classification problems using cross-entropy on the training set as objective where and is encoded by a DNN. The loss is usually minimized using stochastic gradient descent (SGD) bottou2018optimization, which updates the weights
with a noisy estimate of the gradient computed from a mini-batch of samples. Thus, weights update can be expressed by a stochastic diffusion processes with non-isotropic noiseli2017stochastic. In order to measure the (asymmetric) similarity between distributions we use the Kullbach-Liebler divergence between and given by . It is well-known that the second order approximation of the KL-divergence is where is the Fisher Information Matrix (simply referred to as “Fisher”), defined by The FIM gives a local measure of how much a perturbation on parameters , will change with respect to KL divergence martens2014new. Finally, let and be two random variables. The Shannon mutual information is defined as . Throughout this paper, we indicate the representations before the linear layer as , where is called feature extractor.
AT aims at solving the following min-max problem:
In the following we denote with . We remark that by we mean the empirical expectation over elements of the dataset. Intuitively, the objective of AT is to ensure stability to small perturbations on the input. With cross-entropy loss this amounts to require that , with small. Depending on , we can write Equation 1 as:
which is the formulation introduced in zhang2019theoretically when using cross-entropy loss.
3 Does invertibility contradict Information Bottleneck?
As shown by engstrom2019learning, robust representations are (almost) invertible, also for out-of-distribution data. Before continuing we define the (weak) inversion:
Definition 3.1 (Inversion).
Let be the final representation (before linear classifier) of an image , and let be the robust feature extractor. The reconstructed image (inversion) is the solution of the following problem:
where the initial condition of is white noise
is white noise, where is the noise scale111In the actual implementation images are clamped to ..
(second row of images): the effectiveness in invertibilty is apparent, which is odd considering that a classifier should store only information useful for the task. This is surprising as robust features should discard useful details more than standard models. In fact, letand be two random variables, where is such that there exists a (implicit) map such that , where is the identity map. Then, we have: . In other words, when is an invertible map, the Shannon information is infinite. This fact seems to be contradicting the results reported in literature in the past recent years; how can invertibility and minimality of representations be conciliated? Where is the excess of information which explains the gap? If you had a powerful enough decoder, you could possibly perfectly reconstruct the original image from any layer of any network. The important fact is that invertibility of robust models empirically prove that is it not necessary to remove information about the input to generalize well invnet. This main problem of standard IB, is that it requires to operate in the activactions during training and there is no guarantee that information is also reduced at test time, which is not as AT shows. In support of this, achille2019information shows that it is still possible to maintain information about input at test time while make the information inaccessible for the classifier. In Section 5 we will expand such argument.
4 Robust models: transferability-accuracy trade off?
The insights from the previous section motivates the following argument: if in robust models information is still there, is it possible that features not useful for the original task are useful for other tasks? In a sense, is a well-organized semantic compression of such that it approximately allows to linearly solve the new task . How well the task is solved depends on how is organized. In fact, even though is optimal for and for reconstructing , it still could be not optimal for . This intuition suggests that having robust features is more beneficial then having a standard model when the distance between tasks and is such that features from the source models are not easily adaptable to the new task. Thus, there may exist a trade-off between accuracy on a given task and stability to distributions changes: ”locally” standard models work better as feature extractor, but globally this may not be true; in the Appendix we provide a theoretical explanation for this claim. In order to test our hypothesis, we fist analyze the structure of representations extracted from adversarially-trained models and then we experimentally analyze how robust features linearly transfer with respect standard features.
Structure of representations
Recently, frosst2019analyzing showed that more entangled features, that is more class-independent, allow for better generalization and robustness. In order to understand the effect of AT, in Figure 1 we show the t-SNE maaten2008visualizing embedding of final representations for different values of : as increases, the entanglement increases at the expenses of less discriminative features. Thus, robust models capture more high-level features instead of the ones useful only for the task at hand.
t-SNE of features extracted from a batch of 512 images with a robust ResNet-18 model trained on CIFAR-10 for different values of. The color code follows the different classes. As increases, features become less discriminative.
4.1 Transferability experiments
In our experiments we employ CIFAR-10 cifar, CIFAR-100 cifar and ImageNet deng2009imagenet as source datasets. All the experiments are obtained with ResNet-50 and for CIFAR and for ImageNet as described in ilyas2019adversarial and in the Appendix.
In Table 1 we show performance of fine-tuning for the networks pretrained on CIFAR-10 and CIFAR-100 transfering to CIFAR-10, CIFAR-100 F-MNIST xiao2017fashion, MNIST mnist and SVHN svhn. Details of target datasets are given in Appendix. Results confirm our hypotheses: when a task is ”visually” distant from the source dataset, the robust model performs better. For example, CIFAR-10 images are remarkably different from the SVHN or MNIST ones. Moreover, as we should expect, the accuracy gap (and thus the distance) is not symmetric: while CIFAR-100 is a good proxy for CIFAR-10, the opposite is not true. In fact, when fine-tuning on a more complex, from a robust model is possible to leverages features that the standard model would discard. According to cui2018large, we employ Earth Mover’s Distance (EMD) as a proxy of dataset distance, and we extract the order between datasets. As we show in Figure 2, the distance correlates well with the accuracy gap between robust and standard across all the tasks.
Table 2 shows similar results using models pretrained on ImageNet. The robust model provides better performance in all the benchmarks being them quite different from the original tasks. We also report experiments on more difficult datasets namely Aircraft aircrafts, Birds birds, Cars cars, Dogs dogs222The Stanford Dogs dataset contains images of 120 breeds of dogs. This dataset has been built using images and annotation from ImageNet., Flowers flowers, Indoor indoor67 that would have not been suitable for transfering from simpler tasks like CIFAR-10 and CIFAR-100. Not surprisingly the robust model shows lower accuracy compared to the standard one since images are very similar to those contained in the ImageNet dataset. For examples, Dogs images are selected from ImageNet.
Also with ImageNet, as shown by Figure 3, the difference in accuracy between the two model is correlated with distance. We can see that the closer the task the higher the difference in accuracy in favor of the standard model. For the sake of space, we report similar results for other source and target datasets in the Appendix. This experiments show that, when we know that the target dataset is sufficiently dissimilar from the source task, it is more advantageous fine-tuning adversarially-trained models.
Robustness of fine-tuned models
Are the fine-tuned models still robust? As already experimentally shown by ilyas2019adversarial; shafahi2019adversarially, an advantage of using as a feature extraction is that then the new model is robust for the new task. Indeed, it is sufficient to show that the Fisher is bounded from above by , that is, the linear classifier can only reduce information. Of course, this is true only when fine-tuning only the linear classifier.
Let be the feature extractor, , with , where . Let be the Fisher of its activations about the input. Then, it holds: .
5 Adversarial training reduces information
In this section, we analytically show why, even if the robust network is invertible at test time, it is effectively not invertible as a consequence of noise injected by SGD. This also explains why generalization power can be reduced. We first define the Fisher of representations w.r.t. inputs.
The FIM of representations w.r.t the input distribution is defined as:
where is the sensitivity matrix of the model at a fixed input location .
In the next proposition we relate AT to Equation 4, showing that requiring the stability of w.r.t. is equivalent to regularize the FIM .
Let be a small perturbation such that .333We would like to note that the practical implementation only requires . However, in practice, it is possible to see that for small , the norm of is almost always . Then,
Hence, AT is equivalent to regularize the Fisher of representation with respect to inputs . Applying white Gaussian noise instead of adversarial noise Equation 5 would become , where is the input dimension. It is easy to see that , meaning that Gaussian Noise Regularization (GNR) is upped bounded by AT: the inefficiency of GNR increases as the input dimension increases, causing that many directions preserve high curvature. tsipras2018robustness showed that AT, for a linear classification problem with hinge loss, is equivalent to penalize the -norm of weights. The next, examples show that when using cross-entropy loss, penalizing the Fisher yields a similar results.
Example 5.3 (Binary classification).
Assume a binary classification problem where Let . Then we have:
Previous examples may suggest that with perturbations AT may reduce the -norm of the weights. We trained robust models with different (with the same seed) to verify this claim. We discovered that it is true only for , pointing out that there exist (roughly) two different regimes.
What we are interested in is the relation between the Shannon Mutual Information and the Fisher Information in the activations . However, in adversarial training there is nothing that is stochastic but SGD. In order to deal the problem of dealing with deterministic networks, achille2019information introduced effective information. The idea under this definition is that, even though the network is deterministic at the end of training, what matters is the noise that SGD inject to the classifier. Thus, the effective information is a measure of the information that the network effectively uses in order to classify. Before continuing, we need to quantify this noise applied to weights.
Definition 5.4 (Information in the Weight).
The complexity of the task at level , using the posterior and the prior , is
where is the (expected) reconstruction error of the label under the “noisy” weight distribution ; measures the entropy of relative to the prior . If minimizes Equation 6 for a given , we call the Information in the Weights for the task at level .
Given , the solution of the optimal trade-off is given by the distribution such that . The previous definition tells us that if we perturb uninformative weights, the loss is only slightly perturbed. This means that information in the activations that is not preserved by such perturbations is not used by the classifier.
(Effective Information in the Activations achille2019information). Let be the weights, and let , with be the optimal Gaussian noise minimizing Equation 6 at level for a uniform prior . We call effective information (at noise level ) the amount of information about that is not destroyed by the added noise:
where are the activations computed by the perturbed weights .
By Prop 4.2(i) in achille2019information we have that the relation between and effective information is given by:
where is the entropy of input distribution. Equation 8 shows that AT compresses data similarly to IB. With AT, the noise in injected in the input and not only in the weights. In order to reduce the effective information that the representations have about the input (relative to the task), it is sufficient to decrease , that is, increasing . In the Supplementary Material, we show how details about are discarded varying .
A benefit of robust networks is that it is possible to visualize the images that are effectively ”seen” by the classifier. This, together with Definition 5.5, suggest the following definition.
Definition 5.6 (Effective image).
Let , and let be the model trained with . We define effective image at level , the solution of the following problem:
where and .
The idea of under effective images is to simulate the training conditions by artificially injecting the noise that approximate SGD. In this manner we can visualize how AT controls the information conveyed by the images. In Figure 6 we show some examples. Interestingly, robust features are not always good features: in fact, due to the poor diversity of the dataset (CIFAR-10), the feature color green is highly correlated with class frog.
AT reduces the information in the weights
We showed that AT reduces effective information about in the activation. However, achille2019information showed that to have guarantees about generalization and invariance to nuisances at test time one has to control the trade off between sufficiency for the task and information the weights have about the dataset. A natural question to ask is whether reducing information in the activations implies reducing information in the weights, that is the mutual information between the weights and the dataset. The connection between weights and activation is given by the following formula:
Decreasing the Fisher Information that the weights contain about the training set decreases the effective information between inputs and activations. However, the vice-versa may not be true in general. In fact, it is sufficient that decreases. Indeed, this fact was used in several works to enhance model robustness virmaux2018lipschitz; fazlyab2019efficient. However, as we show in Section 4.1, AT reduces information in the features as the embedding defined by , that is, the log-variance of parameters is increased when increasing the applied on training. Experiments are done with a ResNet-18 on CIFAR-10. Interestingly, this provides the evidence that it is possible to achieve robustness without reducing .
Adding effective noise (once) improves inversion
The quality of inversion depends on the capability of gradient flow to reach the target representation . Starting from regions that are distant to training and test points may be less smooth. Intuitively, especially during the first phase of optimization, it can be beneficial to inject noise to escape from local minima. Surprisingly, we discover that by injecting effective noise once, reconstruction is much faster and the quality of images improve dramatically. More practically, at the beginning of optimization, we perturb weights with and solve the inversion with . By visually comparing row 2 and 3 of Figure 6, it is easy to see that injecting noise as described above, improves the quality of reconstruction.
In support of this, in Figure 7 we numerically assess the quality of representations using the loss . The variational model, beside improving quality of fit, also allow fast convergence: converge is achieved after roughly 200 iterations while the deterministic model converges after 8k iterations ().
Existing works about robust models madry2017towards; ilyas2019adversarial; tsipras2018robustness showed that there exists a trade-off between robustness of representations and accuracy for the task. This paper extend this property showing the parameters of robust models are the solution of a trade-off between usability of features for other tasks and accuracy for the source task.
By leveraging results in achille2019information; achille2018emergence, we show that AT has a compression effect similarly to IB, and we explain our a network can be invertible and still loses accuracy for the task. Moreover, we show that AT also reduces information in the weights, extending the notion of effective information from perturbations of the weights, to perturbations of the input.
We also show that effective noise can be also useful to improve reconstruction of images both in terms of convergence and quality of reconstruction.
Finally, we provided an analytic argument which explains why robust models can be better at transfering features to other tasks. As a corollary of our analysis, to train a generic feature extractor for several tasks, it is best to train adversarially, unless one already knows the specific task for which the feature is going to be used.
The impact of this work resides on the efficiency of adapting existing models to new tasks where a small or insufficient number of training examples are given. In fact, real-world applications, where data are much scarser than those typically employed in benchmark comparisons, can benefit from our approach. The limitation of this work is that it does not provide with criteria to decide whether to use a standard or a robust model as source: this choice is left to users to decide what approach to prefer depending on the application at hand.
Appendix A Proofs of propositions
In the following we prove Lemma 4.1.
Proof of Lemma 4.1 The intuition under this lemma is very similar to Data Processing Inequality (DPI). If we have and is robust, the map can only use robust information. As shown, for example, in [zegers2015fisher] [Theorem 13], the DPI also holds for the Fisher Information. ∎
Although the previous lemma is very simple, it has remarkable consequences: as soon as one is able to extract robust features, at some level of the ”chain”, then all the information extracted from these features is robust. For example, [ilyas2019adversarial] shows that by training on images that are obtained by robust models, leads to a robust model, without applying AT. In this case, the robust features are directly the images.
Appendix B Experimental setting
To quantitatively evaluate the improved transferability provided by robust models we perform experiments on common benchmarks for object recognition. More in details, we fine tune three networks pretrained on CIFAR-10, CIFAR-100.
We used the pretrained robust ResNet-50 models on CIFAR-10 (with and ImageNet (with ) from [ilyas2019adversarial] We used the pretrained ResNet-50 models and Imagenet ). Similarly, we trained on CIFAR-100 with steps of PGD iterations with .
We fine-tune with different modalities: 0) both the linear classifier and the batch norm before it, 1) both the linear classifier and the batch norm of the entire network, 2) the entire network. We then compare the top1 accuracy on the test set of the different models. We asses the performance on the tranferability using a Resnet50. For CIFAR 10 and CIFAR 100 fine tuning is done for 120 epochs using SGD with batch size 128, learning rate that starts from 1e-2 and drops to 1e-3, 1e-4 at epochs 50 and 80 respectively. We use weight decay 5e-4. For Imagenet fine tuning is done for 300 epochs with batch size equal to 256, the same learning rate decay at epochs 150 and 250 respectively and weight decay 1e-4. We use momentum acceleration with parameter 0.9 for all datasets.
In table 3 we report the description of the datasets used in this paper.
|Dataset||Task Category||Classes||Training size||Test size|
|Imagenet [deng2009imagenet]||general object detection||1000||1281167||50000|
|CIFAR-10 [cifar]||general object detection||10||50000||10000|
|CIFAR-100 [cifar]||general object detectio||100||50000||10000|
|MNIST [mnist]||handwritten digit recognition||10||60000||10000|
|F-MNIST [xiao2017fashion]||clothes classification||10||60000||10000|
|SVHN [svhn]||civic number classification||10||73257||26032|
|Oxford Flowers [flowers]||fine-grained object recognition||102||2,040||6,149|
|CUB-Birds 200-2011 [birds]||fine-grained object recognition||200||5,994||5,794|
|FGVC Aircrafts [aircrafts]||fine-grained object recognition||100||6,667||3,333|
|Stanford Cars [cars]||fine-grained object recognition||196||8,144||8,041|
|Stanford Dogs [dogs]||fine-grained object recognition||120||12,000||8,580|
|MIT Indoor-67 [indoor67]||scene classification||67||5,360||1,340|
Appendix C Image reconstruction
Algorithm 1 shows the procedure to compute effective images (see Definition 5.6), while Algorithm 2 represents the procedure to compute the variational inversion where noise in sample once (see Section 5).
c.2 Effect of on the inversion
In Figure 8 it is shown the effect of training with different values of on the image reconstruction.
Appendix D Effective transferable information
We provide an theoretical intuition about transferability of robust models.
Since AT reduces , it reduces the information that the network has about the dataset . In fact:
From the previous proposition we can see that there are two ways of reducing the information . The first is reducing and the other is making the weights more stable with respect to perturbation of the datasets. For example, the latter can be accomplished by choosing a suitable optimization algorithm or a particular architecture. Reducing the Fisher , implies that the representations vary less when perturbing the dataset with . This explains that fact that AT is more robust to distribution shifts. We would like to remark again that there are two ways for transfering better: one is to reduce and the other one is reducing .
Finally, the previous argument does not imply that robust training is better at transfering but that is more stable, making more likely that robust models are better when target tasks are distant from the source task.
Appendix E Omitted tables and figures
We test the trivial hypothesis that standard models are better at transfering features when the source and target distributions are nearly the same: we choose CIFAR-10 as source dataset and CINIC-10 [darlow2018cinic] as target dataset removing the images in common with CIFAR-10. The remaining images are extracted from ImageNet. We call this dataset CINIC-IMAGENET. As [darlow2018cinic] shows, the pixel statistics are very similar, and in fact the standard models performs better at linear transfer:
Appendix F Transfer with all modes
While our aim is to show that robust models have better linear transferability than standard ones, we report here results also for fine tuning in modalities 1 and 2 (Tables 9, 8, 7, 6 and 5 and Figures 14, 13, 12, 11 and 10). Of course, the performance gap in these cases is reduced compared to mode 0 (see Tables 9, 8 and 7) being the network able to change more to adapt to the new task. Interestingly, we notice a substantial impact of the batch norm layers on the classification performance: mode 1 provides a significant boost in classification accuracy compared to mode 0 particularly when the network is pretrained on simple datasets (CIFAR-10, CIFAR-100), even though the parameters of feature extractor are still kept fixed and only the batch norm in the entire network is fine tuned.
f.1 Architecture impact
We report here a comparison of transfering performance using two different architectures namely ResNet50 and ResNet18, trained on CIFAR-100, to assess the impact of the network capacity. It is noticeable that with the more complex network (ResNet50) the gap is reduced in cases where the standard model is better and it is increased in cases where the robust one is better.