1 Introduction
The idea of an information bottleneck (IB) (Tishby et al., 1999) is to learn a compressed representation of an input that is predictive of a target . This leads to the following training objective involving two mutual information terms:
(1) 
This objective favours a representation that retains the minimum amount of information about while being maximally predictive of . The hyperparameter controls the tradeoff between the two losses.
To make this objective practical, Alemi et al. (2017) used variational techniques to construct the upper bound for the expression in Eq. 1 – also known as the Variational Information Bottleneck (VIB) loss:
(2) 
where is a stochastic encoder distribution, is a variational approximation to , and is the variational approximation to the marginal . In a similar way to the Variational AutoEncoder (VAE) setup (Kingma & Welling, 2014), we can parameterize Gaussian densities and
using neural networks, and fix
to be a dimensional Gaussian , whereis the size of the bottleneck layer. We can then use the reparameterization trick to learn the parameters of the neural networks when optimizing the stochastic estimate of the objective in Eq.
2.A tighter bound on the IB objective is given by the Conditional Entropy Bottleneck (CEB) (Fischer & Alemi, 2020):
(3) 
where the second term uses a classconditional variational marginal , and is a hyperparameter with the same role as in Eq.2. CEB parameterizes by a linear mapping that takes a onehot label
as input and outputs a vector
representing the mean of the Gaussian. CEB uses an identity matrix for the variance of
and , which is unlike VIB, where the variance of the encoder distribution is not fixed.Multiple studies suggest that IBs can reduce overfitting and improve robustness to adversarial attacks (Alemi et al., 2017; Fischer & Alemi, 2020; Kirsch et al., 2021). For example, Fischer & Alemi (2020) showed that CEB models can outperform adversarially trained models under both and PGD attacks (Madry et al., 2018) while also incurring no drop in standard accuracy. However, no clear explanation has been found as to how IB models become more robust to adversarial examples. Previous works also failed to investigate possible effects of gradient obfuscation which could lead to a false sense of security (Athalye et al., 2018). In this paper, we continue the analysis into the behaviour of IB models in the context of adversarial robustness. Our experiments provide evidence of gradient obfuscation, which leads us to conclude that the adversarial robustness of IB models was previously overestimated.
2 Adversarial robustness
Since the discovery of adversarial examples for neural networks (Szegedy et al., 2014; Biggio et al., 2013), there has been a lot of interest in creating new attacks and defenses. In this section we briefly review methods for crafting normbounded adversarial examples. Later, we use these methods to assess the adversarial robustness of IB models.
The Fast Gradient Sign (FGS) attack (Goodfellow et al., 2015) is an bounded singlestep attack that computes an adversarial example as , where is the original image, is the true label, is the crossentropy loss, and is the perturbation size.
Projected Gradient Descent (PGD) (Madry et al., 2018) is the multistep variant of FGS. The PGD attack finds an adversarial example by following iterative updates for some fixed number of steps . Here, is a projection operator onto – the ball of radius around the original image . The attack starts from an initial sampled randomly within .
The reliability of PGD attacks often depends on the choice of parameters such as the step size , or the type of loss . Recent PGD variants are designed to be less sensitive to these choices, and it is common to run an ensemble of attacks with different parameters and properties. AutoAttack (Croce & Hein, 2020) and MultiTargeted (Gowal et al., 2019) are examples of this strategy.
3 Experiments
In this section, we experiment with VIB and CEB models on MNIST and CIFAR10. We run a number of diagnostics, which indicate that gradient obfuscation is the main reason why IB models are seemingly robust. In trying to understand their failure modes, we also look at some toy problems. Our interpretation of the results is deferred to the next section. Hyperparameters of all our models and additional plots are included in the appendix.
3.1 Mnist
For VIB experiments on MNIST, we follow the setup of Alemi et al. (2017). Namely, for the encoder network, we use a 3layer MLP with the last bottleneck layer of size . This bottleneck layer outputs the means and standard deviations (after a softplus transformation) of the Gaussian . The decoder distribution over 10 classes is parameterized by a linear layer ending with a softmax. During training, we use the reparameterization trick (Kingma & Welling, 2014) with samples from the encoder when estimating the expectation over in Eq. 2. At test time, we also collect samples of , and compute as . We refer to this evaluation as stochastic mode. In the mean mode, we only use the mean of as an input to the decoder. Our deterministic baseline is an MLP of the same overall structure as the VIB model. We train it with a crossentropy loss without any additional regularization.
First, we evaluate our models using the FGS attack. Figure 1 shows the robust accuracy of VIB models with varying under the FGS attack with different perturbation sizes . For the rest of the paper, we assume that input images are in the range. Our results slightly differ from those of Alemi et al. (2017). In particular, the performance of our VIB models peaks at instead of as reported previously, and the evaluation in the mean mode and stochastic mode does not lead to the same results.
Despite these differences, we can still achieve large gains in robust accuracy under the FGS attack for VIB models in comparison to the baseline. One result that stands out is the unusually high robust accuracy under the attack with
. Indeed, with this perturbation size, one can design an attack that makes all images solid gray and, as such, the classifier should not do better than random guessing
(Carlini et al., 2019). The obtained robust accuracy above 10% indicates that gradients of VIB models do not always direct us towards stronger adversarial examples. To check if the improvements in robust accuracy generalize to stronger attacks, we evaluate VIB models with under the PGD attack with 40 steps, , , and a different number of restarts. Figure 2 shows that we can drive the robust accuracy to zero as we increase the number of restarts. It is an indication of gradient obfuscation, as the loss landscape cannot be efficiently explored by gradientbased methods (Carlini et al., 2019; Croce & Hein, 2020).3.2 Cifar10
For CIFAR10, as our encoder network we use a PreActivationResNet18 (He et al., 2016) followed by an MLP with the same architecture as the MNIST experiments. We train this network endtoend, and only use random crops and flips to augment the data. As previously, we construct an analogous deterministic model that we do not regularize in any way, thus it overfits.
In Figure 2(a), we evaluate the adversarial robustness of CEB models on PGD attack with 20 steps and (Madry et al., 2018). It is surprising that some of our deterministic models can outperform an adversariallytrained ResNet from Madry et al. (2018) with a reported robust accuracy of 45.8%. This result alone suggests that PGD attacks should be used with caution when evaluating models that might obfuscate the gradients. As with MNIST, we can again significantly reduce the robust accuracy by increasing the number of restarts as shown in Figure 2(b).
To get a better estimate of the robust accuracy in the presence of gradient obfuscation, we use a set of stronger attacks: a mixture of AutoAttack (AA) and MultiTargeted (MT) (Croce & Hein, 2020; Gowal et al., 2019)
. We execute the following sequence of attacks: AutoPGD on the crossentropy loss with 5 restarts and 100 steps, AutoPGD on the difference of logits ratio loss with 5 restarts and 100 steps, MultiTargeted on the margin loss with 10 restarts and 200 steps. From Figure
2(c), we see that deterministic models have zero robust accuracy, while the performance of CEB models varies across models with different random seeds. This dependence on the seed could be the consequence of suboptimal network initialization and difficulties related to training IB models. Some part of the variance in the robust accuracy might still be attributed to having an imperfect attack due to the unreliable gradients.Finally, in Figure 4 and in the appendix, we show typical loss landscapes produced by the CEB model with that scored 15.8% accuracy under the AA+MT ensemble of attacks. These plots are strikingly different from typical smooth nonflat loss landscapes obtained from adversarially trained models (Qin et al., 2019). The flatness of the plotted landscapes explains why gradientbased attacks with crossentropy loss are not as effective. Moreover, since IB losses do not explicitly penalize misclassification for perturbed inputs within a certain ball, the model is free to choose where to place decision boundaries. Figure 4 suggests that CEB models could be robust to much smaller perturbation radii.
3.3 A toy problem
We established that gradient obfuscation makes it harder to understand the robustness properties of IB models on real datasets. Thus, analysing toy examples can be a useful alternative. A classification task from Tsipras et al. (2019) is one example that can motivate the use of IBs, where their ability to ignore irrelevant features becomes helpful. We study this problem in the appendix. Here, we consider another simple setup where labels are sampled uniformly at random from , and two features have the following conditional distributions:
In this example, the label can be predicted from the sign of , so in the optimal IB case, we need to communicate 1 bit of information about the input. The first feature is also more robust since it requires a larger perturbation before its sign gets flipped. In practice, we found that a simple VIB classifier does not exclusively focus on , and so it becomes prone to a rather trivial attack that substracts or adds to depending on the label, as shown in Figure 5. This could be the consequence of SGD training, the approximate nature of the objective function, VIB’s formulation as a combination of competing objectives or other reasons we do not yet understand.
4 Discussion
By reevaluating adversarial robustness of VIB and CEB models, we have shown that weak adversarial attacks are often unable to provide reliable robustness estimates as these models create highly nonsmooth loss surfaces, which are harder to explore with gradients. Therefore, we believe that previous, as well as future results on the robustness of IB models should include basic checks for gradient obfuscation. This is especially important when comparing different types of models, e.g. IBs versus adversarial training.
Our experiments were inconclusive as to whether IB models offer adversarial robustness gains relative to the undefended deterministic baseline. For MNIST, the results under the FGS attack seemed promising. However, looking at the performance under the PGD attack with multiple restarts and different perturbation sizes showed a different picture. For CIFAR10, some of the CEB models were significantly better than the baseline under the strongest attack. However, we did not identify the exact cause for having excessive variance in the results of models with different random seeds. Thus, it would be interesting to find regimes where CEB can reliably converge to more robust models.
In this paper, we only considered IB models in discriminative settings. A generative model related to VIB is VAE (Higgins et al., 2017). For autoencoders, the adversarial attack amounts to finding inputs that would cause the decoder to reconstruct a visually distinct image, e.g. an object from a different class. Camuto et al. (2021) showed that VAE for larger values of is more robust to adversarial attacks. However, Kuzina et al. (2021)
used a different set of evaluation metrics to challenge this claim.
Cemgil et al. (2020) attributes the lack of robustness of VAE models to the inability of their objective to control the behaviour of the encoder outside of the support of the empirical data distribution. Namely, without additionally forcing the encoder to be smooth, tuning alone is not enough for learning robust representations. Together with our observations for VIB and CEB models, the disagreement about VAE’s results corroborates the need for more nuanced evaluation before adversarial robustness claims can be made.Overall, we believe that using IBs in the context of adversarial robustness is an idea that deserves further exploration. In this paper, we focused on the empirical evaluation of IB models under standard robustness metrics and illustrating the caveats related to it. An interesting future research direction would be to understand the properties of IB models, especially in the stochastic regime, from both informationtheoretic and adversarial robustness perspectives. Another promising direction would be to explore IBs with additional curvature regularization (MoosaviDezfooli et al., 2019; Qin et al., 2019) or in combination with adversarial training.
Acknowledgements
We would like to thank Taylan Cemgil, Lucas Theis, Hubert Soyer, Jonas Degrave, and the wonderful people from the robustness teams at DeepMind for their help with this project, interesting questions, valuable discussions, and feedback on the paper.
References
 Alemi et al. (2017) Alemi, A., Fischer, I., Dillon, J., and Murphy, K. Deep variational information bottleneck. In International Conference on Learning Representations, 2017.
 Athalye et al. (2018) Athalye, A., Carlini, N., and Wagner, D. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International Conference on Machine Learning, 2018.
 Biggio et al. (2013) Biggio, B., Corona, I., Maiorca, D., Nelson, B., Šrndić, N., Laskov, P., Giacinto, G., and Roli, F. Evasion attacks against machine learning at test time. In Machine Learning and Knowledge Discovery in Databases, 2013.
 Bradbury et al. (2018) Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., VanderPlas, J., WandermanMilne, S., and Zhang, Q. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.

Camuto et al. (2021)
Camuto, A., Willetts, M., Roberts, S., Holmes, C., and Rainforth, T.
Towards a theoretical understanding of the robustness of variational autoencoders.
InInternational Conference on Artificial Intelligence and Statistics
, 2021.  Carlini et al. (2019) Carlini, N., Athalye, A., Papernot, N., Brendel, W., Rauber, J., Tsipras, D., Goodfellow, I. J., Madry, A., and Kurakin, A. On evaluating adversarial robustness. ArXiv, abs/1902.06705, 2019.
 Cemgil et al. (2020) Cemgil, T., Ghaisas, S., Dvijotham, K. D., and Kohli, P. Adversarially robust representations with smooth encoders. In International Conference on Learning Representations, 2020.
 Croce & Hein (2020) Croce, F. and Hein, M. Reliable evaluation of adversarial robustness with an ensemble of diverse parameterfree attacks. In International Conference on Machine Learning, 2020.
 Fischer & Alemi (2020) Fischer, I. and Alemi, A. A. CEB improves model robustness. Entropy, 22(10), 2020.
 Glorot & Bengio (2010) Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In International Conference on Artificial Intelligence and Statistics, 2010.
 Goodfellow et al. (2015) Goodfellow, I., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.
 Gowal et al. (2019) Gowal, S., Uesato, J., Qin, C., Huang, P.S., Mann, T. A., and Kohli, P. An alternative surrogate loss for PGDbased adversarial testing. ArXiv, abs/1910.09338, 2019.

He et al. (2016)
He, K., Zhang, X., Ren, S., and Sun, J.
Identity mappings in deep residual networks.
In
European Conference on Computer Vision
, 2016.  Higgins et al. (2017) Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. betaVAE: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017.
 Kingma & Welling (2014) Kingma, D. P. and Welling, M. AutoEncoding Variational Bayes. In International Conference on Learning Representations, 2014.
 Kirsch et al. (2021) Kirsch, A., Lyle, C., and Gal, Y. Unpacking information bottlenecks: Unifying informationtheoretic objectives in deep learning. ArXiv, abs/2003.12537, 2021.
 Kuzina et al. (2021) Kuzina, A., Welling, M., and Tomczak, J. M. Diagnosing vulnerability of variational autoencoders to adversarial attacks. ArXiv, abs/2103.06701, 2021.

Madry et al. (2018)
Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A.
Towards deep learning models resistant to adversarial attacks.
In International Conference on Learning Representations, 2018. 
MoosaviDezfooli et al. (2019)
MoosaviDezfooli, S.M., Fawzi, A., Uesato, J., and Frossard, P.
Robustness via curvature regularization, and vice versa.
In
IEEE Conference on Computer Vision and Pattern Recognition
, 2019.  Polyak & Juditsky (1992) Polyak, B. and Juditsky, A. Acceleration of stochastic approximation by averaging. Siam Journal on Control and Optimization, 30:838–855, 1992.
 Qin et al. (2019) Qin, C., Martens, J., Gowal, S., Krishnan, D., Dvijotham, K., Fawzi, A., De, S., Stanforth, R., and Kohli, P. Adversarial robustness through local linearization. In Advances in Neural Information Processing Systems, 2019.
 Stutz et al. (2019) Stutz, D., Hein, M., and Schiele, B. Disentangling adversarial robustness and generalization. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.
 Szegedy et al. (2014) Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014.
 Tishby et al. (1999) Tishby, N., Pereira, F. C., and Bialek, W. The information bottleneck method. In 37th Annual Allerton Conference on Communication, Control and Computing, 1999.

Tsipras et al. (2019)
Tsipras, D., Santurkar, S., Engstrom, L., Turner, A., and Madry, A.
Robustness may be at odds with accuracy.
In International Conference on Learning Representations, 2019.  Yang et al. (2020) Yang, Y.Y., Rashtchian, C., Zhang, H., Salakhutdinov, R. R., and Chaudhuri, K. A closer look at accuracy vs. robustness. In Advances in Neural Information Processing Systems, 2020.
Appendix
Toy example
In Figure 6, we plot 1K samples from the data distribution as outlined in Section 3.3. We train both a deterministic and a VIB model. For the deterministic model, we used a linear classifier whose weights and biases were initialized with zeros. For the VIB model, we used the bottleneck of size . The weights of the encoder were initialized with Xavier uniform scheme (Glorot & Bengio, 2010). The linear decoder’s weights were initialized to zero. We use 12 samples from during training as well as for the stochastic evaluation mode. We optimize the parameters of both the linear deterministic and the VIB model using SGD with a learning rate of , momentum of and Nesterov updates. We perform iterations with a batch size of (resampled from the data distribution each iteration) and the same random seed for both models. We evaluated clean and robust accuracy on a fixed set of 10K samples.
To this end, considering the discussion in (Stutz et al., 2019; Tsipras et al., 2019), we create a precomputed set of adversarial examples by sampling exclusively from the lowdensity regions, cf. Figure 6. This emulates adversarial examples directly attacking the feature with weak correlation and is reasonable due to the low dimensionality, i.e., . It also means that we do not consider classical constrained adversarial examples. This is because, even for small , such adversarial examples are not guaranteed to preserve the original label. This is also complementary to the toy example by Tsipras et al. (2019) discussed below.
Toy example from Tsipras et al. (2019)
For our second toy problem, we consider a binary classification task from Tsipras et al. (2019), where uniformly at random, and features are distributed as:
(4) 
We choose , , and . An adversarial attack with can shift Gaussian features towards the opposite class, so that . Thus, it becomes easy to fool a classifier that relies on these features. Note, however, that this might also change the true label according to the data distribution. Nevertheless, one might expect that IB models are more robust in this case since the compression cost forces to focus on – the feature that highly predictive of the label. Indeed, if we look at Figure 7, this seems to be the case, but oddly, only for the case of stochastic evaluation. Note that Tsipras et al. (2019) constructed this problem to demonstrate that clean and robust accuracy were at odds with each other, and this is what we also see in Figure 7. There are, however, doubts whether this toy example can reflect what happens in realworld scenarios (Yang et al., 2020; Stutz et al., 2019).
On this toy example, we used the same setup as described above for our own toy example. However, we adapted the VIB bottleneck to be of size due to the increased dimensionality, i.e., and only perform 200 update steps.
Architectures and hyperparameters
For MNIST experiments, we based our JAX (Bradbury et al., 2018) implementation on the original VIB code: github.com/alexalemi/vib_demo
. We used the following MLP architecture for the encoder: 1024  ReLU  1024  ReLU 
, with . The decoder consisted of a single dense layer with a softmax nonlinearity over 10 outputs. All weights were initialized using the default Xavier uniform scheme (Glorot & Bengio, 2010), and all biases were initialized to zero. We used Adam optimizer with an initial learning rate of and parameters ,. We decayed the learning rate by a factor of 0.97 every 2 epochs. The batch size was set to 100, and we trained the networks for 200 epochs. Input images within a
range were rescaled inside the network to range prior to passing them to the first dense layer. We used Polyak averaging (Polyak & Juditsky, 1992) with a constant decay of 0.999. For outputs from the bottleneck layer that correspond to the standard deviation of , we used the following softplus transformation to make them positive: . Our deterministic baseline models had the same overall structure as the VIB models, i.e. 1024  ReLU  1024  ReLU   10  softmax, with all training hyperparameters as above.For CIFAR10 experiments, the encoder network was a concatenation of a PreActivationResNet18 (He et al., 2016) with the same MLP as in our MNIST setup. The decoder and the backward encoder in CEB were again onelayer networks. We trained everything endtoend for 1000 epochs. The batch size was set to 1024, and we used Adam with an initial learning rate of 0.012 and default parameters. The learning rate was multiplied by 0.3 every 250 epochs. For CEB, we annealed from an initial value of 100 down to its target value during the first 4 epochs. Similarly, for VIB, we increased from to its target during the first 100 epochs. Prior to the first ResNet layer, input images within a range were normalized using perchannel means and standard deviations computed across the train set of CIFAR10.
Additional results on MNIST and CIFAR10
Below, we provide additional figures for the experiments in Sections 3.1 and 3.2. For MNIST, Figure 8 shows the results of increasing the number of restarts when we use a PGD attack with . For CIFAR10, Figure 9 plots the robust accuracy of VIB models under various attacks, and Figure 10 illustrates crossentropy loss surface of a CEB model on a couple of test images.