1 Introduction
Neural Networks (NNs) have become the dominant approach to addressing computer vision (CV)
Girshick2015 ; vgg ; videoprediction, natural language processing (NLP)
embedding1 ; embedding2 ; mikolovrnn , speech recognition (ASR) dnnspeech ; DeepSpeech and bioinformatics Caruana2015 ; dnarnatasks. Notable progress has recently been made on predictive uncertainty estimation for Deep Learning through the definition of baselines, tasks and metrics
baselinedetecting , and the development of practical methods for estimating uncertainty using ensemble methods, such as MonteCarlo Dropout Gal2016Dropout and Deep Ensembles deepensemble2017 . Uncertainty estimates derived from ensemble approaches have been successfully applied to the tasks of detecting misclassifications and outofdistribution inputs, and have also been investigated for adversarial attack detection carlinidetected ; galadversarial . However, ensembles can be computationally expensive and it is hard to control their behaviour. Recently, malininpn2018 proposed Prior Networks  a new approach to modelling uncertainty which has been shown to outperform MonteCarlo dropout on a range of tasks. Prior Networks parameterize a Dirichlet prior over output distributions, which allows them to emulate an ensemble of models using a single network, whose behaviour can be explicitly controlled via choice of training data. In malininpn2018 , Prior Networks are trained using the forward KLdivergence between the model and a target Dirichlet distribution. It is, however, necessary to use auxiliary losses, such as the crossentropy, to yield competitive classification performance. Furthermore, it is also difficult to train Prior Networks using this criterion on complex datasets with many classes. In this work we show that the forward KLdivergence (KL) is an inappropriate optimization criterion and instead propose to train Prior Networks with the reverse KLdivergence (RKL) between the model and a target Dirichlet. In sections 3 and 4 of this paper it is shown, both theoretically and empirically on synthetic data, that this loss yields the desired behaviours of a Prior Network and does not require auxiliary losses. In section 5 Prior Networks are successfully trained on a range of image classification tasks using the proposed criterion without loss of classification performance. It is also shown that these models yield better outofdistribution detection performance on the CIFAR10 and CIFAR100 datasets than Prior Networks trained using forward KLdivergence. An interesting application of uncertainty estimation is the detection of adversarial attacks, which are small perturbations to the input that are almost imperceptible to humans, yet which drastically affect the predictions of the neural network szegedyadversarial . Adversarial attacks are a serious security concern, as there exists a plethora of adversarial attacks which are quite easy to construct goodfellowadversarial ; BIM ; MIM ; carlinirobustness ; papernotblackbox ; papernotlimitation2016 ; liudelving2016 ; madry2017towards . At the same time, while it is possible to improve the robustness of a network to adversarial attacks using adversarial training szegedyadversarial and adversarial distillation papernotedistllaition2016 , it is still possible to craft successful adversarial attacks against these networks carlinirobustness . Instead of considering robustness to adversarial attacks, carlinidetected investigates detection of adversarial attack and shows that adversarial attacks can be detectable using a range of approaches. While, adaptive attacks can be crafted to successfully attack the proposed detection schemes, carlinidetected singles out detection of adversarial attacks using uncertainty measures derived from MonteCarlo dropout as being more challenging to successfully overcome using adaptive attacks. Thus, in this work we investigate the detection of adversarial attacks using Prior Networks, which have previously outperformed MonteCarlo dropout on other tasks. Using the greater degree of control over the behaviour of Prior Networks which the reverse KLdivergence loss affords, Prior Networks are trained to predict the correct class on adversarial inputs, but yield a higher measure of uncertainty than on natural inputs. Effectively, this becomes a generalization of adversarial training szegedyadversarial which improves both the robustness of the model to adversarial attacks and also allows them to be detected. In section 6 it is shown that on the CIFAR10 and CIFAR100 datasets it is more computationally challenging to construct adaptive adversarial attacks against Prior Networks than against standard neural networks, adversarially trained neural networks and MCdropout defended networks. This is because, like ensembles, Prior Networks yield measures of uncertainty derived from distributions over output distributions. Consequently, adaptive adversarial attacks need to satisfy more constraints in order to attack Prior Networks and evade detection. Thus, the two main contributions of this paper are the following: a new reverseKLdivergence loss function which yields the desired behaviour of Prior Networks and allows them more complex datasets; the application of Prior Networks to adversarial attack detection, enabled using the proposed training criterion, where it is shown that whitebox
adaptive attacks are more computationally expensive to construct for Prior Networks than for baseline models.2 Prior Networks
An ensemble of models can be interpreted as a set of output distributions drawn from an implicit conditional distribution over output distributions. A Prior Network ^{1}^{1}1Here  the parameters of a categorical distribution., is a neural network which explicitly parametrizes a prior distribution over output distributions. This effectively allows a Prior Network to emulate an ensemble and yield the same measures of uncertainty galthesis ; mutualinformation , but in closed form and without sampling.
(1) 
A Prior Network for classification typically parameterizes the Dirichlet distribution^{2}^{2}2Alternate choices of distribution, such as a mixture of Dirichlets or the LogisticNormal, are possible. (eqn 2
), which is the conjugate prior to the categorical, due to its tractable analytic properties. The Dirichlet distribution is defined as:
(2) 
where is the gamma function. The Dirichlet distribution is parameterized by its concentration parameters , where , the sum of all , is called the precision of the Dirichlet distribution. Higher values of lead to sharper, more confident distributions. The predictive distribution of a Prior Network is given by the expected categorical distribution under the conditional Dirichlet prior:
(3) 
The desired behaviors of a Prior Network, as described in malininpn2018 , can be visualized on a simplex in figure 1. Here, figure 1:a describes confident behavior (lowentropy prior focused on lowentropy output distributions), figure 1:b describes uncertainty due to severe class overlap (data uncertainty) and figure 1:c describes the behaviour for an outofdistribution input (knowledge uncertainty).
Given a Prior Network which yields the desired behaviours, it is possible to derive measures of uncertainty in the prediction by considering the mutual information between and , given by the following expression:
(4) 
The given expression allows total uncertainty, given by the entropy of the predictive distribution, to be decomposed into data uncertainty and knowledge uncertainty, which are the two sources of uncertainty. Data uncertainty arises due to classoverlap in the data, which is the equivalent of noise for classification problems. Knowledge Uncertainty, also know as epistemic uncertainty Gal2016Dropout or distributional uncertainty malininpn2018 , arises due to the model’s lack of understanding or knowledge about the input. In other words, knowledge uncertainty arises due to a mismatch between the training and the test data.
3 Forward and Reverse KLDivergence Losses
The original training criterion for Prior Networks is forward KLdivergence between the model and a target Dirichlet distribution , where the target concentration parameters depend on the class .
(5) 
The target concentration parameters are set as follows:
(6) 
This criterion is then jointly optimized on indomain data and outofdomain training data as follows:
(7) 
where is the outofdistribution loss weight. Indomain should take on a large value, for example , so that the concentration is high only in the corner corresponding to the target class, and low elsewhere. Note, the concentration parameters have to be strictly positive, so it is not possible to set the rest of the concentration parameters to 0. Instead, they are set to one, which also provides a small degree smoothing. Outofdomain , which results in a flat Dirichlet distribution. However, there is a significant issue with this criterion. Consider taking the expectation of equation 5 with respect to the empirical distribution :
(8) 
In expectation, this loss is the KLdivergence between the model and a mixture of Dirichlet distributions, which has a mode in each corner of the simplex. When the level of data uncertainty is low, this is not a problem, as there will only be a single significant mode. However, when there is a significant amount of data uncertainty the target distribution will be multimodal. As the forward KLdivergence is zeroavoiding, it will drive the model to spread itself over each mode, effectively ’inverting’ the Dirichlet distribution and driving the precision to a low value. This is an undesirable behaviour, as the model should instead yield a distribution with a single highprecision mode at the center of the simplex, as shown in figure 1b. Furthermore, this can compromise predictive performance. The main issue with the KLdivergence loss is that the target distributions are arithmetically summed in expectation. This can be avoided by instead minimizing the reverse KLdivergence between the target distribution and the model:
(9) 
By taking the expectation of this criterion with respect to the empirical distribution, it becomes the reverse KLdivergence between the model and a geometric mixture of target Dirichlet distributions:
(10) 
A geometric mixture of Dirichlet distributions results in a standard Dirichlet distribution whose concentration parameters are an arithmetic mixture of the target concentration parameters for each class. When there is low data uncertainty this loss simply yields the reverse KLdivergence to a sharp Dirichlet at a particular corner. However, when the data uncertainty is significant, this loss minimizes the reverse KLdivergence to a Dirichlet with a single mode close to the center of the simplex. This is exactly the behaviour which the model should learn when there is an indomain input in a region of significant data uncertainty. Thus, target distribution is always
a standard unimodal Dirichlet. Furthermore, as a consequence of this loss, the concentration parameters are appropriately interpolated on the boundary of the indomain and outofdistribution regions, where the degree of interpolation depends on the OOD loss weight
. Finally, it is necessary to point out that the reverse KLdivergence is commonly used in variational inference murphy and training variational autoencoders vae . It is interesting to further analyze the properties of the reverse KLdivergence by decomposing it into the reverse crossentropy and the negative differential entropy:(11) 
Lets consider the reversecross entropy term in more detail (and dropping additive constants):
(12) 
When the target concentration parameters are defined as in equation 6, the form of the reverse crossentropy will be:
(13) 
This expression for the reverse cross entropy is a scaled version of an upperbound to the cross entropy between discrete distributions, obtained via Jensen’s inequality, which was proposed in a parallel work evidential that investigated a model similar to Dirichlet Prior networks:
(14) 
This form of this upper bound loss is identical to standard negative loglikelihood loss, except with digamma functions instead of natural logarithms. This loss can be analyzed further by considering the following asymptotic series approximation to the digamma function:
(15) 
Given this approximation, it is easy to show that this upperbound loss is equal to the negative loglikelihood plus an extra term which drives the concentration parameter to be as large as possible:
(16) 
Thus, the reverse KLdivergence between Dirichlet distributions, given setting of target concentration parameters via equation 6, yields the following expression:
(17) 
Clearly, this expression is equal to the standard negative loglikelihood loss for discrete distributions, weighted by , plus a term which drives the precision of the Dirichlet to be , where is the number of classes.
4 Experiments on Synthetic Data
The previous section investigated the theoretical properties of forward and reverse KLdivergence training criteria for Prior Networks. In this section these criteria are assessed empirically by using them to train Prior Networks on the artificial highuncertainty 3class dataset^{3}^{3}3Described in appendix A. introduced in malininpn2018 . In these experiments, the outofdistribution training data was sampled such that it forms a thin shell around the training data. The target Dirichlet concentration parameters were constructed as described in equation 6, with and . The indomain loss and outofdistribution losses were equally weighted when trained using forward KLdivergence. However, it was found that it is necessary to weight the outofdistribution loss 10 times as much as the indomain loss when using reverse KL divergence.
Figure 2 depicts the total uncertainty, expected data uncertainty and mutual information, which is a measure of knowledge uncertainty, derived using equation 4 from Prior Networks trained using both criteria. By comparing figures 2a and 2d it is clear that a Prior Network trained using forward KLdivergence overestimates total uncertainty in domain, as the total uncertainty is equally high along the decision boundaries, in the region of class overlap and outofdomain. The Prior Network trained using the reverse KLdivergence, on the other hand, yields an estimate of total uncertainty which better reflects the structure of the dataset. Figure 2b shows that the expected data uncertainty is altogether incorrectly estimated by the Prior Network trained via forward KLdivergence, as it is uniform over the entire indomain region. As a result, the mutual information is higher indomain along the decision boundaries than outofdomain. In contrast, figures 2c and 2f show that the measures of uncertainty provided by a Prior Network trained using the reverse KLdivergence decompose correctly  data uncertainty is highest in regions of class overlap while mutual information is low indomain and high outofdomain. Thus, these experiments support the analysis in the previous section, and illustrate how the reverse KLdivergence is a more suitable optimization criterion than forward KLdivergence.
5 Image Classification Experiments
Having evaluated the forward and reverse KLdivergence losses on a synthetic dataset in the previous section, we now evaluate these losses on a range of image classification datasets. The training configurations are described in appendix B. Table 1 presents the classification error rates of standard DNNs, an ensemble of 5 DNNs deepensemble2017 , and Prior Networks trained using both the forward and reverse KLdivergence losses. From table 1 it is clear that Prior Networks trained using forward KLdivergence (PNKL) achieve increasingly worse classification performance as the datasets become more complex and have a larger number of classes. At the same time, Prior Networks trained using the reverse KLdivergence loss (PNRKL) have similar error rates as ensembles and standard DNNs. Note that in these experiments no auxiliary losses were used.
Table 2 presents the outofdistribution detection performance of Prior Networks trained on CIFAR10 and CIFAR100 cifar using the forward and reverse KLdivergences. Prior Networks trained on CIFAR10 use CIFAR100 are OOD training data, while Prior Networks trained on CIFAR100 use TinyImageNet tinyimagenet as OOD training data. Performance is assessed using area under an ROC curve (AUROC) in the same fashion as in malininpn2018 ; baselinedetecting . The results on CIFAR10 show that PNRKL consistently yields better performance than PNKL and the ensemble on all OOD test datasets (SVHN, LSUN and TinyImagenet). The results using model trained on CIFAR100 show that Prior Networks are capable of outperforming the ensembles when evaluated against the LSUN and SVHN datasets. However, Prior Networks have difficulty distinguishing between the CIFAR10 and CIFAR100 test sets. However, this represents a limitation of the both the classification model and the OOD training data, rather than the training criterion. Improving classification performance of Prior Networks on CIFAR100, which improves understanding of what is ’indomain’, and using a more appropriate OOD training dataset, which provides a better contrast, is likely improve OOD detection performance.
6 Adversarial Attack Detection
Having developed a new training criterion for Prior Networks which allows them to scale to more complex datasets and gives greater control over their behaviour, we now investigate using measures of uncertainty derived from Prior Networks to detect adversarial attacks. Detection of adversarial attacks via measures of uncertainty was previously studied in carlinidetected
, where it was shown that MonteCarlo dropout ensembles yield measures of uncertainty which are more challenging to attack than other considered methods. Like MonteCarlo dropout, Prior Networks yield rich measures of uncertainty derived from distributions over distributions. This means that for adversarial attacks to both affect the prediction and evade detection, they must be located in a region of input space within the decision boundary of the desired target class, and where both the relative magnitudes of the logits (distribution over classes) and absolute magnitude of the logits (distribution over distributions) are the same as for natural inputs. Clearly, this constrains the space of possible solutions to the optimization problem which yields adversarially attacks. Furthermore, the behavior of Prior Networks can be
explicitly controlled for particular input regions via choice of outofdistribution training data, for example adversarial attacks. This further constrains the space of solutions to the optimization problem which yields detectionevading adversarial attacks. Thus, in the following experiments Prior Networks are trained on adversarially perturbed inputs as outofdistribution data. The models are trained to both yield the correct prediction and high measures of uncertainty for adversarially modified inputs. During training, targeted adversarial attacks are generated via the Fast Gradient Sign Method (FGSM) szegedyadversarial which minimizes the reverse KLdivergence (eqn 10) between the Prior Network and a sharp Dirichlet distribution () focused on a randomly chosen class which is not the true class of the training image. The Prior Network is then jointly trained to yield either a sharp or wide Dirichlet distribution at the appropriate corner of the simplex for natural or adversarial data, respectively. The target concentration parameters are set using equation 6, where for natural and for adversarial data. This approach can be seen as a generalization of adversarial training szegedyadversarial ; madry2017towards , where models are trained to predict the correct class on a set of adversarially perturbed inputs. The difference is that here we are training the model to yield a particular behaviour of an entire distribution over output distributions, rather than simply making sure that the decision boundaries are correct in regions of input space which correspond to adversarial attacks. As discussed in carlinidetected ; carlinievaluating , approaches to detecting adversarial attacks need to be evaluated against adaptive whitebox attacks which have full knowledge of the detection scheme and actively seek to bypass it. Here, we consider two types of targeted, iterative PGDMIM MIM ; madry2017towardsattacks which aim to switch the prediction to a target class while leaving the measures of uncertainty derived from Prior Networks or DNNs (entropy, mutual information) unchanged. The first approach is to simply permute the predicted distribution over classes and swap the probabilities of the max and target classes. The loss function minimized by the adversarial generation process will be the
forward KLdivergence between the predicted distribution over class labels and the target permuted distribution . For Prior Networks, the equivalent approach would be to permute the concentration parameters and to minimize forward KL divergence to the permuted target Dirichlet distribution:(18) 
However, it was found^{4}^{4}4Results are described in appendix D. that yields more aggressive attacks than , which is why only attacks generated via are considered here. The target for these attacks is always the second most likely class, as that represents the least ’unnatural’ perturbation of the outputs. In the following set of experiments Prior Networks are trained on either the CIFAR10 or CIFAR100 datasets cifar using the procedure defined above. Details of the experimental configuration can be found in appendix B. The baseline models are an undefended DNN and a DNN trained using standard adversarial training (DNNADV). For these models uncertainty is estimated via the entropy of the predictive posterior. Additionally, estimates of mutual information (knowledge uncertainty) are derived via a MonteCarlo dropout ensemble generated from each of these models. Similarly, Prior Networks also use the mutual information (eqn. 4) for adversarial attack detection. Performance is assessed via the Success Rate, AUROC and Joint Success Rate (JSR). For the ROC curves considered here the true positive rate is computed using natural examples, while the falsepositive rate is computed using only successful adversarial attacks^{5}^{5}5The may result in minimum AUROC performance being a little greater than 50 is the success rate is not 100 %, as is the case with MCDP AUROC in figure 3.. The JSR, described in greater detail in appendix C, is the equal error rate where false positive rate equals false negative rate, and allows joint assessment of adversarial robustness and detection.
The results presented in figure 3 show that on both the CIFAR10 and CIFAR100 datasets whitebox attacks successfully change the prediction of DNN and DNNADV models to the second most likely class and evade detection (AUROC goes to 50). MonteCarlo dropout ensembles are marginally harder to adversarially overcome, due to the random noise. At the same time, it takes far more iterations of gradient descent to successfully attack Prior Networks such that they fail to detect the attack. On CIFAR10 the Joint Success Rate is only 0.25 at 1000 iterations, while the JSR for the other models is 0.5 (the maximum). Results on the more challenging CIFAR100 dataset show that adversarially trained Prior Networks yield a more modest increase in robustness over baseline approaches, but it still takes significantly more computational effort to attack the model. Thus, these results support the assertion that adversarially trained Prior Networks constrain the solution space for adaptive adversarial attack, making them computationally more difficult to successfully construct. At the same time, blackbox attacks, computed on identical networks trained on the same data from a different random initialization, fail entirely against Prior Networks trained on CIFAR10 and CIFAR100. This shows that the adaptive attacks considered here are nontransferable.
7 Conclusion
Prior Networks have been shown to be an interesting approach to emulating ensembles, allowing rich and interpretable measures of uncertainty to be derived from neural networks. This work consists of two main contributions which aim to improve these models. Firstly, a new training criterion for Prior Networks, the reverse KLdivergence between Dirichlet distributions, is proposed. It is shown, both theoretically and empirically, that this criterion yields the desired set of behaviours of a Prior Network and allows these models to be trained on more complex datasets with a large number of classes. Furthermore, it is shown that this loss improves outofdistribution detection performance on the CIFAR10 and CIFAR100 datasets relative to the forward KLdivergence loss used in malininpn2018 . However, it is necessary to investigate proper choice of outofdistribution training data, as an inappropriate choice can limit OOD detection performance on complex datasets. Secondly, this improved training criterion enables Prior Networks to be applied to the task of detecting whitebox adaptive adversarial attacks. It is shown that it is significantly more computationally challenging to construct successfully adaptive whitebox PGD attacks against Prior Network than against baseline models. Thus, adversarial training of Prior Networks can be seen as a generalization of standard adversarial training which improves both robustness to adversarial attacks and the ability to detect them by placing more constraints on the space of solutions to the optimization problem which yields adversarial attacks. It is necessary to point out that the evaluation of adversarial attack detection using Prior Networks is limited to only strong attacks. It is of interest to assess how well Prior Networks are able to detect adaptive C&W attacks carlinirobustness and EAD attacks chen2018ead . However, one challenge with these attacks is the adaptation of their loss functions to Prior Networks, which is left for future work.
References
 (1) Ross Girshick, “Fast RCNN,” in Proc. 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1440–1448.
 (2) Karen Simonyan and Andrew Zisserman, “Very Deep Convolutional Networks for LargeScale Image Recognition,” in Proc. International Conference on Learning Representations (ICLR), 2015.

(3)
Ruben Villegas, Jimei Yang, Yuliang Zou, Sungryull Sohn, Xunyu Lin, and Honglak
Lee,
“Learning to Generate Longterm Future via Hierarchical
Prediction,”
in
Proc. International Conference on Machine Learning (ICML)
, 2017.  (4) Tomas Mikolov et al., “Linguistic Regularities in Continuous Space Word Representations,” in Proc. NAACLHLT, 2013.
 (5) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean, “Efficient Estimation of Word Representations in Vector Space,” 2013, arXiv:1301.3781.

(6)
Tomas Mikolov, Martin Karafiát, Lukás Burget, Jan Cernocký,
and Sanjeev Khudanpur,
“Recurrent Neural Network Based Language Model,”
in Proc. INTERSPEECH, 2010.  (7) Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara Sainath, and Brian Kingsbury, “Deep neural networks for acoustic modeling in speech recognition,” Signal Processing Magazine, 2012.
 (8) Awni Y. Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y. Ng, “Deep speech: Scaling up endtoend speech recognition,” 2014, arXiv:1412.5567.
 (9) Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad, “Intelligible models for healthcare: Predicting pneumonia risk and hospital 30day readmission,” in Proc. 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 2015, KDD ’15, pp. 1721–1730, ACM.
 (10) Babak Alipanahi, Andrew Delong, Matthew T. Weirauch, and Brendan J. Frey, “Predicting the sequence specificities of DNA and RNAbinding proteins by deep learning,” Nature Biotechnology, vol. 33, no. 8, pp. 831–838, July 2015.
 (11) Dan Hendrycks and Kevin Gimpel, “A Baseline for Detecting Misclassified and OutofDistribution Examples in Neural Networks,” http://arxiv.org/abs/1610.02136, 2016, arXiv:1610.02136.
 (12) Yarin Gal and Zoubin Ghahramani, “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning,” in Proc. 33rd International Conference on Machine Learning (ICML16), 2016.
 (13) B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles,” in Proc. Conference on Neural Information Processing Systems (NIPS), 2017.
 (14) Nicholas Carlini and David A. Wagner, “Adversarial examples are not easily detected: Bypassing ten detection methods,” CoRR, 2017.
 (15) L. Smith and Y. Gal, “Understanding Measures of Uncertainty for Adversarial Example Detection,” in UAI, 2018.
 (16) Andrey Malinin and Mark Gales, “Predictive uncertainty estimation via prior networks,” in Advances in Neural Information Processing Systems, 2018, pp. 7047–7058.
 (17) Christian Szegedy, Alexander Toshev, and Dumitru Erhan, “Deep neural networks for object detection,” in Advances in Neural Information Processing Systems, 2013.
 (18) Ian Goodfellow, Jonathon Shlens, and Christian Szegedy, “Explaining and harnessing adversarial examples,” in International Conference on Learning Representations, 2015.
 (19) Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio, “Adversarial examples in the physical world,” 2016, vol. abs/1607.02533.

(20)
Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and
Jianguo Li,
“Boosting adversarial attacks with momentum,”
in
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2018.  (21) Nicholas Carlini and David A. Wagner, “Towards evaluating the robustness of neural networks,” CoRR, 2016.
 (22) Nicolas Papernot, Patrick D. McDaniel, Ian J. Goodfellow, Somesh Jha, Z. Berkay Celik, and Ananthram Swami, “Practical blackbox attacks against deep learning systems using adversarial examples,” CoRR, vol. abs/1602.02697, 2016.
 (23) Nicolas Papernot, Patrick D. McDaniel, Somesh Jha, Matt Fredrikson, Z. Berkay Celik, and Ananthram Swami, “The limitations of deep learning in adversarial settings,” in IEEE European Symposium on Security and Privacy, EuroS&P 2016, Saarbrücken, Germany, March 2124, 2016, 2016, pp. 372–387.
 (24) Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song, “Delving into transferable adversarial examples and blackbox attacks,” CoRR, vol. abs/1611.02770, 2016.
 (25) Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu, “Towards deep learning models resistant to adversarial attacks,” arXiv preprint arXiv:1706.06083, 2017.
 (26) Nicolas Papernot, Patrick D. McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami, “Distillation as a defense to adversarial perturbations against deep neural networks,” in IEEE Symposium on Security and Privacy, SP 2016, San Jose, CA, USA, May 2226, 2016, 2016, pp. 582–597.
 (27) Yarin Gal, Uncertainty in Deep Learning, Ph.D. thesis, University of Cambridge, 2016.

(28)
Stefan Depeweg, José Miguel HernándezLobato, Finale DoshiVelez, and
Steffen Udluft,
“Decomposition of uncertainty for active learning and reliable reinforcement learning in stochastic systems,”
stat, vol. 1050, pp. 11, 2017.  (29) Kevin P. Murphy, Machine Learning, The MIT Press, 2012.
 (30) Diederik P. Kingma and Max Welling, “AutoEncoding Variational Bayes,” in Proc. International Conference on Learning Representations (ICLR), 2014.
 (31) Murat Sensoy, Lance Kaplan, and Melih Kandemir, “Evidential deep learning to quantify classification uncertainty,” in Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, Eds., pp. 3179–3189. Curran Associates, Inc., 2018.
 (32) Alex Krizhevsky, “Learning multiple layers of features from tiny images,” 2009.

(33)
Stanford CS231N,
“Tiny ImageNet,”
https://tinyimagenet.herokuapp.com/, 2017.  (34) Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris Tsipras, Ian Goodfellow, and Aleksander Madry, “On evaluating adversarial robustness,” arXiv preprint arXiv:1902.06705, 2019.

(35)
PinYu Chen, Yash Sharma, Huan Zhang, Jinfeng Yi, and ChoJui Hsieh,
“Ead: elasticnet attacks to deep neural networks via adversarial
examples,”
in
Thirtysecond AAAI conference on artificial intelligence
, 2018. 
(36)
Martín Abadi et al.,
“TensorFlow: LargeScale Machine Learning on Heterogeneous Systems,” 2015,
Software available from tensorflow.org.  (37) Diederik P. Kingma and Jimmy Ba, “Adam: A Method for Stochastic Optimization,” in Proc. 3rd International Conference on Learning Representations (ICLR), 2015.
 (38) Zhitao Gong, Wenlu Wang, and WeiShinn Ku, “Adversarial and clean data are not twins,” CoRR, vol. abs/1704.04960, 2017.
 (39) Kathrin Grosse, Praveen Manoharan, Nicolas Papernot, Michael Backes, and Patrick D. McDaniel, “On the (statistical) detection of adversarial examples,” CoRR, vol. abs/1702.06280, 2017.
 (40) Jan Hendrik Metzen, Tim Genewein, Volker Fischer, and Bastian Bischoff, “On detecting adversarial perturbations,” in Proceedings of 5th International Conference on Learning Representations (ICLR), 2017.
Appendix A Synthetic Experiments
The current appendix describes the high data uncertainty artificial dataset used in section 4 of this paper. This dataset is sampled from a distribution
which consists of three normally distributed clusters with tied isotropic covariances with equidistant means, where each cluster corresponds to a separate class. The marginal distribution over
is given as a mixture of Gaussian distributions:
(19) 
The conditional distribution over the classes can be obtained via Bayes’ rule:
(20) 
This dataset is depicted for below. The green points represent the ’outofdistribution’ training data, which is sampled close to the indomain region. The Prior Networks considered in section 4 are trained on this dataset.
Figure 5 depicts the behaviour of the differential entropy of Prior Networks trained on the high data uncertainty artificial dataset using both KLdivergence losses. Unlike the total uncertainty, expected data uncertainty and mutual information, it is less clear what is the desired behaviour of the differential entropy. Figure 5 shows that both losses yield low differential entropy indomain and high differential entropy outofdistribution. However, the reverse KLdivergence seems to capture more of the structure of the dataset, which is especially evident in figure 5b, than the forward KLdivergence. This suggests that the differential entropy of Prior Networks trained via reverse KLdivergence is a measures of total uncertainty, while the differential entropy of Prior Networks trained using forward KLdivergence is a measure of knowledge uncertainty. The latter is consistent with results in malininpn2018 .
Appendix B Experimental Setup
The current appendix describes the experimental setup and datasets used for experiments considered in this paper. Table 3 describes the datasets used in terms of their size and numbers of classes.
All models considered in this paper were implemented in Tensorflow tensorflow using the VGG16 vgg architecture for image classification, but with the dimensionality of the fullyconnected layer reduced down to 2048 units. DNN models were trained using the negative loglikelihood loss. Prior Networks were trained using both the forward KLdivergence (PNKL) and reverse KLdivergence (PNRKL) losses to compare their behaviour on more challenging datasets. Identical target concentration parameters were used for both the forward and reverse KLdivergence losses. All models were trained using the Adam adam optimizer, with a 1cycle learning rate policy and dropout regularization. In additional, data augmentation was done when training models on the CIFAR10, CIFAR100 and TinyImageNet datasets via random leftright flips, random shifts up to 4 pixels and random rotations by up to 15 degrees. The details of the training configurations for all models and each dataset can be found in table 4. 5 models of each type were trained starting from different random seeds. The 5 DNN models were evaluated both individually (DNN) and as an explicit ensemble of models (ENS).
b.1 Adversarial Attack Generation
An adversarial input will be defined as the output of a constrained optimization process applied to a natural input :
(21) 
The loss is typically the negative loglikelihood of a particular target class :
(22) 
The distance represents a proxy for the perceptual distance between the natural sample and the adversarial sample . In the case of adversarial images is typically the , or norm. The distance is constrained to be within the set of allowed perturbations such that the adversarial attack is still perceived to be a natural input to a human observer. Firstorder optimization under a constraint is called Projected Gradient Descent madry2017towards , where the solution is projected back onto the norm ball whenever it exceeds the constraint. There are multiple ways in which the PGD optimization problem 21 can be solved szegedyadversarial ; goodfellowadversarial ; BIM ; MIM ; madry2017towards . The simplest way to generate an adversarial example is via the Fast Gradient Sign Method or FGSM goodfellowadversarial , where the sign of the gradient of the loss with respect to the input is added to the input:
(23) 
Epsilon controls the magnitude of the perturbation under a particular distance , the norm in this case. A generalization of this approach to other norms, called Fast Gradient Methods (FGM), is provided below:
(24) 
FGM attacks are simple adversarial attacks which are not always successful. A more challenging class of attacks are iterative FGM attacks, such as the Basic Iterative Method (BIM) BIM and Momentum Iterative Method (MIM) MIM , and others carlinirobustness ; chen2018ead . However, as pointed out by Madry et. al madry2017towards , all of these attacks, whether onestep or iterative, are generated using variants of Projected Gradient Descent to solve the constrained optimization problem in equation 21. Madry madry2017towards argues that all attacks generated using various forms of PGD share similar properties, even if certain attacks use more sophisticated forms of PGD than others. In this work MIM attacks, which are considered to be strong attacks, are used to attack all models considered in section 6. However, standard targeted attacks which minimize the negative loglikelihood of a target class are not adaptive to the detection scheme. Thus, in this work adaptive targeted attacks are generated by minimizing the losses proposed in section 6, in equation 18. The optimization problem in equation 21 contains a hard constraint, which essentially projects the solutions of gradient descent optimization to the allowed norm ball whenever is larger than the constraint. This may be both disruptive to iterative momentumbased optimization methods. An alternative softconstraint formulation of the optimization problem is to simultaneously minimize the loss as well as the perturbation directly:
(25) 
In this formulation is a hyperparameter which trades of minimization of the loss and the perturbation . Approaches which minimize this expression are the Carlini and Wagner (C&W) attack carlinirobustness and the "Elasticnet Attacks to DNNs" (EAD) attack chen2018ead . While the optimization expression is different, these methods are also a form of PGD and therefore are expected to have similar properties as other PGDbased attacks madry2017towards . The C&W and EAD are considered to be particularly strong and attacks, and Prior Networks need to be assessed on their ability to be robust to and detect them. However, adaptation of these attacks to Prior Networks is nontrivial and left to future work.
b.2 Adversarial Training of DNNs and Prior Networks
Prior Networks and DNNs considered in section 6 are trained on a combination of natural and adversarially perturbed data, which is known as adversarial training. DNNs are trained on targeted FGSM attacks which are generated dynamically during training from the current training minibatch. The target class is selected from a uniform categorical distribution, but such that it is not the true class of the image. The magnitude of perturbation
is randomly sampled for each image in the minibatch from a truncated normal distribution, which only yields positive values, with a standard deviation of 30 pixels:
(26) 
The perturbation strength is sampled such that the model learns to be robust to adversarial attacks across a range of perturbations. The DNN is then trained via maximum likelihood on both the natural and adversarially perturbed version of the minibatch. Adversarial training of the Prior Network is a little more involved. During training, an adversarially perturbed version of the minibatch is generated using the targeted FGSM method. However, the loss is not the negative loglikelihood of a target class, but the reverse KLdivergence (eqn. 10) between the model and a targeted Dirichlet which is focused on a target class which is chosen from a uniform categorical distribution (but not the true class of the image). For this loss the target concentration is the same as for natural data (). The Prior Network is then jointly trained on the natural and adversarially perturbed version of the minibatch using the following loss:
(27) 
Here, the concentration of the target class for natural data is and for adversarially perturbed data , where the concentration parameters are set via 6. Setting results in a very wide Dirichlet distribution whose mode and mean are closest to the target class. This ensures that the prediction yields the correct class and that all measure of uncertainty, such as entropy of the predictive posterior or the mutual information, are high. Note, that due to the nature of the reverse KLdivergence loss, adversarial inputs which have a very small perturbation and lie close to their natural counterparts will naturally have a target concentration which is an interpolation between the concentration for natural data and for adversarial data. The degree of interpolation is determined by the OOD loss weight , as discussed in section 3. It is necessary to point out that FGSM attack are used because they are computationally cheap to compute during training. However, iterative adversarial attacks can also be considered during training, although this will make training much slower.
Appendix C Jointly Assessing Adversarial Attack Robustness and Detection
In order to investigate detection of adversarial attacks, it is necessary to discuss how to assess the effectiveness of an adversarial attack in the scenario where detection of the attack is possible. Previous work on detection of adversarial examples gongdetection2017 ; grossedetection2017 ; metzendetecting2017 ; carlinidetected ; galadversarial assesses the performance of detection methods separately from whether an adversarial attack was successful, and use the standard measures of adversarial success and detection performance. However, in a real deployment scenario, an attack can only be considered successful if it both affects the predictions and evades detection. Here, we develop a measure of performance to assess this. For the purposes of this discussion the adversarial generation process will be defined to either yield a successful adversarial attack or an empty set . In a standard scenario, where there is no detection, the efficacy of an adversarial attack on a model^{6}^{6}6Given an evaluation dataset can be summarized via the success rate of the attack:
(28) 
Typically is plotted against the total maximum perturbation from the original image, measured as either the , or distance from the original image. Consider using a thresholdbased detection scheme where a sample is labelled ’positive’ if some measure of uncertainty , such as entropy or mutual information, is less than a threshold and ’negative’ if it is higher than a threshold:
(29) 
The performance of such a scheme can be evaluated at every threshold value using the true positive rate and the false positive rate :
(30) 
The whole range of such trade offs can be visualized using a ReceiverOperatingCharacteristic (ROC) and the quality of the tradeoff can be summarized using area under the ROC curve. However, a standard ROC curve does account for situations where the process fails to produce a successful attack. In fact, if an adversarial attack is made against a system which has a detection scheme, it can only be considered successful if it both affects the predictions and evades detection. This condition can be summarized in the following indicator function:
(31) 
Given this indicator function, a new false positive rate can be defined as:
(32) 
This false positive rate can now be seen as a new Joint Success Rate which measures how many attacks were both successfully generated and evaded detection, given the threshold of the detection scheme. The Joint Success Rate can be plotted against the standard true positive rate on an ROC curve to visualize the possible tradeoffs. One possible operating point is where the false positive rate is equal to the false negative rate, also known as the Equal ErrorRate point:
(33) 
Throughout this work the EER false positive rate will be quoted as the Joint Success Rate.
Appendix D Additional Adversarial Attack Detection Experiments
In this appendix additional experiments on adversarial attack detection are presented. In figure 6 adaptive whitebox adversarial attacks generated by iteratively minimizing KL divergence between the original and target (permuted) categorical distributions are compared to attacks generated by minimzing the KLdivergence between the predicted and permuted Dirichlet distributions . Performance is assessed only against Prior Network models. The results show that KL PMF attacks are more successful at switching the prediction to the desired class and at evading detection. The could be due to the fact that Dirichlet distributions which are sharp at different corners have limited common support, making the optimization of the KLdivergence between them more difficult than the KLdivergence between categorical distributions.
Results in figure 7 show that PGD Momentum Iterative attacks which minimize the loss are marginally more successful than the version of these attacks. However, it is necessary to consider appropriate adaptation of the C&W attacks to the loss functions considered in this work for a more aggressive set of attacks.
Comments
There are no comments yet.