Reverse KL-Divergence Training of Prior Networks: Improved Uncertainty and Adversarial Robustness

05/31/2019 ∙ by Andrey Malinin, et al. ∙ University of Cambridge 0

Ensemble approaches for uncertainty estimation have recently been applied to the tasks of misclassification detection, out-of-distribution input detection and adversarial attack detection. Prior Networks have been proposed as an approach to efficiently emulating an ensemble of models by parameterising a Dirichlet prior distribution over output distributions. These models have been shown to outperform ensemble approaches, such as Monte-Carlo Dropout, on the task of out-of-distribution input detection. However, scaling Prior Networks to complex datasets with many classes is difficult using the training criteria originally proposed. This paper makes two contributions. Firstly, we show that the appropriate training criterion for Prior Networks is the reverse KL-divergence between Dirichlet distributions. Using this loss we successfully train Prior Networks on image classification datasets with up to 200 classes and improve out-of-distribution detection performance. Secondly, taking advantage of the new training criterion, this paper investigates using Prior Networks to detect adversarial attacks. It is shown that the construction of successful adaptive whitebox attacks, which affect the prediction and evade detection, against Prior Networks trained on CIFAR-10 and CIFAR-100 takes a greater amount of computational effort than against standard neural networks, adversarially trained neural networks and dropout-defended networks.



There are no comments yet.


page 6

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural Networks (NNs) have become the dominant approach to addressing computer vision (CV)

Girshick2015 ; vgg ; videoprediction

, natural language processing (NLP)

embedding1 ; embedding2 ; mikolov-rnn , speech recognition (ASR) dnnspeech ; DeepSpeech and bio-informatics Caruana2015 ; dnarna

tasks. Notable progress has recently been made on predictive uncertainty estimation for Deep Learning through the definition of baselines, tasks and metrics

baselinedetecting , and the development of practical methods for estimating uncertainty using ensemble methods, such as Monte-Carlo Dropout Gal2016Dropout and Deep Ensembles deepensemble2017 . Uncertainty estimates derived from ensemble approaches have been successfully applied to the tasks of detecting misclassifications and out-of-distribution inputs, and have also been investigated for adversarial attack detection carlini-detected ; gal-adversarial . However, ensembles can be computationally expensive and it is hard to control their behaviour. Recently, malinin-pn-2018 proposed Prior Networks - a new approach to modelling uncertainty which has been shown to outperform Monte-Carlo dropout on a range of tasks. Prior Networks parameterize a Dirichlet prior over output distributions, which allows them to emulate an ensemble of models using a single network, whose behaviour can be explicitly controlled via choice of training data. In malinin-pn-2018 , Prior Networks are trained using the forward KL-divergence between the model and a target Dirichlet distribution. It is, however, necessary to use auxiliary losses, such as the cross-entropy, to yield competitive classification performance. Furthermore, it is also difficult to train Prior Networks using this criterion on complex datasets with many classes. In this work we show that the forward KL-divergence (KL) is an inappropriate optimization criterion and instead propose to train Prior Networks with the reverse KL-divergence (RKL) between the model and a target Dirichlet. In sections 3 and 4 of this paper it is shown, both theoretically and empirically on synthetic data, that this loss yields the desired behaviours of a Prior Network and does not require auxiliary losses. In section 5 Prior Networks are successfully trained on a range of image classification tasks using the proposed criterion without loss of classification performance. It is also shown that these models yield better out-of-distribution detection performance on the CIFAR-10 and CIFAR-100 datasets than Prior Networks trained using forward KL-divergence. An interesting application of uncertainty estimation is the detection of adversarial attacks, which are small perturbations to the input that are almost imperceptible to humans, yet which drastically affect the predictions of the neural network szegedy-adversarial . Adversarial attacks are a serious security concern, as there exists a plethora of adversarial attacks which are quite easy to construct goodfellow-adversarial ; BIM ; MIM ; carlini-robustness ; papernot-blackbox ; papernot-limitation-2016 ; liu-delving-2016 ; madry2017towards . At the same time, while it is possible to improve the robustness of a network to adversarial attacks using adversarial training szegedy-adversarial and adversarial distillation papernote-distllaition-2016 , it is still possible to craft successful adversarial attacks against these networks carlini-robustness . Instead of considering robustness to adversarial attacks, carlini-detected investigates detection of adversarial attack and shows that adversarial attacks can be detectable using a range of approaches. While, adaptive attacks can be crafted to successfully attack the proposed detection schemes, carlini-detected singles out detection of adversarial attacks using uncertainty measures derived from Monte-Carlo dropout as being more challenging to successfully overcome using adaptive attacks. Thus, in this work we investigate the detection of adversarial attacks using Prior Networks, which have previously outperformed Monte-Carlo dropout on other tasks. Using the greater degree of control over the behaviour of Prior Networks which the reverse KL-divergence loss affords, Prior Networks are trained to predict the correct class on adversarial inputs, but yield a higher measure of uncertainty than on natural inputs. Effectively, this becomes a generalization of adversarial training szegedy-adversarial which improves both the robustness of the model to adversarial attacks and also allows them to be detected. In section 6 it is shown that on the CIFAR-10 and CIFAR-100 datasets it is more computationally challenging to construct adaptive adversarial attacks against Prior Networks than against standard neural networks, adversarially trained neural networks and MC-dropout defended networks. This is because, like ensembles, Prior Networks yield measures of uncertainty derived from distributions over output distributions. Consequently, adaptive adversarial attacks need to satisfy more constraints in order to attack Prior Networks and evade detection. Thus, the two main contributions of this paper are the following: a new reverse

KL-divergence loss function which yields the desired behaviour of Prior Networks and allows them more complex datasets; the application of Prior Networks to adversarial attack detection, enabled using the proposed training criterion, where it is shown that whitebox

adaptive attacks are more computationally expensive to construct for Prior Networks than for baseline models.

2 Prior Networks

An ensemble of models can be interpreted as a set of output distributions drawn from an implicit conditional distribution over output distributions. A Prior Network 111Here - the parameters of a categorical distribution., is a neural network which explicitly parametrizes a prior distribution over output distributions. This effectively allows a Prior Network to emulate an ensemble and yield the same measures of uncertainty galthesis ; mutual-information , but in closed form and without sampling.


A Prior Network for classification typically parameterizes the Dirichlet distribution222Alternate choices of distribution, such as a mixture of Dirichlets or the Logistic-Normal, are possible. (eqn 2

), which is the conjugate prior to the categorical, due to its tractable analytic properties. The Dirichlet distribution is defined as:


where is the gamma function. The Dirichlet distribution is parameterized by its concentration parameters , where , the sum of all , is called the precision of the Dirichlet distribution. Higher values of lead to sharper, more confident distributions. The predictive distribution of a Prior Network is given by the expected categorical distribution under the conditional Dirichlet prior:


The desired behaviors of a Prior Network, as described in malinin-pn-2018 , can be visualized on a simplex in figure 1. Here, figure 1:a describes confident behavior (low-entropy prior focused on low-entropy output distributions), figure 1:b describes uncertainty due to severe class overlap (data uncertainty) and figure 1:c describes the behaviour for an out-of-distribution input (knowledge uncertainty).


(a) Low uncertainty


(b) High data uncertainty


(c) Out-of-distribution
Figure 1: Desired Behaviors of a Dirichlet distribution over categorical distributions.

Given a Prior Network which yields the desired behaviours, it is possible to derive measures of uncertainty in the prediction by considering the mutual information between and , given by the following expression:


The given expression allows total uncertainty, given by the entropy of the predictive distribution, to be decomposed into data uncertainty and knowledge uncertainty, which are the two sources of uncertainty. Data uncertainty arises due to class-overlap in the data, which is the equivalent of noise for classification problems. Knowledge Uncertainty, also know as epistemic uncertainty Gal2016Dropout or distributional uncertainty malinin-pn-2018 , arises due to the model’s lack of understanding or knowledge about the input. In other words, knowledge uncertainty arises due to a mismatch between the training and the test data.

3 Forward and Reverse KL-Divergence Losses

The original training criterion for Prior Networks is forward KL-divergence between the model and a target Dirichlet distribution , where the target concentration parameters depend on the class .


The target concentration parameters are set as follows:


This criterion is then jointly optimized on in-domain data and out-of-domain training data as follows:


where is the out-of-distribution loss weight. In-domain should take on a large value, for example , so that the concentration is high only in the corner corresponding to the target class, and low elsewhere. Note, the concentration parameters have to be strictly positive, so it is not possible to set the rest of the concentration parameters to 0. Instead, they are set to one, which also provides a small degree smoothing. Out-of-domain , which results in a flat Dirichlet distribution. However, there is a significant issue with this criterion. Consider taking the expectation of equation 5 with respect to the empirical distribution :


In expectation, this loss is the KL-divergence between the model and a mixture of Dirichlet distributions, which has a mode in each corner of the simplex. When the level of data uncertainty is low, this is not a problem, as there will only be a single significant mode. However, when there is a significant amount of data uncertainty the target distribution will be multi-modal. As the forward KL-divergence is zero-avoiding, it will drive the model to spread itself over each mode, effectively ’inverting’ the Dirichlet distribution and driving the precision to a low value. This is an undesirable behaviour, as the model should instead yield a distribution with a single high-precision mode at the center of the simplex, as shown in figure 1b. Furthermore, this can compromise predictive performance. The main issue with the KL-divergence loss is that the target distributions are arithmetically summed in expectation. This can be avoided by instead minimizing the reverse KL-divergence between the target distribution and the model:


By taking the expectation of this criterion with respect to the empirical distribution, it becomes the reverse KL-divergence between the model and a geometric mixture of target Dirichlet distributions:


A geometric mixture of Dirichlet distributions results in a standard Dirichlet distribution whose concentration parameters are an arithmetic mixture of the target concentration parameters for each class. When there is low data uncertainty this loss simply yields the reverse KL-divergence to a sharp Dirichlet at a particular corner. However, when the data uncertainty is significant, this loss minimizes the reverse KL-divergence to a Dirichlet with a single mode close to the center of the simplex. This is exactly the behaviour which the model should learn when there is an in-domain input in a region of significant data uncertainty. Thus, target distribution is always

a standard uni-modal Dirichlet. Furthermore, as a consequence of this loss, the concentration parameters are appropriately interpolated on the boundary of the in-domain and out-of-distribution regions, where the degree of interpolation depends on the OOD loss weight

. Finally, it is necessary to point out that the reverse KL-divergence is commonly used in variational inference murphy and training variational auto-encoders vae . It is interesting to further analyze the properties of the reverse KL-divergence by decomposing it into the reverse cross-entropy and the negative differential entropy:


Lets consider the reverse-cross entropy term in more detail (and dropping additive constants):


When the target concentration parameters are defined as in equation 6, the form of the reverse cross-entropy will be:


This expression for the reverse cross entropy is a scaled version of an upper-bound to the cross entropy between discrete distributions, obtained via Jensen’s inequality, which was proposed in a parallel work evidential that investigated a model similar to Dirichlet Prior networks:


This form of this upper bound loss is identical to standard negative log-likelihood loss, except with digamma functions instead of natural logarithms. This loss can be analyzed further by considering the following asymptotic series approximation to the digamma function:


Given this approximation, it is easy to show that this upper-bound loss is equal to the negative log-likelihood plus an extra term which drives the concentration parameter to be as large as possible:


Thus, the reverse KL-divergence between Dirichlet distributions, given setting of target concentration parameters via equation 6, yields the following expression:


Clearly, this expression is equal to the standard negative log-likelihood loss for discrete distributions, weighted by , plus a term which drives the precision of the Dirichlet to be , where is the number of classes.

4 Experiments on Synthetic Data

The previous section investigated the theoretical properties of forward and reverse KL-divergence training criteria for Prior Networks. In this section these criteria are assessed empirically by using them to train Prior Networks on the artificial high-uncertainty 3-class dataset333Described in appendix A. introduced in malinin-pn-2018 . In these experiments, the out-of-distribution training data was sampled such that it forms a thin shell around the training data. The target Dirichlet concentration parameters were constructed as described in equation 6, with and . The in-domain loss and out-of-distribution losses were equally weighted when trained using forward KL-divergence. However, it was found that it is necessary to weight the out-of-distribution loss 10 times as much as the in-domain loss when using reverse KL divergence.


(a) Total Uncertainty - KL


(b) Data Uncertainty - KL


(c) Mutual Information - KL


(d) Total Uncertainty - RKL


(e) Data Uncertainty - RKL


(f) Mutual Information - RKL
Figure 2: Comparison of measures of uncertainty derived from Prior Networks trained with forward and reverse KL-divergence. Measures of uncertainty are derived via equation 4.

Figure 2 depicts the total uncertainty, expected data uncertainty and mutual information, which is a measure of knowledge uncertainty, derived using equation 4 from Prior Networks trained using both criteria. By comparing figures 2a and  2d it is clear that a Prior Network trained using forward KL-divergence over-estimates total uncertainty in domain, as the total uncertainty is equally high along the decision boundaries, in the region of class overlap and out-of-domain. The Prior Network trained using the reverse KL-divergence, on the other hand, yields an estimate of total uncertainty which better reflects the structure of the dataset. Figure 2b shows that the expected data uncertainty is altogether incorrectly estimated by the Prior Network trained via forward KL-divergence, as it is uniform over the entire in-domain region. As a result, the mutual information is higher in-domain along the decision boundaries than out-of-domain. In contrast, figures 2c and  2f show that the measures of uncertainty provided by a Prior Network trained using the reverse KL-divergence decompose correctly - data uncertainty is highest in regions of class overlap while mutual information is low in-domain and high out-of-domain. Thus, these experiments support the analysis in the previous section, and illustrate how the reverse KL-divergence is a more suitable optimization criterion than forward KL-divergence.

5 Image Classification Experiments

Having evaluated the forward and reverse KL-divergence losses on a synthetic dataset in the previous section, we now evaluate these losses on a range of image classification datasets. The training configurations are described in appendix B. Table 1 presents the classification error rates of standard DNNs, an ensemble of 5 DNNs deepensemble2017 , and Prior Networks trained using both the forward and reverse KL-divergence losses. From table 1 it is clear that Prior Networks trained using forward KL-divergence (PN-KL) achieve increasingly worse classification performance as the datasets become more complex and have a larger number of classes. At the same time, Prior Networks trained using the reverse KL-divergence loss (PN-RKL) have similar error rates as ensembles and standard DNNs. Note that in these experiments no auxiliary losses were used.

Dataset DNN PN-KL PN-RKL ENSM MNIST 0.5 0.6 0.5 0.5 NA SVHN 4.3 5.7 4.2 3.3 NA CIFAR-10 8.0 14.7 7.5 6.6 NA CIFAR-100 30.4 - 28.1 26.9 NA TinyImageNet 41.7 - 40.3 36.9 NA

Table 1: Mean classification performance (% Error) across 5 random initializations. Error rates for PN-KL on CIFAR-100 and TinyImageNet are not presented as the models failed to train on these datasets using the forward KL-divergence.

Table 2 presents the out-of-distribution detection performance of Prior Networks trained on CIFAR-10 and CIFAR-100 cifar using the forward and reverse KL-divergences. Prior Networks trained on CIFAR-10 use CIFAR-100 are OOD training data, while Prior Networks trained on CIFAR-100 use TinyImageNet tinyimagenet as OOD training data. Performance is assessed using area under an ROC curve (AUROC) in the same fashion as in malinin-pn-2018 ; baselinedetecting . The results on CIFAR-10 show that PN-RKL consistently yields better performance than PN-KL and the ensemble on all OOD test datasets (SVHN, LSUN and TinyImagenet). The results using model trained on CIFAR-100 show that Prior Networks are capable of out-performing the ensembles when evaluated against the LSUN and SVHN datasets. However, Prior Networks have difficulty distinguishing between the CIFAR-10 and CIFAR-100 test sets. However, this represents a limitation of the both the classification model and the OOD training data, rather than the training criterion. Improving classification performance of Prior Networks on CIFAR-100, which improves understanding of what is ’in-domain’, and using a more appropriate OOD training dataset, which provides a better contrast, is likely improve OOD detection performance.

Model CIFAR-10 CIFAR-100 SVHN LSUN TinyImageNet SVHN LSUN CIFAR-10 ENSM 89.5 NA 93.2 NA 90.3 NA 78.9 NA 85.6 NA 76.5 NA PN-KL 97.8 . 91.6 92.4 - - - PN-RKL 98.2 95.7 95.7 84.8 100.0 57.8

Table 2: Out-of-domain detection results (mean % AUROC across 5 rand. inits) using mutual information (eqn. 4) derived from models trained on CIFAR-10 and CIFAR-100.

6 Adversarial Attack Detection

Having developed a new training criterion for Prior Networks which allows them to scale to more complex datasets and gives greater control over their behaviour, we now investigate using measures of uncertainty derived from Prior Networks to detect adversarial attacks. Detection of adversarial attacks via measures of uncertainty was previously studied in carlini-detected

, where it was shown that Monte-Carlo dropout ensembles yield measures of uncertainty which are more challenging to attack than other considered methods. Like Monte-Carlo dropout, Prior Networks yield rich measures of uncertainty derived from distributions over distributions. This means that for adversarial attacks to both affect the prediction and evade detection, they must be located in a region of input space within the decision boundary of the desired target class, and where both the relative magnitudes of the logits (distribution over classes) and absolute magnitude of the logits (distribution over distributions) are the same as for natural inputs. Clearly, this constrains the space of possible solutions to the optimization problem which yields adversarially attacks. Furthermore, the behavior of Prior Networks can be

explicitly controlled for particular input regions via choice of out-of-distribution training data, for example adversarial attacks. This further constrains the space of solutions to the optimization problem which yields detection-evading adversarial attacks. Thus, in the following experiments Prior Networks are trained on adversarially perturbed inputs as out-of-distribution data. The models are trained to both yield the correct prediction and high measures of uncertainty for adversarially modified inputs. During training, targeted adversarial attacks are generated via the Fast Gradient Sign Method (FGSM) szegedy-adversarial which minimizes the reverse KL-divergence (eqn 10) between the Prior Network and a sharp Dirichlet distribution () focused on a randomly chosen class which is not the true class of the training image. The Prior Network is then jointly trained to yield either a sharp or wide Dirichlet distribution at the appropriate corner of the simplex for natural or adversarial data, respectively. The target concentration parameters are set using equation 6, where for natural and for adversarial data. This approach can be seen as a generalization of adversarial training szegedy-adversarial ; madry2017towards , where models are trained to predict the correct class on a set of adversarially perturbed inputs. The difference is that here we are training the model to yield a particular behaviour of an entire distribution over output distributions, rather than simply making sure that the decision boundaries are correct in regions of input space which correspond to adversarial attacks. As discussed in  carlini-detected ; carlini-evaluating , approaches to detecting adversarial attacks need to be evaluated against adaptive whitebox attacks which have full knowledge of the detection scheme and actively seek to bypass it. Here, we consider two types of targeted, iterative PGD-MIM MIM ; madry2017towards

attacks which aim to switch the prediction to a target class while leaving the measures of uncertainty derived from Prior Networks or DNNs (entropy, mutual information) unchanged. The first approach is to simply permute the predicted distribution over classes and swap the probabilities of the max and target classes. The loss function minimized by the adversarial generation process will be the

forward KL-divergence between the predicted distribution over class labels and the target permuted distribution . For Prior Networks, the equivalent approach would be to permute the concentration parameters and to minimize forward KL divergence to the permuted target Dirichlet distribution:


However, it was found444Results are described in appendix D. that yields more aggressive attacks than , which is why only attacks generated via are considered here. The target for these attacks is always the second most likely class, as that represents the least ’unnatural’ perturbation of the outputs. In the following set of experiments Prior Networks are trained on either the CIFAR-10 or CIFAR-100 datasets cifar using the procedure defined above. Details of the experimental configuration can be found in appendix B. The baseline models are an undefended DNN and a DNN trained using standard adversarial training (DNN-ADV). For these models uncertainty is estimated via the entropy of the predictive posterior. Additionally, estimates of mutual information (knowledge uncertainty) are derived via a Monte-Carlo dropout ensemble generated from each of these models. Similarly, Prior Networks also use the mutual information (eqn. 4) for adversarial attack detection. Performance is assessed via the Success Rate, AUROC and Joint Success Rate (JSR). For the ROC curves considered here the true positive rate is computed using natural examples, while the false-positive rate is computed using only successful adversarial attacks555The may result in minimum AUROC performance being a little greater than 50 is the success rate is not 100 %, as is the case with MCDP AUROC in figure 3.. The JSR, described in greater detail in appendix C, is the equal error rate where false positive rate equals false negative rate, and allows joint assessment of adversarial robustness and detection.


(a) C10 Whitebox Success Rate


(b) C10 Whitebox AUROC


(c) C10 Whitebox JSR


(d) C100 Whitebox Success Rate


(e) C100 Whitebox ROC AUC


(f) C100 Whitebox JSR


(g) PN Blackbox Success Rate


(h) PN Blackbox AUROC


(i) PN Blacbox JSR
Figure 3: Adaptive Attack detection performance in terms of mean Success Rate, % AUROC and Joint Success Rate (JSR) across 5 random inits. bound on adversarial perturbation is 30 pixels.

The results presented in figure 3 show that on both the CIFAR-10 and CIFAR-100 datasets whitebox attacks successfully change the prediction of DNN and DNN-ADV models to the second most likely class and evade detection (AUROC goes to 50). Monte-Carlo dropout ensembles are marginally harder to adversarially overcome, due to the random noise. At the same time, it takes far more iterations of gradient descent to successfully attack Prior Networks such that they fail to detect the attack. On CIFAR-10 the Joint Success Rate is only 0.25 at 1000 iterations, while the JSR for the other models is 0.5 (the maximum). Results on the more challenging CIFAR-100 dataset show that adversarially trained Prior Networks yield a more modest increase in robustness over baseline approaches, but it still takes significantly more computational effort to attack the model. Thus, these results support the assertion that adversarially trained Prior Networks constrain the solution space for adaptive adversarial attack, making them computationally more difficult to successfully construct. At the same time, blackbox attacks, computed on identical networks trained on the same data from a different random initialization, fail entirely against Prior Networks trained on CIFAR-10 and CIFAR-100. This shows that the adaptive attacks considered here are non-transferable.

7 Conclusion

Prior Networks have been shown to be an interesting approach to emulating ensembles, allowing rich and interpretable measures of uncertainty to be derived from neural networks. This work consists of two main contributions which aim to improve these models. Firstly, a new training criterion for Prior Networks, the reverse KL-divergence between Dirichlet distributions, is proposed. It is shown, both theoretically and empirically, that this criterion yields the desired set of behaviours of a Prior Network and allows these models to be trained on more complex datasets with a large number of classes. Furthermore, it is shown that this loss improves out-of-distribution detection performance on the CIFAR-10 and CIFAR-100 datasets relative to the forward KL-divergence loss used in malinin-pn-2018 . However, it is necessary to investigate proper choice of out-of-distribution training data, as an inappropriate choice can limit OOD detection performance on complex datasets. Secondly, this improved training criterion enables Prior Networks to be applied to the task of detecting whitebox adaptive adversarial attacks. It is shown that it is significantly more computationally challenging to construct successfully adaptive whitebox PGD attacks against Prior Network than against baseline models. Thus, adversarial training of Prior Networks can be seen as a generalization of standard adversarial training which improves both robustness to adversarial attacks and the ability to detect them by placing more constraints on the space of solutions to the optimization problem which yields adversarial attacks. It is necessary to point out that the evaluation of adversarial attack detection using Prior Networks is limited to only strong attacks. It is of interest to assess how well Prior Networks are able to detect adaptive C&W attacks carlini-robustness and EAD attacks chen2018ead . However, one challenge with these attacks is the adaptation of their loss functions to Prior Networks, which is left for future work.


Appendix A Synthetic Experiments

The current appendix describes the high data uncertainty artificial dataset used in section 4 of this paper. This dataset is sampled from a distribution

which consists of three normally distributed clusters with tied isotropic covariances with equidistant means, where each cluster corresponds to a separate class. The marginal distribution over

is given as a mixture of Gaussian distributions:


The conditional distribution over the classes can be obtained via Bayes’ rule:


This dataset is depicted for below. The green points represent the ’out-of-distribution’ training data, which is sampled close to the in-domain region. The Prior Networks considered in section 4 are trained on this dataset.


Figure 4: High Data Uncertainty artificial dataset.

Figure 5 depicts the behaviour of the differential entropy of Prior Networks trained on the high data uncertainty artificial dataset using both KL-divergence losses. Unlike the total uncertainty, expected data uncertainty and mutual information, it is less clear what is the desired behaviour of the differential entropy. Figure 5 shows that both losses yield low differential entropy in-domain and high differential entropy out-of-distribution. However, the reverse KL-divergence seems to capture more of the structure of the dataset, which is especially evident in figure 5b, than the forward KL-divergence. This suggests that the differential entropy of Prior Networks trained via reverse KL-divergence is a measures of total uncertainty, while the differential entropy of Prior Networks trained using forward KL-divergence is a measure of knowledge uncertainty. The latter is consistent with results in malinin-pn-2018 .


(a) Differential Entropy PN-KL


(b) Differential Entropy PN-RKL
Figure 5: Differential Entropy derived from Prior Networks trained with forward and reverse KL-divergence loss.

Appendix B Experimental Setup

Dataset Train Valid Test Classes MNIST 55000 5000 10000 10 SVHN 73257 - 26032 10 CIFAR-10 50000 - 10000 10 LSUN - - 10000 10 CIFAR-100 50000 - 10000 100 TinyImagenet 100000 10000 10000 200

Table 3: Description of datasets in terms of number of images and classes.

The current appendix describes the experimental setup and datasets used for experiments considered in this paper. Table 3 describes the datasets used in terms of their size and numbers of classes.

Training Model Epochs Cycle Dropout OOD data Dataset Length MNIST DNN 1e-3 20 10 0.5 - PN-KL 0.0 1e3 - PN-RKL SVHN DNN 1e-3 40 30 0.5 - PN-KL 5e-4 0.7 1.0 1e3 CIFAR-10 PN-RKL 5e-6 0.7 10.0 CIFAR-10 DNN 1e-3 45 30 0.5 - - - DNN-ADV FGSM-ADV PN-KL 5e-4 45 30 0.7 1.0 1e2 CIFAR-100 PN-RKL 5e-6 10.0 PN 5e-6 45 30 0.7 30.0 1e2 FGSM-ADV CIFAR-100 DNN 1e-3 100 70 0.5 - - - DNN-ADV FGSM-ADV PN-KL 5e-4 100 70 0.7 1.0 1e2 TinyImageNet PN-RKL 5e-6 10.0 PN 5e-4 100 70 0.7 30.0 1e2 FGSM-ADV TinyImageNet DNN 1e-3 120 80 0.5 - PN-KL 5e-4 0.0 1e2 - PN-RKL 5e-6

Table 4: Training Configurations. is the initial learning rate, is the out-of-distribution loss weight and is the concentration of the target class. The batch size for all models was 128. Dropout rate is quoted in terms of probability of not dropping out a unit.

All models considered in this paper were implemented in Tensorflow tensorflow using the VGG-16 vgg architecture for image classification, but with the dimensionality of the fully-connected layer reduced down to 2048 units. DNN models were trained using the negative log-likelihood loss. Prior Networks were trained using both the forward KL-divergence (PN-KL) and reverse KL-divergence (PN-RKL) losses to compare their behaviour on more challenging datasets. Identical target concentration parameters were used for both the forward and reverse KL-divergence losses. All models were trained using the Adam adam optimizer, with a 1-cycle learning rate policy and dropout regularization. In additional, data augmentation was done when training models on the CIFAR-10, CIFAR-100 and TinyImageNet datasets via random left-right flips, random shifts up to 4 pixels and random rotations by up to 15 degrees. The details of the training configurations for all models and each dataset can be found in table 4. 5 models of each type were trained starting from different random seeds. The 5 DNN models were evaluated both individually (DNN) and as an explicit ensemble of models (ENS).

b.1 Adversarial Attack Generation

An adversarial input will be defined as the output of a constrained optimization process applied to a natural input :


The loss is typically the negative log-likelihood of a particular target class :


The distance represents a proxy for the perceptual distance between the natural sample and the adversarial sample . In the case of adversarial images is typically the , or norm. The distance is constrained to be within the set of allowed perturbations such that the adversarial attack is still perceived to be a natural input to a human observer. First-order optimization under a constraint is called Projected Gradient Descent madry2017towards , where the solution is projected back onto the -norm ball whenever it exceeds the constraint. There are multiple ways in which the PGD optimization problem 21 can be solved szegedy-adversarial ; goodfellow-adversarial ; BIM ; MIM ; madry2017towards . The simplest way to generate an adversarial example is via the Fast Gradient Sign Method or FGSM goodfellow-adversarial , where the sign of the gradient of the loss with respect to the input is added to the input:


Epsilon controls the magnitude of the perturbation under a particular distance , the norm in this case. A generalization of this approach to other norms, called Fast Gradient Methods (FGM), is provided below:


FGM attacks are simple adversarial attacks which are not always successful. A more challenging class of attacks are iterative FGM attacks, such as the Basic Iterative Method (BIM) BIM and Momentum Iterative Method (MIM) MIM , and others carlini-robustness ; chen2018ead . However, as pointed out by Madry et. al madry2017towards , all of these attacks, whether one-step or iterative, are generated using variants of Projected Gradient Descent to solve the constrained optimization problem in equation 21. Madry madry2017towards argues that all attacks generated using various forms of PGD share similar properties, even if certain attacks use more sophisticated forms of PGD than others. In this work MIM attacks, which are considered to be strong attacks, are used to attack all models considered in section 6. However, standard targeted attacks which minimize the negative log-likelihood of a target class are not adaptive to the detection scheme. Thus, in this work adaptive targeted attacks are generated by minimizing the losses proposed in section 6, in equation 18. The optimization problem in equation 21 contains a hard constraint, which essentially projects the solutions of gradient descent optimization to the allowed -norm ball whenever is larger than the constraint. This may be both disruptive to iterative momentum-based optimization methods. An alternative soft-constraint formulation of the optimization problem is to simultaneously minimize the loss as well as the perturbation directly:


In this formulation is a hyper-parameter which trades of minimization of the loss and the perturbation . Approaches which minimize this expression are the Carlini and Wagner (C&W) attack carlini-robustness and the "Elastic-net Attacks to DNNs" (EAD) attack chen2018ead . While the optimization expression is different, these methods are also a form of PGD and therefore are expected to have similar properties as other PGD-based attacks madry2017towards . The C&W and EAD are considered to be particularly strong and attacks, and Prior Networks need to be assessed on their ability to be robust to and detect them. However, adaptation of these attacks to Prior Networks is non-trivial and left to future work.

b.2 Adversarial Training of DNNs and Prior Networks

Prior Networks and DNNs considered in section 6 are trained on a combination of natural and adversarially perturbed data, which is known as adversarial training. DNNs are trained on targeted FGSM attacks which are generated dynamically during training from the current training minibatch. The target class is selected from a uniform categorical distribution, but such that it is not the true class of the image. The magnitude of perturbation

is randomly sampled for each image in the minibatch from a truncated normal distribution, which only yields positive values, with a standard deviation of 30 pixels:


The perturbation strength is sampled such that the model learns to be robust to adversarial attacks across a range of perturbations. The DNN is then trained via maximum likelihood on both the natural and adversarially perturbed version of the minibatch. Adversarial training of the Prior Network is a little more involved. During training, an adversarially perturbed version of the minibatch is generated using the targeted FGSM method. However, the loss is not the negative log-likelihood of a target class, but the reverse KL-divergence (eqn. 10) between the model and a targeted Dirichlet which is focused on a target class which is chosen from a uniform categorical distribution (but not the true class of the image). For this loss the target concentration is the same as for natural data (). The Prior Network is then jointly trained on the natural and adversarially perturbed version of the minibatch using the following loss:


Here, the concentration of the target class for natural data is and for adversarially perturbed data , where the concentration parameters are set via  6. Setting results in a very wide Dirichlet distribution whose mode and mean are closest to the target class. This ensures that the prediction yields the correct class and that all measure of uncertainty, such as entropy of the predictive posterior or the mutual information, are high. Note, that due to the nature of the reverse KL-divergence loss, adversarial inputs which have a very small perturbation and lie close to their natural counterparts will naturally have a target concentration which is an interpolation between the concentration for natural data and for adversarial data. The degree of interpolation is determined by the OOD loss weight , as discussed in section 3. It is necessary to point out that FGSM attack are used because they are computationally cheap to compute during training. However, iterative adversarial attacks can also be considered during training, although this will make training much slower.

Appendix C Jointly Assessing Adversarial Attack Robustness and Detection

In order to investigate detection of adversarial attacks, it is necessary to discuss how to assess the effectiveness of an adversarial attack in the scenario where detection of the attack is possible. Previous work on detection of adversarial examples gong-detection-2017 ; grosse-detection-2017 ; metzen-detecting-2017 ; carlini-detected ; gal-adversarial assesses the performance of detection methods separately from whether an adversarial attack was successful, and use the standard measures of adversarial success and detection performance. However, in a real deployment scenario, an attack can only be considered successful if it both affects the predictions and evades detection. Here, we develop a measure of performance to assess this. For the purposes of this discussion the adversarial generation process will be defined to either yield a successful adversarial attack or an empty set . In a standard scenario, where there is no detection, the efficacy of an adversarial attack on a model666Given an evaluation dataset can be summarized via the success rate of the attack:


Typically is plotted against the total maximum perturbation from the original image, measured as either the , or distance from the original image. Consider using a threshold-based detection scheme where a sample is labelled ’positive’ if some measure of uncertainty , such as entropy or mutual information, is less than a threshold and ’negative’ if it is higher than a threshold:


The performance of such a scheme can be evaluated at every threshold value using the true positive rate and the false positive rate :


The whole range of such trade offs can be visualized using a Receiver-Operating-Characteristic (ROC) and the quality of the trade-off can be summarized using area under the ROC curve. However, a standard ROC curve does account for situations where the process fails to produce a successful attack. In fact, if an adversarial attack is made against a system which has a detection scheme, it can only be considered successful if it both affects the predictions and evades detection. This condition can be summarized in the following indicator function:


Given this indicator function, a new false positive rate can be defined as:


This false positive rate can now be seen as a new Joint Success Rate which measures how many attacks were both successfully generated and evaded detection, given the threshold of the detection scheme. The Joint Success Rate can be plotted against the standard true positive rate on an ROC curve to visualize the possible trade-offs. One possible operating point is where the false positive rate is equal to the false negative rate, also known as the Equal Error-Rate point:


Throughout this work the EER false positive rate will be quoted as the Joint Success Rate.

Appendix D Additional Adversarial Attack Detection Experiments

In this appendix additional experiments on adversarial attack detection are presented. In figure 6 adaptive whitebox adversarial attacks generated by iteratively minimizing KL divergence between the original and target (permuted) categorical distributions are compared to attacks generated by minimzing the KL-divergence between the predicted and permuted Dirichlet distributions . Performance is assessed only against Prior Network models. The results show that KL PMF attacks are more successful at switching the prediction to the desired class and at evading detection. The could be due to the fact that Dirichlet distributions which are sharp at different corners have limited common support, making the optimization of the KL-divergence between them more difficult than the KL-divergence between categorical distributions.


(a) C10 Success Rate


(b) C10 ROC AUC


(c) C10 Joint Success Rate
Figure 6: Comparison of performance of whitebox adaptive PGD MIM attacks which minimize the KL-divergence between PMFs (KL PMF) and Dirichlet distributions (KL DIR) on CIFAR-10.

Results in figure 7 show that PGD Momentum Iterative attacks which minimize the loss are marginally more successful than the version of these attacks. However, it is necessary to consider appropriate adaptation of the C&W attacks to the loss functions considered in this work for a more aggressive set of attacks.


(a) C10 Success Rate


(b) C10 ROC AUC


(c) C10 Joint Success Rate


(d) C100 Success Rate


(e) C100 ROC AUC


(f) C100 Joint Success Rate
Figure 7: Comparison of performance of whitebox adaptive and PGD MIM attacks against Prior Networks trained on CIFAR-10 (C10) and CIFAR-100 (C100) datasets.