Simple and Scalable Epistemic Uncertainty Estimation Using a Single Deep Deterministic Neural Network

03/04/2020 ∙ by Joost van Amersfoort, et al. ∙ 22

We propose a method for training a deterministic deep model that can find and reject out of distribution data points at test time with a single forward pass. Our approach, deterministic uncertainty quantification (DUQ), builds upon ideas of RBF networks. We scale training in these with a novel loss function and centroid updating scheme. By enforcing detectability of changes in the input using a gradient penalty, we are able to reliably detect out of distribution data. Our uncertainty quantification scales well to large datasets, and using a single model, we improve upon or match Deep Ensembles on notable difficult dataset pairs such as FashionMNIST vs. MNIST, and CIFAR-10 vs. SVHN, while maintaining competitive accuracy.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Estimating uncertainty reliably and efficiently has remained an open problem with many important applications such as guiding exploration in Reinforcement Learning

(Osband et al., 2016)

or as a method for selecting data points for which to acquire labels in Active Learning

(Houlsby et al., 2011). The type of uncertainty we care about in these applications is called epistemic uncertainty which captures the model’s lack of knowledge about the data. It is important to distinguish it from aleatoric uncertainty, which is uncertainty inherent in the data, for example caused by sensor noise. Aleatoric uncertainty is irreducible, while epistemic uncertainty decreases as we observe more data. Therefore, we can use epistemic uncertainty estimation to find out of distribution data.

Until now, most approaches for estimating epistemic uncertainty rely on ensembling (Lakshminarayanan et al., 2017) or Monte Carlo sampling (Gal and Ghahramani, 2016). In this paper, we introduce a deep model that is able to estimate epistemic uncertainty in a single forward pass. We call our model DUQ, Deterministic Uncertainty Quantification, and we construct it by re-examining ideas originally suggested in the 90s. We combine these with recent advances and make a number of improvements which enable scalable training of modern deep learning architectures. We evaluate our model against the current best approach for estimating uncertainty in Deep Learning, Deep Ensembles, and show that DUQ compares favourably on a number of evaluations, such as out of distribution detection of FashionMNIST vs MNIST, and CIFAR vs. SVHN. We visualise how DUQ performs on the two moons dataset in Figure 1. We see that DUQ is only certain on the training data, and its certainty decreases away from it. Deep Ensembles are not able to obtain meaningful uncertainty on this dataset, because of a lack of diversity in the different models in the ensemble.

(a) Deep Ensembles
(b) Our model - DUQ
Figure 1: Uncertainty results on two moons dataset. Yellow indicates high certainty, while blue indicates uncertainty. DUQ is certain only on the data distribution, and uncertain away from it: the ideal result. Deep Ensembles is uncertain only along the decision boundary, and certain elsewhere.

DUQ consists of a deep model and a set of feature vectors corresponding to the different classes (

centroids). A prediction is made by computing a kernel function, a distance function, between the feature vector computed by the model and the centroids. This type of model is called an RBF network (LeCun et al., 1998a) and epistemic uncertainty is measured as the distance between the model output and the closest centroid. A data point for which the feature vector is far away from all centroids does not belong to any class and can be considered out of distribution.

The model is trained by minimising the distance to the correct centroid, while maximising it with respect to the others. This incentivises the model to put the features of training data close to a particular centroid, however there is no mechanism that dictates what should happen away from the training data. Therefore we need to enforce that DUQ is sensitive to changes in the input, such that we can reliably detect out of distribution data and avoid mapping out of distribution data to in distribution feature representations — an effect we call feature collapse. The upper bound of this sensitivity can be quantified by the Lipschitz constant of the model. We are interested in models for which this sensitivity is not too low, but also not too high, because that could hurt generalisation and optimisation. DUQ achieves this results by regularising the Jacobian with respect to the input, as was first introduced by Drucker and Le Cun (1992).

In practice, RBF networks prove difficult to optimise, because of instability of the centroids and a saturating loss. We propose to make training stable by updating the centroids using an exponential moving average of the feature vectors of the data points assigned to them, as was introduced in van den Oord et al. (2017). We use a “one vs the rest” loss function minimising the distance to the correct centroid, while maximising the other distances. We find that these two changes stabilise training and lead to accuracies that are similar to the standard softmax and cross entropy set up on standard datasets such as FashionMNIST and CIFAR-10.

Uncertainty quantification in deep neural networks with a softmax output is generally done by measuring the entropy of the predictive distribution, so the maximally uncertain output is achieved by uniformly assigning probabilities over all the classes. The only way to achieve a uniform output for out of distribution data, is by training on additional data and hoping it generalises to out of distribution samples at test time. This does not happen in practice, and its found that the only uncertainty that can reliably be captured by looking at the entropy of the softmax distribution is aleatoric uncertainty

(Gal, 2016). In DUQ, it is possible to predict that none of the classes seen during training is a good fit, when the distance between the model output and all centroids is large.

The contributions of this paper are as follows:

  • We stabilise training of RBF networks and show, for the first time, that these type of models can achieve competitive accuracy softmax models.

  • We introduce two-sided Jacobian regularisation in the context of epistemic uncertainty, which make it possible to obtain reliable uncertainty estimates for RBF networks.

  • We obtain excellent epistemic uncertainty in a single forward pass, while maintaining competitive accuracy.

Figure 2: A depiction of the architecture of DUQ. The input is mapped to feature space, where it is assigned to the closest centroid. The distance to that centroid is the uncertainty.

2 Methods

DUQ consists of a deep feature extractor, such as a ResNet

(He et al., 2016)

, but without the softmax layer. Instead, we have one learnable weight matrix

per class, . Using the output and the class centroids, we compute the exponentiated distance between the model output and the centroids:

(1)

with our model, the input dimension, the output dimension, and parameters . is the centroid for class , a vector of length . is a weight matrix of size (centroid size) by (feature extractor output size) and

a hyper parameter sometimes called the length scale. This function is also referred to as a Radial Basis Function (RBF) kernel. The class dependent weight matrix allows features insensitivity on a class by class basis, minimising the potential for feature collapse. A prediction is made by taking the class

with the maximum correlation (minimum distance) between data point and class centroids :

(2)

we define the uncertainty in this model as the distance to the closest centroid, i.e. replacing the operator by a in Equation (2).

The loss function is the sum of the binary cross entropy between each class’ kernel value , and a one-hot (binary) encoding of the label. For a particular data point in our data set :

(3)

where we shortened as

. During training, we average the loss over a minibatch of data points, and perform stochastic gradient descent on

and . The class centroids, , are updated using an exponential moving average of the feature vectors of data points belonging to that class. If the model parameters, and , are held constant, then this update rule leads to the closed form solution for the centroids that minimises the loss:

(4)
(5)
(6)

where is the number of data points assigned to class in minibatch , is element of a minibatch at time , with class . is the momentum, which we usually set between [0.99, 0.999]. This method of updating centroids was introduced in the Appendix of van den Oord et al. (2017) for updating quantised latent variables. The high momentum leads to stable optimisation that is robust to initialisation.

The proposed set up leads to the centroids being pushed further away at each minibatch, without converging to a stable point. We avoid this by regularising the norm of . This restricts the model to sensible solutions and aids optimisation.

2.1 Gradient Penalty

As discussed in the introduction, without further regularisation deep networks are prone to feature collapse. We find that it can be avoided by regularising the representation map using a gradient penalty. Gradient penalties were first introduced to aid generalisation in Drucker and Le Cun (1992)

, who named it “double backpropagation”. Recently, this type of penalty has been used successfully in training Wasserstein GANs

(Gulrajani et al., 2017) to regularise the Lipschitz constant.

In our set up, we consider the following two-sided penalty:

(7)

where is the norm and the target Lipschitz constant. We found empirically that regularising the gradient of works better than or (which is the vector of kernel distances for input ). A similar approach was taken for softmax models by Ross and Doshi-Velez (2018).

The two-sided penalty was introduced by Gulrajani et al. (2017), who mention that despite a one-sided penalty being sufficient to satisfy their requirements, the two sided penalty proved to be better in practice. The one-sided penalty is defined as:

(8)

In Section 4.1, we show the difference between the single and two sided penalties experimentally. We find the two-sided penalty to be ideal for enforcing sensitivity, while still allowing strong generalisation.

2.2 Intuition about Gradient Penalty

A gradient penalty enforces smoothness, limiting how quickly the output of a function changes as the input x changes. Smoothness is important for generalisation, especially if we are using a kernel which depends on distances in the representation space. It is simple to show that regularising the norm of the Jacobian, , enforces a Lipschitz constraint at least locally, since for a small region around we have .

However, smoothness still leaves us vulnerable to the feature collapse problem outlined earlier, where multiple inputs are mapped to the same . Lipschitz smooth functions can collapse their inputs — the constant function is Lipschitz for any Lipschitz constant . Collapsing features can be beneficial for accuracy, but it hurts our ability to perform out of distribution detection, since it has the potential to make input points indistinguishable in the representation space. We find empirically in our work that the two sided penalty is extremely important: using the one sided penalty, i.e, enforcing only smoothness, is not sufficient to produce the sensitive behaviour we want in our representation. This can be seen in Figure 2(b), in contrast to Figure 0(b) with the two-sided penalty.

By keeping the norm of the Jacobian above some value, intuitively we encourage sensitivity of the learnt function, by preventing it from collapsing to a locally constant function, ignoring all changes in the input space. This argument is speculative, as this regularisation scheme has no effect on sensitivity in directions orthogonal to the local Jacobian, and more work is needed to explain definitively exactly why this penalty seems to encourage sensitivity, as it would seem mathematically that collapsing the representation would still be possible. However, we find that empirically that it is important for preserving out of distribution performance. In Appendix C, we evaluate a number of alternative approaches such as using a reversible model as feature extractor (guaranteed to be invertible) and computing the Jacobian with respect to the vector and .

2.3 Why Sensitivity can be at odds with Classification

In this section we analyse some of the trade-offs and assumptions encoded in detecting out-of-distribution inputs. We show in a toy experiment that standard classification losses can hurt out-of-distribution detection. Consider fitting a model on a problem with two features, and , both sampled from a unit Gaussian, and output y, such that , where is noise with a low probability of flipping the label. The optimal decision function in terms of the empirical risk, no matter the algorithm, is the function . But this says nothing about the out of distribution behaviour. What happens if we now see the input

? By our definition of the problem, this is out of distribution, as it lies many standard deviations away from the observed data. But should it be detected as out of distribution? The data does not define what could be given as the input, at least if we take a conventional empirical risk minimisation approach.

In this situation, it seems natural to prefer the kind of decisions which would be made by a generative model, for example. If and represent medical data, then presumably a highly abnormal value for is notable, and we would like to detect it. However, if is a truly irrelevant variable, say, the temperature on the surface of a distant planet, then presumably our model is correct to ignore its value, even if the value of the irrelevant variable is highly abnormal. When training using empirical risk minimisation, features not relevant to classification accuracy can simply be ignored by the feature extractors of a neural network. This makes out-of-distribution detection more difficult using feature space methods, even those that use a distance loss as we do. It is important to note that there is a potential tension here with classification accuracy. Enforcing sensitivity can make accurate classification harder because it forces the model to represent changes in input — as in the example above, these may be irrelevant to the causal structure of the problem. If we know about invariances that are appropriate for the problem at hand, we can enforce these by corresponding construction of the network. For example, we enforce translation invariance by using convolutional networks in this paper.

3 Related Work

The largest body of research on obtaining uncertainty in deep learning are Bayesian neural networks (MacKay, 1992; Neal, 2012). While exact inference in them is intractable, a range of approximate methods have been proposed. Mean-field variational inference methods, such as Bayes by Backprop (Blundell et al., 2015) and Radial BNNs (Farquhar et al., 2020) are a promising direction but have not yet lead to stable training on large image datasets. A more scalable alternative is MC Dropout (Gal and Ghahramani, 2016), which is very simple to implement and evaluate. In practice, these variational Bayesian methods are outperformed by Deep Ensembles (Lakshminarayanan et al., 2017). This is a simple, non-Bayesian, method that involves training multiple deep models from different initialisations and a different data set ordering. Snoek et al. (2019) showed that Deep Ensembles consistently outperform Bayesian neural networks that were trained using variational inference. This performance comes at the expense of computational cost, Deep Ensembles’ memory and compute use scales linearly with the number of ensemble elements at both train and test time.

Aside from using discriminative models, there have also been attempts at finding out of distribution data using generative models. Nalisnick et al. (2019a) showed that simply measuring the likelihood under the data distribution does not work. Recently, a more advanced approach that involves separating the likelihood of the semantic foreground from the background did show promising results on selected datasets (Ren et al., 2019). While generative models are a promising avenue for out of distribution detection, they are not able to assess predictive uncertainty; given that a data point is in distribution, can our discriminative model actually make a reliable prediction? Further, generative models are significantly more expensive to train than classification models.

Our approach is distinct from both ensembles/Monte Carlo methods, who aim to find different explanations for the data and increase uncertainty when these disagree, and generative models which model the data distribution directly. Instead our approach is more related to pre-deep learning kernel methods, such as Gaussian processes which revert to a prior away from data, and Support Vector Machines, where the distance to the separating hyperplane is informative of the uncertainty. These approaches have never scaled to high dimensional data, because of a lack of well performing kernel functions.

The decision function based on kernel distances was first used in the context of convolutional neural networks by

LeCun et al. (1998a). They were quickly abandoned for softmax models, because they were difficult to scale and optimise with gradient-based approaches due to saturating gradients and unstable centroids. Notable improvements in our work over the original are the updating mechanism of the centroids, solving the unstable centroids, and the loss function that is based on a multivariate Bernoulli, solving saturating gradients.

Regularising the Jacobian has a long history, starting with Drucker and Le Cun (1992) and more recently Ross and Doshi-Velez (2018). Both papers aim to regularise the norm of Jacobian down to zero. In the first case to obtain better generalisation, while the second paper aims to achieve adversarial robustness and interpretability. In neither case are the authors interested in increasing the Jacobian. Gulrajani et al. (2017) showed how a gradient penalty can be applied to training GANs with the Wasserstein distance, which was a more scalable and simpler alternative to weight clipping. They use the double sided penalty and mention it works better in practice. Follow up work has analysed the penalty in more detail and concluded that, contrary to our case, for training Wasserstein GANS the one-sided penalty is preferable theoretically and practically (Jolicoeur-Martineau and Mitliagkas, 2019; Petzka et al., 2017).

4 Experiments

We show the behaviour of DUQ in two dimensions, with the two moons dataset and show the effect of leaving out the gradient penalty and using a one sided penalty. We continue by looking at the out of distribution detection performance for some notable difficult data set pairs (Nalisnick et al., 2019a), such as FashionMNIST vs MNIST, and CIFAR-10 vs SVHN. We further study sensitivity to two important hyper parameters the length scale and gradient penalty weight and propose how to tune them without relying on example OoD data.

4.1 Two Moons

We use the scikit-learn (Pedregosa et al., 2011) implementation of this dataset and describe the model architecture and optimisation details in Appendix A.1.

The result of our model trained with a two-sided gradient penalty is shown in Figure 0(b). The uncertainty is exactly as one would expect for the two moons dataset: certain on the training data, uncertain away from it and in the heart within the two moons. The difference with Deep Ensembles is striking (Figure 0(a)). The uncertainty for DUQ is quantified as the distance to the closest centroid ( over the kernel distances), the uncertainty for Deep Ensembles is computed as the predictive entropy of the average output, see Appendix B. The ensemble elements were trained separately using the same model as described in Appendix A.1, but without L2 regularisation to encourage diverse solutions.

Discussion While Figure 0(b) is an impressive result in deep learning, it is worth highlighting that Gaussian processes are able to obtain such result too. A good visualisation can be found in Bradshaw et al. (2017). Interestingly, even though Deep Ensembles have been successfully applied to many large datasets (Snoek et al., 2019), they fail to estimate uncertainty well on the two moons dataset. This is due to the simplicity and low dimensionality of this dataset, the ensembles generalise in nearly the same way — with a diagonal line dividing the top left and the bottom right.

(a) DUQ - No penalty
(b) DUQ - One-sided penalty
Figure 3: Uncertainty results for two variations of DUQ: left without gradient penalty, and right with a one-sided gradient penalty (). Yellow indicates certainty, while blue indicates uncertainty. Both results are significantly worse than DUQ with a two-sided penalty.

Gradient Penalty In Section 2.1, we introduced the two-sided gradient penalty. Figure 3 shows why it is important. In Figure 2(a), we show the result of having no gradient penalty, which shows that the model is certain every far away from the data. In Figure 2(b)

, we see that the uncertainty does not improve when only a one-sided penalty is applied. In both cases, there are ’blobs’ sticking out of the training data domain that are also classified with high certainty.

Hyper parameters We found classification performance on two moons to be insensitive to our setting of the gradient penalty weight , likely because of the simplicity of the two moons dataset. For the uncertainty visualisation, we found it important to set the length scale to be small (in the interval ), despite accuracy not being affected by this hyper parameter. In the following experiments, we will discuss methods for picking the length scale and the weight of the gradient penalty.

4.2 FashionMNIST vs MNIST

In this experiment, we assess the quality of our epistemic uncertainty estimation by looking at how well we can separate the test set of FashionMNIST (Xiao et al., 2017) from the test set of MNIST (LeCun et al., 1998b) by looking at the epistemic uncertainty. We train our model on FashionMNIST and we expect it to assign high certainty to the FashionMNIST test set, but low certainty to MNIST, since the model has never seen that dataset before and it is very different from FashionMNIST.

During evaluation we compute uncertainty scores on both test sets and measure for a range of thresholds how well the two are separated. As in previous work (Ren et al., 2019), we report the AUROC metric, where a higher value is better and indicates that all FashionMNIST data points have a higher certainty than all MNIST data points. We picked FashionMNIST vs MNIST, because it is a notably difficult dataset pair (Nalisnick et al., 2019a), while MNIST vs NotMNIST (Bulatov, 2011) is much simpler.

Experimental set up Our model is a three layer convolutional network and we report all architectural and optimisation details in Appendix A.2

. It is important to note that at test time we set Batch Normalization to evaluation mode, meaning that we use the mean and standard deviation of the feature activations computed from the training set (i.e. FashionMNIST). It is unlikely that in practice we would get an entire batch of (uncorrelated) OoD points, so we can not normalise using test time batch statistics. Further, we use the same data normalisation for the out of distribution set as the in distribution set. Skipping either of these steps makes the problem artificially simple.

Figure 4: ROC curve for DUQ trained on FashionMNIST and evaluated on FashionMNIST and MNIST. The task is to separate these data sets based on epistemic uncertainty estimates.

Length Scale Most hyper parameters, such as as the learning rate or weight decay parameter, can be set using the standard train/validation split. However there are two hyper parameters that are particularly important: the length scale and the gradient penalty weight . We set the length scale by doing a grid search over the interval while keeping . We pick the value that leads to the highest validation accuracy. Following this process, we found that a length scale of leads to the highest accuracy, as measured over five runs. While this process might not result in a length scale that leads to the best OoD performance, it works well in practice.

Gradient Penalty Setting the parameter is more involved: from Section 2.3, we know that the accuracy can suffer as a result from gaining the ability to do out of distribution detection, so we cannot rely on it to select the best . We also cannot use the AUROC score on the MNIST dataset, because that would give the method an unfair advantage: we cannot assume access to the OoD set in advance in practice.111If we do assume access, then we can trivially train a binary classifier on the original and OoD set. Instead we use a third dataset on which we evaluate the AUROC and select our values based on that. We follow previous work (Ren et al., 2019) and use NotMNIST as the third dataset for this pair. The results can be seen in Table 1. As expected, the accuracy goes down as increases, and we also observe that the best AUROC result for NotMNIST coincides with the best score for MNIST, which shows that the strategy of selecting a hyper parameter based on the NotMNIST data set is reasonable. We note that while NotMNIST generalises to MNIST, we cannot rely on this property in general. Therefore, we propose an alternative method for model selection based on predictive uncertainty in Section 4.3.

Figure 5: Rejection classification plot: accuracy on a combination of FashionMNIST and MNIST test sets. The x-axis indicates the proportion of data rejected based on the uncertainty score. The theoretical maximum is computed from a classifier with 100% accuracy on FashionMNIST and rejects all MNIST points first.

Comparison We show our results and compare with alternative methods in Table 2. Our proposed method, DUQ, outperforms all other classification based methods. The only method that is better is LL ratio (Ren et al., 2019), which is based on generative models. These type of models are more computationally costly to train than DUQ. The PixelCNN++ (Salimans et al., 2017) used by LL ratio for FashionMNIST uses 2 blocks of 5 gated ResNet layers, while our model is a simple three layer convolutional network. An alternative, competitive approach is Mahalanobis Distance (Lee et al., 2018), which computes a distance in the feature space of a pretrained softmax/cross entropy model in combination with a number of dataset specific augmentations that rely on tuning via the out of distribution dataset or in some cases an alternative third dataset.

The difference in AUROC between our Deep Ensemble result and Ren et al. (2019)’s is due to using different architectures. For a fair comparison, we use the same architecture for the ensemble elements as for DUQ (replacing the class dependent final layer by the usual single linear layer). In Figure 4, we show the complete ROC curve for our implementation of Deep Ensembles and DUQ. We see that DUQ outperforms Deep Ensembles at all chosen rates.

Accuracy and Gradient Penalty To confirm that training using DUQ’s distance based output achieves competitive accuracy, we train two models using our architecture: the standard softmax and cross entropy set up and DUQ with . We obtained accuracy for the softmax model, and for our proposed set up , both averaged over five runs. The results show that we can obtain competitive accuracy using DUQ, resolving previous problems with RBF networks. In Table 1, we show how accuracy changes for an increasingly weighted gradient penalty. The accuracy only degrades slightly, while AUROC is improved.

Acc (FM) AUROC (NM) AUROC (M)
0
0.05
0.1
0.2
0.3
0.5
1.0
Table 1: FM stands for FashionMNIST, NM for NotMNIST, and M for MNIST. The results are mean/std computed from 5 experiment repetitions. We show AUROC for separating FashionMNIST from NotMNIST and MNIST; higher is better. We see that the gradient penalty improves AUROC performance slightly, but performance on this dataset pair is already very strong.

Rejection Classification In Figure 5, we visualise how well these algorithms work in a more realistic scenario. We combine the FashionMNIST and MNIST test sets, then we reject a certain portion of the combined dataset by uncertainty score. Next we compute the accuracy on the remaining data for each portion, considering all predictions on the OoD MNIST set to be incorrect. We expect the accuracy to go up as we reject more of the data points on which the model is uncertain. Ideally, we reject the incorrectly classified FashionMNIST points and all MNIST points. The Theoretical Maximum is computed by assuming a model that has perfect accuracy on the FashionMNIST test set and is able to reject all MNIST data before any FashionMNIST data. This experiment combines out of distribution detection, with detecting difficult to classify data points, which is closer to actual deployment scenarios than the AUROC metric, and also a suggested practically informed evaluation method by Filos et al. (2019). Note that the ensemble model has an accuracy of 0.936 on FashionMNIST, giving it a 1.2% head start on DUQ, which has an accuracy of 0.924. We see that DUQ outperforms Deep Ensembles in this more realistic scenario.

Method AUROC
DUQ 0.955
LL ratio (generative model) 0.994
Single model 0.843
5 - Deep Ensembles (ours) 0.861
5 - Deep Ensembles (ll) 0.839
Mahalanobis Distance (ll) 0.942
Table 2: Results on FashionMNIST, with MNIST as OoD set. Deep Ensembles is by Lakshminarayanan et al. (2017), Mahalanobis Distance by Lee et al. (2018), LL ratio by Ren et al. (2019). Results marked by (ll) are obtained from Ren et al. (2019), (ours) is implemented using our architecture. Single model is our architecture, but trained with softmax/cross entropy.
Figure 6: A histogram of uncertainty estimates as computed using DUQ (). CIFAR-10 and SVHN are clearly separated. The counts are normalised, because the SVHN test set is significantly larger than CIFAR-10’s.

4.3 CIFAR-10 vs SVHN

In this section we look at the CIFAR-10 dataset (Krizhevsky et al., 2014), with SVHN (Netzer et al., 2019) as OoD set. We use a ResNet-18 (He et al., 2016) as feature extractor

, specifically the version provided by PyTorch

(Paszke et al., 2017) with some minor modifications: we use 64 filters in the first convolutional layer, and skip the first pooling operation and last linear layer. CIFAR-10 is a difficult dataset for out of distribution detection for several reasons. There is a significant amount of data noise: some of the dog and cat examples are not distinguishable using only 32 by 32 pixels. The training set is small compared to its complexity, making it easy to overfit without data augmentation.

Experimental set up As in the previous section, we tune the length scale using the accuracy on the validation set, and find that works best from a range of

. We train for a fixed 75 epochs and reduce the learning rate by a factor of

at 25 and 50 epochs. We use random horizontal flips and random crops as data augmentation and find that this is enough regularisation to prevent the model from overfitting. All architectural and optimisation details are described in Appendix A.3. We obtain an accuracy of using the standard softmax/cross entropy loss. A Deep Ensemble of several softmax models obtains an accuracy of . DUQ without a gradient penalty () obtains accuracy, while accuracy of DUQ with is .

Gradient Penalty For CIFAR-10, we do not use a third dataset to set . Instead, we avoid using more data and look at the AUROC of the classification accuracy on the CIFAR-10 validation set with varying thresholds of the uncertainty. We found this value to be a strong indicator of out of distribution performance. In general, this approach is preferable over using a third dataset, because it is difficult to find an appropriate out of distribution dataset, which will have the same characteristics as those encountered during deployment. Imagine a particular difficult traffic situation or an MRI scan which shows a new type of disease, these scenarios have no reasonable out of distribution set available. Generative models are not able to take this approach, because they do not have predictive uncertainty. Even if we use a hybrid model (Nalisnick et al., 2019b), then the discriminative part, a softmax/cross entropy model, does not have reliable predictive uncertainty.

Figure 7: Rejection classification plot, which shows model performance on a mix of CIFAR-10 and SVHN, while rejecting uncertain points. The theoretical maximum is achieved when a hypothetical classifier obtains 100% accuracy on CIFAR-10 and rejects all SVHN data points first. We see that DUQ and a 5 element Deep Ensemble perform very similar.

Results In Figure 6, we show a a normalised histogram for the kernel distances of CIFAR-10 and SVHN. We see that most of CIFAR-10 is very close to 1, while SVHN is uniformly spread out over the range of distances. This shows that DUQ works as expected and that out of distribution data ends up away from all of the centroids in feature space.

The rejection classification plot, Figure 7, is created similar to the previous experiment in the last section. Note that this time the Theoretical Maximum line is significantly lower, because the SVHN test set contains close to elements, while CIFAR-10’s only contains . This means that the best possible accuracy when 100% of the data is considered is about 28%. We see that DUQ and Deep Ensembles perform similarly.

In Table 3, we compare DUQ with several alternative methods. We see that DUQ performs competitively with a number of recent approaches. Interestingly, on these more complicated data sets Deep Ensembles performs the best. We suspect this is because the complexity of the data set allows the ensemble elements to be more diverse while still explaining the data well.

We further see a significant gap between DUQ with and without a gradient penalty: there is a big improvement going from to . We suspect this is because there is a lot of within class variation, which incentivises the model to collapse more diverse data points to the class centroids.

Runtime One of the main advantages of DUQ over Deep Ensembles is computational cost. For Deep Ensembles, both computation and memory cost scale linearly in the number of ensemble components, during both train and test time. DUQ has to compute the Jacobian at training time, which is expensive, but at test time there is only a marginal overhead over a softmax based model. Training for one epoch on a modern 1080 Ti GPU, takes 21 seconds for a softmax/cross entropy model, which leads to 105 seconds for a Deep Ensemble with 5 components. DUQ with gradient penalty needs 103 seconds for one epoch at training time, but only 27 seconds without gradient penalty. DUQ is  25% slower at test time than single softmax/cross entropy model, but about 4 times faster than a Deep Ensemble with 5 components.

Method AUROC
DUQ ()
DUQ ()
LL ratio (generative model) 0.930
Single model
5 - Deep Ensembles 0.943
Table 3: Deep Ensembles is by Lakshminarayanan et al. (2017), but re-implemented and evaluated using our architecture. LL ratio is by Ren et al. (2019). Single model is our architecture, but trained with softmax/cross entropy. We show the AUROC for separating CIFAR-10 from SVHN.

5 Conclusion

We introduced DUQ, Deterministic Uncertainty Quantification, a simple method for obtaining epistemic uncertainty using a deep neural network in a single forward pass. Evaluations show that our method is better in some scenarios and competitive in others with the more computationally expensive Deep Ensembles.

Interesting future work would be to evaluate the adversarial robustness of DUQ. It is important that the adversarial attacks is chosen with care, since many previous attacks were designed with the softmax output in mind. Further, it is possible that there are alternatives to the gradient penalty to achieve the desired sensitivity effect.

6 Acknowledgements

We thank Andreas Kirsch, Luisa Zintgraf, Bas Veeling, Milad Alizadeh, Christos Louizos, and Bobby He for helpful discussions and feedback. We also thank the rest of OATML for feedback at several stages of the project. JvA/LS are grateful for funding by the EPSRC (grant reference EP/N509711/1 and EP/L015897/1, respectively). JvA is also grateful for funding by Google-DeepMind.

References

  • C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra (2015) Weight uncertainty in neural network. In

    International Conference on Machine Learning

    ,
    pp. 1613–1622. Cited by: §3.
  • J. Bradshaw, A. G. d. G. Matthews, and Z. Ghahramani (2017) Adversarial examples, uncertainty, and transfer testing robustness in gaussian process hybrid deep networks. arXiv preprint arXiv:1707.02476. Cited by: §4.1.
  • Y. Bulatov (2011) NotMNIST dataset. Tech. Rep.[Online]. Available: http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html. Cited by: §4.2.
  • L. Dinh, D. Krueger, and Y. Bengio (2014) Nice: non-linear independent components estimation. arXiv preprint arXiv:1410.8516. Cited by: §C.1.
  • H. Drucker and Y. Le Cun (1992) Improving generalization performance using double backpropagation. IEEE Transactions on Neural Networks 3 (6), pp. 991–997. Cited by: §1, §2.1, §3.
  • S. Farquhar, M. Osborne, and Y. Gal (2020) Radial Bayesian neural networks: beyond discrete support in large-scale bayesian deep learning.

    Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics

    .
    Cited by: §3.
  • A. Filos, S. Farquhar, A. N. Gomez, T. G. Rudner, Z. Kenton, L. Smith, M. Alizadeh, A. de Kroon, and Y. Gal (2019) A systematic comparison of bayesian deep learning robustness in diabetic retinopathy tasks. arXiv preprint arXiv:1912.10481. Cited by: §4.2.
  • Y. Gal and Z. Ghahramani (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In International Conference on Machine Learning, pp. 1050–1059. Cited by: §1, §3.
  • Y. Gal (2016) Uncertainty in deep learning. University of Cambridge 1, pp. 3. Cited by: §1.
  • I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In Advances in neural information processing systems, pp. 5767–5777. Cited by: §2.1, §2.1, §3.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 770–778. Cited by: §2, §4.3.
  • J. Hoffman, D. A. Roberts, and S. Yaida (2019) Robust learning with jacobian regularization. arXiv preprint arXiv:1908.02729. Cited by: §C.2.
  • N. Houlsby, F. Huszár, Z. Ghahramani, and M. Lengyel (2011) Bayesian active learning for classification and preference learning. arXiv preprint arXiv:1112.5745. Cited by: §1.
  • M. F. Hutchinson (1990) A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics-Simulation and Computation 19 (2), pp. 433–450. Cited by: §C.2.
  • J. Jacobsen, A. Smeulders, and E. Oyallon (2018) I-revnet: deep invertible networks. In ICLR 2018-International Conference on Learning Representations, Cited by: §C.1.
  • A. Jolicoeur-Martineau and I. Mitliagkas (2019) Connections between support vector machines, wasserstein distance and gradient-penalty gans. arXiv preprint arXiv:1910.06922. Cited by: §3.
  • A. Krizhevsky, V. Nair, and G. Hinton (2014) The cifar-10 dataset. online: http://www. cs. toronto. edu/kriz/cifar. html 55. Cited by: §4.3.
  • B. Lakshminarayanan, A. Pritzel, and C. Blundell (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pp. 6402–6413. Cited by: §1, §3, Table 2, Table 3.
  • Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998a) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §1, §3.
  • Y. LeCun, C. Cortes, and C. J. Burges (1998b)

    The mnist database of handwritten digits, 1998

    .
    URL http://yann.lecun.com/exdb/mnist 10, pp. 34. Cited by: §4.2.
  • K. Lee, K. Lee, H. Lee, and J. Shin (2018) A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In Advances in Neural Information Processing Systems, pp. 7167–7177. Cited by: §4.2, Table 2.
  • D. J. MacKay (1992) Bayesian methods for adaptive models. Ph.D. Thesis, California Institute of Technology. Cited by: §3.
  • E. Nalisnick, A. Matsukawa, Y. W. Teh, D. Gorur, and B. Lakshminarayanan (2019a) Do deep generative models know what they don’t know?. In International Conference on Learning Representations, External Links: Link Cited by: §3, §4.2, §4.
  • E. Nalisnick, A. Matsukawa, Y. W. Teh, D. Gorur, and B. Lakshminarayanan (2019b) Hybrid models with deep and invertible features. arXiv preprint arXiv:1902.02767. Cited by: §4.3.
  • R. M. Neal (2012) Bayesian learning for neural networks. Vol. 118, Springer Science & Business Media. Cited by: §3.
  • Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Ng (2019) The street view house numbers (svhn) dataset. Accessed 2016-08-01.[Online]. Available: http://ufldl. stanford. edu …. Cited by: §4.3.
  • I. Osband, C. Blundell, A. Pritzel, and B. Van Roy (2016) Deep exploration via bootstrapped dqn. In Advances in neural information processing systems, pp. 4026–4034. Cited by: §1.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §4.3.
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. (2011) Scikit-learn: machine learning in python. Journal of machine learning research 12 (Oct), pp. 2825–2830. Cited by: §4.1.
  • H. Petzka, A. Fischer, and D. Lukovnicov (2017) On the regularization of wasserstein gans. arXiv preprint arXiv:1709.08894. Cited by: §3.
  • J. Ren, P. J. Liu, E. Fertig, J. Snoek, R. Poplin, M. Depristo, J. Dillon, and B. Lakshminarayanan (2019) Likelihood ratios for out-of-distribution detection. In Advances in Neural Information Processing Systems 32, pp. 14680–14691. Cited by: §3, §4.2, §4.2, §4.2, §4.2, Table 2, Table 3.
  • A. S. Ross and F. Doshi-Velez (2018) Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. In Thirty-second AAAI conference on artificial intelligence, Cited by: §2.1, §3.
  • T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma (2017) Pixelcnn++: improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517. Cited by: §4.2.
  • J. Snoek, Y. Ovadia, E. Fertig, B. Lakshminarayanan, S. Nowozin, D. Sculley, J. Dillon, J. Ren, and Z. Nado (2019) Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems 32, pp. 13969–13980. Cited by: §3, §4.1.
  • A. van den Oord, O. Vinyals, et al. (2017) Neural discrete representation learning. In Advances in Neural Information Processing Systems, pp. 6306–6315. Cited by: §1, §2.
  • H. Xiao, K. Rasul, and R. Vollgraf (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §4.2.

Appendix A Experimental Details

This section contains all info to reproduce our experiments.

a.1 Two Moons

We set the noise level to 0.1 and generate 1000 points for our training set. Our model consists of three layers with 20 hidden units each, the embedding size is 10. We use the relu activation function and standard SGD optimiser with learning rate 0.01, momentum 0.9 and L2 regularisation with weight

. Our batch size is 64 and we train for a set 30 epochs. We set the length scale to and to 0.99.

a.2 FashionMNIST

We use a model consisting of three convolutional layers of 64, 128 and 128 3x3 filters, with a fully connected layer of 256 hidden units on top. The embedding size is 256. After every convolutional layer, we perform batch normalization and a 2x2 max pooling operation.

We use the SGD optimizer with learning rate (decayed by a factor of 5 every 10 epochs), momentum , weight decay and train for a set 30 epochs. The centroid updates are done with . The output dimension of the model, is 256, and we use the same value for the size of the centroids, .

We normalise our data using per channel mean and standard deviation, as computed on the training set. The validation set contains 5000 elements, removed at random from the full 60,000 elements in the training set. For the final results, we rerun on the full training set with the final set of hyper parameters.

a.3 Cifar-10

We use a ResNet-18, as implement in torchvision version 0.4.2222Available online at: https://github.com/pytorch/vision/tree/v0.4.2

. We make the following modifications: the first convolutional layer is changed to have 64 3x3 filters with stride 1, the first pooling layer is skipped and the last linear layer is changed to be 512 by 512.

We use the SGD optimizer with learning rate of , decayed by a factor 10 every 25 epochs, momentum of , weight decay and we train for a set 75 epochs. The centroid updates are done with . The output dimension of the model, is 512, and we use the same value for the size of the centroids .

We normalise our data using per channel mean and standard deviation, as computed on the training set. We augment the data at training time using random horizontal flips (with probability 0.5) and random crops after padding 4 zero pixels on all sides. The validation set contains 10,000 elements, removed at random from the full 50,000 elements in the training set. For the final results, we rerun on the full training set with the final set of hyper parameters.

Appendix B Deep Ensemble Uncertainty

The uncertainty in Deep Ensembles is measured as the entropy of the average predictive distribution:

with the set of parameters for ensemble element .

Figure 8: Uncertainty results on two moons data set. Yellow means certain, while blue indicates uncertainty. The model is a reversible feature extractor in combination with the kernel based output as in DUQ.

Appendix C Gradient Penalty Alternatives

c.1 Reversible Models

An alternative method of enforcing sensitivity is by using an invertible feature extractor. A simple and effective method of doing so is by using invertible layers originally introduced in (Dinh et al., 2014). Using these type of layers leads to strong results on the two moons dataset as seen in Figure 8. Unfortunately, it is difficult to train reversible models on higher dimensional data sets. Without dimensionality reduction, such as max pooling, the memory usage of these type of networks is unreasonably high (Jacobsen et al., 2018) . We found it impossible to obtain strong accuracy and uncertainty using these type of models, indicating that dimensionality reduction is an important component of why these models work.

c.2 Gradient Penalty

Empirically, we found that computing the penalty on works well. However there are two other candidates to enforce the penalty on: , the vector of kernel distances and , the feature vector output of the feature extractor. At first sight, these targets might actually be preferential. Sensitivity of ought to be sufficient to obtain the out of distribution sensitivity properties we desire.

Computing the Jacobian of a vector valued output is expensive using automatic differentiation. To evaluate the two alternative candidates we turn to the Hutchinson’s Estimator (Hutchinson, 1990), which allows us to estimate the trace of the Jacobian by computing the derivative of random projections of the output. This approach was previously discussed in the context of making neural networks more robust by Hoffman et al. (2019).

While we were able to get good uncertainty on the two moons data set using both alternative targets, the results were not consistent. We attempted to reduce the variance of the Hutchinson’s estimator by using the same random projection for each element in the batch, which worked well on two moons, but lead to unsatisfactory results on larger scale data sets. In conclusion, we found that while

is not a priori the best place to compute the gradient penalty, it is still preferable over the noise that comes from applying Hutchinson’s estimator on any of the alternatives, at least in our experiments.