deepstats
Repository for "Known Unknowns: Uncertainty Quality in Bayesian Neural Networks" paper.
view repo
We evaluate the uncertainty quality in neural networks using anomaly detection. We extract uncertainty measures (e.g. entropy) from the predictions of candidate models, use those measures as features for an anomaly detector, and gauge how well the detector differentiates known from unknown classes. We assign higher uncertainty quality to candidate models that lead to better detectors. We also propose a novel method for sampling a variational approximation of a Bayesian neural network, called One-Sample Bayesian Approximation (OSBA). We experiment on two datasets, MNIST and CIFAR10. We compare the following candidate neural network models: Maximum Likelihood, Bayesian Dropout, OSBA, and --- for MNIST --- the standard variational approximation. We show that Bayesian Dropout and OSBA provide better uncertainty information than Maximum Likelihood, and are essentially equivalent to the standard variational approximation, but much faster.
READ FULL TEXT VIEW PDFRepository for "Known Unknowns: Uncertainty Quality in Bayesian Neural Networks" paper.
While current Deep Learning focuses on point estimates, many real-world applications require a full range of uncertainty. Reliable confidence on the prediction might be as useful as the prediction itself. The debate over the dangers of overconfident machine learning has reached the headlines of mass media
vaughan2016bpost ; crawford2016ai . Indeed, if our models are to drive cars, diagnose medical conditions, and even analyze the risk of criminal recidivism, unreliable confidence appraisal may have dire consequences.Traditional Deep Learning trains by maximum likelihood — needing aggressive regularization to avoid overfitting — and only provides point estimates, with limited uncertainty information. If the model outputs a vector of probabilities (as a softmax classifier does), we can quantify its uncertainty using the entropy of the prediction. However, the model can predict with high confidence for samples way outside the distribution seen during training
gal2016thesis . Frequentist mitigations, like the bootstrap efron1994introduction , do not scale well for deep models.True Bayesian models infer the posterior distribution over all unknown factors, but their computational demands are often prohibitive. On the other hand, we may profitably reinterpret under a Bayesian perspective some of the ad hoc regularizations used in ordinary Deep Learning (e.g., dropout gal2015dropout ; kingma2015variational , early stopping maclaurin2015early , or weight decay bishop2006pattern ; blundell2015weight ). Gal and Ghahramani gal2015dropout show that multiple dropout forward passes in test time are equivalent to a Bayesian prediction (marginalized over the parameters’ posteriors) given a particular variational approximation. A more direct (and expensive) approach variationally approximates the posterior of each weight blundell2015weight .
Here we propose a novel Bayesian approach for neural networks, similar to the variational approximation of Blundell et al. blundell2015weight , but much cheaper computationally. We call that approach One-Sample Bayesian Approximation (OSBA), and investigate whether it achieves better quality of uncertainty information than traditional maximum likelihood.
We use exactly the same approach presented by Blundell et al. (blundell2015weight
, section 3.2), but instead of sampling the weight matrices for each training example, we sample the matrices only once per mini-batch, and use the same weights for all examples in that mini-batch. That approach leads to the same expected gradient, trading off higher variance for computational efficiency (about 10 times faster with a mini-batch of 100).
To evaluate the quality of uncertainty information, we employ anomaly detection: deciding whether or not a test sample belongs to the classes seen during training. More concretely, we pick a classification problem, exclude some classes from training, and use them to evaluate how much insight a candidate model has about its own classification confidence. We expect Bayesian neural networks to express such uncertainty well, to the point we can use it to decide whether a sample belongs or not to the known classes. Thus, we employ the AUC of the anomaly detector as a relative measure of the quality of the uncertainty information output by candidate models (Figure 1).
We contrast two experimental protocols. In the Blind Protocol, we separate the classes into two groups (In and Out); train the candidate neural network using only the In classes; and then train — over the In vs.
Out classes — a separate anomaly detector using the uncertainty extracted from the prediction of the candidate network. In the Calibrated Protocol, we separate the classes into three groups (In, Unknown, and Out); train the candidate network using the In classes with the loss function using the correct labels, and the Unknown classes with the loss function using the equiprobable prediction vector; and then train — over the In
vs. Out classes — a separate anomaly detector using the same features as before. The test set used to compute the AUC of the anomaly task excludes (obviously) all samples used to train the anomaly detector, and (perhaps less obviously) all samples used to train the candidate neural network.We use MNIST lecun1998mnist and CIFAR10 krizhevsky2009learning
datasets. For MNIST, the candidate networks have a two-layered fully-connected architecture with 512 neurons each, with dropout of 0.5 applied after each hidden layer. For CIFAR10, the candidate networks have two convolutional blocks (with dropout of 0.25 after each of them), followed by a fully-connected layer with 512 neurons (with dropout of 0.5). We optimize with ADAM
kingma2014adam, and limit each training procedure to 100 epochs for MNIST, and 200 epochs for CIFAR10. For each dataset we choose 4 In classes, 4 Out classes, and (for the Calibrated Protocol) 2 Unknown classes (Table
1). We randomize 20 combinations of InOut[Unknown] classes, with 5 repetitions each, totaling 100 replications.Dataset | In | Out | Unknown |
---|---|---|---|
MNIST | |||
CIFAR |
As methods, we evaluate the usual baseline of Maximum Likelihood (ML), a Bayesian posterior estimated from dropout gal2015dropout ; gal2015bayesian (BD), our approximation for the standard variational Bayesian neural networks using one sample per mini-batch (OSBA), and, for MNIST, we also evaluate the standard variational approximation blundell2015weight (SV).
The features for anomaly detection are uncertainty measures extracted from probabilistic predictions. For simplicity, the detector is a linear logistic classifier, with regularization parameter set by stratified cross-validation refaeilzadeh2009cross
. For ML, only the vector of predicted probabilities is available, and thus we employ as feature the entropy — the most theoretically sound measure of uncertainty — over that vector. All Bayesian methods provide extra information; we use as feature vector the average and standard deviation of the entropy of the decision vector over 100 network prediction samples (estimating the expectation and variance of the entropy), the entropy of the average decision vector over those same samples (entropy of estimated expected predictions), and the average (over classes) of the standard deviations (over samples) of the predictions for each class.
We analyze the results using Bayesian ANOVA kruschke2014doing , with a separate mean for each protocol (Blind vs.
Calibrated). That is equivalent to a two-way ANOVA without interactions, where the global mean and experimental protocol factors are fused together (for interpretability). The methods (ML, BD, OSBA) are the factors of variation. We constrain the sum of the effects to be zero, for identifiability. The response variable is the AUC of the anomaly detector. We use weakly informative priors. The following model reflects those choices:
We implement the model using Stan carpenter2016stan , and infer the posteriors of the unknown parameters using the NUTS algorithm hoffman2014no . To ensure proper convergence, we use 4 chains with 100K steps, including a 10K burn-in, and a thinning factor of 5. From Kruschke’s suggestion kruschke2014doing , we present both the distribution of the marginal effects, and the distribution of the differences between effects.^{1}^{1}1Code for models, experiments, and analyses at https://github.com/ramon-oliveira/deepstats.
We show the Bayesian ANOVA results in Figures 2 and 3 (for reference, we also show the raw distributions of the AUCs, as boxplots in Figures 4 and 5). Calibration with the auxiliary Unknown classes has a large effect, larger than choosing among uncertainty methods. Calibration, however, is not realistic for many applications, due to the artificial constraint of picking well-formatted Unknown classes. On the well-controlled scenario provided by MNIST, Bayesian methods give significantly better uncertainty information than ML. On MNIST, all Bayesian methods outperform ML, and their effects do not appear significantly different from each other. On CIFAR10, however, perfect semantic separation between classes is questionable (Table 1), and the performance differences disappear: BD slightly outperforms ML, and OSBA slightly outperforms ML, but none of the differences appear significant.
Dataset | Protocol | ML | BD | OSBA | SV |
---|---|---|---|---|---|
MNIST | Calibrated | 0.990 (0.002) | 0.991 (0.002) | 0.991 (0.002) | 0.991 (0.002) |
MNIST | Blind | 0.992 (0.002) | 0.992 (0.002) | 0.991 (0.002) | 0.991 (0.002) |
CIFAR10 | Calibrated | 0.878 (0.036) | 0.896 (0.033) | 0.884 (0.037) | — |
CIFAR10 | Blind | 0.905 (0.029) | 0.908 (0.028) | 0.896 (0.032) | — |
Table 2 shows the accuracies of all candidate models. Note that competing candidate models have very similar performance: any gains in anomaly detection rather come from enhanced probabilistic information than from increased accuracy.
We formalized how to ascertain uncertainty quality of neural networks by using anomaly detection. We contrasted the usual maximum likelihood networks to Bayesian alternatives. Bayesian networks outperformed the frequentist network in all cases.
We also proposed a novel way to sample from a variational approximation of a Bayesian neural network, OSBA, which is much faster than the standard sampling procedure, but still retains the same uncertainty quality. OSBA is 10 faster than SV; in our experiments, we observed relative training computational costs of 1 (ML) to 1 (BD) to 3 (OSBA) to 30 (SV).
We believe, thus, that techniques like BD and OSBA deserve further investigation in more contexts. Finding a general measure of uncertainty quality is, however, still a challenge. Our experiments suggest that anomaly detection only gives good uncertainty measures for well-separated classes, like MNIST’s; for uncontrolled datasets like CIFAR10 (or ImageNet), we need a measure that tolerates a degree of semantic intersection between the classes.
As future work, we intend to explore other forms of uncertainty quality evaluation, and to test OSBA in more varied settings.
We thank Brazilian agencies CAPES, CNPq and FAPESP for financial support. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for this research. Eduardo Valle is partially supported by a Google Awards LatAm 2016 grant, and by a CNPq PQ-2 grant (311486/2014-2). Ramon Oliveira is supported by a grant from Motorola Mobility Brazil.
The mnist database of handwritten digits, 1998.