1 Introduction
Obtaining reliable uncertainty estimates from deep neural networks (NNs) is an active field of research with an abundance of proposed methods, e.g., variational inference, stochastic gradient MCMC, and ensembling. Two especially prominent methods are MonteCarlo dropout (MCdropout,
Gal and Ghahramani (2016)) and Deep Ensembles (Lakshminarayanan et al., 2017), but MCdropout appears less reliable and Deep Ensembles requires training several copies of the same model, which can be expensive. While many methods have been proposed, there is not yet a clear winner.Recently, the simple idea of doing Bayesian linear regression (BLR) on the representation in the last hidden layer has caught some attention, e.g., for Bayesian optimization (Snoek et al., 2015)
, Thompson sampling in contextual bandits
(Riquelme et al., 2018), and exploration in modelfree reinforcement learning (RL) (Azizzadenesheli et al., 2018). While not as general as some other methods, it is a simple and fast way to obtain uncertainty estimates from NNs with a linear and fullyconnected output layer. Despite its application in several recent papers, its performance has not yet been studied in the general regression context.We begin by describing the method (which we coin Deep BLR), introducing a novel variation which enables learning heteroscedastic noise, thus addressing a weakness of previous applications of BLR to deep representations. The method, along with ensembles of it, is then evaluated on a set of standard datasets, which is common in the literature for BNNs but has not been done for BLR. Finally, we evaluate Deep BLR as an uncertaintyaware model in a modelbased RL algorithm, in which it replaces Deep Ensembles.
Related work
In prior work (Snoek et al., 2015; Riquelme et al., 2018; Azizzadenesheli et al., 2018)
, the predictive variance is either assumed known or unknown but homoscedastic. We propose a simple modification which permits learning flexible and heteroscedastic aleatoric uncertainty estimates by incorporating the underlying NN’s variance prediction.
2 Method
Given some data with corresponding target values (for multioutput problems, we can, and do, treat the dimensions independently), a NN with a Gaussian output layer (i.e., the NN predicts mean and variance for each input ) is trained by maximizing the loglikelihood. The NN is then used in two ways in the BLR:

Regression is done on the latent representation obtained from the last hidden layer.

The predicted variance is used as a “known” variance in the likelihood.
A helpful distinction is between aleatoric and epistemic unertainty, where aleatoric uncertainty is irreducible uncertainty, e.g., due to noise, and epistemic uncertainty is uncertainty that is due to our lack of knowledge about the world. In this sense, the Gaussian output estimates the aleatoric uncertainty whereas the BLR keeps track of epistemic uncertainty, i.e., our uncertainty in .
Concretely, we choose a normal prior and a normal likelihood , where
is a hyperparameter that controls the prior uncertainty. We assume that both inputs and targets are centered for simplicity. Then, by Bayes’ rule
(Murphy, 2012), the posterior is with parameterswhere and
. Furthermore, the posterior predictive distribution at a test point
, with latent representation , is given byHere, and represent the heteroscedastic aleatoric uncertainty and epistemic uncertainty, respectively. Figure 1 contains an example of Deep BLR applied to a toy 1D regression problem. We see that the predictive distribution looks sensible and that the necessary nonlinear prediction is captured well by the linear regression on the deep representations.
A straightforward extension of Deep BLR is to train an ensemble of NNs with different random initializations and apply BLR to each resulting representation. The predictive distribution will then be a mixture of Gaussians.
3 Experiments
Deep BLR is compared to two prominent alternative methods for uncertainty estimation: MCdropout (Gal and Ghahramani, 2016) and Deep Ensembles (Lakshminarayanan et al., 2017). Additionally, we compare to Deep BLR ensembles. We conduct two experiments: regression on standard datasets and application in uncertaintyaware modelbased reinforcement learning.
3.1 Predictive performance on standard datasets
We evaluate the predictive performance of Deep BLR and Deep BLR ensembles on a set of standard datasets (Dua and Graff, 2017) with the same experimental setup as in prior work on uncertaintyaware NNs (Gal and Ghahramani, 2016; Lakshminarayanan et al., 2017). See appendix A for full details. The prior variance is chosen with a grid search to minimize the negative loglikelihood (NLL) on a validation set, which is cheap since the NN can be reused.
The resulting performance in terms of NLL, which acts as a proxy measure for the capability to estimate uncertainty, can be seen in Table 1. We see that Deep BLR performs competitively, and Deep BLR ensemble achieves the best performance on all datasets except the last, on which only a single split was done.
Predictive NLL  

Dataset  MCdropout  Deep Ensembles  BLR  BLR Ensemble 
Boston Housing  
Concrete Strength  
Energy Efficiency  
Kin8nm  
Naval Propulsion  
Power Plant  
Protein Structure  
Wine Quality  
Yacht Hydrodynamics  
Year Prediction MSD 
Comparison of MCdropout, Deep Ensembles, Deep BLR, and Deep BLR ensemble on a set of standard datasets. We report the mean and standard error of predictive NLL across 20 random trainingtest splits (5 for Protein Structure, 1 for Year Prediction MSD). The best method for each dataset is marked in bold.
3.2 Uncertaintyaware modelbased reinforcement learning
PETS (Chua et al., 2018) is a modelbased reinforcement learning algorithm which utilizes an uncertaintyaware dynamics model in the form of an ensemble of NNs. We study the downstream performance of PETS when the ensemble is replaced with Deep BLR, as well as with a Deep BLR ensemble. For simplicity, we use hyperparameters identical to Chua et al. (2018), i.e., the NNs are not tuned for Deep BLR. The prior variance is set to . We found that larger values led to unstable learning, so the regularizing effect appears important.
Figure 2 shows the results in the CartPole and 7dof Reacher environment for four different types of models. On CartPole, the poor performance of 1 NN indicates that utilizing epistemic uncertainty is important for efficient learning in that environment. We see that Deep BLR is competitive with, or slightly better than, ensembles of NNs, and that Deep BLR ensemble is the clear winner. On 7dof Reacher, all methods perform very similarly, indicating that accurate uncertainty estimation does not aid PETS in learning a good policy in this environment. Thus, even if Deep BLR was significantly better than Deep Ensembles, we could not distinguish them in this environment.
simulator. All NNs parameterize a Gaussian distribution. The experiment is repeated 10 times and the mean return per episode is reported, with the shaded area representing one standard error.
4 Conclusions and future work
Bayesian linear regression on deep representations is a simple, flexible method for obtaining uncertaintyaware NNs for regression. Our experiments indicate that Deep BLR is competitive with (and as an ensemble can outperform) the commonly used ensemble methods, which is consistent with prior work (Riquelme et al., 2018). Note that BLR was used on top of the existing architectures with no tuning, and further tuning may be beneficial.
We believe there is much potential for future work in this direction. In particular, investigating the connection between the NN architecture and reliability of BLR uncertainty estimates, and extending the idea to classification. In the case of, e.g., Bayesian logistic regression, there is no analytic posterior, but perhaps approximations suffice. Further studying the downstream performance of Deep BLR would also be interesting, e.g., in uncertaintyaware modelbased RL.
Acknowledgments
This work was partially supported by the Wallenberg Al, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation.
References
 Efficient exploration through Bayesian deep Qnetworks. In 2018 Information Theory and Applications Workshop, ITA 2018, External Links: ISBN 9781728101248, Document Cited by: Bayesian Linear Regression on Deep Representations, §1, §1.
 Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models. In 32nd Conference on Neural Information Processing Systems, Cited by: §3.2.

UCI Machine Learning Repository
. External Links: Link Cited by: §3.1. 
Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning
. In Proceedings of the 33rd International Conference on Machine Learning, Cited by: Appendix A, §1, §3.1, §3.  Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations, Cited by: Appendix A.
 Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. In 31st Conference on Neural Information Processing Systems, Cited by: Appendix A, §1, §3.1, §3.
 Machine Learning: A Probabilistic Perspective. The MIT Press. Cited by: §2.
 Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling. In 6th International Conference on Learning Representations, Cited by: Bayesian Linear Regression on Deep Representations, §1, §1, §4.
 Scalable Bayesian Optimization Using Deep Neural Networks. In Proceedings of the 32nd International Conference on Machine Learning, Cited by: §1, §1.
 MuJoCo: A physics engine for modelbased control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Cited by: Figure 2.
Appendix A Experimental details
We replicate the experimental setup in prior work [Gal and Ghahramani, 2016, Lakshminarayanan et al., 2017]
. All methods use a NN with a single hidden layer with 50 ReLU, except for the larger Protein Structure and Year Prediction MSD datasets where 100 ReLU are used. Each NN parameterizes a normal distribution and is trained to minimize the negative loglikelihood using Adam
[Kingma and Ba, 2015]for 40 epochs with batch size 32 and learning rate 0.01 (except 0.001 and 0.0001 for Protein Structure and Year Prediction MSD respectively.) All ensembles consist of 5 identical NNs with different random initializations. We evaluate the methods on 20 random 90/10 training/test splits, except for the larger Protein Structure and Year Prediction MSD where only 5 and 1 splits are done respectively. The inputs and outputs are always normalized to have zero mean and unit variance.
Comments
There are no comments yet.