1 Introduction
The wide adoption of Convolutional Neural Networks (cnns) in increasingly popular pieces of technology such as self driving cars and medical imaging, where decisionmaking under uncertainty is fundamental, has brought attention to the ability of these learning architectures to accurately quantify the uncertainty in their predictions Kendall17 ; Gal16b . In short, the reliability of predictive probabilities of learning algorithms can be evaluated through the analysis of their calibration Flach16
. In particular, a classifier is well calibrated when its output offers an accurate account of the probability of a given class, i.e. when it predicts a given class label with probability
that matches the true proportion of test points belonging to that class.The calibration properties of standard classifiers and neural networks have been studied in the literature Kull17 ; NiculescuMizil05 , which has shown that classifiers that use the standard crossentropy loss are generally well calibrated. Perhaps surprisingly, modern cnns, which are a particular case of deep neural networks (dnns), have been found to be miscalibrated, and the depth of convolutional filters is the main factor affecting calibration Guo17 . The work in Guo17 shows that regularization, implemented through weight decay, improves calibration and that, ultimately, simple methods such as postcalibration Platt99b can be an effective remedy for most cnns calibration issues.
Alternatively, Bayesian cnns Gal16b
where convolutional filters are inferred using Bayesian inference techniques, seem like perfect candidates to model uncertainty in these architectures in a principled way. However, while Bayesian
cnns have been shown to be effective in obtaining stateoftheart performance in image classification tasks, we are not aware of studies that show their calibration properties. Hence, our first contribution is to investigate the calibration properties of Bayesian cnns.Along a similar vein, independently of the works on Bayesian cnns, there have been other attempts to give a probabilistic flavor to cnns by combining them with Gaussian processes (gps, Rasmussen06 ). Most of these approaches can be seen as a way to parameterize a cnnbased covariance for gps, and the aim is to learn endtoend both the filters and the gps (see, e.g., Bradshaw17 ; Wilson16 ). A crucial aspect that the literature has overlooked, however, is that methods that combine cnns and gps suffer from the same issues of miscalibration that characterize modern cnns. Therefore, the second contribution of this paper is to show that current combinations of cnns and gps are miscalibrated.
Consequently, as our third contribution, we propose a novel combination of cnns and gps that is indeed wellcalibrated, while being simple to implement. In particular, we propose to replace the fully connected layers of cnns with gps that we approximate with random features Cutajar17 ; Gredilla10 . Due to this approximation, the resulting model becomes a Bayesian cnn with a nonlinear transformation applied to the convolutional features. Building on the connection between variational inference and dropout, we apply Monte Carlo dropout (mcd, Gal16 ) to carry out joint inference over the filters and the approximate gps, thus obtaining an endtoend learning method for the proposed model, which we call cnn+gp(rf). The resulting approach is characterized by a number of attractive features: (i) it is well calibrated, given that it uses the multinomial likelihood and the filters are regularized using Bayesian inference techniques; (ii) it is as scalable as stateoftheart cnns, in so much as it can be trained using minibatch updates and can exploit GPU and distributed computing; (iii) unlike other works that combine cnns and gps, it is as easy to implement as standard cnns, as it leverages the equivalence of gps approximated with random features and Bayesian dnns Cutajar17 ; Gal15 ; Neal96 , and the connections between dropout and variational inference Gal16 . We extensively validate these properties in a variety of image classification tasks.
Our final contribution extends the above framework by replacing the last layer of cnns with Deep gps Cutajar17 and use structured random features to obtain faster and more compact gp approximations Le13 ; Yu16 . In all, our proposal considerably improves on classification accuracy compared to previous combinations of cnns and gps (e.g., on cifar10 and on cifar100, all without data augmentation), while being competitive with stateoftheart cnns; we are not aware of other gp works that approach these results. Crucially, we achieve these performance without compromising on calibration, again considerably improving on previous approaches that combine cnns and gps.
2 Related Work
Calibration of Convolutional Networks:
The issue of calibration of classifiers in machine learning was popularized in the 90’s with the use of support vector machines for probabilistic classification
Platt99b . Calibration techniques aim to learn a transformation of the output using a validation set in order for the transformed output to give a reliable account of the actual probability of class labels Flach16 ; interestingly, calibration can be applied regardless of the probabilistic nature of the untransformed output of the classifier. Popular calibration techniques include Platt scaling Platt99b and isotonic regression Zadrozny02 .Classifiers based on Deep Neural Networks (dnns) have been shown to be wellcalibrated NiculescuMizil05 . The reason is that the optimization of the crossentropy loss promotes calibrated output. The same loss is used in Platt scaling and it corresponds to the correct multinomial likelihood for class labels. Recent sudies on the calibration of cnns, which are a particular case of dnns, however, show that depth has a negative impact on calibration, despite the use of a crossentropy loss, and that regularization improves the calibration properties of classifiers Guo17 .
Combinations of Conv Nets and Gaussian Processes:
Thinking of Bayesian priors as a form of regularization, it is natural to assume that Bayesian cnns can “cure” the miscalibration of modern cnns. Despite the abundant literature on Bayesian dnns Neal96 ; Mackay94 , far less attention has been devoted to Bayesian cnns Gal16 , and the calibration properties of these approaches have not been investigated.
Several approaches have proposed the combination of cnns and gps as a means to give a probabilistic character to cnns. Most of these works are based on ideas developed in the context of manifold gps Calandra16 , where inputs are transformed using some parametric transformation of the input. In these works, the parametric transformation is based on convolutional layers, and scalability to large data is achieved through the use of ideas drawn from the literature on scalable gps, for example the Stochastic Variational Deep Kernel Learning (svdkl) approach in Wilson16 . In contrast, the work on hybrid gps and dnns (gpdnn, Bradshaw17 ) combines cnns and gps using an inducing point approximation. Other recent approaches that aim to introduce convolutions in the calculation of the covariance between images include VanDerWilk17 , which proposes a way to construct covariances between domains/patches, mimicking the computations in cnns.
In this work, we propose an alternative way to combine cnns and gps, where gps are approximated using random features expansions Rahimi08 ; Gredilla10 . The random feature expansion approximation amounts in replacing the orginal kernel matrix with a lowrank approximation, turning gps into Bayesian linear models. Combining this with cnns leads to a particular form of Bayesian cnns, much like gps and dgps are particular forms of Bayesian dnns Duvenaud14 ; Gal16 ; Neal96 . Inference in Bayesian cnns is intractable and requires some form of approximation. In this work, we draw on the interpretation of dropout as variational inference, employing the socalled Monte Carlo Dropout (mcd, Gal16 ) to obtain a practical way of combining cnns and gps.
3 Combinations of cnns and gps are miscalibrated
Consider a class image classification task where denotes a set of images , and
is the matrix consisting of the corresponding onehot encoded labels
stacked by row. We can use various metrics to determine the quality of a classifier, and here we focus in particular on calibration. Let be the output of a classifier for an input image . To compute the calibration properties of a classifier, consider a partitioning of the test set into disjoint sets , such that each subset contains the inputs yielding predictions in the range . Hence, the confidence associated with each subset is characterized by the midpoint of its corresponding range, i.e. . Then, the accuracy for each subset can be evaluated as follows:(1) 
where is equal to one if , and zero otherwise.
In what follows, we use reliability diagrams to assess calibration, where we plot accuracy as a function of confidence for the subsets . For a perfectly calibrated classifier, we expect for all , with deviations implying that the class probabilities are either underestimated or overestimated. A useful summary statistics that can be extracted from reliability diagrams is the Expected Calibration Error (ece), which is the average of the absolute difference between accuracy and confidence weighted according to its size:
(2) 
Another metric that measures the accuracy in predicting class probabilities is the brier score, defined as the squared distance between labels and outputs averaged across classes and test points:
(3) 
In figure 5, we report the reliability diagrams of three stateoftheart combinations of cnns and gps applied to the cifar100 data set, along with the corresponding values of ece and brier scores. The left and right panels of the figure show the gpdnn approach in Bradshaw17 and svdkl in Wilson16 , respectively, where we use a resnet architecture for the convolutional layers. The central panel of the figure reports the reliability diagram for the cgp in VanDerWilk17 where there is no equivalent of a cnn architecture.
The results indicate that current approaches that combine cnns and gps are miscalibrated, with a tendence of being either underconfident (gpdnn) or overconfident (cgp and svdkl) in predictions. This is an important and perhaps surprising finding, because one of the motivations to combine cnns with gps is to do better quantification of uncertainty compared to plain cnns. In the experiments section we report more extensively on the calibration of these classifiers, as well as illustrating other performance metrics. These considerations call for the study of better ways to combine cnns and gps to recover calibration while attempting to improve on standard metrics such as error rate and test loglikelihood. The next section illustrates our proposal that achieves this goal.
4 cnn+gp(rf): Conv Nets with Random Feature Expanded gps
In the proposed model, the labels are assumed to be conditionally independent given a set of corresponding latent variables , i.e. we consider the likelihood where the latent variables are realizations of a set of functions at the input images , i.e., for . In this work we focus on functions that are modeled using gps; note that extension to dgps is actually easy to consider in our framework, and we will report results on this choice in the experiments. Each individual is multinomial with probabilities obtained using a softmax transformation of the latent variables.
Due to the gp modeling assumption, the latent function values are jointly Gaussian with , where is the covariance matrix. The entries of the covariance matrix , are specified by a covariance (kernel) function
(with hyperparameters
) and, this form is shared across output dimensions, although this can be relaxed and allow for a different for the outputs.Instead of applying the gp modeling directly to the images, we propose to employ a transformation using convolutional layers, where denotes the parameters of these convolutional layers. The vectorvalued function is differentiable as it implements a series of differentiable operations, such as convolutions and pooling. This is one of the key successes of cnn models that allows for the learning of their filters, which we exploit for the endtoend learning of our model.
Inference in this model requires being able to characterize the posterior over all or a selected group of model parameters , but this posterior is analytically intractable and thus computationally prohibitive Rasmussen06 . In the remainder of this paper, we will build on previous work on scalable inference for gps and dgps with random features Cutajar17 to obtain an approximation to the proposed model that can be learned endtoend.
4.1 Random Feature Expansions for Gaussian Processes
Naïve inference in gp models requires algebraic operations with that would cost in time. Popular approaches to recover tractability use lowrank approximations of the kernel matrix. Among this family of lowrank approximations, we choose to work with random feature approximations Gredilla10 ; Cutajar17 . The reason is that they offer a number of possible extensions to speedup computations (e.g., using structured approximations Le13 ; Yu16 ) and increase the complexity of the model (e.g., considering Deep gps Cutajar17 ); we will elaborate on this in the experiments section. In random feature expansions, the kernel matrix is replaced by a lowrank approximation , with and . This approximation suggests the construction of a Bayesian linear model to approximate the gp latent variables as . Using it is straightforward to show that the covariance of each of the latent functions is indeed the approximate , as .
In this work, we focus in particular on the orderone arccosine kernel Cho09
(4) 
where and is the angle between and .
The arccosine
covariance has a convenient integral representation that allows for a Monte Carlo approximation, obtaining a lowrank approximation to the covariance matrix involving Rectified Linear Unit (
relu) activations Cho09(5) 
In this expression, we have defined as the matrix resulting from the application of convolutional layers to the image training set and is obtained by stacking samples from
by column. Note that in the case of a popular Radial Basis Function (
rbf) covariance, it is possible to obtain a similar random feature approximation, where the relu activation is replaced by trigonometric functions; see Rahimi08 and the supplement for details.4.2 Endtoend learning of the proposed cnn+gp(rf) model
Inference in the proposed model is intractable due to the likelihood that is not conjugate to the gp prior. Further complications stem from the need to infer kernel parameters, which include convolutional parameters, and the need to be able to scale to large data. Our aim is to carry out inference within a consistent framework that is characterized by simplicity, as described in what follows.
We start by introducing an approximate posterior over and , that we denote as . Following standard variational inference arguments, we can define an operative way to obtain these approximate posteriors. The logmarginal likelihood can be bounded by the sum of an expected loglikelihood term and a negative KullbackLeibler (KL) divergence term:
(6) 
Variational inference amounts to optimizing the lower bound above with respect to and any other parameters of interest. Here we assume fixed from the prior, while we wish to optimize its prior parameters , but note that could also be inferred variationally (see supplement and Gal15 ; Cutajar17 ).
We have now a number of options on the form for the approximate posteriors . In previous works on variational inference for dnns, it has been proposed to define the approximating distributions to be Gaussian and factorized across parameters Kingma14 ; Graves11 . The drawback of this is that it doubles the number of parameters. Alternatively, we can rely on the connections between dropout and variational inference Gal16 ; Gal16b to obtain an easier approximate inference scheme, which is also known as Monte Carlo Dropout (mcd). Focusing on the weights for now, the connection with dropout is apparent if we rewrite
(7) 
The reparameterization introduces variational parameters (one for each weight in
) and a vector of binary variables that can switch on or off the columns of the weight matrix with probability
. A similar reprameterization can be done for the convolutional parameters , introducing and . The optimization of the lower bound wrt all variational parameters requires being able to evaluate the expectation and the KL term.The KL term can be approximated following Gal16 , obtaining a regularization term involving the squarednorm of the parameters
(8) 
The expectation, instead, can be unbiasedly estimated using Monte Carlo and also considering a minibatch of size
:(9) 
with , and is a set of indices to select a minibatch of training points Kingma14 ; Graves11 . This doublystochastic approximation is differentiable wrt variational parameters when the Bernoulli variables are fixed.
The approximate objective can now be optimized in the same vein as in standard backpropagation with dropout, noting that dropout is applied to as well as convolutional parameters . What changes, however, is the interpretation of the procedure as stochastic variational inference, whereby the Bernoulli variables are resampled at each iteration. A practical implication is in the way we compute the predictive distribution, which has a probabilistic flavor as follows:
(10) 
and can be approximated using Monte Carlo by resampling the Bernoulli variables. While mcd has been proposed for cnns in Gal16b , in this work we extend it to the case of joint inference over convolutional parameters and the gp approximation in the cnn+gp(rf) model, thus obtaining a practical inference and prediction scheme combining cnns and gps.
4.3 Extensions
Structured random feature approximations:
One of the advantages of the proposed model, compared to other gp approximations, is that it can exploit structured random feature expansions to accelerate computations and reduce the size of the approximate gp Le13 ; Yu16 . In the random features approximation, random features are constructed by multiplying with the convolutional features. Without loss of generality, assuming that and , the cost of computing products is , while storing requires storage.
Structured approximations aim to reduce the time complexity to and the storage cost to . Taking a standard random features expansion of the isotropic covariance in equation (5) with as an example, , with . One way to make computations cheaper is to replace the Gaussian matrix with a pseudorandom alternative. The Structured Orthogonal Random Feature (sorf) approximation Yu16 approximates through a series of Hadamard transformations of diagonal matrices with elements randomly sampled from , that is . We refer to this variation of the model as cnn+gp(sorf).
Convolutional Networks with RandomFeatureExpanded Deep gps:
A dgp model represents a deep probabilistic nonparametric approach where the output of one gp at each layer is used as the input to the gp in the next layer Damianou13 . Extending the random feature approximation to dgps and the inference scheme presented here is straightforward; see Cutajar17 for details. The random feature approximation turns the dgp into a Bayesian dnn for which we can apply stochastic variational inference to infer model parameters. In the experiments section, we explore the possibility to stack a dgp on top of convolutional layers, and we show the impact of depth on performance.
5 Experiments
Depth  Data set  cnn architecture  cnn name  # Conv features 

Shallow  mnist  2 Conv Layers + 2 Fully connected  LeNet  4096 
Shallow  cifar10  2 Conv Layers + 3 Fully connected  LeNet  4096 
Deep  cifar10  30 Conv Layers + 1 Fully connected  resnet  64 
Deep  cifar100  150 Conv Layers + 1 Fully connected  resnet  64 
We carry out the experimental evaluation using popular benchmark datasets, such as mnist, cifar10 and cifar100 and with a number of popular cnn architectures based on LeNet and resnet (see table 1). We report three stateoftheart competitors combining cnns and gps, namely gpdnn Bradshaw17 , svdkl Wilson16 , and cgp VanDerWilk17 . We also report Bayesian cnns, as suggested in Gal16b and cnns with postcalibration as proposed in Guo17 , which we refer to as cnn+mcd and cnn+cal, respectively. For all the competing methods we used available implementations, adding the same cnn architecture to ensure a fair comparison. In all experiments, we use a batchsize and the Adam optimizer with default learning rate Kingma14b . In the methods that use mcd, we use a dropout rate of for all parameters.
The results are reported in figure 2, where we show results for different training sizes , keeping the classes balanced. In the figure, we report the calibration measures that we have introduced earlier, namely ece and brier scores, and we also report the classification error rate (err) and the mean negative test loglikelihood (mnll). Compared to other combinations of cnns and gps, our proposal improves considerably on all metrics. It is interesting to see that our proposal is competitive with Bayesian cnns employing mcd, with only a marginal improvement on err and mnll in some configurations. Compared to plain cnns with postcalibration, our proposal is considerably better, although in some configurations the former is superior in ece and brier; this is expected given that this is the metric that is optimized on a validation set.
The two variants of our approach, namely cnn+gp(rf) where we learn the frequencies and cnn+gp(sorf) where we sample from its prior, are comparable. This suggests that the extra level of complexity of learning the spectral frequencies does not lead to substantial gains in performance and that the structured random feature approximation yields satisfactory performance. We also note that these results have been obtained by fixing the covariance parameters of the gp, as we found it to be unstable when learning these jointly with . This might be the reason why these parameters were learned through crossvalidation in Gal17 . In the supplement, we report the results obtained when learning , which we found yielding similar performance as fixing them. All these observations corroborate the hypothesis that most of the performance of cnnbased classification models is due to the convolutional layers.
In summary, figure 2 shows that our cnn+gp(rf) is the best strategy for calibrating these models compared to other approaches using gps. Furthermore, we found perhaps surprisingly that mcd has similarly performance. In the supplementary material, we report results on gpdnn where we infer convolutional parameters using mcd, so as to gain insights as to whether most of the improvements in performance are due to this form of regularization. The results support the intuition that inferring these parameters yields improvements in calibration, but also that our cnn+gp(rf) still offers better performance.
Experiments combining cnns and Deep gps:
In figure 3, we report results varying the depth of a dgp on top of the convolutional layers; again, we learn the convolutional filters and the dgp endtoend as discussed in the previous sections. We show results when applying our model to the whole cifar10 data set in the case of the shallow convolutional structure (table 1). We feedforward the convolutional features to all layers of the dgp, in line with what suggested in the literature of dgps to avoid pathologies in the functions that can be modeled Cutajar17 ; Duvenaud14 ; Neal96 . The results indicate that increasing the complexity of the model improves on all performance metrics, and worsen calibration, which however is still around ece. This is in line with the intuition that increasing model complexity negatively impacts calibration.
Knowing when the model doesn’t know:
We report experiments showing the ability of our model to know when it does not know, following a similar experimental setup as in Lakshminarayanan17 . In this experiment we train our cnn+gp(rf) model on mnist and test on the notmnist dataset, which contains images of letters from “A” to “J” in various typefaces. For this experiment, while we do not know the exact value that we should obtain for predictive probabilities, we expect to observe low entropy in the predictions when tesing on mnist and high entropy when predicting on notmnist, indicating high uncertainty. The results are reported in figure 4, where we show the density plot of the entropy of predictive probabilities for two depths of the convolutional structure. In the figure, we compare our cnn+gp(rf) against one of the methods combining cnns and gps, that is gpdnn. In the figure, we also include results on cnns with postcalibration and Bayesian cnns inferred with mcd. Our approach is competitive with Bayesian cnns and it is considerably superior to postcalibration. This is especially true in the case of a deeper convolutional structure, where postcalibration still yields a large number of predictions with low uncertainty. Interestingly, gpdnn assigns large uncertainty to predictions on notmnist, although with the deeper convolutional architecture it yields a large fraction of predictions with low entropy.
6 Conclusions
Despite the considerable interest in combining cnns with gps, little attention has been devoted to understand the implications in terms of the ability of these models to accurately quantify the level of uncertainty in predictions. This is the first work that highlights the issues of calibration of these models, showing that gps cannot cure the issues of miscalibration in cnns. We have proposed a novel combination of cnns and gps where the resulting model becomes a particular form of a Bayesian cnn for which inference using variational inference is straightforward. However, our results also indicate that combining cnns and gps does not significantly improve the performance of standard cnns. This can serve as a motivation for investigating new approximation methods for scalable inference in gp models and combinations with cnns.
Acknowledgments
JPC acknowledges support from the Simons Foundation and the McKnight Foundation. MF gratefully acknowledges support from the AXA Research Fund.
References
 (1) J. Bradshaw, Alexander, and Z. Ghahramani. Adversarial Examples, Uncertainty, and Transfer Testing Robustness in Gaussian Process Hybrid Deep Networks, July 2017. arXiv:1707.02476.
 (2) R. Calandra, J. Peters, C. E. Rasmussen, and M. P. Deisenroth. Manifold Gaussian Processes for regression. In 2016 International Joint Conference on Neural Networks, IJCNN 2016, Vancouver, BC, Canada, July 2429, 2016, pages 3338–3345, 2016.

(3)
Y. Cho and L. K. Saul.
Kernel Methods for Deep Learning.
In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 342–350. Curran Associates, Inc., 2009.  (4) K. Cutajar, E. V. Bonilla, P. Michiardi, and M. Filippone. Random feature expansions for deep Gaussian processes. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 884–893, International Convention Centre, Sydney, Australia, Aug. 2017. PMLR.

(5)
A. C. Damianou and N. D. Lawrence.
Deep Gaussian Processes.
In
Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2013, Scottsdale, AZ, USA, April 29  May 1, 2013
, volume 31 of JMLR Proceedings, pages 207–215. JMLR.org, 2013.  (6) D. K. Duvenaud, O. Rippel, R. P. Adams, and Z. Ghahramani. Avoiding pathologies in very deep networks. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, AISTATS 2014, Reykjavik, Iceland, April 2225, 2014, volume 33 of JMLR Workshop and Conference Proceedings, pages 202–210. JMLR.org, 2014.
 (7) P. A. Flach. Classifier Calibration. In C. Sammut and G. I. Webb, editors, Encyclopedia of Machine Learning and Data Mining, pages 1–8. Springer US, Boston, MA, 2016.
 (8) Y. Gal and Z. Ghahramani. Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference, Jan. 2016. arXiv:1506.02158.
 (9) Y. Gal and Z. Ghahramani. Dropout As a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48, ICML’16, pages 1050–1059. JMLR.org, 2016.
 (10) Y. Gal, J. Hron, and A. Kendall. Concrete Dropout. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 3581–3590. Curran Associates, Inc., 2017.
 (11) Y. Gal and R. Turner. Improving the Gaussian Process Sparse Spectrum Approximation by Representing Uncertainty in Frequency Inputs. In F. R. Bach and D. M. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 611 July 2015, volume 37 of JMLR Workshop and Conference Proceedings, pages 655–664. JMLR.org, 2015.
 (12) A. Graves. Practical Variational Inference for Neural Networks. In J. ShaweTaylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 2348–2356. Curran Associates, Inc., 2011.
 (13) C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On Calibration of Modern Neural Networks. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1321–1330, International Convention Centre, Sydney, Australia, Aug. 2017. PMLR.

(14)
A. Kendall and Y. Gal.
What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?
In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5574–5584. Curran Associates, Inc., 2017.  (15) D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization, Jan. 2017. arXiv:1412.6980.
 (16) D. P. Kingma and M. Welling. AutoEncoding Variational Bayes. In Proceedings of the Second International Conference on Learning Representations (ICLR 2014), Apr. 2014.
 (17) M. Kull, T. S. Filho, and P. Flach. Beta calibration: a wellfounded and easily implemented improvement on logistic calibration for binary classifiers. In A. Singh and J. Zhu, editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 623–631, Fort Lauderdale, FL, USA, Apr. 2017. PMLR.
 (18) B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6402–6413. Curran Associates, Inc., 2017.
 (19) M. LázaroGredilla, J. QuinoneroCandela, C. E. Rasmussen, and A. R. FigueirasVidal. Sparse Spectrum Gaussian Process Regression. Journal of Machine Learning Research, 11:1865–1881, 2010.
 (20) Q. Le, T. Sarlos, and A. Smola. Fastfood  Approximating Kernel Expansions in Loglinear Time. In 30th International Conference on Machine Learning (ICML), 2013.

(21)
D. J. C. Mackay.
Bayesian methods for backpropagation networks.
In E. Domany, J. L. van Hemmen, and K. Schulten, editors, Models of Neural Networks III, chapter 6, pages 211–254. Springer, 1994.  (22) R. M. Neal. Bayesian Learning for Neural Networks (Lecture Notes in Statistics). Springer, 1 edition, Aug. 1996.

(23)
A. NiculescuMizil and R. Caruana.
Predicting Good Probabilities with Supervised Learning.
In Proceedings of the 22Nd International Conference on Machine Learning, ICML ’05, pages 625–632, New York, NY, USA, 2005. ACM.  (24) J. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, 10(3), 1999.
 (25) A. Rahimi and B. Recht. Random Features for LargeScale Kernel Machines. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1177–1184. Curran Associates, Inc., 2008.
 (26) C. E. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.
 (27) M. van der Wilk, C. E. Rasmussen, and J. Hensman. Convolutional Gaussian Processes. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 2849–2858. Curran Associates, Inc., 2017.
 (28) A. G. Wilson, Z. Hu, R. R. Salakhutdinov, and E. P. Xing. Stochastic Variational Deep Kernel Learning. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2586–2594. Curran Associates, Inc., 2016.
 (29) F. X. Yu, A. T. Suresh, K. M. Choromanski, D. N. HoltmannRice, and S. Kumar. Orthogonal Random Features. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 1975–1983. Curran Associates, Inc., 2016.
 (30) B. Zadrozny and C. Elkan. Transforming Classifier Scores into Accurate Multiclass Probability Estimates. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02, pages 694–699, New York, NY, USA, 2002. ACM.
Appendix A Random Feature Expansion of the rbf Covariance
We report here the expansion of the popular Radial Basis Function (rbf) covariance. Following the convolutional representation of images in our cnn+gp(rf) model, the rbf covariance is defined as:
(11) 
with .
It is possible to express this covariance function as the Fourier transform of a nonnegative measure
Rahimi08 , where are the socalled spectral frequencies. It is straightforward to verify that . Stacking Monte Carlo samples from into by column, we obtain(12) 
where denotes the matrix resulting from the application of convolutional layers to the image training set .
Appendix B Variational Objective for the Proposed cnn+gp(rf) Model
Consider the proposed cnn+gp(rf) model. In the main paper we report an expression for the lower bound on the marginal likelihood when treating and variationally, while assuming sampled from the prior (which in turn depends on covariance parameters ). Assume now that we are interested in carrying out inference over , as well as and . Extending the expression of the variational lower bound in the main paper, we can introduce an approximate posterior over , say and attempt to optimize it within the mcd framework. The lower bound to the logmarginal likelihood can be expressed as
(13) 
We can again apply mcd by introducing Bernoulli variables reparameterizing
(14) 
and similarly for and .
Again, the expectation can be unbiasedly estimated using Monte Carlo and also considering a minibatch of size :
(15) 
with , and is a set of indices to select a minibatch of training points.
The KL term can be approximated following Gal16 , noting that the fact that we are treating variationally, gives rise to extra terms that involve the gp lengthscale :
(16) 
When using this expression to optimize all variational parameters pertaining to jointly with we encountered some instabilities, and therefore we decided to report results when fixing the covariance parameters .
For the case where is not learned variationally, which is what we do when employing the sorf approximation, we can simply draw from the prior and consider the reparameterization Gredilla10 :
(17) 
where . This reparameterization allows for the update of covariance parameters fixing the randomness in the sampling from . The results comparing cnn+gp(sorf) when updating or fixing throughout optimization are reported in table 2. It is interesting to notice how fixing covariance parameters leads to comparable performance to the case where we learn them.
SHALLOW  DEEP  

mnist  cifar10  cifar10  cifar100  
Metrics  Fixed  Learned  Fixed  Learned  Fixed  Learned  Fixed  Learned 
err  0.006  0.005  0.203  0.192  0.113  0.115  0.352  0.359 
mnll  0.018  0.018  0.610  0.584  0.348  0.355  1.264  1.287 
ece  0.002  0.003  0.015  0.010  0.051  0.054  0.050  0.054 
brier  0.009  0.008  0.288  0.271  0.170  0.173  0.466  0.478 
Appendix C Variational inference of filters in gpdnn
In this section we report results when applying variational inference on the weights in gpdnn Bradshaw17 . In order to do this, we implemented mcd for the convolutional parameters, similarly to what presented in the main paper for our cnn+gp(rf) model.
SHALLOW  DEEP  

mnist  cifar10  cifar10  cifar100  
Metrics  cnn+gp(rf)  gpdnn  cnn+gp(rf)  gpdnn  cnn+gp(rf)  gpdnn  cnn+gp(rf)  gpdnn 
err  0.005  0.005  0.172  0.172  0.111  0.190  0.351  0.820 
mnll  0.014  0.019  0.535  0.531  0.344  0.675  1.255  8.606 
ece  0.004  0.005  0.012  0.012  0.051  0.036  0.050  0.527 
brier  0.0071  0.008  0.245  0.244  0.168  0.278  0.466  1.268 
The results in table 3 indicate that this improves the calibration and accuracy of gpdnn compared to optimizing the filters. In the case of a shallow convolutional architecture, the performance of cnn+gp(rf) and gpdnn are comparable, although in the deeper case cnn+gp(rf) achieves better performance. This supports the intuition that inferring convolutional parameters, ranther than optimizing them, leads to considerable improvements in calibration.
Appendix D Reliability diagrams for cnn+gp(rf) and cnn+mcd
In figure 5 we report the reliability diagram of our cnn+gp(rf) model and cnn+mcd. These plots show that our approach and Bayesian cnns are wellcalibrated.