1 Introduction
Deep learning has had tremendous successes over the last couple of years, solving many problems that previously had been within the remit of humans only. Deep neural networks have been particularly successful in solving a wide variety problems with natural data, like audio or image classification or generation. Despite their successes, deep learning systems can still benefit from improvements to make them more applicable in integrated realworld scenarios.
Deep neural networks work best in situations where large training datasets are available, and test cases remain closely related. Less good performance is observed in edge cases for which few examples have been given (e.g. a selfdriving car which is suddenly blinded by the lowhanging sun), or known objects in unlikely configurations (Alcorn et al., 2018). Introducing explicit measures of uncertainty has been suggested as a method of improving the robustness of predictions, and any decisions that are taken on the basis of them. Recent methods (Blundell et al., 2015; Kingma et al., 2015; Gal and Ghahramani, 2016)
are inspired by Bayesian inference, but have been constrained to fit closely in the current computational framework of modern deep learning, in order to retain existing benefits.
While Bayesian deep learning methods have been empirically successful in improving the robustness of DNN predictions, it is unclear to what extent they accurately approximate the true posteriors (Hron et al., 2018)
. Additionally, they do not deliver on an important promise of the Bayesian framework: automatic regularisation of model complexity which allows the training of hyperparameters
(Rasmussen and Ghahramani, 2001). Current marginal likelihood estimates are not usable for hyperparameter selection, and the strong relationship between their quality and the quality of posterior approximations suggests that further improvements are possible with better Bayesian approximations.
In this paper, we investigate using Gaussian processes (GPs) as an alternative building block for creating deep learning models with the benefits of Bayesian inference. GPs allow interpretable incorporation of prior knowledge, and provide accurate Bayesian inference with robust uncertainty estimates, due to their nonparametric nature. Their practical application has been limited due to their large computational requirements for big datasets, and due to the limited inductive biases that they could encode. In recent years, however, advances in stochastic variational inference have allowed Gaussian processes to be scaled to large datasets for both regression and classification models (Hensman et al., 2013, 2015). More sophisticated model structures that are common in the deep learning community, such as depth (Damianou and Lawrence, 2013; Salimbeni and Deisenroth, 2017) and convolutions (van der Wilk et al., 2017), have been incorporated as well. Notably, inference is still accurate enough to provide marginal likelihood estimates that are routinely used for hyperparameter selection.
The accuracy of a Bayesian method, and the quality of its posterior uncertainties, depends strongly on the suitability of the assumptions made in the prior, as well as the quality of inference. In recent years, improving inference has received the most research attention. We set out to improve current convolutional Gaussian process models by investigating problems in their posterior. We find that current translational invariant properties are too restrictive, and propose the Translation Insensitive Convolutional Kernel (TICK) as a remedy. We found a significant improvement in performance, in both accuracy and uncertainty quantification. Although more improvements are still necessary to achieve the classification accuracies of deep learning, we do demonstrate the effectiveness and elegance of Bayesian modelling and critiquing, together with variational approximations, for creating models which give useful uncertainties and automatically tune hyperparameters.
2 Background
2.1 Gaussian process models
Gaussian processes (GPs) (Rasmussen and Williams, 2006) are nonparametric distributions over functions similar to Bayesian neural networks. The core difference, is that neural networks represent distributions over functions through distributions on weights, while a Gaussian process specifies a distribution on function values at a collection of input locations. Using this representation allows us to use an infinite number of basis functions, while still allowing Bayesian inference (Neal, 1996)
. In a GP, the joint distribution of these function values is Gaussian and is fully determined by its mean
and covariance (kernel) function . Taking the mean function to be zero without loss of generality, function values at inputs are distributed as , where . The Gaussianity, and the fact that we can manipulate function values at some finite points of interest without taking the behaviour at any other points into account (marginalisation property) make GPs particularly convenient to manipulate and use as priors over functions in Bayesian models.Gaussian processes can be used in many machine learning tasks where some function has to be learned from data, e.g. in classification, where we learn a mapping from an image to a logit (
). For models where a GP is directly used as a prior on the function mapping, the kernel has the strongest influence on the model’s generalisation ability. A good choice of kernel will impose as much structure in the prior as possible, while still retaining enough flexibility to fit the data. Common kernels, such as the Matérn or squared exponential (SE), only impose varying levels of smoothness. More complicated structure like periodicity (MacKay, 1998) can also be encoded which greatly improves generalisation when appropriate. Models based on compositions of functions, analogous to deep neural networks, can also be given a Bayesian treatment by placing Gaussian process priors on the functions, resulting in deep Gaussian processes (Damianou and Lawrence, 2013).2.2 Convolutional Gaussian processes
In this work, we focus on creating models for image inputs. While existing GP models with kernels like the Squared Exponential (SE) kernel have the capacity to learn any wellbehaved function when given infinite data (Rasmussen and Williams, 2006, chapter 7), they are unlikely to work well for image tasks with realistic dataset sizes. Local kernels, like the SE, only constrain functions in the prior to be smooth, and allow the function to vary along any direction in the input space. This will allow these models to only generalise in neighbourhoods near training data, with large uncertainties being predicted elsewhere. This excessive flexibility is a particular problem for images, which have high input dimensionality, while exhibiting a large amount of structure. When designing (Bayesian) models it is crucial to think about sensible inductive biases to incorporate into the model. For instance, convolutional structure has been widely used to address this issue (LeCun et al., 1989; Goodfellow et al., 2016). Recently, van der Wilk et al. (2017) introduced this structure into a Gaussian process model together with an efficient inference scheme, and showed that this significantly improved performance on image classification tasks.
van der Wilk et al. (2017) construct the convolutional kernel for functions from images of size to realvalued responses . Their starting point is a patch response function operating on patches of the input image of size
. The output for a particular image is found by taking a sum of the patch response function applied to all patches of the image. A vectorized image
of height and width containsoverlapping patches when we slide the window one pixel at a time (i.e. vertical and horizontal stride of 1), and we denote the
patch of an image as . Placing a GP prior on with kernel implies a GP prior on :(1)  
(2) 
The convolution kernel places much stronger constraints on the functions in the prior, based on the idea that similar patches contribute similarly to the function’s output, regardless of their position. This prior places more mass in functions that are sensible for images, and therefore allow the model to generalise more aggressively and with smaller uncertainty than, for example, the SE kernel. If these assumptions are appropriate for a given dataset this leads to a model with a higher marginal likelihood and better generalisation on unseen test data.
Analogous to the link between neural networks and existing Gaussian processes (Neal, 1996)
, the convolutional GP can also be obtained from a particular limit of infinite filters in a convolutional neural network
(van der Wilk, 2019). Other limits have also been investigated (GarrigaAlonso et al., 2018; Novak et al., 2018), resulting in different kernels.2.3 Deep Gaussian processes
The convolutional structure discussed in the previous section is an example of how the kernel and its associated feature representation influence the performance of a model. Deep learning models partially automate this feature selection by learning feature hierarchies from the training data. In image tasks, the first layers of a deep network identify edges, corners, and other local features, while combining them into more complicated silhouettes and structures further into the hierarchy. Eventually a simple regressor solves the task.
Deep GPs (DGPs) share this compositional nature, by composing layers of GPs (Damianou and Lawrence, 2013). They can be defined as , where each component is a GP, itself . DGPs allow us to specify priors on flexible functions with compositional structure, and open the door to nonparametric Bayesian feature learning. Salimbeni and Deisenroth (2017) showed that this is crucial to achieve stateoftheart performance on many datasets and that DGP models never perform worse than singlelayer GPs.
2.4 Contributions
The goal of this paper is to build sensible Bayesian models for image data, that simultaneously achieve high accuracy and provide good uncertainty estimates. Given the success of deep learning, it is a natural choice to blend both depth and convolution into our model. A couple of works on ArXiv (Blomqvist et al., 2018; Kumar et al., 2018) have combined the DGP approach of Salimbeni and Deisenroth (2017) with the convolutional structure of van der Wilk et al. (2018), leading to a Deep Convolutional Gaussian processes (DCGP). In this work we start by reformulating the hidden layers of a DCGP as a multioutput GP. We have developed an extension to GPflow (Matthews et al., 2017) for the convenient handling of multioutput GPs for this purpose (van der Wilk et al., 2019). To find avenues for improvement, we study posterior samples from the original ConvGP (van der Wilk et al., 2017), which show us that the model is too constrained, leading to reduced classification accuracy and overconfidence in predictions. As a solution we propose a “translation insensitive” convolutional kernel, which we validate on a series of experiments on MNIST, FASHIONMNIST, Semeion and CIFAR10. The accuracies we obtain enter the region that started interest in deep learning, only with better uncertainty estimates and automatic adjustment of hyperparameters that our Bayesian approach provides.
3 Bayesian Modelling of Images
Classification of 2s vs 7s. We show two images (black and white – left) of the training set that are incorrectly classified by the ConvGP (a), but correctly classified by the TICKGP (b). The orange and blue images are deviations from the mean of posterior samples.
3.1 Limits of the ConvGP kernel
In this section we focus on analysing the behaviour of singlelayer convolutional Gaussian processes (ConvGPs, section 2.2), so we can develop improvements in a targeted way. The convolutional structure in eq. 1 introduces a form of translation invariance, as the same GP is used for all patches in the image, regardless of location. Depending on the task, a strict form of invariance may or may not be beneficial. For example, in MNIST classification, a horizontal stroke near the top of the digit indicates a ‘7’, while the same stroke near the bottom indicates a ‘2’, as shown in appendix A. The construction of eq. 1 will apply the same to patches, which is undesirable if we wish to distinguish between the two classes using ’s output.
van der Wilk et al. (2017) circumvented the translation invariance problem of the ConvGP in two ways. Firstly, by introducing weights it is possible to rescale the contribution of each patch, turning the uniform sum of eq. 1 into a weighted sum . This is a rudimentary approach which may be both too flexible, in that it allows wildly varying weights for neighbouring pixels, and not flexible enough, in that an image evaluation will always be a linear combination of evaluations of at the input patches. As a second solution, van der Wilk et al. (2017) proposed to add a flexible noninvariant kernel, e.g. a squared exponential (SE), to model any residuals, making the modelling function . Adding this additional SE kernel should be seen as a last resort to capture any residuals, as it reintroduces properties we set out to improve in the first place.
We illustrate the problem of the the original ConvGP being too constrained in fig. 3. We trained a model to classify MNIST 2 vs 7 only, and display the deviations from the mean of samples from the posterior of
before applying the summation. On the left (a) we show posterior samples for the original ConvGP and on the right (b) samples from our modified Translation Insensitive Convolutional GP. Note that all samples in (a) and (b) are plotted using the same color range. We immediately notice that the samples in (a) are less vibrant than in (b), indicating the smaller variance of the ConvGP. The small variance is the result of the ConvGP being too constrained, which leads to a collapsed posterior predictive distribution that is not able to accommodate for patches that can be both positive and negative (i.e. belong to both classes). We also notice that all background pixels within an image have the exact same value.
3.2 Translation insensitive convolutional kernel
A better modelling assumption would be to relax the “same patch, same output” constraint and have a patch response function that is able to vary its output depending on both the patch input and the patch location. Inspired by kernels for “locally invariant” or “insensitive” functions (Raj et al., 2017; van der Wilk et al., 2018) we call this property translation insensitivity. To this end, we propose a product kernel between the patches and their locations:
(3)  
where returns the location of the upperleft corner of the patch in the image, and are the kernels we use over the patches and patch locations, respectively. In our experiments we used SE kernels for both. We refer to this kernel as the Translation Insensitive Convolutional Kernel (TICK). The term “insensitive” was used by van der Wilk et al. (2018) as a relaxation of invariance. We use the term to indicate that the output is slightly sensitive to translations.
The degree of insensitivity (i.e. the degree to which the output of depends on the location of the input patch) is determined by the lengthscale of . Large lengthscales recover the original convolutional kernel, while very short lengthscales allow large variation between locations, resulting in an additive kernel (Duvenaud et al., 2011; Durrande et al., 2012). We expect reasonable lengthscales to be on the order of the size of the image, so the model can learn that a patch near the bottom of the image may contribute differently than the same feature at the top. We will learn this lengthscale automatically together with other hyperparameters using the marginal likelihood approximation.
Returning to fig. 3 (b) we see that the deviation from the mean for the TICKGP is much larger, which show the larger variance and indicates that the model is less constrained. More interestingly, the samples vary in a way that’s consistent with our modelling assumptions. This can most easily be observed by inspecting the background of the images (away from the digit), where the mapping varies smoothly. Also, the mapping of similar patches varies smoothly across the stroke: the response of horizontal and vertical lines in the image gives locally similar responses.
3.3 Deep Convolutional Gaussian processes
With the ideas of (improved) convolutional kernels and deep Gaussian processes in place, it is straightforward to conceive of a model that does both: a deep GP with convolutional kernels at each layer. To do this we need to make these convolutional layers map from images to images, which we do using a multioutput kernel.
This can be done by a minor reformulation to the convolutional kernel of eq. 1: instead of summing over the patches we simply apply to all patches in the input image. As a result, we obtain a vectorvalued function defined as
(4) 
where indicates the output of . Since the same is applied to the different patches, there will be correlations between outputs. For this reason, we consider the mapping a multioutput GP (MOGP), and name it the MultiOutput Convolutional Kernel (MOCK). Multioutput GPs (Alvarez et al., 2012) can be characterised by their covariance between the different outputs and of different inputs and , giving in our case
(5) 
Note that based on this equation, the covariance matrix corresponding to patches and images has a size . For MNIST with , sized images and for patches of size, the calculation and inversion of this matrix is infeasible. Efficient inference for MOGPs relies strongly on choosing useful inducing variables. To this end, we developed a framework for generic MOGPs, that allows for the flexible specification of both multioutput priors and inducing variables, in a way that can take computational advantage of independence properties of the prior (van der Wilk et al., 2019).
The DCGP is built out of multiple convolutional GP layers, where the first layers use imagetoimage mappings based on the MOCK. For a flattened dimensional input image, these layers will produce a dimensional output vector. The next MOCK layer will then act on the output of the first layer and produce an even smaller flattened image of size , and so forth. Eventually, the final layer of the DCGP will use the formulation of eq. 1 and sum over all the outputs to produce a single scalar output prediction for each class. In each of these convolutional layers we have the choice whether or not we add the TICK or original Conv kernel.
4 Inference
Consider a dataset , consisting of images each accompanied by their class label , where is the number of classes. We want to learn from image to logits. We set our deep convolutional GP model up as
We define , and to be the output dimension of .
We are interested in both the posterior for making subsequent predictions and the marginal likelihood (evidence) to optimise the model’s hyperparameters. Calculating both these quantities is intractable because of 1) the cost of operations on covariance matrices, 2) the nonconjugate likelihood , 3) the infeasibly large number of kernel evaluations that are required for dealing with images on a patch basis, and 4) the propagation of the outputs of lower layer GPs through the next layer.
Stochastic variational inference (Hoffman et al., 2013; Hensman et al., 2013) takes care of all the aforementioned issues within one framework. Following the standard variational approach, we construct a lower bound to the marginal likelihood (known as the Evidence Lower BOund, or ELBO) which we then optimise to find the optimal approximate posterior and the model’s hyperparameters.
To derive the ELBO, we start with the joint density, slightly abusing the notation to denote the density of a GP
and a variational posterior which we give the form . The repetition of in both the prior and variational posterior leads to their cancellation inside the expectation of the final ELBO
The particular form of is important and different choices give rise to different DGPs. For instance, the original DGP formulation of Damianou and Lawrence (2013)
used a Gaussian distribution
. We, however, follow Salimbeni and Deisenroth (2017) and use a deterministic relation between and given the latent function , corresponding to a Dirac function in the prior .Minibatching
Because the likelihood factorises, the ELBO decomposes into a sum over all data points, allowing an unbiased estimate to be created using a subset of the data.
Reparameterisation
Obtaining the expectation in the ELBO in closedform is impossible, so we follow the Monte Carlo estimate of Salimbeni and Deisenroth (2017). The variational approximation can be sampled from by successively sampling through the layers. We start with , where is sampled from , and continue similarly for every layer , with , so that the input of the current layer is the sampled output of the previous one. We choose to be Gaussian processes, which have Gaussian marginals to which the ‘reparametrization trick’ (Rezende et al., 2014; Kingma et al., 2015) can be applied for learning their parameters using gradientbased optimisation.
Sparse Gaussian processes
We specify the variational distribution for the latent functions following Titsias (2009), Hensman et al. (2013), and Matthews et al. (2016). This framework conditions the prior on inducing variables , and then specifies a free Gaussian density . This gives the approximation for each layer. The original framework chose the inducing outputs to be observations of the GP to some inducing inputs , i.e. . The key idea of the sparse GP framework is to choose , which makes the size of the matrix that we perform cubic operations on . Note that the posterior is still a fullrank GP, which predicts using an infinite number of basis functions thanks to the use of the prior conditional. This maintains the desirable error bars of the original GP. The overall approximate posterior has the form with
(6)  
where , and . When we predict for a single point, the size of is , the number of outputs by the number of inducing variables, while returns the covariance matrix for all outputs.
In order to evaluate the expectation in the ELBO as described above, we need to generate samples with the covariance . This requires taking a Cholesky of this covariance, of which we have one for each datapoint in the minibatch. This presents a significant computational problem, as its size is , with being roughly the same as the number of patches in the input image. For MNIST, with , this Cholesky has a cost that is comparable to the inversion of the inducing variable covariance , as is usually taken to be 500–1000. However, the total cost is much larger, as we only need to perform a single Cholesky of per layer. The deep convolutional GP model of Blomqvist et al. (2018) suffers from this problem as well, although it is not discussed. Blomqvist et al. (2018) avoid this computational cost by simply sampling from the marginals. In this work, we also follow this approach, as it seems to work well in practice, despite it not being strictly mathematically correct.
Interdomain inducing patches
We have two types of convolutional layers which cause problems. First, the multioutput convolutional layers (MOCK) and second, the final convolutional layer following eq. 1 that performs sumpooling. Using inducing points for the latter results in impractically large double sums over all patches for computing . For the MOCK case, we need some bookkeeping to avoid being defined as all outputs in response to the inducing inputs . Making use of interdomain inducing variables (LázaroGredilla and FigueirasVidal, 2009) in both these cases solves the mathematical, organisational, and software problems. We follow van der Wilk et al. (2017) to define for each layer as evaluations of the patch response function , and we place the inducing inputs in in the patch space , rather than image space .
To apply this approximation, we need to find the appropriate covariances for and , which can then be used in (6) for the conditional mean and covariance
Choosing the inducing variables in this way greatly reduces the computational cost of the method, since we now only require covariances between the patches of the input image and the inducing patches .
The final layer performs sumpooling, as in the original formulation of van der Wilk et al. (2017), removing the problem of correlations between output patches. We still use inducing patches to avoid needing to calculate covariances between all pairs of patches, which results in a crosscovariance of
which is now a vector of length and can directly be plugged into eq. 6.
ConvGP vs. TICKGP
The main difference between both models lies in the kernel of the patch response function . The kernel in the ConvGP acts solely on patches, while in the TICKGP the kernel is constructed as in eq. 3, acting on patches and their corresponding location. As a result, in the TICKGP each inducing patch in every layer is accompanied by an inducing location , which is also optimised during training.
5 Experiments
MNIST  FASHIONMNIST  GREY CIFAR10  

metric  SEGP  ConvGP  TICKGP  CNN  SEGP  ConvGP  TICKGP  CNN  SEGP  ConvGP  TICKGP  CNN 
top error  
top error  
top error  
NLL full test set  
NLL misclassified  
ECE 
In this section we present results using our translation insensitive convolutional kernel. We show that TICKGP improves over ConvGP and achieves the highest reported classification result for a shallow GP model on MNIST, CIFAR10 and FASHIONMNIST. Crucially, we find that while the classification accuracy rivals CNNs, the uncertainty estimates are superior. The CNN is confidently wrong on some ambiguous cases, but TICKGP provides appropriate uncertainty. We demonstrate further that this effect is even more pronounced in a transfer learning task, where the CNN predicts wrong labels with high confidence despite the distributional shift. We also demonstrate the benefits of translation insensitivity in a DCGP.
5.1 MNIST, FASHIONMNIST and CIFAR10
We evaluate TICKGP on three standard image benchmarks (MNIST, FASHIONMNIST and greyscale CIFAR10) and compare its performance to a SEGP, ConvGP and CNN. All GP models in this experiment are shallow and trained using the procedure outlined in section 4. We compare the GP models to a simple CNN architecture, consisting of two convolutional layers followed by two full dense layers. We use dropout with 50% keepprobability to prevent overfitting. All other neural network settings can be found in appendix B. Although we acknowledge that the network is simple compared to other networks which may perform even better, we believe that this network uses a representative collection of standard training techniques, and therefore is a reasonable comparison to assess uncertainty quality on. The SEGP model is a vanilla Sparse Variational GP (SVGP) (Hensman et al., 2013) using a SE kernel defined directly on the images. For MNIST and FASHIONMNIST, we use the defacto standard split on the data: 60,000 images are used for training, and 10,000 for testing. The CIFAR10 dataset consists of 60,000 32x32 images, which we convert to greyscale: 50,000 training images and 10,000 test images. All datasets contain examples of 10 different classes, .
For comparison’s sake, we set up TICKGP and ConvGP as similar as possible. They are both configured to have 1000 inducing 5x5 patches, which are initialised using randomly picked patches from the training examples. Further, we choose a SE kernel for the patch response function and follow van der Wilk et al. (2017) who multiply the patch response outputs with learned weights before summation. Finally, we initialise the inducing patch locations of TICKGP to random values in , and use a Matérn3/2 kernel with lengthscale initialised to 3 for the location kernel from eq. 3.
All GP models use a minibatch size of 128 and are trained using the Adam optimiser (Kingma and Ba, 2014) with a decaying learning rate, starting at . The models are ran until converges on a single GeForce GTX 1070 GPU.
We are dealing with a multiclass classification problem, so we use the softmax likelihood with 10 latent GPs. As the softmax likelihood is not conjugate to the variational posterior we need to evaluate the predictive distribution using Monte Carlo estimates, , where . In our experiments we set .
Table 1 reports the error rate, Negative LogLikelihood (NLL) and Expected Calibration Error (ECE) (Naeini et al., 2015). We use NLL as our main metric for calibration, as it is a proper scoring rule (Gneiting and Raftery, 2007) and has a useful relationship to returns obtained from bets on the future based on the predicted belief (Roulston and Smith, 2002). We see that TICKGP outperforms the other models in terms of NLL, both on the complete test set and on the misclassified images, while still being competitive with the CNN in terms of error rate. The shallow TICKGP also sets the new records of classification with GP models on the listed datasets.
In fig. 4 we show the predictive probability for a few randomly selected misclassified images, demonstrating better calibrated probabilities of GP based models compared to the parametric NN model, and the improvements of the newly presented TICKGP over the ConvGP. In appendix C we show the complete set of misclassified images.
metric  ConvGP  CNN  TICKGP 

top error  
top error  
top error  
NLL full test set  
NLL misclassified 
5.2 OutOfDistribution test set
In this experiment we test the generalisation capacity of the models presented in section 5.1. In particular, we are interested in studying their behaviour when a distribution shift occurs on the test set. This is an important application as most machine learning models will eventually be used in domains broader than what their training dataset encloses. It is therefore crucial that the models are able to detect this change of environment, and adjust their uncertainty levels so that appropriate actions can be taken.
The models in table 2 are trained on MNIST but the reported metrics, error rate and NLL, are calculated for the Semeion digit dataset. The Semeion dataset (UCI, )
has 1593 images of 16x16 pixels size. To be able to reuse MNIST trained models we pad the Semeion images with zero pixels to match the MNIST size. The table shows that TICKGP outperforms the CNN and to lesser extent the ConvGP in terms of NLL, and performs comparably to a CNN in terms of accuracy. In
appendix D, in the same way as in fig. 4, we show the predictive probability for the models for a few randomly selected misclassified images. The image clearly illustrates the fact that the CNN is making wrong predictions with a very high certainty, explaining the low NLL values.5.3 Deep Convolutional GPs
In this final experiment we show that the translation insensitivity of TICKGP can be incorporated in deep convolutional Gaussian processes and improve its performance. In table 3 we list the results of a deep ConvGP and a deep TICKGP on MNIST and CIFAR10. We train models with one, two and three layers. We configure all models identically: each layer uses 384 inducing 5x5 patches (initialised using random patches from the training images), an identity Conv2D mean function for the hidden layers, and a SE kernel for the patch response function. The hidden layers for the L=2 and L=3 models are identical for both the deep ConvGP and deep TICKGP, as the translation insensitivity is only added to the final layer. We use a minibatch of size 32 and 64 for MNIST and CIFAR, respectively. All models are optimised using Adam with exponentially decaying learning rate, starting at 0.01, and decreased every 50,000 optimisation steps by a factor of 4. We run all models for the same number of iterations 300,000 and plot their error rates for MNIST as a function of time in fig. 5.
For the initialisation of the hidden layers’ variational parameters we follow Salimbeni and Deisenroth (2017) and set and . The zero mean and small covariance turn off the nonlinear GP behaviour of the first layers, making them practically deterministic and completely determined by their identity mean function. In the final layer we set and , as we do for the singlelayer models in section 5.1. For the initialisation of the threelayer models we set the first and last layer to the trained values of the twolayered model, as was done in Blomqvist et al. (2018). This is why we plot the optimisation curves for the threelayered models after the twolayer models in fig. 5.
MNIST  CIFAR10  

# layers  metric  ConvGP  TICKGP  ConvGP  TICKGP 
L = 1  top error  
NLL full test set  
NLL misclassified  
ELBO negative  
L = 2  top error  
NLL full test set  
NLL misclassified  
ELBO negative  
L = 3  top error  
NLL full test set  
NLL misclassified  
ELBO negative 
Table 3 lists the performance for the DCGP models on MNIST and CIFAR10. We observe that 1) our model with TICKGP as final layer outperforms a vanilla deep ConvGP in terms of accuracy and NLL, 2) adding depth improves the performance and uncertainty quantification for both models, and 3) that there is a strong correlation between the ELBO and the model’s performance, making it possible to use the ELBO for model selection. The modelling improvement that comes with the addition of translation insensitivity in the final layer is also clearly visible in fig. 5, where TICKGP models (solid) are consistently below ConvGP (dashed) models.
6 Conclusion
Overall, we believe this work to be a step towards bringing the advantages of Gaussian processes into deep learning. Deep and convolutional structures, once a preserve for deep learning models, are now applicable within GP models. In this work we’ve demonstrated a clear advantage of the Bayesian framework, we’ve critiqued a modelling assumption (translational invariance) and adjusted the model accordingly. We’ve demonstrated that our proposed TICK kernel closes the performance gaps on several benchmarks.
Acknowledgements
We have greatly appreciated valuable discussions with Marc Deisenroth and Zhe Dong in the preparation of this work. We would like to thank Fergus Simpson, Hugh Salimbeni, ST John, Victor Picheny, and anonymous reviewers for helpful feedback on the manuscript.
References
 Alcorn et al. (2018) Michael A Alcorn, Qi Li, Zhitao Gong, Chengfei Wang, Long Mai, WeiShinn Ku, and Anh Nguyen. Strike (with) a pose: Neural networks are easily fooled by strange poses of familiar objects. arXiv preprint arXiv:1811.11553, 2018.
 Alvarez et al. (2012) Mauricio A Alvarez, Lorenzo Rosasco, Neil D Lawrence, et al. Kernels for vectorvalued functions: A review. Foundations and Trends® in Machine Learning, 4(3):195–266, 2012.
 Blomqvist et al. (2018) Kenneth Blomqvist, Samuel Kaski, and Markus Heinonen. Deep convolutional gaussian processes. arXiv preprint arXiv:1810.03052, 2018.
 Blundell et al. (2015) Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1613–1622, Lille, France, 07–09 Jul 2015. PMLR. URL http://proceedings.mlr.press/v37/blundell15.html.
 Damianou and Lawrence (2013) Andreas Damianou and Neil Lawrence. Deep gaussian processes. In Artificial Intelligence and Statistics, 2013.
 Durrande et al. (2012) Nicolas Durrande, David Ginsbourger, and Olivier Roustant. Additive covariance kernels for highdimensional Gaussian process modeling. In Annales de la Faculté de Sciences de Toulouse, volume 21, pages p–481, 2012.
 Duvenaud et al. (2011) David K. Duvenaud, Hannes Nickisch, and Carl E. Rasmussen. Additive Gaussian processes. In Advances in neural information processing systems, pages 226–234, 2011.
 Gal and Ghahramani (2016) Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of The 33rd International Conference on Machine Learning, 2016.
 GarrigaAlonso et al. (2018) Adrià GarrigaAlonso, Laurence Aitchison, and Carl Edward Rasmussen. Deep convolutional networks as shallow gaussian processes, 2018.
 Gneiting and Raftery (2007) Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378, 2007.
 Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT press Cambridge, 2016.
 Hensman et al. (2013) James Hensman, Nicolo Fusi, and Neil D Lawrence. Gaussian Processes for Big Data. Uncertainty in Artificial Intelligence, 2013.
 Hensman et al. (2015) James Hensman, Alexander G de G Matthews, and Zoubin Ghahramani. Scalable variational Gaussian process classification. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, 2015.
 Hoffman et al. (2013) Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic Variational Inference. Journal of Machine Learning Research, 2013.
 Hron et al. (2018) Jiri Hron, Alex Matthews, and Zoubin Ghahramani. Variational Bayesian dropout: pitfalls and fixes. In Proceedings of the 35th International Conference on Machine Learning, 2018.
 Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kingma et al. (2015) Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems 28, 2015.
 Kumar et al. (2018) Vinayak Kumar, Vaibhav Singh, PK Srijith, and Andreas Damianou. Deep gaussian processes with convolutional kernels. arXiv preprint arXiv:1806.01655, 2018.
 LázaroGredilla and FigueirasVidal (2009) Miguel LázaroGredilla and Aníbal FigueirasVidal. Interdomain gaussian processes for sparse inference using inducing features. In Advances in Neural Information Processing Systems, pages 1087–1095, 2009.
 LeCun et al. (1989) Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, 1989.
 MacKay (1998) D. J. C. MacKay. Introduction to Gaussian processes. In Neural Networks and Machine Learning, NATO ASI Series. Kluwer Academic Press, 1998.

Matthews et al. (2016)
Alexander Matthews, James Hensman, Turner Richard, and Zoubin Ghahramani.
On Sparse Variational Methods and the KullbackLeibler Divergence between Stochastic Processes.
Artificial Intelligence and Statistics, 2016. 
Matthews et al. (2017)
Alexander G. de G. Matthews, Mark van der Wilk, Tom Nickson, Keisuke. Fujii,
Alexis Boukouvalas, Pablo LeónVillagrá, Zoubin Ghahramani, and
James Hensman.
GPflow: A Gaussian process library using TensorFlow.
Journal of Machine Learning Research, 18(40):1–6, apr 2017. URL http://jmlr.org/papers/v18/16537.html.  Naeini et al. (2015) Mahdi Pakdaman Naeini, Gregory F Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. In AAAI, pages 2901–2907, 2015.
 Neal (1996) Radford M. Neal. Bayesian learning for neural networks, volume 118. Springer, 1996.
 Novak et al. (2018) Roman Novak, Lechao Xiao, Yasaman Bahri, Jaehoon Lee, Greg Yang, Daniel A Abolafia, Jeffrey Pennington, and Jascha SohlDickstein. Bayesian deep convolutional networks with many channels are gaussian processes. 2018.
 Raj et al. (2017) A. Raj, A. Kumar, Y. Mroueh, T. Fletcher, and B. Schölkopf. Local group invariant representations via orbit embeddings. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS 2017), 2017.
 Rasmussen and Williams (2006) Carl E Rasmussen and Christopher KI Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.
 Rasmussen and Ghahramani (2001) Carl Edward Rasmussen and Zoubin Ghahramani. Occam’s razor. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 294–300. MIT Press, 2001. URL http://papers.nips.cc/paper/1925occamsrazor.pdf.
 Rezende et al. (2014) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic Backpropagation and Approximate Inference in Deep Generative Models. International Conference on Machine Learning, 2014.
 Roulston and Smith (2002) Mark S Roulston and Leonard A Smith. Evaluating probabilistic forecasts using information theory. Monthly Weather Review, 130(6):1653–1660, 2002.
 Salimbeni and Deisenroth (2017) Hugh Salimbeni and Marc P Deisenroth. Doubly Stochastic Variational Inference for Deep Gaussian Processes. Advances in Neural Information Processing Systems, 2017.
 (33) TensorFlow. Deep MNIST for experts. Available from https://www.tensorflow.org/get_started/mnist/pros.
 Titsias (2009) Michalis Titsias. Variational Learning of Inducing Variables in Sparse Gaussian Processes. Artificial Intelligence and Statistics, 2009.
 (35) UCI. Semeion handwritten digit data set. Available from https://archive.ics.uci.edu/ml/datasets/semeion+handwritten+digit.
 van der Wilk (2019) Mark van der Wilk. Sparse Gaussian Process Approximations and Applications. PhD thesis, University of Cambridge, 2019.
 van der Wilk et al. (2017) Mark van der Wilk, Carl Edward Rasmussen, and James Hensman. Convolutional Gaussian Processes. In Advances in Neural Information Processing Systems, 2017.
 van der Wilk et al. (2018) Mark van der Wilk, Matthias Bauer, ST John, and James Hensman. Learning invariances using the marginal likelihood. Advances in Neural Information Processing Systems 31, 2018.
 van der Wilk et al. (2019) Mark van der Wilk, Vincent Dutordoir, ST John, Artem Artemev, Vincent Adam, and James Hensman. A framework for interdomain and multioutput Gaussian processes. Technical report, PROWLER.io, Feb 2019.
Appendix A MNIST 2 and 7 classification example
Appendix B CNN architectures
The Convolutional neural network (CNN) used in the classification experiments consists of two convolutional layers. The convolutional layers are configured to have 32 and 64 kernels, respectively, a kernel size of 5x5 and a stride of 1. Both convolutional layers are followed by max pooling with strides and size equal to 2. The output of the second max pooling layer of size 1024 is fed into a fully connected layer with ReLU activation, the result of which is passed through a dropout layer with rate 0.5. The final fully connected layer has 10 units with softmax nonlinearity. We initialised the convolutional and fullyconnected weights by a truncated normal with standard deviation equal to 0.1 and the bias weights were initialised to 0.1 constant. The CNN is trained using the Adam optimiser
[Kingma and Ba, 2014] with constant learning rate . We followed the architecture used in TensorFlow .
Comments
There are no comments yet.