Deep Gaussian Processes with Importance-Weighted Variational Inference

05/14/2019
by   Hugh Salimbeni, et al.
12

Deep Gaussian processes (DGPs) can model complex marginal densities as well as complex mappings. Non-Gaussian marginals are essential for modelling real-world data, and can be generated from the DGP by incorporating uncorrelated variables to the model. Previous work on DGP models has introduced noise additively and used variational inference with a combination of sparse Gaussian processes and mean-field Gaussians for the approximate posterior. Additive noise attenuates the signal, and the Gaussian form of variational distribution may lead to an inaccurate posterior. We instead incorporate noisy variables as latent covariates, and propose a novel importance-weighted objective, which leverages analytic results and provides a mechanism to trade off computation for improved accuracy. Our results demonstrate that the importance-weighted objective works well in practice and consistently outperforms classical variational inference, especially for deeper models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 6

page 13

11/01/2020

On Signal-to-Noise Ratio Issues in Variational Inference for Deep Gaussian Processes

We show that the gradient estimates used in training Deep Gaussian Proce...
12/03/2014

Nested Variational Compression in Deep Gaussian Processes

Deep Gaussian processes provide a flexible approach to probabilistic mod...
10/21/2016

Mean-Field Variational Inference for Gradient Matching with Gaussian Processes

Gradient matching with Gaussian processes is a promising tool for learni...
04/04/2019

Robust Deep Gaussian Processes

This report provides an in-depth overview over the implications and nove...
05/20/2021

Nonlinear Hawkes Process with Gaussian Process Self Effects

Traditionally, Hawkes processes are used to model time–continuous point ...
02/12/2019

Gaussian Mean Field Regularizes by Limiting Learned Information

Variational inference with a factorized Gaussian posterior estimate is a...
10/20/2020

Semi-parametric γ-ray modeling with Gaussian processes and variational inference

Mismodeling the uncertain, diffuse emission of Galactic origin can serio...

Code Repositories

DGPs_with_IWVI

Deep Gaussian Processes with Importance-Weighted Variational Inference


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Gaussian processes are powerful and popular models with widespread use, but the joint Gaussian assumption of the latent function values can be limiting in many situations. This can be for at least two reasons: firstly, not all prior knowledge is possible to express solely in terms of mean and covariance, and secondly Gaussian marginals are not sufficient for many applications. The deep Gaussian process (DGP) (Damianou and Lawrence, 2013) can potentially overcome both these limitations.

We consider the very general problem of conditional density estimation. A Gaussian process model with a Gaussian likelihood is a poor model unless the true data marginals are Gaussian. Even if the marginals are Gaussian, a Gaussian process can still be a poor model if the mapping is not well-captured by a known mean and covariance. For example, changes of lengthscale at unknown locations cannot be captured with a fixed covariance function

(Neil, 1998).

Figure 1: Posterior density from a two-layer model, illustrating non-Gaussian marginals and non-smooth input dependence.

As an example, consider a 1D density estimation task formed from the letters ‘DGP’, where the inputs are the horizontal coordinates and the outputs are the vertical coordinates of points uniformly in the glyphs. The marginals are multimodal and piece-wise constant, and also change non-smoothly in some regions of the space (for example changing discontinuously from a bi- to tri-modal density mid-way through the ‘G’). Figure 1 shows the posterior of a two-layer DGP with this data.

In this work, we develop the DGP model by revisiting the original construction of Damianou and Lawrence (2013), which includes variables that are a priori independent for each data point. Independent variables are important for modelling non-Gaussian marginals, as otherwise function values with the same input will be perfectly correlated. Differently from Damianou and Lawrence (2013), we introduce the independent variables to the model as latent covariates (that is, as additional inputs to the GP) (Wang and Neal, 2012) rather than additively as process noise. As we have a clean separation between variables that are correlated and variables that are independent, we can apply appropriate inference for each. For the correlated variables we use a sparse variational GP to represent the posterior (Titsias, 2009; Matthews et al., 2016). For the uncorrelated variables we use a mean-field approach. We first derive a straightforward combination of the stochastic variational approach of Salimbeni and Deisenroth (2017) with a mean-field Gaussian approximation for the latent variables. We then propose a novel importance-weighted scheme that improves upon the Gaussian approach and trades additional computation for accuracy while retaining analytic results for the GP parts.

Our results show that the deep Gaussian process with latent variables is an effective model for real data. We investigate a large number of datasets and demonstrate that highly non-Gaussian marginals occur in practice, and that they are not well modelled by the noise-free approach of Salimbeni and Deisenroth (2017). We also show that our importance-weighted scheme is always an improvement over variational inference, especially for the deeper models.

2 Model

Our DGP model is built from two components (or layers) that can be freely combined to create a family of models. The two types of layers have orthogonal uses: one is for modelling epistemic uncertainty (also known as model or reducible uncertainty, when the output depends deterministically on the input, but there is uncertainty due to lack of observations) and the other is for modelling aleatoric uncertainty (also known as irreducible uncertainty, when the output is inherently random). Both layers are Gaussian processes, but we use the term latent variable

when the process has the white noise covariance, and we use

Gaussian process when the covariance has no noise component. Each layer takes an input and returns an output. The model is constructed by applying the layers sequentially to the input to get a density over an output .

2.1 Gaussian process (GP) layer

The Gaussian process layer defines a set of continuously indexed random variables, which are a priori jointly Gaussian, with mean and covariance that depend on the indices. We write

for the full set of variables (or ‘function’) and the particular variable with index . The notation states that for any finite set , the variables are jointly Gaussian, with and , where is a mean function and is a positive semi-definite covariance function. In this work, we always use ‘noise free’ kernels, which satisfy .

The GP prior is defined for all inputs, not just the ones associated with data. When we notate the GP function in a graphical model (see Figure 2), all the variables appear together as . We include an infinity symbol in the graphical model to indicate that the GP node represents an infinite collection of random variables, one for every possible input.

2.2 Latent-variable (LV) layer

The latent-variable layer introduces variables to the model that are independent for each data point. As we wish to interpret these variables as unobserved covariates we introduce them through concatenation rather than through addition, as in previous work (see, for example Dai et al., 2015; Bui et al., 2016; Damianou and Lawrence, 2013). We use the notation to denote the concatenation of with . Throughout this work each component of will be distributed as . For input , the output of the latent variable layer is .

2.3 Example model: LV-GP-GP

For the purpose of demonstrating the details of inference in the next section, we focus on a particular model. The model has an initial layer of latent variables followed by two Gaussian process layers. We refer to this model as ‘LV-GP-GP’. The graphical model indicating the a priori statistical relationship between the variables is shown in Figure 2.

Figure 2: Graphical model for LV-GP-GP

Writing and for and , the likelihood of this model is , with

(1)

The priors are

The model is appropriate for continuously valued data, and can model heavy and asymmetric tails, as well as multimodality, as we later demonstrate. We discuss choices for the mean and covariance functions in section 2.6.

2.4 Model variants

The LV-GP-GP model can be extended by adding more layers, and by adding more latent variables. For example, we could insert an additional GP layer,

where . We refer to this model as LV-GP-GP-GP. Alternatively we could place the latent variables between the GPs, and have instead for the conditional mean. We refer to this model as GP-LV-GP. Other models are analogously defined. For example, GP-GP is as above but with no latent variable. We refer to models without latent variables as ‘noise-free’.

(a)
(b)
(c)
(d)
Figure 7: A single sample from the prior for 4 illustrative models. The GP and GP-GP model have no latent variables. Their samples appear as deterministic functions. See the supplementary material for more examples.

Prior samples from models with one and two GP layers are shown in Figure 7 to illustrate their properties. All the GP layers have 1D outputs and use the RBF kernel with unit lengthscale. Further illustrations with additional samples and additional models are shown in the supplementary material, Figures 33-66. For the latent variable models, the number of GPs above the latent variable layer determines the complexity of the marginals, but the number of GPs in total determines the complexity of the functional mapping.

2.5 Multiple outputs

All variables in our model (including the model inputs and outputs) may be vector valued. The ‘multiple output GP’ is a GP with a covariance function defined between outputs

, where indexes output dimension and indexes the input. In this notation, the independent output approach of Damianou and Lawrence (2013) can be written as . For all our models, we consider a linear model of covariance between outputs: (Alvarez et al., 2012). For this is equivalent to assuming a Kronecker structured covariance between the outputs stacked into a single vector (i.e. with covariance ), and is also equivalent to multiplying independent GPs stacked into a vector by the matrix .

2.6 Mean and covariance functions

The DGP model suffers from a pathology if used with zero mean GPs at each layer (Duvenaud et al., 2014; Dunlop et al., 2018). To remedy this problem, Duvenaud et al. (2014) propose concatenating the inputs with the outputs at each GP layer, whereas Salimbeni and Deisenroth (2017) use a linear mean function to address this issue. We follow the latter approach. The inference we present in the next section is agnostic to covariance function, so we are free to make any choice. In our experiments, we use an RBF kernel for each layer, sharing the kernel over the different outputs.

3 Inference in the LV-GP-GP model

In this section, we present two approaches for approximate inference: the first with variational inference and a second with importance-weighted variational inference. Both schemes can optionally amortize the optimization of the local variational parameters (often referred to as ‘auto-encoding’), are scalable through data sub-sampling, and can exploit natural gradients of the variational parameters for the final layer. While the variational approach is a straightforward extension of the doubly-stochastic method by Salimbeni and Deisenroth (2017), the importance-weighted approach is more subtle and requires a careful choice of variational distribution to retain the analytic results for the final layer.

3.1 Variational inference

Variational inference seeks an approximate posterior that is close to the true posterior in terms of kl divergence. The posterior is typically restricted to some tractable family of distributions, and an optimization problem is formed by minimizing the kl divergence from an approximate posterior to the true posterior. Equivalently, the same objective can be obtained by applying Jensen’s inequality to an importance-weighted expression for the marginal likelihood (Domke and Sheldon, 2018). For an approximate posterior, we follow Damianou and Lawrence (2013)

and use a mean-field Gaussian distribution for the latent variables

with , and independent processes for the functions. The posterior density111We abuse notation and write a process as if it has a density. The derivation can be made rigorous as the densities only appear as expectations of ratios. See Matthews et al. (2016) for details. then has the same structure as the prior: . We begin by writing the (exact) marginal likelihood as

(2)

where the expectations are taken with respect to the variational distributions. Applying Jensen’s inequality to the logarithm of both sides of (2), we obtain

(3)

where we have used the short-hand for , and is given by

(4)

By considering a variational distribution over the entire function for and we avoid the difficulty of representing the indeterminately indexed inner-layer variables. We require only that and are finite, which is possible if we construct and as measures with respect to their priors. We follow Hensman et al. (2013) and form these posteriors by conditioning the prior on some finite set of ‘inducing’ points with a parameterized Gaussian distribution. We defer the details to the supplementary material (Section B) as the results are widely known (see, for example Matthews et al., 2016; Cheng and Boots, 2016; Dutordoir et al., 2018). The priors and variational posteriors over and are independent (the coupling is only in the likelihood), so the results from the single layer GP apply to each layer.

To evaluate exactly is intractable except in the single-layer case (and then only with a certain kernel; see Titsias and Lawrence (2010) for details), so we rely on a Monte Carlo estimate. To obtain unbiased samples we first reparameterize the expectations over and , where and where . We then obtain an estimate of by sampling from , which is the ‘reparameterization trick’ popularized by Rezende et al. (2014); Kingma and Welling (2014). After these two expectations have been approximated with a Monte Carlo estimate we can take the expectation over analytically as the likelihood is Gaussian.

3.2 Importance-weighted variational inference

Jensen’s inequality is tighter if the quantity inside the expectation is concentrated about its mean (Domke and Sheldon, 2018)

. To decrease the variance inside (

2) while preserving the value, we can replace the term with a sample average of terms,

(5)

The expression inside the expectation in (5) is known as the importance sampled estimate (with respect to ), motivating the term ‘importance-weighted variational inference’ when we take the logarithm of both sides and form a lower bound using Jensen’s inequality. Applying Jensen’s inequality to (5), we have the lower bound

(6)

where the data-fit term is given by

(7)

This approach provides a strictly tighter bound (Burda et al., 2016; Domke and Sheldon, 2018), but the expression for is less convenient to estimate than as we cannot apply the analytic result for the expectation due to the non-linearity of the logarithm. We must therefore resort to sampling for the expectation over as well as and , incurring potentially high variance. This seems like an unacceptable price to pay as we have not gained any additional flexibility over . In the next section, we show how we can retain the analytic expectation for while exploiting importance-weighted sampling for .

3.3 Analytic final layer

The expression in (7) does not exploit the conjugacy of the Gaussian likelihood with the final layer. In this section, we present a novel two-stage approach to obtain a bound that has all the analytic properties of the variational bound (3), but with improved accuracy. Our aim is to obtain a modified version of (7) but with the expectation taken over the logarithm of the likelihood, since this expectation is tractable. We will show that we can find a bound that replaces the term in (7) with , which we can compute analytically. It is worth noting that since the likelihood is Gaussian we could analytically integrate out to obtain , though doing so precludes the use of minibatches and incurs complexity. It is not surprising, then, that it is possible to ‘collapse’ the bound over approximately. Our approach is inspired by the partially collapsed approaches in Hensman et al. (2012). We begin by applying Jensen’s inequality to the expectation in (2):

where the data term is given by

(8)

The expression for is available in closed form as the conditional likelihood is Gaussian (see, for example, Titsias, 2009) . Applying the exponential function to both sides of (8) gives the bound

(9)

Returning again to (2), the exact marginal likelihood can be expressed equivalently with marginalized as

(10)

We can now use the bound on from (9) and substitute into (10) to obtain

Next we use Jensen’s inequality for the expectation. After some rearranging the bound is given by

To bound , we first reduce the variance of the quantity inside the expectation using the sample average as before to tighten the subsequent use of Jensen’s inequality. For any , is (exactly) equal to

where are independent samples from . We now make a final use of Jensen for the expectation, and the final objective is then given by

(11)

This bound can be evaluated using Monte Carlo sampling for and , both with the reparameterization described in Section 3.1.

The terms inside the sum in (11) must be sampled with a single draw from , and not independently. This is not an insurmountable problem, however, as we can draw samples using the full covariance and reparameterize using the Cholesky factor at cost. Note that the decomposition over the data points is a consequence of our choice of proposal distributions and the factorization of the likelihood, so we do not incur complexity and can sample each term in the sum over independently.

3.4 The posterior over the latent variables

Variational inference simultaneously finds a lower bound and an approximate posterior. The importance weighted approach does also minimize the kl divergence posterior, but the posterior is an implicit one over an augmented space. Samples from this posterior are obtained by first sampling and then selecting one of the

with probabilities proportional to

. We refer to Domke and Sheldon (2018) for details. In this work, we are not concerned with the posterior over the latent variables themselves, but rather with prediction at test points where we sample from the prior.

3.5 Further inference details

We can amortize the optimization of the parameters by making them parameterized functions of . As the bound is a sum over the data we can use data subsampling. We can also use natural gradients for the variational parameters of the final layer, following Salimbeni et al. (2018); Dutordoir et al. (2018). Our bound is modular in both the GP and LV layers so that it extends straightforwardly to the other model variants.

4 Results

(a)
(b)
(c)
(d)
Figure 12:

Posteriors for 1D synthetic data. The models without latent variables cannot capture the bimodality or heteroscedasticity. See Figure

1 for the LV-GP-GP-GP model.

We use a density estimation task to establish the utility of the DGP models with and without latent variables. For the latent variable models we also compare our importance-weighted (IW) inference (11) against the variational inference (VI) approach (3). Our central interest is how well the models balance the complexity of the marginals with the complexity of the mapping from inputs to outputs.

4.1 1D example

To illustrate the inductive bias of our models we show a 1D example with a conditional density that is non-Gaussian, heteroscedastic and changes discontinuously in some regions of the space. We form a dataset from the letters ‘DGP’ by taking a fine grid of points inside the glyphs of a cropped image and using the horizontal coordinates as the input and the vertical coordinates as the output ( points in total). Posteriors are shown in Figure 12 for four models (the same models as in Figure 7), with the data plotted in the first figure. The GP model can obviously not model this data, but neither can the GP-GP as the approximate posterior has no way to separate out the points with the same input but different outputs. The LV-GP model can fit the data to some extent, and the LV-GP-GP model more closely captures the shapes of the letters.

We see that the LV-GP has a tendency to smoother densities than LV-GP-GP, and also a smoother response as a function of the input. Between the ‘G’ and ‘P’ the LV-GP-GP model extrapolates with high confidence, whereas the LV-GP model maintains a bi-modal distribution connecting the two letters. Posteriors from further models are in the supplementary material.

4.2 UCI datasets

GP GP-GP GP-GP-GP LV-GP LV-GP-GP LV-GP-GP-GP
Dataset N D VI VI VI VI IWVI VI IWVI VI IWVI
challenger 23 4 -3.22 (0.07) -2.04 (0.02) -2.10 (0.03) -0.67 (0.01) -0.68 (0.00) -9.99 (3.60) -9.47 (2.19) -1.21 (0.01) -2.45 (0.04)
fertility 100 9 -2.51 (0.13) -1.43 (0.01) -1.40 (0.01) -1.41 (0.01) -1.40 (0.01) -1.42 (0.01) -1.69 (0.01) -1.32 (0.00) -1.31 (0.01)
concreteslump 103 7 1.91 (0.01) 1.55 (0.00) 1.45 (0.00) 1.93 (0.00) 1.94 (0.00) 1.53 (0.00) 1.75 (0.00) 1.43 (0.00) 1.62 (0.00)
autos 159 25 -0.36 (0.00) -0.14 (0.01) -0.14 (0.03) -0.36 (0.01) -0.35 (0.00) -0.06 (0.04) -0.34 (0.01) -0.25 (0.05) -0.33 (0.02)
servo 167 4 -0.17 (0.00) -0.19 (0.01) -0.17 (0.01) -0.07 (0.01) -0.08 (0.01) -0.19 (0.01) -0.07 (0.01) -0.25 (0.01) 0.03 (0.03)
breastcancer 194 33 -1.32 (0.00) -1.36 (0.00) -1.34 (0.00) -1.31 (0.00) -1.31 (0.00) -1.43 (0.01) -1.59 (0.10) -1.38 (0.00) -1.80 (0.19)
machine 209 7 -0.70 (0.01) -0.65 (0.02) -0.61 (0.01) -0.75 (0.02) -0.71 (0.02) -0.63 (0.01) -0.51 (0.01) -0.59 (0.01) -0.52 (0.03)
yacht 308 6 1.32 (0.02) 1.84 (0.01) 2.00 (0.06) 1.64 (0.00) 1.65 (0.00) 1.69 (0.09) 1.71 (0.01) 1.77 (0.07) 2.05 (0.02)
autompg 392 7 -0.44 (0.01) -0.57 (0.02) -0.48 (0.03) -0.32 (0.01) -0.33 (0.01) -0.38 (0.04) -0.24 (0.01) -0.65 (0.11) -0.28 (0.03)
boston 506 13 0.02 (0.00) 0.07 (0.00) 0.03 (0.01) -0.04 (0.00) -0.07 (0.00) 0.06 (0.00) 0.14 (0.00) -0.04 (0.01) 0.17 (0.00)
forest 517 12 -1.38 (0.00) -1.37 (0.00) -1.37 (0.00) -0.99 (0.00) -1.00 (0.00) -0.91 (0.00) -1.02 (0.02) -0.92 (0.01) -0.91 (0.01)
stock 536 11 -0.22 (0.00) -0.29 (0.00) -0.26 (0.00) -0.22 (0.00) -0.22 (0.00) -0.29 (0.00) -0.19 (0.00) -0.23 (0.01) -0.36 (0.02)
pendulum 630 9 -0.14 (0.00) 0.22 (0.01) 0.12 (0.08) -0.14 (0.00) -0.14 (0.00) 0.22 (0.01) 0.33 (0.08) 0.25 (0.00) 0.20 (0.03)
energy 768 8 1.71 (0.00) 1.85 (0.00) 2.07 (0.01) 1.86 (0.01) 1.92 (0.01) 2.00 (0.04) 1.96 (0.01) 2.07 (0.01) 2.28 (0.01)
concrete 1030 8 -0.43 (0.00) -0.45 (0.00) -0.35 (0.02) -0.32 (0.00) -0.32 (0.00) -0.35 (0.00) -0.21 (0.00) -0.22 (0.01) -0.12 (0.00)
solar 1066 10 -1.75 (0.08) -1.21 (0.02) -1.20 (0.03) 0.04 (0.07) 0.07 (0.01) 0.54 (0.01) 0.22 (0.01) 0.54 (0.01) 0.20 (0.02)
airfoil 1503 5 -0.79 (0.05) 0.08 (0.02) 0.14 (0.03) -0.44 (0.03) -0.36 (0.01) 0.07 (0.00) 0.30 (0.00) -0.02 (0.03) 0.34 (0.01)
winered 1599 11 -1.09 (0.00) -1.11 (0.00) -1.08 (0.00) -1.07 (0.00) -1.06 (0.00) -1.10 (0.00) -0.84 (0.01) -1.06 (0.03) -1.31 (0.40)
gas 2565 128 1.07 (0.00) 1.60 (0.05) 1.69 (0.05) 1.69 (0.08) 1.56 (0.06) 0.70 (0.10) 1.57 (0.16) 1.30 (0.14) 1.61 (0.12)
skillcraft 3338 19 -0.94 (0.00) -0.94 (0.00) -0.94 (0.00) -0.91 (0.00) -0.91 (0.00) -0.92 (0.00) -0.93 (0.00) -0.92 (0.00) -0.94 (0.00)
sml 4137 26 1.53 (0.00) 1.72 (0.01) 1.83 (0.01) 1.52 (0.00) 1.52 (0.00) 1.79 (0.00) 1.91 (0.00) 1.92 (0.01) 1.97 (0.05)
winewhite 4898 11 -1.14 (0.00) -1.14 (0.00) -1.14 (0.00) -1.13 (0.00) -1.13 (0.00) -1.13 (0.00) -1.10 (0.00) -1.13 (0.00) -1.09 (0.00)
parkinsons 5875 20 1.99 (0.00) 2.61 (0.01) 2.75 (0.02) 1.79 (0.02) 1.82 (0.02) 2.36 (0.02) 2.76 (0.02) 2.71 (0.07) 3.12 (0.05)
kin8nm 8192 8 -0.29 (0.00) -0.01 (0.00) -0.00 (0.00) -0.29 (0.00) -0.29 (0.00) -0.02 (0.00) -0.00 (0.00) -0.00 (0.00) 0.03 (0.00)
pumadyn32nm 8192 32 0.08 (0.00) 0.11 (0.01) 0.11 (0.00) -1.44 (0.00) -1.44 (0.00) 0.10 (0.01) -0.62 (0.25) 0.11 (0.00) 0.11 (0.00)
power 9568 4 -0.65 (0.04) -0.75 (0.03) -0.80 (0.02) -0.39 (0.04) -0.23 (0.02) -0.36 (0.04) -0.28 (0.05) -0.25 (0.06) -0.11 (0.06)
naval 11934 14 4.52 (0.02) 4.43 (0.03) 4.35 (0.03) 4.19 (0.01) 4.24 (0.01) 4.36 (0.01) 4.52 (0.02) 4.27 (0.02) 4.41 (0.02)
pol 15000 26 0.48 (0.00) 1.51 (0.01) 1.45 (0.01) 0.37 (0.08) -0.50 (0.00) 2.34 (0.02) 2.63 (0.01) 1.47 (0.01) 2.72 (0.10)
elevators 16599 18 -0.44 (0.00) -0.41 (0.01) -0.40 (0.00) -0.37 (0.00) -0.36 (0.00) -0.29 (0.00) -0.27 (0.00) -0.28 (0.00) -0.27 (0.00)
bike 17379 17 0.82 (0.01) 3.49 (0.01) 3.68 (0.01) 2.48 (0.01) 2.66 (0.01) 3.48 (0.00) 3.75 (0.01) 3.72 (0.01) 3.95 (0.01)
kin40k 40000 8 0.02 (0.00) 0.84 (0.00) 1.17 (0.00) 0.04 (0.00) 0.05 (0.00) 0.87 (0.01) 0.93 (0.00) 1.15 (0.00) 1.27 (0.00)
protein 45730 9 -1.06 (0.00) -0.98 (0.00) -0.95 (0.00) -0.85 (0.00) -0.80 (0.00) -0.70 (0.00) -0.61 (0.00) -0.68 (0.00) -0.57 (0.00)
tamielectric 45781 3 -1.43 (0.00) -1.43 (0.00) -1.43 (0.00) -1.31 (0.00) -1.31 (0.00) -1.31 (0.00) -1.31 (0.00) -1.31 (0.00) -1.31 (0.00)
keggdirected 48827 20 0.16 (0.01) 0.21 (0.01) 0.26 (0.02) 1.24 (0.06) 2.03 (0.01) 1.84 (0.05) 2.23 (0.01) 1.92 (0.02) 2.26 (0.01)
slice 53500 385 0.86 (0.00) 1.80 (0.00) 1.86 (0.00) 0.85 (0.00) 0.90 (0.00) 1.80 (0.00) 1.88 (0.00) 1.86 (0.00) 2.02 (0.00)
keggundirected 63608 27 0.06 (0.01) 0.06 (0.01) 0.07 (0.01) -75 (4) -4.21 (0.33) -64 (6) 2.37 (0.29) -116 (17) 2.98 (0.21)
3droad 434874 3 -0.79 (0.00) -0.61 (0.01) -0.58 (0.01) -0.71 (0.00) -0.65 (0.00) -0.69 (0.00) -0.55 (0.01) -0.61 (0.00) -0.48 (0.00)
song 515345 90 -1.19 (0.00) -1.18 (0.00) -1.18 (0.00) -1.15 (0.00) -1.14 (0.00) -1.13 (0.00) -1.10 (0.00) -1.12 (0.00) -1.07 (0.00)
buzz 583250 77 -0.24 (0.00) -0.15 (0.00) -0.15 (0.00) -0.22 (0.01) -0.23 (0.01) -0.09 (0.01) -0.04 (0.03) -0.09 (0.01) 0.04 (0.00)
nytaxi 1420068 8 -1.42 (0.01) -1.44 (0.01) -1.42 (0.01) -1.09 (0.03) -0.90 (0.04) -1.57 (0.04) -1.03 (0.04) -1.74 (0.05) -0.95 (0.07)
houseelectric 2049280 11 1.37 (0.00) 1.41 (0.00) 1.47 (0.01) 1.65 (0.03) 1.82 (0.01) 1.66 (0.05) 2.01 (0.00) 1.77 (0.06) 2.03 (0.00)
Median difference from GP baseline 0 0.06 0.09 0.04 0.05 0.12 0.26 0.18 0.32
Average ranks 2.72 (0.14) 4.01 (0.14) 4.84 (0.14) 4.19 (0.17) 4.60 (0.16) 4.84 (0.14) 6.69 (0.15) 5.62 (0.16) 7.49 (0.16)
Table 1:

Average test log-likelihood results for five splits (standard error).

Bold font indicates overlapping error bars with the best performing method. Italic indicates a significantly non-Gaussian posterior, computed by the Shapiro–Wilk test (see Table 3). Colours indicate datasets demonstrating prominent examples of complex marginals, mappings, or both.

We use 41 publicly available datasets222The full datasets with the splits and pre-processing can be found at github.com/hughsalimbeni/bayesian_benchmarks. with 1D targets. The datasets range in size from points to . In each case we reserve 10% of the data for evaluating a test log-likelihood, repeating the experiment five times with different splits. We use five samples for the importance-weighted models,

inducing points, and five GP outputs for the inner layers. Hyperparameters and initializations are the same for all models and datasets and are fully detailed in the supplementary material. Results for test log-likelihood are reported in Table

1 for the GP models. Table 2 in the supplementary material shows additional results for conditional VAE models (Sohn et al., 2015), and results from deep Gaussian process models using stochastic gradient Hamiltonian Monte Carlo (Havasi et al., 2018)

. To assess the non-Gaussianity of the predictive distribution we compute the Shapiro–Wilk test statistic on the test point marginals. The test statistics are shown in

3 in the supplementary material. Full code to reproduce our results is available online 333https://github.com/hughsalimbeni/DGPs_with_IWVI. We draw five conclusions from the results.

1) Latent variables improve performance. For a given depth of Gaussian process mappings (for example GP-GP compared to LV-GP-GP) the latent variable model generally has equal or better performance, and often considerably better. This is true both using IW and VI.

2) IW outperforms VI. The difference is more pronounced on the deeper models, which is to be expected as the deeper models are likely to have more complicated posteriors, so the VI approximation is likely to be more severe. There are also a few datasets (‘keggundirected’, ‘servo’ and ‘automgp’) where the VI latent variable model has very poor performance, but the IW approach performs well. This suggests that the IW approach might be less prone to local optima compared with VI.

3) Some datasets require latent variables for any improvement over the single-layer model. For several datasets (for example ‘forest’ and ‘power’) the inclusion of latent variables makes a very pronounced difference, whereas depth alone cannot improve over the single layer model. These datasets are highlighed in blue. For all these datasets the marginals for the latent variable models are non-Gaussian (see Table 3 in the supplementary material). Conversely, for some datasets (e.g., ‘kin40k’ and ‘sml’) we observe much greater benefit from the deep noise free models than the single layer model with latent variables. These datasets are highlighted in orange.

4) Some datasets benefit from both depth and latent variables. For the ‘bike’ and ‘keggdirected’ data, for example, we see that both the LV-GP and GP-GP model improve over the single layer model, suggesting that complex input dependence and non-Gaussian marginals are beneficial for modelling the data. Four examples of these datasets are highlighted in green. For these datasets the marginals are non-Gaussian for the latent variable models, and in each case the LV-GP-GP-GP model is the best performing.

5) On average, the LV-GP-GP-GP model is best performing. This indicates that the inductive bias of this model is suitable over the broad range of datasets considered. We also compare to Conditional VAE models (Sohn et al., 2015) (see Table 2 in the supplementary material) and find that these models overfit considerably for the smaller datasets and even on the larger datasets do not outperform our deep latent variable models. The marginals of the CVAE models are more Gaussian than our models (see Table 3 in the supplementary material), indicating that these models explain the data through the input mapping and not the complexity of the marginal densities. See Figures 67 and 68 in the supplementary material for the plots of posterior marginals.

5 Related Work

If the GP layers are replaced by neural networks our models are equivalent to (conditional) variational autoencoders (VAE)

(Kingma and Welling, 2014; Rezende et al., 2014) when used with VI 3.1, or importance weighted autoencoders (IWAEs) (Burda et al., 2016) when used with the IW inference 3.3. Our model can be thought of as a VAE (or IWAE) but with a deep GP for the ‘encoder’ mapping. Compared to the conditional VAE, the deep GP approach has several advantages: it incorporates model (epistemic) uncertainty, is automatically regularized, and allows a fine-grained control of the properties of the mapping. For example, the lengthscale corresponding to the latent variable can be tuned to favour complex marginals (a small value) or near Gaussian marginals (a large value). Note that our IW inference is not a simple adaptation of Burda et al. (2016) as we have to perform additional inference over the GPs.

The LV-GP model was proposed by Wang and Neal (2012), and then developed in (Dutordoir et al., 2018) and used for meta learning in Sæmundsson et al. (2018). A version with discrete latent variables was presented in Bodin et al. (2017). The LV-GP without any inputs is known as the Gaussian process latent variable model (Lawrence, 2004), which was used in a semi-supervised setting in Damianou and Lawrence (2015), which is also equivalent to LV-GP.

The deep model of Damianou and Lawrence (2013) is closely related to our model but incorporates the latent variables additively rather than through concatenation (that is, rather than ). In principle, it would be possible to recover our model using the approach of Damianou and Lawrence (2013) in a certain setting of hyperparameters, but in all previous work the kernel hyperparameters were tied within each layer, so this limit was not achievable. Bui et al. (2016) also used this model, with a form of expectation propagation for approximate inference.

The variational inference we have presented (without importance weighting) is not equivalent to that of Damianou and Lawrence (2013). In Damianou and Lawrence (2013) a mean-field variational posterior is used for the noisy corruptions, which may be a poor approximation as there are a priori correlations between these outputs. This mean-field assumption also forces independence between the inputs and outputs of the layers, whereas we make no such assumption.

6 Discussion

On a broad range of 1D density estimation tasks we find that our DGP with latent variables outperforms the single-layer and noise-free models, sometimes considerably. Closer investigation reveals that non-Gaussian marginals are readily found by our model, and that the importance-weighted objective improves performance in practice.

Conditional density estimation must balance the complexity of the density with the complexity of the input dependency. The inductive bias of our LV-GP-GP-GP model appears to be suitable for a broad range of datasets we have considered. An advantage of our approach is that the deep models all contain the shallower models as special cases. A layer can be ‘turned off’ with a single scalar hyperparameter (the kernel variance) set to zero. This is a consequence of the ResNet-inspired (He et al., 2016) use of mean functions. This may explain why we observe empirically that in practice adding depth rarely hurts performance. The latent variables can similarly be ‘turned off’ if the appropriate lengthscale is large. This may explain why the latent variables models are rarely outperformed by the noise-free models, even when the marginals are Gaussian.

There are very few hyperparameters in our model to optimize (the kernel parameters and the likelihood variance). This allows us to use the same model across a wide range of data. In the small and medium data regimes we have considered, an unregularized mapping (for example, a neural network as in the conditional VAE model (Sohn et al., 2015)) is likely to overfit, as indeed we have observed (see Table 2 in the supplementary material).

6.1 Limitations

As depth increases it becomes increasingly difficult to reason about the DGP priors. Our approach is unlikely to recover interpretable features, such as periodicity, and the approaches of Lloyd et al. (2014); Sun et al. (2018) may be more appropriate if an interpretable model is required. The latent variables are also difficult to interpret as their effect is coupled with the GP mappings.

We have only considered 1D latent variables and 1D outputs, and while the extension to higher dimensions is straightforward for training, difficulties may arise in evaluating posterior expectations as our model does not provide a closed-form predictive density and Monte Carlo sampling may have unacceptably high variance. The task of estimating test likelihoods with high-dimensional latent variables is challenging, and techniques described in Wu et al. (2016) may be necessary in higher dimensions.

The inference we have presented is limited by cubic scaling in both and the number of inducing points. The importance-weighting approach may also suffer from problems of vanishing signal for parameters of , as discussed in Rainforth et al. (2018). The doubly reparameterized gradient estimator from Tucker et al. (2019) could be used to alleviate this problem.

7 Conclusion

We have presented a novel inference scheme for the deep Gaussian process with latent variables, combining importance weighting with partially collapsed variational inference. We have also developed a variant of the deep Gaussian process model where uncorrelated variables are introduced as latent covariates rather than process noise. We have shown empirically that latent variables models deep models outperform the noise-free deep GP on a range of data, and also that our importance-weighted inference delivers an advantage over variational inference in practice.

Acknowledgements

We gratefully acknowledge the contributions of anonymous reviewers, and thank Sanket Kamthe, Victor Picheny and Alan Saul for feedback on the manuscript.

References

  • M. A. Alvarez, L. Rosasco, N. D. Lawrence, et al. (2012) Kernels for vector-valued functions: a review.

    Foundations and Trends in Machine Learning

    .
    Cited by: §2.5.
  • E. Bodin, N. D. Campbell, and C. H. Ek (2017) Latent Gaussian Process Regression. arXiv:1707.05534. Cited by: §5.
  • T. Bui, D. Hernández-Lobato, J. Hernandez-Lobato, Y. Li, and R. Turner (2016) Deep gaussian processes for regression using approximate expectation propagation. International Conference on Machine Learning. Cited by: §2.2, §5.
  • Y. Burda, R. Grosse, and R. Salakhutdinov (2016) Importance weighted autoencoders. International Conference on Learning Representations. Cited by: §3.2, §5.
  • C. Cheng and B. Boots (2016) Incremental variational sparse Gaussian process regression. Advances in Neural Information Processing Systems. Cited by: §3.1.
  • Z. Dai, A. Damianou, J. González, and N. Lawrence (2015) Variational auto-encoded deep Gaussian processes. International Conference on Learning Representations. Cited by: §2.2.
  • A. Damianou and N. D. Lawrence (2015)

    Semi-described and Semi-supervised Learning with Gaussian Processes

    .
    arXiv:1509.01168. Cited by: §5.
  • A. Damianou and N. Lawrence (2013) Deep gaussian processes. Artificial Intelligence and Statistics. Cited by: §1, §1, §2.2, §2.5, §3.1, §5, §5.
  • J. Domke and D. Sheldon (2018) Importance weighting and varational inference. Advances in Neural Information Processing Systems. Cited by: §3.1, §3.2, §3.2, §3.4.
  • M. M. Dunlop, M. Girolami, A. M. Stuart, and A. L. Teckentrup (2018) How Deep Are Deep Gaussian Processes?. Journal of Machine Learning Research. Cited by: §2.6.
  • V. Dutordoir, H. Salimbeni, M. P. Deisenroth, and J. Hensman (2018) Gaussian Process Conditional Density Estimation. Advances in Neural Information Processing Systems. Cited by: §3.1, §3.5, §5.
  • D. Duvenaud, O. Rippel, R. Adams, and Z. Ghahramani (2014) Avoiding pathologies in very deep networks. Artificial Intelligence and Statistics. Cited by: §2.6.
  • X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. Artificial Intelligence and Statistics. Cited by: Appendix A, Appendix A.
  • M. Havasi, J. M. H. Lobato, and J. J. M. Fuentes (2018) Inference in deep gaussian processes using stochastic gradient hamiltonian monte carlo. Advances in Neural Information Processing Systems. Cited by: Table 2, §4.2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition.

    The IEEE conference on computer vision and pattern recognition

    .
    Cited by: §6.
  • A. Hebbal, L. Brevault, M. Balesdent, E. Talbi, and N. Melab (2019) Bayesian Optimization using Deep Gaussian Processes. arxiv/1905.03350. Cited by: Appendix A.
  • J. Hensman, N. Fusi, and N. D. Lawrence (2013) Gaussian Processes for Big Data. Uncertainty in Artificial Intelligence. Cited by: Appendix B, Appendix B, §3.1.
  • J. Hensman, M. Rattray, and N. D. Lawrence (2012) Fast variational inference in the conjugate exponential family. Advances in neural information processing systems. Cited by: §3.3.
  • D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. International Conference on Learning Representations. Cited by: Appendix A.
  • D. P. Kingma and M. Welling (2014) Auto-encoding Variational Bayes. International Conference for Learning Representations. Cited by: §3.1, §5.
  • N. D. Lawrence (2004)

    Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data

    .
    Advances in Neural Information Processing Systems. Cited by: §5.
  • J. R. Lloyd, D. K. Duvenaud, R. B. Grosse, J. B. Tenenbaum, and Z. Ghahramani (2014) Automatic construction and natural-language description of nonparametric regression models.. The AAAI Conference on Artificial Intelligence. Cited by: §6.1.
  • A. G. d. G. Matthews, M. van der Wilk, T. Nickson, K. Fujii, A. Boukouvalas, P. León-Villagrá, Z. Ghahramani, and J. Hensman (2017)

    GPflow: a Gaussian process library using TensorFlow

    .
    The Journal of Machine Learning Research. Cited by: Appendix A.
  • A. Matthews, J. Hensman, T. Richard, and Z. Ghahramani (2016)

    On Sparse Variational Methods and the Kullback-Leibler Divergence between Stochastic Processes

    .
    Artificial Intelligence and Statistics. Cited by: §1, §3.1, footnote 1.
  • R. M. Neil (1998) Regression and classification using gaussian process priors. Bayesian statistics. Cited by: §1.
  • T. Rainforth, A. R. Kosiorek, T. A. Le, C. J. Maddison, M. Igl, F. Wood, and Y. W. Teh (2018) Tighter variational bounds are not necessarily better. International Conference on Machine Learning. Cited by: §6.1.
  • D. J. Rezende, S. Mohamed, and D. Wierstra (2014)

    Stochastic Backpropagation and Approximate Inference in Deep Generative Models

    .
    International Conference on Machine Learning. Cited by: §3.1, §5.
  • S. Sæmundsson, K. Hofmann, and M. P. Deisenroth (2018)

    Meta Reinforcement Learning with Latent Variable Gaussian Processes

    .
    Uncertainty in Artificial Intelligence. Cited by: §5.
  • H. Salimbeni and M. P. Deisenroth (2017) Doubly Stochastic Variational Inference for Deep Gaussian Processes. Advances in Neural Information Processing Systems. Cited by: Table 2, §1, §1, §2.6, §3.
  • H. Salimbeni, S. Eleftheriadis, and J. Hensman (2018) Natural Gradients in Practice: Non-Conjugate Variational Inference in Gaussian Process Models. Artificial Intelligence and Statistics. Cited by: §3.5.
  • K. Sohn, H. Lee, and X. Yan (2015) Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems, Cited by: §4.2, §4.2, §6.
  • S. Sun, G. Zhang, C. Wang, W. Zeng, J. Li, and R. Grosse (2018) Differentiable compositional kernel learning for gaussian processes. International Conference on Machine Learning. Cited by: §6.1.
  • M. Titsias and N. D. Lawrence (2010) Bayesian Gaussian Process Latent Variable Model. Artificial Intelligence and Statistics. Cited by: §3.1.
  • M. Titsias (2009) Variational Learning of Inducing Variables in Sparse Gaussian Processes. Artificial Intelligence and Statistics. Cited by: Appendix B, §1, §3.3.
  • G. Tucker, D. Lawson, S. Gu, and C. J. Maddison (2019) Doubly Reparameterized Gradient Estimators for Monte Carlo Objectives. International Conference on Learning Representations. Cited by: §6.1.
  • C. Wang and R. Neal (2012) Gaussian Process Regression with Heteroscedastic or Non-Gaussian Residuals. arXiv:1212.6246. Cited by: §1, §5.
  • Y. Wu, Y. Burda, R. Salakhutdinov, and R. Grosse (2016) On the Quantitative Analysis of Decoder-based Generative Models. International Conference on Learning Representations. Cited by: §6.1.

Supplementary material

Appendix A Hyperparameters and training settings

For all our models we used the following settings in the UCI experiments:

Kernels. All kernels were the RBF, using a lengthscale parameter per input dimension, initialized to the squareroot of the dimension.

Inducing points. For data with more than 128 training data points the inducing point locations were chosen using the kmeans2 from the scipy

package, with 128 points. Otherwise they were set to the data. For latent variable layers, the GP layer above has extra dimensions. The inducing points for the extra dimensions were padded with random draws from a standard normal variable.

Linear projections between layers

. We implemented the linear mean functions and multioutput structure using a linear projection of 5 independent GPs concatenated with their inputs. We initialized the projection matrix to the first 5 principle components of the data concatenated with the identity matrix.

Amortization of the variational parameters. We used three layer fully connected network with skip connections between each layer. We used the non-linearity with 10 units for the inner layers. We used the weight initialization from [Glorot and Bengio, 2010]

and the exponential function to enforce positivity of the standard deviation parameters. We added a bias of -5 in the final layer to ensure the standard deviations were small at the start of optimization.

Likelihood. The likelihood variance was intialized 0.01.

Parameterizations. All positive model parameters were constrained to be positive using the softplus function, clipped to . The variational parameters for the sparse GP layers were parameterized by the mean and the square root of the covariance.

Optimization. The final GP layers were optimized using natural gradients on the natural parameters, with an initial step size of 0.01. All other parameters were optimized using the Adam optimizer [Kingma and Ba, 2015] with all parameters set to their tensorflow default values except the initial step size of 0.005. Optimizing just final layer using natural gradients and the inner layers with the Adam optimizer in this way is show by Hebbal et al. [2019] to be an effective strategy in practice. We used a batch size of 512 and trained for 100K iterations, annealing the learning rate of both the adam and natural gradient steps by a factor of 0.98 per 1000 iterations. The mean functions and kernel projection matrices (the matrices that correlate the outputs) were not optimized.

Importance weights. We used 5 importance weights for all the latent variables models with importance-weighted variational inference.

Predictions. We used 2000 samples for prediction, sampling from the prior for

for the latent variables models. We then used a kernel density estimator (with Silverman’s rule to set the bandwidth) to estimate the density.

Splits and preprocessing. We used 90% splits on the shuffled data, and rescaled the inputs and outputs to unit standard deviation and zero mean. For dataset with more than 10K test points we used a random sample of 10K test points (the same sample for each split). NB we have reported the test log-likelihoods without restoring the scaling, to facilitate comparisons between data. The splits were the same for all the models. We implemented our models using gpflow [Matthews et al., 2017].

Comparison of methods. The error bars in Table 1 are standard errors over the five splits. These error bars are likely to overestimate the uncertainty for comparisons between methods, however, as there may be correlations between the performance of methods over the splits (as the splits are the same for each method). The average ranks were computed for each split separately, and then averaged over the splits and method. To mitigate the effect of correlations between splits we report the median difference in test log-likelihood from the single layer GP.

VAE models The conditional VAE models used activations and weight initializations from [Glorot and Bengio, 2010]. Optimization was performed used the adam optimizer, with the same batch size and learning rate schedule as the GP models.

Appendix B Variational posterior derivation

Here we derive the sparse GP posterior and its kl divergence from the prior, following Hensman et al. [2013]. This derivation applies to all the GP layers in the DGP model.

We begin with the approach of Titsias [2009] and form a variational distribution over the function by conditioning the prior on a set of inducing points . We write for the vector with th component . In Titsias [2009], is defined as

(12)

where is unspecified. In the single layer case, analytic results can be used to show that has an optimal form that is Gaussian. Instead, we follow Hensman et al. [2013] and assume that

(13)

As this choice is conjugate to the prior conditional , we can use standard Gaussians identities to show that

(14)

where

(15)
(16)

with and The kl divergence is given by

(17)
(18)
(19)
(20)

which is finite.

The parameters and together with are variational parameters.

For multiple outputs, use the same matrix for correlating the outputs as the prior. In practice, we implement the correlated output model by multiplying independent GPs by the matrix , so we use independent outputs for the variational distribution.

Appendix C Further tables and figures

(a)
(b)
(c)
(d)
Figure 17: Posteriors for further models for the ‘DGP’ data
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
(m)
(n)
(o)
Figure 33: Prior draws from one, two and three layer models without latent variables
(a)
(b)
(c)
(d)
(e)
Figure 39: Prior draws from the LV-GP model
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
Figure 50: Prior draws from the LV-GP-GP and GP-LV-GP models
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
(m)
(n)
(o)
Figure 66: Prior draws from LV-GP-GP-GP, GP-LG-GP-GP and GP-GP-LV-GP models
Linear CVAE 50 CVAE GP GP-GP GP-GP-GP GP GP-GP GP-GP-GP LV-GP LV-GP-GP LV-GP-GP-GP
Dataset N D VI VI VI VI VI VI SGHMC SGHMC SGHMC VI IWVI VI IWVI VI IWVI
challenger 23 4 -1.12 (0.02) -0.59 (0.03) -3.22 (0.07) -2.04 (0.02) -2.10 (0.03) -654 (337) -0.67 (0.01) -0.68 (0.00) -9.99 (3.60) -9.47 (2.19) -1.21 (0.01) -2.45 (0.04)
fertility 100 9 -1.32 (0.01) -2.51 (0.13) -1.43 (0.01) -1.40 (0.01) -2.19 (0.02) -1.67 (0.03) -1.35 (0.02) -1.41 (0.01) -1.40 (0.01) -1.42 (0.01) -1.69 (0.01) -1.32 (0.00) -1.31 (0.01)
concreteslump 103 7 0.71 (0.01) 1.91 (0.01) 1.55 (0.00) 1.45 (0.00) 0.59 (0.04) 0.10 (0.03) -0.14 (0.03) 1.93 (0.00) 1.94 (0.00) 1.53 (0.00) 1.75 (0.00) 1.43 (0.00) 1.62 (0.00)
autos 159 25 -0.43 (0.00) -0.36 (0.00) -0.14 (0.01) -0.14 (0.03) -10 (1) -15 (2) -23 (4) -0.36 (0.01) -0.35 (0.00) -0.06 (0.04) -0.34 (0.01) -0.25 (0.05) -0.33 (0.02)
servo 167 4 -0.86 (0.00) -385 (220) -0.17 (0.00) -0.19 (0.01) -0.17 (0.01) -1.22 (0.21) -27 (10) -326 (44) -0.07 (0.01) -0.08 (0.01) -0.19 (0.01) -0.07 (0.01) -0.25 (0.01) 0.03 (0.03)
breastcancer 194 33 -1.49 (0.01) -1.32 (0.00) -1.36 (0.00) -1.34 (0.00) -10 (1) -28 (10) -462 (100) -1.31 (0.00) -1.31 (0.00) -1.43 (0.01) -1.59 (0.10) -1.38 (0.00) -1.80 (0.19)
machine 209 7 -0.71 (0.02) -114 (78) -0.70 (0.01) -0.65 (0.02) -0.61 (0.01) -2.21 (0.28) -14 (4) -74 (13) -0.75 (0.02) -0.71 (0.02) -0.63 (0.01) -0.51 (0.01) -0.59 (0.01) -0.52 (0.03)
yacht 308 6 -0.14 (0.10) -7.91 (4.34) 1.32 (0.02) 1.84 (0.01) 2.00 (0.06) -0.26 (0.79) -9.70 (1.35) -16 (2) 1.64 (0.00) 1.65 (0.00) 1.69 (0.09) 1.71 (0.01) 1.77 (0.07) 2.05 (0.02)
autompg 392 7 -0.48 (0.00) -71 (9) -0.44 (0.01) -0.57 (0.02) -0.48 (0.03) -2.35 (0.22) -43 (4) -131 (26) -0.32 (0.01) -0.33 (0.01) -0.38 (0.04) -0.24 (0.01) -0.65 (0.11) -0.28 (0.03)
boston 506 13 -0.56 (0.00) 0.02 (0.00) 0.07 (0.00) 0.03 (0.01) 0.14 (0.00) -2.70 (0.70) -21 (5) -0.04 (0.00) -0.07 (0.00) 0.06 (0.00) 0.14 (0.00) -0.04 (0.01) 0.17 (0.00)
forest 517 12 -1.36 (0.00) -47 (15) -1.38 (0.00) -1.37 (0.00) -1.37 (0.00) -1.75 (0.02) -3.24 (0.48) -43 (5) -0.99 (0.00) -1.00 (0.00) -0.91 (0.00) -1.02 (0.02) -0.92 (0.01) -0.91 (0.01)
stock 536 11 -0.16 (0.00) -0.22 (0.00) -0.29 (0.00) -0.26 (0.00) -0.22 (0.01) -2.46 (0.41) -97 (13) -0.22 (0.00) -0.22 (0.00) -0.29 (0.00) -0.19 (0.00) -0.23 (0.01) -0.36 (0.02)
pendulum 630 9 -1.16 (0.00) -0.14 (0.00) 0.22 (0.01) 0.12 (0.08) -0.11 (0.01) 0.33 (0.06) -5.43 (1.52) -0.14 (0.00) -0.14 (0.00) 0.22 (0.01) 0.33 (0.08) 0.25 (0.00) 0.20 (0.03)
energy 768 8 -0.06 (0.00) -0.44 (0.31) 1.71 (0.00) 1.85 (0.00) 2.07 (0.01) 1.75 (0.00) 1.97 (0.01) 1.12 (0.33) 1.86 (0.01) 1.92 (0.01) 2.00 (0.04) 1.96 (0.01) 2.07 (0.01) 2.28 (0.01)
concrete 1030 8 -0.97 (0.00) -26 (10) -0.43 (0.00) -0.45 (0.00) -0.35 (0.02) -0.33 (0.01) -0.88 (0.38) -1.85 (1.01) -0.32 (0.00) -0.32 (0.00) -0.35 (0.00) -0.21 (0.00) -0.22 (0.01) -0.12 (0.00)
solar 1066 10 -1.54 (0.03) 0.71 (0.02) 0.26 (0.34) -1.75 (0.08) -1.21 (0.02) -1.20 (0.03) -1.56 (0.05) -1.58 (0.08) -1.65 (0.06) 0.04 (0.07) 0.07 (0.01) 0.54 (0.01) 0.22 (0.01) 0.54 (0.01) 0.20 (0.02)
airfoil 1503 5 -1.11 (0.00) -0.64 (0.32) -0.79 (0.05) 0.08 (0.02) 0.14 (0.03) -0.90 (0.05) -0.19 (0.20) 0.32 (0.09) -0.44 (0.03) -0.36 (0.01) 0.07 (0.00) 0.30 (0.00) -0.02 (0.03) 0.34 (0.01)
winered 1599 11 -1.14 (0.00) -9.64 (1.40) -1.86 (0.85) -1.09 (0.00) -1.11 (0.00) -1.08 (0.00) -1.11 (0.00) -1.09 (0.01) -1.63 (0.09) -1.07 (0.00) -1.06 (0.00) -1.10 (0.00) -0.84 (0.01) -1.06 (0.03) -1.31 (0.40)
gas 2565 128 0.23 (0.00) -241 (77) 1.07 (0.00) 1.60 (0.05) 1.69 (0.05) 0.93 (0.01) 1.20 (0.02) 1.46 (0.05) 1.69 (0.08) 1.56 (0.06) 0.70 (0.10) 1.57 (0.16) 1.30 (0.14) 1.61 (0.12)
skillcraft 3338 19 -0.95 (0.00) -5.80 (0.29) -0.94 (0.00) -0.94 (0.00) -0.94 (0.00) -0.95 (0.00) -0.99 (0.02) -1.10 (0.03) -0.91 (0.00) -0.91 (0.00) -0.92 (0.00) -0.93 (0.00) -0.92 (0.00) -0.94 (0.00)
sml 4137 26 0.32 (0.00) -3.79 (0.87) 1.53 (0.00) 1.72 (0.01) 1.83 (0.01) 1.46 (0.00) 1.71 (0.04) 1.93 (0.01) 1.52 (0.00) 1.52 (0.00) 1.79 (0.00) 1.91 (0.00) 1.92 (0.01) 1.97 (0.05)
winewhite 4898 11 -1.22 (0.00) -1.37 (0.04) -1.14 (0.00) -1.14 (0.00) -1.14 (0.00) -1.13 (0.00) -1.13 (0.00) -1.14 (0.01) -1.13 (0.00) -1.13 (0.00) -1.13 (0.00) -1.10 (0.00) -1.13 (0.00) -1.09 (0.00)
parkinsons 5875 20 -1.30 (0.00) -1.14 (0.56) 1.99 (0.00) 2.61 (0.01) 2.75 (0.02) 1.87 (0.02) 2.38 (0.05) 2.41 (0.04) 1.79 (0.02) 1.82 (0.02) 2.36 (0.02) 2.76 (0.02) 2.71 (0.07) 3.12 (0.05)
kin8nm 8192 8 -1.12 (0.00) -0.01 (0.01) -0.29 (0.00) -0.01 (0.00) -0.00 (0.00) -0.30 (0.00) -0.03 (0.01) -0.00 (0.00) -0.29 (0.00) -0.29 (0.00) -0.02 (0.00) -0.00 (0.00) -0.00 (0.00) 0.03 (0.00)
pumadyn32nm 8192 32 -1.44 (0.00) -2.10 (0.09) 0.08 (0.00) 0.11 (0.01) 0.11 (0.00) 0.11 (0.00) 0.11 (0.00) 0.10 (0.00) -1.44 (0.00) -1.44 (0.00) 0.10 (0.01) -0.62 (0.25) 0.11 (0.00) 0.11 (0.00)
power 9568 4 -0.43 (0.01) -0.44 (0.08) -2.14 (0.06) -0.65 (0.04) -0.75 (0.03) -0.80 (0.02) -0.58 (0.03) -1.41 (0.00) -1.41 (0.00) -0.39 (0.04) -0.23 (0.02) -0.36 (0.04) -0.28 (0.05) -0.25 (0.06) -0.11 (0.06)
naval 11934 14 -0.63 (0.00) 3.07 (0.08) 3.39 (0.12) 4.52 (0.02) 4.43 (0.03) 4.35 (0.03) 3.17 (0.03) -1.41 (0.00) -1.41 (0.00) 4.19 (0.01) 4.24 (0.01) 4.36 (0.01) 4.52 (0.02) 4.27 (0.02) 4.41 (0.02)
pol 15000 26 -1.10 (0.00) -0.05 (0.08) -411 (19) 0.48 (0.00) 1.51 (0.01) 1.45 (0.01) 0.46 (0.00) 1.32 (0.03) 1.31 (0.07) 0.37 (0.08) -0.50 (0.00) 2.34 (0.02) 2.63 (0.01) 1.47 (0.01) 2.72 (0.10)
elevators 16599 18 -0.74 (0.00) -0.28 (0.00) -33 (1) -0.44 (0.00) -0.41 (0.01) -0.40 (0.00) -0.46 (0.00) -0.40 (0.01) -0.39 (0.01) -0.37 (0.00) -0.36 (0.00) -0.29 (0.00) -0.27 (0.00) -0.28 (0.00) -0.27 (0.00)
bike 17379 17 -0.76 (0.00) 3.90 (0.01) 4.47 (0.02) 0.82 (0.01) 3.49 (0.01) 3.68 (0.01) 1.06 (0.01) 1.21 (0.95) 2.77 (0.10) 2.48 (0.01) 2.66 (0.01) 3.48 (0.00) 3.75 (0.01) 3.72 (0.01) 3.95 (0.01)
kin40k 40000 8 -1.43 (0.00) 0.65 (0.01) 1.24 (0.13) 0.02 (0.00) 0.84 (0.00) 1.17 (0.00) -0.02 (0.00) -1.43 (0.00) -1.43 (0.00) 0.04 (0.00) 0.05 (0.00) 0.87 (0.01) 0.93 (0.00) 1.15 (0.00) 1.27 (0.00)
protein 45730 9 -1.25 (0.00) -0.78 (0.02) -0.57 (0.00) -1.06 (0.00) -0.98 (0.00) -0.95 (0.00) -1.08 (0.00) -1.10 (0.07) -0.95 (0.00) -0.85 (0.00) -0.80 (0.00) -0.70 (0.00) -0.61 (0.00) -0.68 (0.00) -0.57 (0.00)
tamielectric 45781 3 -1.43 (0.00) -1.31 (0.00) -1.31 (0.00) -1.43 (0.00) -1.43 (0.00) -1.43 (0.00) -1.43 (0.00) -1.43 (0.00) -1.43 (0.00) -1.31 (0.00) -1.31 (0.00) -1.31 (0.00) -1.31 (0.00) -1.31 (0.00) -1.31 (0.00)
keggdirected 48827 20 -0.34 (0.15) 0.65 (0.32) 1.59 (0.14) 0.16 (0.01) 0.21 (0.01) 0.26 (0.02) 0.26 (0.02) 0.45 (0.02) 0.53 (0.04) 1.24 (0.06) 2.03 (0.01) 1.84 (0.05) 2.23 (0.01) 1.92 (0.02) 2.26 (0.01)
slice 53500 385 -0.47 (0.00) -27 (1) -358 (15) 0.86 (0.00) 1.80 (0.00) 1.86 (0.00) 1.02 (0.00) 1.84 (0.02) 1.93 (0.01) 0.85 (0.00) 0.90 (0.00) 1.80 (0.00) 1.88 (0.00) 1.86 (0.00) 2.02 (0.00)
keggundirected 63608 27 0.28 (0.00) -344 (39) 0.06 (0.01) 0.06 (0.01) 0.07 (0.01) -0.03 (0.02) -0.14 (0.23) 0.10 (0.01) -75 (4) -4.21 (0.33) -64 (6) 2.37 (0.29) -116 (17) 2.98 (0.21)
3droad 434874 3 -1.41 (0.00) -0.85 (0.00) -0.53 (0.01) -0.79 (0.00) -0.61 (0.01) -0.58 (0.01) -1.15 (0.06) -1.42 (0.00) -1.42 (0.00) -0.71 (0.00) -0.65 (0.00) -0.69 (0.00) -0.55 (0.01) -0.61 (0.00) -0.48 (0.00)
song 515345 90 -1.28 (0.00) -1.11 (0.00) -1.09 (0.00) -1.19 (0.00) -1.18 (0.00) -1.18 (0.00) -1.19 (0.00) -1.17 (0.00) -1.17 (0.00) -1.15 (0.00) -1.14 (0.00) -1.13 (0.00) -1.10 (0.00) -1.12 (0.00) -1.07 (0.00)
buzz 583250 77 -1.56 (0.04) 0.01 (0.00) 0.07 (0.00) -0.24 (0.00) -0.15 (0.00) -0.15 (0.00) -0.92 (0.02) -0.48 (0.01) -0.43 (0.03) -0.22 (0.01) -0.23 (0.01) -0.09 (0.01) -0.04 (0.03) -0.09 (0.01) 0.04 (0.00)
nytaxi 1420068 8 -1.53 (0.01) -1.64 (0.04) -1.90 (0.04) -1.42 (0.01) -1.44 (0.01) -1.42 (0.01) -1.46 (0.01) -1.78 (0.00) -1.77 (0.00) -1.09 (0.03) -0.90 (0.04) -1.57 (0.04) -1.03 (0.04) -1.74 (0.05) -0.95 (0.07)
houseelectric 2049280 11 -0.52 (0.00) 1.41 (0.00) 1.48 (0.01) 1.37 (0.00) 1.41 (0.00) 1.47 (0.01) 0.01 (0.04) -1.43 (0.00) -1.43 (0.00) 1.65 (0.03) 1.82 (0.01) 1.66 (0.05) 2.01 (0.00) 1.77 (0.06) 2.03 (0.00)
Median difference from GP baseline -0.38 -2.06 -6016.67 0 0.06 0.09 -0.02 -0.02 -0.28 0.04 0.05 0.12 0.26 0.18 0.32
Average ranks 4.99 (0.25) 5.32 (0.31) 4.35 (0.38) 6.80 (0.21) 8.65 (0.20) 9.63 (0.20) 5.57 (0.19) 5.57 (0.24) 5.21 (0.28) 8.94 (0.22) 9.36 (0.22) 9.82 (0.17) 12.04 (0.18) 10.71 (0.19) 13.03 (0.20)
Table 2: Test log likelihood values over 5 splits (standard errors), including the CVAE models and models with Stochastic Gradient HMC Havasi et al. [2018]. Results less than are reported as . The numbers after the CVAE models are the number of hidden units. The SGHMC results were run with identical models as the corresponding VI methods, Hyperparameter optimization was performed using a random sample from the last 100 iterations, as described in [Havasi et al., 2018]. Posterior sample were take with 2000 samples, with a thinning factor of 5 (i.e. a chain of length 10000). We included also a single layer GP using the SGHMC approach for comparison. The discrepancy between the variational approach and the SGHMC is attributable to several factors (note that the variational approach is optimal in this case): different batch sizes (512 in this work compared to 10000 in Havasi et al. [2018]); different hyperparameter initializations (for example, a much longer initial lengthscales is used in Havasi et al. [2018]); different learning rate schedules (decaying learning rate in this work compared to fixed in Havasi et al. [2018]); the use of natural gradients in this work for the final layer, as opposed to optimizing with Adam optimizer, as done in Havasi et al. [2018], Salimbeni and Deisenroth [2017]; insufficient mixing in the posterior chain.
Linear CVAE 50 CVAE GP GP-GP GP-GP-GP GP GP-GP GP-GP-GP LV-GP LV-GP-GP LV-GP-GP-GP
Dataset N D VI VI VI VI VI VI SGHMC SGHMC SGHMC VI IWVI VI IWVI VI IWVI
challenger 23 4 0.5191 0.9990 0.8870 0.7739 0.9823 0.9099 0.7001 0.7115 0.8466 0.8484 0.7187
fertility 100 9 0.9408 0.8353 0.9598 0.9341 0.9928 0.9969
concreteslump 103 7 0.9970 0.9922
autos 159 25 0.9823 0.8798 0.9903 0.9874 0.9990 0.9990
servo 167 4 0.8358 0.9989 0.9975 0.9990 0.9990 0.9988
breastcancer 194 33 0.9787 0.9814 0.9911 0.9943
machine 209 7 0.9181 0.8967 0.9974 0.9989
yacht 308 6 0.9987 0.9989 0.9990
autompg 392 7 0.9097 0.9948 0.9989 0.9987 0.9990 0.9988 0.9987 0.9985 0.9988
boston 506 13 0.9916 0.9989 0.9989 0.9979 0.9989 0.9980 0.9981
forest 517 12 0.8453 0.7145 0.9981 0.9964 0.7034 0.7074 0.7470 0.7398 0.7608 0.7642
stock 536 11
pendulum 630 9 0.9989 0.9988
energy 768 8 0.9987 0.9990
concrete 1030 8 0.9986 0.9782 0.9990 0.9989
solar 1066 10 0.3028 0.3228 0.3380 0.3509 0.2728 0.2945 0.2557 0.2883
airfoil 1503 5 0.9989 0.9986 0.9988 0.9988
winered 1599 11 0.9790 0.6980 0.9234 0.9751 0.8473
gas 2565 128 0.9984 0.9983 0.9935 0.9989 0.9983 0.9990 0.9986
skillcraft 3338 19 0.9989 0.9983 0.9989 0.9988 0.9990 0.9989
sml 4137 26
winewhite 4898 11 0.9990 0.9990
parkinsons 5875 20 0.9989
kin8nm 8192 8
pumadyn32nm 8192 32 0.9770
power 9568 4 0.9964 0.9963 0.9926 0.9964 0.9852 0.9963 0.9855
naval 11934 14 0.9649 0.9332
pol 15000 26 0.9970 0.8482 0.7214 0.9839 0.9508 0.9980
elevators 16599 18 0.9979 0.9986 0.9977 0.9987 0.9976 0.9988 0.9979
bike 17379 17 0.9815 0.9657 0.9337 0.9990 0.9990
kin40k 40000 8 0.9990
protein 45730 9 0.8839 0.8800 0.9117 0.9041 0.8904 0.8761 0.8840 0.8691
tamielectric 45781 3 0.9551 0.9502 0.9558 0.9562 0.9550 0.9549 0.9551 0.9549
keggdirected 48827 20 0.8279 0.5350 0.7498 0.5445 0.6839 0.5187 0.7039 0.5001
slice 53500 385 0.9990 0.9983 0.9988 0.9988
keggundirected 63608 27 0.4804 0.9882 0.5287 0.3281 0.5511 0.2049 0.9295 0.2322
3droad 434874 3 0.9421 0.9479 0.9432 0.9271 0.9650 0.9467 0.9744 0.9464
song 515345 90 0.9936 0.9957 0.9983 0.9932 0.9969 0.9893 0.9963 0.9835
buzz 583250 77 0.9952 0.9909 0.9988 0.9938 0.9984 0.9838
nytaxi 1420068 8 0.9608 0.9422 0.9698 0.9274 0.9515 0.9019 0.9467 0.8981
houseelectric 2049280 11 0.9830 0.9676 0.9962 0.9671 0.9933 0.9743
Table 3: Median Shapiro–Wilk test statistic for the test points. Blanks have values of one (perfectly Gaussian). Smaller values indicate more evidence for non-Gaussian marginal distributions.
CVAE CVAE LV-GP (IW) LV-GP (IW) LV-GP-GP (IW)

forest

solar

power

keggdirected

Figure 67: Posterior marginals, re-scaled to zero mean and unit standard deviation, with the Gaussian density marked as a dotted curve. Colors are according to the principal component of the inputs. These are the four datasets highlighted in the main text for being prominent examples benefiting from latent variables. In these cases the additional depth of model leads to more non-Gaussian marginals.
CVAE CVAE LV-GP (IW) LV-GP (IW) LV-GP-GP (IW)

keggundirected

houseelectric

pol

bike

Figure 68: As Figure 67, but for the four datasets highlighted in the main text for benefiting from both latent variables and depth. Note that, unlike in Figure 67, for the ‘pol’ and ‘bike’ data the deeper models actually have simpler marginals. This is because the deep model has more complexity in the mapping, so can more closely fit the data, whereas the simple model must explain the data with a complex noise distribution.