DGPs_with_IWVI
Deep Gaussian Processes with Importance-Weighted Variational Inference
view repo
Deep Gaussian processes (DGPs) can model complex marginal densities as well as complex mappings. Non-Gaussian marginals are essential for modelling real-world data, and can be generated from the DGP by incorporating uncorrelated variables to the model. Previous work on DGP models has introduced noise additively and used variational inference with a combination of sparse Gaussian processes and mean-field Gaussians for the approximate posterior. Additive noise attenuates the signal, and the Gaussian form of variational distribution may lead to an inaccurate posterior. We instead incorporate noisy variables as latent covariates, and propose a novel importance-weighted objective, which leverages analytic results and provides a mechanism to trade off computation for improved accuracy. Our results demonstrate that the importance-weighted objective works well in practice and consistently outperforms classical variational inference, especially for deeper models.
READ FULL TEXT VIEW PDFDeep Gaussian Processes with Importance-Weighted Variational Inference
Gaussian processes are powerful and popular models with widespread use, but the joint Gaussian assumption of the latent function values can be limiting in many situations. This can be for at least two reasons: firstly, not all prior knowledge is possible to express solely in terms of mean and covariance, and secondly Gaussian marginals are not sufficient for many applications. The deep Gaussian process (DGP) (Damianou and Lawrence, 2013) can potentially overcome both these limitations.
We consider the very general problem of conditional density estimation. A Gaussian process model with a Gaussian likelihood is a poor model unless the true data marginals are Gaussian. Even if the marginals are Gaussian, a Gaussian process can still be a poor model if the mapping is not well-captured by a known mean and covariance. For example, changes of lengthscale at unknown locations cannot be captured with a fixed covariance function
(Neil, 1998).As an example, consider a 1D density estimation task formed from the letters ‘DGP’, where the inputs are the horizontal coordinates and the outputs are the vertical coordinates of points uniformly in the glyphs. The marginals are multimodal and piece-wise constant, and also change non-smoothly in some regions of the space (for example changing discontinuously from a bi- to tri-modal density mid-way through the ‘G’). Figure 1 shows the posterior of a two-layer DGP with this data.
In this work, we develop the DGP model by revisiting the original construction of Damianou and Lawrence (2013), which includes variables that are a priori independent for each data point. Independent variables are important for modelling non-Gaussian marginals, as otherwise function values with the same input will be perfectly correlated. Differently from Damianou and Lawrence (2013), we introduce the independent variables to the model as latent covariates (that is, as additional inputs to the GP) (Wang and Neal, 2012) rather than additively as process noise. As we have a clean separation between variables that are correlated and variables that are independent, we can apply appropriate inference for each. For the correlated variables we use a sparse variational GP to represent the posterior (Titsias, 2009; Matthews et al., 2016). For the uncorrelated variables we use a mean-field approach. We first derive a straightforward combination of the stochastic variational approach of Salimbeni and Deisenroth (2017) with a mean-field Gaussian approximation for the latent variables. We then propose a novel importance-weighted scheme that improves upon the Gaussian approach and trades additional computation for accuracy while retaining analytic results for the GP parts.
Our results show that the deep Gaussian process with latent variables is an effective model for real data. We investigate a large number of datasets and demonstrate that highly non-Gaussian marginals occur in practice, and that they are not well modelled by the noise-free approach of Salimbeni and Deisenroth (2017). We also show that our importance-weighted scheme is always an improvement over variational inference, especially for the deeper models.
Our DGP model is built from two components (or layers) that can be freely combined to create a family of models. The two types of layers have orthogonal uses: one is for modelling epistemic uncertainty (also known as model or reducible uncertainty, when the output depends deterministically on the input, but there is uncertainty due to lack of observations) and the other is for modelling aleatoric uncertainty (also known as irreducible uncertainty, when the output is inherently random). Both layers are Gaussian processes, but we use the term latent variable
when the process has the white noise covariance, and we use
Gaussian process when the covariance has no noise component. Each layer takes an input and returns an output. The model is constructed by applying the layers sequentially to the input to get a density over an output .The Gaussian process layer defines a set of continuously indexed random variables, which are a priori jointly Gaussian, with mean and covariance that depend on the indices. We write
for the full set of variables (or ‘function’) and the particular variable with index . The notation states that for any finite set , the variables are jointly Gaussian, with and , where is a mean function and is a positive semi-definite covariance function. In this work, we always use ‘noise free’ kernels, which satisfy .The GP prior is defined for all inputs, not just the ones associated with data. When we notate the GP function in a graphical model (see Figure 2), all the variables appear together as . We include an infinity symbol in the graphical model to indicate that the GP node represents an infinite collection of random variables, one for every possible input.
The latent-variable layer introduces variables to the model that are independent for each data point. As we wish to interpret these variables as unobserved covariates we introduce them through concatenation rather than through addition, as in previous work (see, for example Dai et al., 2015; Bui et al., 2016; Damianou and Lawrence, 2013). We use the notation to denote the concatenation of with . Throughout this work each component of will be distributed as . For input , the output of the latent variable layer is .
For the purpose of demonstrating the details of inference in the next section, we focus on a particular model. The model has an initial layer of latent variables followed by two Gaussian process layers. We refer to this model as ‘LV-GP-GP’. The graphical model indicating the a priori statistical relationship between the variables is shown in Figure 2.
Writing and for and , the likelihood of this model is , with
(1) |
The priors are
The model is appropriate for continuously valued data, and can model heavy and asymmetric tails, as well as multimodality, as we later demonstrate. We discuss choices for the mean and covariance functions in section 2.6.
The LV-GP-GP model can be extended by adding more layers, and by adding more latent variables. For example, we could insert an additional GP layer,
where . We refer to this model as LV-GP-GP-GP. Alternatively we could place the latent variables between the GPs, and have instead for the conditional mean. We refer to this model as GP-LV-GP. Other models are analogously defined. For example, GP-GP is as above but with no latent variable. We refer to models without latent variables as ‘noise-free’.
Prior samples from models with one and two GP layers are shown in Figure 7 to illustrate their properties. All the GP layers have 1D outputs and use the RBF kernel with unit lengthscale. Further illustrations with additional samples and additional models are shown in the supplementary material, Figures 33-66. For the latent variable models, the number of GPs above the latent variable layer determines the complexity of the marginals, but the number of GPs in total determines the complexity of the functional mapping.
All variables in our model (including the model inputs and outputs) may be vector valued. The ‘multiple output GP’ is a GP with a covariance function defined between outputs
, where indexes output dimension and indexes the input. In this notation, the independent output approach of Damianou and Lawrence (2013) can be written as . For all our models, we consider a linear model of covariance between outputs: (Alvarez et al., 2012). For this is equivalent to assuming a Kronecker structured covariance between the outputs stacked into a single vector (i.e. with covariance ), and is also equivalent to multiplying independent GPs stacked into a vector by the matrix .The DGP model suffers from a pathology if used with zero mean GPs at each layer (Duvenaud et al., 2014; Dunlop et al., 2018). To remedy this problem, Duvenaud et al. (2014) propose concatenating the inputs with the outputs at each GP layer, whereas Salimbeni and Deisenroth (2017) use a linear mean function to address this issue. We follow the latter approach. The inference we present in the next section is agnostic to covariance function, so we are free to make any choice. In our experiments, we use an RBF kernel for each layer, sharing the kernel over the different outputs.
In this section, we present two approaches for approximate inference: the first with variational inference and a second with importance-weighted variational inference. Both schemes can optionally amortize the optimization of the local variational parameters (often referred to as ‘auto-encoding’), are scalable through data sub-sampling, and can exploit natural gradients of the variational parameters for the final layer. While the variational approach is a straightforward extension of the doubly-stochastic method by Salimbeni and Deisenroth (2017), the importance-weighted approach is more subtle and requires a careful choice of variational distribution to retain the analytic results for the final layer.
Variational inference seeks an approximate posterior that is close to the true posterior in terms of kl divergence. The posterior is typically restricted to some tractable family of distributions, and an optimization problem is formed by minimizing the kl divergence from an approximate posterior to the true posterior. Equivalently, the same objective can be obtained by applying Jensen’s inequality to an importance-weighted expression for the marginal likelihood (Domke and Sheldon, 2018). For an approximate posterior, we follow Damianou and Lawrence (2013)
and use a mean-field Gaussian distribution for the latent variables
with , and independent processes for the functions. The posterior density^{1}^{1}1We abuse notation and write a process as if it has a density. The derivation can be made rigorous as the densities only appear as expectations of ratios. See Matthews et al. (2016) for details. then has the same structure as the prior: . We begin by writing the (exact) marginal likelihood as(2) |
where the expectations are taken with respect to the variational distributions. Applying Jensen’s inequality to the logarithm of both sides of (2), we obtain
(3) |
where we have used the short-hand for , and is given by
(4) |
By considering a variational distribution over the entire function for and we avoid the difficulty of representing the indeterminately indexed inner-layer variables. We require only that and are finite, which is possible if we construct and as measures with respect to their priors. We follow Hensman et al. (2013) and form these posteriors by conditioning the prior on some finite set of ‘inducing’ points with a parameterized Gaussian distribution. We defer the details to the supplementary material (Section B) as the results are widely known (see, for example Matthews et al., 2016; Cheng and Boots, 2016; Dutordoir et al., 2018). The priors and variational posteriors over and are independent (the coupling is only in the likelihood), so the results from the single layer GP apply to each layer.
To evaluate exactly is intractable except in the single-layer case (and then only with a certain kernel; see Titsias and Lawrence (2010) for details), so we rely on a Monte Carlo estimate. To obtain unbiased samples we first reparameterize the expectations over and , where and where . We then obtain an estimate of by sampling from , which is the ‘reparameterization trick’ popularized by Rezende et al. (2014); Kingma and Welling (2014). After these two expectations have been approximated with a Monte Carlo estimate we can take the expectation over analytically as the likelihood is Gaussian.
Jensen’s inequality is tighter if the quantity inside the expectation is concentrated about its mean (Domke and Sheldon, 2018)
. To decrease the variance inside (
2) while preserving the value, we can replace the term with a sample average of terms,(5) |
The expression inside the expectation in (5) is known as the importance sampled estimate (with respect to ), motivating the term ‘importance-weighted variational inference’ when we take the logarithm of both sides and form a lower bound using Jensen’s inequality. Applying Jensen’s inequality to (5), we have the lower bound
(6) |
where the data-fit term is given by
(7) |
This approach provides a strictly tighter bound (Burda et al., 2016; Domke and Sheldon, 2018), but the expression for is less convenient to estimate than as we cannot apply the analytic result for the expectation due to the non-linearity of the logarithm. We must therefore resort to sampling for the expectation over as well as and , incurring potentially high variance. This seems like an unacceptable price to pay as we have not gained any additional flexibility over . In the next section, we show how we can retain the analytic expectation for while exploiting importance-weighted sampling for .
The expression in (7) does not exploit the conjugacy of the Gaussian likelihood with the final layer. In this section, we present a novel two-stage approach to obtain a bound that has all the analytic properties of the variational bound (3), but with improved accuracy. Our aim is to obtain a modified version of (7) but with the expectation taken over the logarithm of the likelihood, since this expectation is tractable. We will show that we can find a bound that replaces the term in (7) with , which we can compute analytically. It is worth noting that since the likelihood is Gaussian we could analytically integrate out to obtain , though doing so precludes the use of minibatches and incurs complexity. It is not surprising, then, that it is possible to ‘collapse’ the bound over approximately. Our approach is inspired by the partially collapsed approaches in Hensman et al. (2012). We begin by applying Jensen’s inequality to the expectation in (2):
where the data term is given by
(8) |
The expression for is available in closed form as the conditional likelihood is Gaussian (see, for example, Titsias, 2009) . Applying the exponential function to both sides of (8) gives the bound
(9) |
Returning again to (2), the exact marginal likelihood can be expressed equivalently with marginalized as
(10) |
We can now use the bound on from (9) and substitute into (10) to obtain
Next we use Jensen’s inequality for the expectation. After some rearranging the bound is given by
To bound , we first reduce the variance of the quantity inside the expectation using the sample average as before to tighten the subsequent use of Jensen’s inequality. For any , is (exactly) equal to
where are independent samples from . We now make a final use of Jensen for the expectation, and the final objective is then given by
(11) |
This bound can be evaluated using Monte Carlo sampling for and , both with the reparameterization described in Section 3.1.
The terms inside the sum in (11) must be sampled with a single draw from , and not independently. This is not an insurmountable problem, however, as we can draw samples using the full covariance and reparameterize using the Cholesky factor at cost. Note that the decomposition over the data points is a consequence of our choice of proposal distributions and the factorization of the likelihood, so we do not incur complexity and can sample each term in the sum over independently.
Variational inference simultaneously finds a lower bound and an approximate posterior. The importance weighted approach does also minimize the kl divergence posterior, but the posterior is an implicit one over an augmented space. Samples from this posterior are obtained by first sampling and then selecting one of the
with probabilities proportional to
. We refer to Domke and Sheldon (2018) for details. In this work, we are not concerned with the posterior over the latent variables themselves, but rather with prediction at test points where we sample from the prior.We can amortize the optimization of the parameters by making them parameterized functions of . As the bound is a sum over the data we can use data subsampling. We can also use natural gradients for the variational parameters of the final layer, following Salimbeni et al. (2018); Dutordoir et al. (2018). Our bound is modular in both the GP and LV layers so that it extends straightforwardly to the other model variants.
Posteriors for 1D synthetic data. The models without latent variables cannot capture the bimodality or heteroscedasticity. See Figure
1 for the LV-GP-GP-GP model.We use a density estimation task to establish the utility of the DGP models with and without latent variables. For the latent variable models we also compare our importance-weighted (IW) inference (11) against the variational inference (VI) approach (3). Our central interest is how well the models balance the complexity of the marginals with the complexity of the mapping from inputs to outputs.
To illustrate the inductive bias of our models we show a 1D example with a conditional density that is non-Gaussian, heteroscedastic and changes discontinuously in some regions of the space. We form a dataset from the letters ‘DGP’ by taking a fine grid of points inside the glyphs of a cropped image and using the horizontal coordinates as the input and the vertical coordinates as the output ( points in total). Posteriors are shown in Figure 12 for four models (the same models as in Figure 7), with the data plotted in the first figure. The GP model can obviously not model this data, but neither can the GP-GP as the approximate posterior has no way to separate out the points with the same input but different outputs. The LV-GP model can fit the data to some extent, and the LV-GP-GP model more closely captures the shapes of the letters.
We see that the LV-GP has a tendency to smoother densities than LV-GP-GP, and also a smoother response as a function of the input. Between the ‘G’ and ‘P’ the LV-GP-GP model extrapolates with high confidence, whereas the LV-GP model maintains a bi-modal distribution connecting the two letters. Posteriors from further models are in the supplementary material.
GP | GP-GP | GP-GP-GP | LV-GP | LV-GP-GP | LV-GP-GP-GP | ||||||
Dataset | N | D | VI | VI | VI | VI | IWVI | VI | IWVI | VI | IWVI |
challenger | 23 | 4 | -3.22 (0.07) | -2.04 (0.02) | -2.10 (0.03) | -0.67 (0.01) | -0.68 (0.00) | -9.99 (3.60) | -9.47 (2.19) | -1.21 (0.01) | -2.45 (0.04) |
fertility | 100 | 9 | -2.51 (0.13) | -1.43 (0.01) | -1.40 (0.01) | -1.41 (0.01) | -1.40 (0.01) | -1.42 (0.01) | -1.69 (0.01) | -1.32 (0.00) | -1.31 (0.01) |
concreteslump | 103 | 7 | 1.91 (0.01) | 1.55 (0.00) | 1.45 (0.00) | 1.93 (0.00) | 1.94 (0.00) | 1.53 (0.00) | 1.75 (0.00) | 1.43 (0.00) | 1.62 (0.00) |
autos | 159 | 25 | -0.36 (0.00) | -0.14 (0.01) | -0.14 (0.03) | -0.36 (0.01) | -0.35 (0.00) | -0.06 (0.04) | -0.34 (0.01) | -0.25 (0.05) | -0.33 (0.02) |
servo | 167 | 4 | -0.17 (0.00) | -0.19 (0.01) | -0.17 (0.01) | -0.07 (0.01) | -0.08 (0.01) | -0.19 (0.01) | -0.07 (0.01) | -0.25 (0.01) | 0.03 (0.03) |
breastcancer | 194 | 33 | -1.32 (0.00) | -1.36 (0.00) | -1.34 (0.00) | -1.31 (0.00) | -1.31 (0.00) | -1.43 (0.01) | -1.59 (0.10) | -1.38 (0.00) | -1.80 (0.19) |
machine | 209 | 7 | -0.70 (0.01) | -0.65 (0.02) | -0.61 (0.01) | -0.75 (0.02) | -0.71 (0.02) | -0.63 (0.01) | -0.51 (0.01) | -0.59 (0.01) | -0.52 (0.03) |
yacht | 308 | 6 | 1.32 (0.02) | 1.84 (0.01) | 2.00 (0.06) | 1.64 (0.00) | 1.65 (0.00) | 1.69 (0.09) | 1.71 (0.01) | 1.77 (0.07) | 2.05 (0.02) |
autompg | 392 | 7 | -0.44 (0.01) | -0.57 (0.02) | -0.48 (0.03) | -0.32 (0.01) | -0.33 (0.01) | -0.38 (0.04) | -0.24 (0.01) | -0.65 (0.11) | -0.28 (0.03) |
boston | 506 | 13 | 0.02 (0.00) | 0.07 (0.00) | 0.03 (0.01) | -0.04 (0.00) | -0.07 (0.00) | 0.06 (0.00) | 0.14 (0.00) | -0.04 (0.01) | 0.17 (0.00) |
forest | 517 | 12 | -1.38 (0.00) | -1.37 (0.00) | -1.37 (0.00) | -0.99 (0.00) | -1.00 (0.00) | -0.91 (0.00) | -1.02 (0.02) | -0.92 (0.01) | -0.91 (0.01) |
stock | 536 | 11 | -0.22 (0.00) | -0.29 (0.00) | -0.26 (0.00) | -0.22 (0.00) | -0.22 (0.00) | -0.29 (0.00) | -0.19 (0.00) | -0.23 (0.01) | -0.36 (0.02) |
pendulum | 630 | 9 | -0.14 (0.00) | 0.22 (0.01) | 0.12 (0.08) | -0.14 (0.00) | -0.14 (0.00) | 0.22 (0.01) | 0.33 (0.08) | 0.25 (0.00) | 0.20 (0.03) |
energy | 768 | 8 | 1.71 (0.00) | 1.85 (0.00) | 2.07 (0.01) | 1.86 (0.01) | 1.92 (0.01) | 2.00 (0.04) | 1.96 (0.01) | 2.07 (0.01) | 2.28 (0.01) |
concrete | 1030 | 8 | -0.43 (0.00) | -0.45 (0.00) | -0.35 (0.02) | -0.32 (0.00) | -0.32 (0.00) | -0.35 (0.00) | -0.21 (0.00) | -0.22 (0.01) | -0.12 (0.00) |
solar | 1066 | 10 | -1.75 (0.08) | -1.21 (0.02) | -1.20 (0.03) | 0.04 (0.07) | 0.07 (0.01) | 0.54 (0.01) | 0.22 (0.01) | 0.54 (0.01) | 0.20 (0.02) |
airfoil | 1503 | 5 | -0.79 (0.05) | 0.08 (0.02) | 0.14 (0.03) | -0.44 (0.03) | -0.36 (0.01) | 0.07 (0.00) | 0.30 (0.00) | -0.02 (0.03) | 0.34 (0.01) |
winered | 1599 | 11 | -1.09 (0.00) | -1.11 (0.00) | -1.08 (0.00) | -1.07 (0.00) | -1.06 (0.00) | -1.10 (0.00) | -0.84 (0.01) | -1.06 (0.03) | -1.31 (0.40) |
gas | 2565 | 128 | 1.07 (0.00) | 1.60 (0.05) | 1.69 (0.05) | 1.69 (0.08) | 1.56 (0.06) | 0.70 (0.10) | 1.57 (0.16) | 1.30 (0.14) | 1.61 (0.12) |
skillcraft | 3338 | 19 | -0.94 (0.00) | -0.94 (0.00) | -0.94 (0.00) | -0.91 (0.00) | -0.91 (0.00) | -0.92 (0.00) | -0.93 (0.00) | -0.92 (0.00) | -0.94 (0.00) |
sml | 4137 | 26 | 1.53 (0.00) | 1.72 (0.01) | 1.83 (0.01) | 1.52 (0.00) | 1.52 (0.00) | 1.79 (0.00) | 1.91 (0.00) | 1.92 (0.01) | 1.97 (0.05) |
winewhite | 4898 | 11 | -1.14 (0.00) | -1.14 (0.00) | -1.14 (0.00) | -1.13 (0.00) | -1.13 (0.00) | -1.13 (0.00) | -1.10 (0.00) | -1.13 (0.00) | -1.09 (0.00) |
parkinsons | 5875 | 20 | 1.99 (0.00) | 2.61 (0.01) | 2.75 (0.02) | 1.79 (0.02) | 1.82 (0.02) | 2.36 (0.02) | 2.76 (0.02) | 2.71 (0.07) | 3.12 (0.05) |
kin8nm | 8192 | 8 | -0.29 (0.00) | -0.01 (0.00) | -0.00 (0.00) | -0.29 (0.00) | -0.29 (0.00) | -0.02 (0.00) | -0.00 (0.00) | -0.00 (0.00) | 0.03 (0.00) |
pumadyn32nm | 8192 | 32 | 0.08 (0.00) | 0.11 (0.01) | 0.11 (0.00) | -1.44 (0.00) | -1.44 (0.00) | 0.10 (0.01) | -0.62 (0.25) | 0.11 (0.00) | 0.11 (0.00) |
power | 9568 | 4 | -0.65 (0.04) | -0.75 (0.03) | -0.80 (0.02) | -0.39 (0.04) | -0.23 (0.02) | -0.36 (0.04) | -0.28 (0.05) | -0.25 (0.06) | -0.11 (0.06) |
naval | 11934 | 14 | 4.52 (0.02) | 4.43 (0.03) | 4.35 (0.03) | 4.19 (0.01) | 4.24 (0.01) | 4.36 (0.01) | 4.52 (0.02) | 4.27 (0.02) | 4.41 (0.02) |
pol | 15000 | 26 | 0.48 (0.00) | 1.51 (0.01) | 1.45 (0.01) | 0.37 (0.08) | -0.50 (0.00) | 2.34 (0.02) | 2.63 (0.01) | 1.47 (0.01) | 2.72 (0.10) |
elevators | 16599 | 18 | -0.44 (0.00) | -0.41 (0.01) | -0.40 (0.00) | -0.37 (0.00) | -0.36 (0.00) | -0.29 (0.00) | -0.27 (0.00) | -0.28 (0.00) | -0.27 (0.00) |
bike | 17379 | 17 | 0.82 (0.01) | 3.49 (0.01) | 3.68 (0.01) | 2.48 (0.01) | 2.66 (0.01) | 3.48 (0.00) | 3.75 (0.01) | 3.72 (0.01) | 3.95 (0.01) |
kin40k | 40000 | 8 | 0.02 (0.00) | 0.84 (0.00) | 1.17 (0.00) | 0.04 (0.00) | 0.05 (0.00) | 0.87 (0.01) | 0.93 (0.00) | 1.15 (0.00) | 1.27 (0.00) |
protein | 45730 | 9 | -1.06 (0.00) | -0.98 (0.00) | -0.95 (0.00) | -0.85 (0.00) | -0.80 (0.00) | -0.70 (0.00) | -0.61 (0.00) | -0.68 (0.00) | -0.57 (0.00) |
tamielectric | 45781 | 3 | -1.43 (0.00) | -1.43 (0.00) | -1.43 (0.00) | -1.31 (0.00) | -1.31 (0.00) | -1.31 (0.00) | -1.31 (0.00) | -1.31 (0.00) | -1.31 (0.00) |
keggdirected | 48827 | 20 | 0.16 (0.01) | 0.21 (0.01) | 0.26 (0.02) | 1.24 (0.06) | 2.03 (0.01) | 1.84 (0.05) | 2.23 (0.01) | 1.92 (0.02) | 2.26 (0.01) |
slice | 53500 | 385 | 0.86 (0.00) | 1.80 (0.00) | 1.86 (0.00) | 0.85 (0.00) | 0.90 (0.00) | 1.80 (0.00) | 1.88 (0.00) | 1.86 (0.00) | 2.02 (0.00) |
keggundirected | 63608 | 27 | 0.06 (0.01) | 0.06 (0.01) | 0.07 (0.01) | -75 (4) | -4.21 (0.33) | -64 (6) | 2.37 (0.29) | -116 (17) | 2.98 (0.21) |
3droad | 434874 | 3 | -0.79 (0.00) | -0.61 (0.01) | -0.58 (0.01) | -0.71 (0.00) | -0.65 (0.00) | -0.69 (0.00) | -0.55 (0.01) | -0.61 (0.00) | -0.48 (0.00) |
song | 515345 | 90 | -1.19 (0.00) | -1.18 (0.00) | -1.18 (0.00) | -1.15 (0.00) | -1.14 (0.00) | -1.13 (0.00) | -1.10 (0.00) | -1.12 (0.00) | -1.07 (0.00) |
buzz | 583250 | 77 | -0.24 (0.00) | -0.15 (0.00) | -0.15 (0.00) | -0.22 (0.01) | -0.23 (0.01) | -0.09 (0.01) | -0.04 (0.03) | -0.09 (0.01) | 0.04 (0.00) |
nytaxi | 1420068 | 8 | -1.42 (0.01) | -1.44 (0.01) | -1.42 (0.01) | -1.09 (0.03) | -0.90 (0.04) | -1.57 (0.04) | -1.03 (0.04) | -1.74 (0.05) | -0.95 (0.07) |
houseelectric | 2049280 | 11 | 1.37 (0.00) | 1.41 (0.00) | 1.47 (0.01) | 1.65 (0.03) | 1.82 (0.01) | 1.66 (0.05) | 2.01 (0.00) | 1.77 (0.06) | 2.03 (0.00) |
Median difference from GP baseline | 0 | 0.06 | 0.09 | 0.04 | 0.05 | 0.12 | 0.26 | 0.18 | 0.32 | ||
Average ranks | 2.72 (0.14) | 4.01 (0.14) | 4.84 (0.14) | 4.19 (0.17) | 4.60 (0.16) | 4.84 (0.14) | 6.69 (0.15) | 5.62 (0.16) | 7.49 (0.16) |
Average test log-likelihood results for five splits (standard error).
Bold font indicates overlapping error bars with the best performing method. Italic indicates a significantly non-Gaussian posterior, computed by the Shapiro–Wilk test (see Table 3). Colours indicate datasets demonstrating prominent examples of complex marginals, mappings, or both.We use 41 publicly available datasets^{2}^{2}2The full datasets with the splits and pre-processing can be found at github.com/hughsalimbeni/bayesian_benchmarks. with 1D targets. The datasets range in size from points to . In each case we reserve 10% of the data for evaluating a test log-likelihood, repeating the experiment five times with different splits. We use five samples for the importance-weighted models,
inducing points, and five GP outputs for the inner layers. Hyperparameters and initializations are the same for all models and datasets and are fully detailed in the supplementary material. Results for test log-likelihood are reported in Table
1 for the GP models. Table 2 in the supplementary material shows additional results for conditional VAE models (Sohn et al., 2015), and results from deep Gaussian process models using stochastic gradient Hamiltonian Monte Carlo (Havasi et al., 2018). To assess the non-Gaussianity of the predictive distribution we compute the Shapiro–Wilk test statistic on the test point marginals. The test statistics are shown in
3 in the supplementary material. Full code to reproduce our results is available online ^{3}^{3}3https://github.com/hughsalimbeni/DGPs_with_IWVI. We draw five conclusions from the results.1) Latent variables improve performance. For a given depth of Gaussian process mappings (for example GP-GP compared to LV-GP-GP) the latent variable model generally has equal or better performance, and often considerably better. This is true both using IW and VI.
2) IW outperforms VI. The difference is more pronounced on the deeper models, which is to be expected as the deeper models are likely to have more complicated posteriors, so the VI approximation is likely to be more severe. There are also a few datasets (‘keggundirected’, ‘servo’ and ‘automgp’) where the VI latent variable model has very poor performance, but the IW approach performs well. This suggests that the IW approach might be less prone to local optima compared with VI.
3) Some datasets require latent variables for any improvement over the single-layer model. For several datasets (for example ‘forest’ and ‘power’) the inclusion of latent variables makes a very pronounced difference, whereas depth alone cannot improve over the single layer model. These datasets are highlighed in blue. For all these datasets the marginals for the latent variable models are non-Gaussian (see Table 3 in the supplementary material). Conversely, for some datasets (e.g., ‘kin40k’ and ‘sml’) we observe much greater benefit from the deep noise free models than the single layer model with latent variables. These datasets are highlighted in orange.
4) Some datasets benefit from both depth and latent variables. For the ‘bike’ and ‘keggdirected’ data, for example, we see that both the LV-GP and GP-GP model improve over the single layer model, suggesting that complex input dependence and non-Gaussian marginals are beneficial for modelling the data. Four examples of these datasets are highlighted in green. For these datasets the marginals are non-Gaussian for the latent variable models, and in each case the LV-GP-GP-GP model is the best performing.
5) On average, the LV-GP-GP-GP model is best performing. This indicates that the inductive bias of this model is suitable over the broad range of datasets considered. We also compare to Conditional VAE models (Sohn et al., 2015) (see Table 2 in the supplementary material) and find that these models overfit considerably for the smaller datasets and even on the larger datasets do not outperform our deep latent variable models. The marginals of the CVAE models are more Gaussian than our models (see Table 3 in the supplementary material), indicating that these models explain the data through the input mapping and not the complexity of the marginal densities. See Figures 67 and 68 in the supplementary material for the plots of posterior marginals.
If the GP layers are replaced by neural networks our models are equivalent to (conditional) variational autoencoders (VAE)
(Kingma and Welling, 2014; Rezende et al., 2014) when used with VI 3.1, or importance weighted autoencoders (IWAEs) (Burda et al., 2016) when used with the IW inference 3.3. Our model can be thought of as a VAE (or IWAE) but with a deep GP for the ‘encoder’ mapping. Compared to the conditional VAE, the deep GP approach has several advantages: it incorporates model (epistemic) uncertainty, is automatically regularized, and allows a fine-grained control of the properties of the mapping. For example, the lengthscale corresponding to the latent variable can be tuned to favour complex marginals (a small value) or near Gaussian marginals (a large value). Note that our IW inference is not a simple adaptation of Burda et al. (2016) as we have to perform additional inference over the GPs.The LV-GP model was proposed by Wang and Neal (2012), and then developed in (Dutordoir et al., 2018) and used for meta learning in Sæmundsson et al. (2018). A version with discrete latent variables was presented in Bodin et al. (2017). The LV-GP without any inputs is known as the Gaussian process latent variable model (Lawrence, 2004), which was used in a semi-supervised setting in Damianou and Lawrence (2015), which is also equivalent to LV-GP.
The deep model of Damianou and Lawrence (2013) is closely related to our model but incorporates the latent variables additively rather than through concatenation (that is, rather than ). In principle, it would be possible to recover our model using the approach of Damianou and Lawrence (2013) in a certain setting of hyperparameters, but in all previous work the kernel hyperparameters were tied within each layer, so this limit was not achievable. Bui et al. (2016) also used this model, with a form of expectation propagation for approximate inference.
The variational inference we have presented (without importance weighting) is not equivalent to that of Damianou and Lawrence (2013). In Damianou and Lawrence (2013) a mean-field variational posterior is used for the noisy corruptions, which may be a poor approximation as there are a priori correlations between these outputs. This mean-field assumption also forces independence between the inputs and outputs of the layers, whereas we make no such assumption.
On a broad range of 1D density estimation tasks we find that our DGP with latent variables outperforms the single-layer and noise-free models, sometimes considerably. Closer investigation reveals that non-Gaussian marginals are readily found by our model, and that the importance-weighted objective improves performance in practice.
Conditional density estimation must balance the complexity of the density with the complexity of the input dependency. The inductive bias of our LV-GP-GP-GP model appears to be suitable for a broad range of datasets we have considered. An advantage of our approach is that the deep models all contain the shallower models as special cases. A layer can be ‘turned off’ with a single scalar hyperparameter (the kernel variance) set to zero. This is a consequence of the ResNet-inspired (He et al., 2016) use of mean functions. This may explain why we observe empirically that in practice adding depth rarely hurts performance. The latent variables can similarly be ‘turned off’ if the appropriate lengthscale is large. This may explain why the latent variables models are rarely outperformed by the noise-free models, even when the marginals are Gaussian.
There are very few hyperparameters in our model to optimize (the kernel parameters and the likelihood variance). This allows us to use the same model across a wide range of data. In the small and medium data regimes we have considered, an unregularized mapping (for example, a neural network as in the conditional VAE model (Sohn et al., 2015)) is likely to overfit, as indeed we have observed (see Table 2 in the supplementary material).
As depth increases it becomes increasingly difficult to reason about the DGP priors. Our approach is unlikely to recover interpretable features, such as periodicity, and the approaches of Lloyd et al. (2014); Sun et al. (2018) may be more appropriate if an interpretable model is required. The latent variables are also difficult to interpret as their effect is coupled with the GP mappings.
We have only considered 1D latent variables and 1D outputs, and while the extension to higher dimensions is straightforward for training, difficulties may arise in evaluating posterior expectations as our model does not provide a closed-form predictive density and Monte Carlo sampling may have unacceptably high variance. The task of estimating test likelihoods with high-dimensional latent variables is challenging, and techniques described in Wu et al. (2016) may be necessary in higher dimensions.
The inference we have presented is limited by cubic scaling in both and the number of inducing points. The importance-weighting approach may also suffer from problems of vanishing signal for parameters of , as discussed in Rainforth et al. (2018). The doubly reparameterized gradient estimator from Tucker et al. (2019) could be used to alleviate this problem.
We have presented a novel inference scheme for the deep Gaussian process with latent variables, combining importance weighting with partially collapsed variational inference. We have also developed a variant of the deep Gaussian process model where uncorrelated variables are introduced as latent covariates rather than process noise. We have shown empirically that latent variables models deep models outperform the noise-free deep GP on a range of data, and also that our importance-weighted inference delivers an advantage over variational inference in practice.
We gratefully acknowledge the contributions of anonymous reviewers, and thank Sanket Kamthe, Victor Picheny and Alan Saul for feedback on the manuscript.
Foundations and Trends in Machine Learning
. Cited by: §2.5.Semi-described and Semi-supervised Learning with Gaussian Processes
. arXiv:1509.01168. Cited by: §5.The IEEE conference on computer vision and pattern recognition
. Cited by: §6.Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data
. Advances in Neural Information Processing Systems. Cited by: §5.GPflow: a Gaussian process library using TensorFlow
. The Journal of Machine Learning Research. Cited by: Appendix A.On Sparse Variational Methods and the Kullback-Leibler Divergence between Stochastic Processes
. Artificial Intelligence and Statistics. Cited by: §1, §3.1, footnote 1.Stochastic Backpropagation and Approximate Inference in Deep Generative Models
. International Conference on Machine Learning. Cited by: §3.1, §5.Meta Reinforcement Learning with Latent Variable Gaussian Processes
. Uncertainty in Artificial Intelligence. Cited by: §5.For all our models we used the following settings in the UCI experiments:
Kernels. All kernels were the RBF, using a lengthscale parameter per input dimension, initialized to the squareroot of the dimension.
Inducing points. For data with more than 128 training data points the inducing point locations were chosen using the kmeans2 from the scipy
package, with 128 points. Otherwise they were set to the data. For latent variable layers, the GP layer above has extra dimensions. The inducing points for the extra dimensions were padded with random draws from a standard normal variable.
Linear projections between layers
. We implemented the linear mean functions and multioutput structure using a linear projection of 5 independent GPs concatenated with their inputs. We initialized the projection matrix to the first 5 principle components of the data concatenated with the identity matrix.
Amortization of the variational parameters. We used three layer fully connected network with skip connections between each layer. We used the non-linearity with 10 units for the inner layers. We used the weight initialization from [Glorot and Bengio, 2010]
and the exponential function to enforce positivity of the standard deviation parameters. We added a bias of -5 in the final layer to ensure the standard deviations were small at the start of optimization.
Likelihood. The likelihood variance was intialized 0.01.
Parameterizations. All positive model parameters were constrained to be positive using the softplus function, clipped to . The variational parameters for the sparse GP layers were parameterized by the mean and the square root of the covariance.
Optimization. The final GP layers were optimized using natural gradients on the natural parameters, with an initial step size of 0.01. All other parameters were optimized using the Adam optimizer [Kingma and Ba, 2015] with all parameters set to their tensorflow default values except the initial step size of 0.005. Optimizing just final layer using natural gradients and the inner layers with the Adam optimizer in this way is show by Hebbal et al. [2019] to be an effective strategy in practice. We used a batch size of 512 and trained for 100K iterations, annealing the learning rate of both the adam and natural gradient steps by a factor of 0.98 per 1000 iterations. The mean functions and kernel projection matrices (the matrices that correlate the outputs) were not optimized.
Importance weights. We used 5 importance weights for all the latent variables models with importance-weighted variational inference.
Predictions. We used 2000 samples for prediction, sampling from the prior for
for the latent variables models. We then used a kernel density estimator (with Silverman’s rule to set the bandwidth) to estimate the density.
Splits and preprocessing. We used 90% splits on the shuffled data, and rescaled the inputs and outputs to unit standard deviation and zero mean. For dataset with more than 10K test points we used a random sample of 10K test points (the same sample for each split). NB we have reported the test log-likelihoods without restoring the scaling, to facilitate comparisons between data. The splits were the same for all the models. We implemented our models using gpflow [Matthews et al., 2017].
Comparison of methods. The error bars in Table 1 are standard errors over the five splits. These error bars are likely to overestimate the uncertainty for comparisons between methods, however, as there may be correlations between the performance of methods over the splits (as the splits are the same for each method). The average ranks were computed for each split separately, and then averaged over the splits and method. To mitigate the effect of correlations between splits we report the median difference in test log-likelihood from the single layer GP.
VAE models The conditional VAE models used activations and weight initializations from [Glorot and Bengio, 2010]. Optimization was performed used the adam optimizer, with the same batch size and learning rate schedule as the GP models.
Here we derive the sparse GP posterior and its kl divergence from the prior, following Hensman et al. [2013]. This derivation applies to all the GP layers in the DGP model.
We begin with the approach of Titsias [2009] and form a variational distribution over the function by conditioning the prior on a set of inducing points . We write for the vector with th component . In Titsias [2009], is defined as
(12) |
where is unspecified. In the single layer case, analytic results can be used to show that has an optimal form that is Gaussian. Instead, we follow Hensman et al. [2013] and assume that
(13) |
As this choice is conjugate to the prior conditional , we can use standard Gaussians identities to show that
(14) |
where
(15) | ||||
(16) |
with and The kl divergence is given by
(17) | ||||
(18) | ||||
(19) | ||||
(20) |
which is finite.
The parameters and together with are variational parameters.
For multiple outputs, use the same matrix for correlating the outputs as the prior. In practice, we implement the correlated output model by multiplying independent GPs by the matrix , so we use independent outputs for the variational distribution.
Linear | CVAE 50 | CVAE | GP | GP-GP | GP-GP-GP | GP | GP-GP | GP-GP-GP | LV-GP | LV-GP-GP | LV-GP-GP-GP | ||||||
Dataset | N | D | VI | VI | VI | VI | VI | VI | SGHMC | SGHMC | SGHMC | VI | IWVI | VI | IWVI | VI | IWVI |
challenger | 23 | 4 | -1.12 (0.02) | -0.59 (0.03) | -3.22 (0.07) | -2.04 (0.02) | -2.10 (0.03) | -654 (337) | -0.67 (0.01) | -0.68 (0.00) | -9.99 (3.60) | -9.47 (2.19) | -1.21 (0.01) | -2.45 (0.04) | |||
fertility | 100 | 9 | -1.32 (0.01) | -2.51 (0.13) | -1.43 (0.01) | -1.40 (0.01) | -2.19 (0.02) | -1.67 (0.03) | -1.35 (0.02) | -1.41 (0.01) | -1.40 (0.01) | -1.42 (0.01) | -1.69 (0.01) | -1.32 (0.00) | -1.31 (0.01) | ||
concreteslump | 103 | 7 | 0.71 (0.01) | 1.91 (0.01) | 1.55 (0.00) | 1.45 (0.00) | 0.59 (0.04) | 0.10 (0.03) | -0.14 (0.03) | 1.93 (0.00) | 1.94 (0.00) | 1.53 (0.00) | 1.75 (0.00) | 1.43 (0.00) | 1.62 (0.00) | ||
autos | 159 | 25 | -0.43 (0.00) | -0.36 (0.00) | -0.14 (0.01) | -0.14 (0.03) | -10 (1) | -15 (2) | -23 (4) | -0.36 (0.01) | -0.35 (0.00) | -0.06 (0.04) | -0.34 (0.01) | -0.25 (0.05) | -0.33 (0.02) | ||
servo | 167 | 4 | -0.86 (0.00) | -385 (220) | -0.17 (0.00) | -0.19 (0.01) | -0.17 (0.01) | -1.22 (0.21) | -27 (10) | -326 (44) | -0.07 (0.01) | -0.08 (0.01) | -0.19 (0.01) | -0.07 (0.01) | -0.25 (0.01) | 0.03 (0.03) | |
breastcancer | 194 | 33 | -1.49 (0.01) | -1.32 (0.00) | -1.36 (0.00) | -1.34 (0.00) | -10 (1) | -28 (10) | -462 (100) | -1.31 (0.00) | -1.31 (0.00) | -1.43 (0.01) | -1.59 (0.10) | -1.38 (0.00) | -1.80 (0.19) | ||
machine | 209 | 7 | -0.71 (0.02) | -114 (78) | -0.70 (0.01) | -0.65 (0.02) | -0.61 (0.01) | -2.21 (0.28) | -14 (4) | -74 (13) | -0.75 (0.02) | -0.71 (0.02) | -0.63 (0.01) | -0.51 (0.01) | -0.59 (0.01) | -0.52 (0.03) | |
yacht | 308 | 6 | -0.14 (0.10) | -7.91 (4.34) | 1.32 (0.02) | 1.84 (0.01) | 2.00 (0.06) | -0.26 (0.79) | -9.70 (1.35) | -16 (2) | 1.64 (0.00) | 1.65 (0.00) | 1.69 (0.09) | 1.71 (0.01) | 1.77 (0.07) | 2.05 (0.02) | |
autompg | 392 | 7 | -0.48 (0.00) | -71 (9) | -0.44 (0.01) | -0.57 (0.02) | -0.48 (0.03) | -2.35 (0.22) | -43 (4) | -131 (26) | -0.32 (0.01) | -0.33 (0.01) | -0.38 (0.04) | -0.24 (0.01) | -0.65 (0.11) | -0.28 (0.03) | |
boston | 506 | 13 | -0.56 (0.00) | 0.02 (0.00) | 0.07 (0.00) | 0.03 (0.01) | 0.14 (0.00) | -2.70 (0.70) | -21 (5) | -0.04 (0.00) | -0.07 (0.00) | 0.06 (0.00) | 0.14 (0.00) | -0.04 (0.01) | 0.17 (0.00) | ||
forest | 517 | 12 | -1.36 (0.00) | -47 (15) | -1.38 (0.00) | -1.37 (0.00) | -1.37 (0.00) | -1.75 (0.02) | -3.24 (0.48) | -43 (5) | -0.99 (0.00) | -1.00 (0.00) | -0.91 (0.00) | -1.02 (0.02) | -0.92 (0.01) | -0.91 (0.01) | |
stock | 536 | 11 | -0.16 (0.00) | -0.22 (0.00) | -0.29 (0.00) | -0.26 (0.00) | -0.22 (0.01) | -2.46 (0.41) | -97 (13) | -0.22 (0.00) | -0.22 (0.00) | -0.29 (0.00) | -0.19 (0.00) | -0.23 (0.01) | -0.36 (0.02) | ||
pendulum | 630 | 9 | -1.16 (0.00) | -0.14 (0.00) | 0.22 (0.01) | 0.12 (0.08) | -0.11 (0.01) | 0.33 (0.06) | -5.43 (1.52) | -0.14 (0.00) | -0.14 (0.00) | 0.22 (0.01) | 0.33 (0.08) | 0.25 (0.00) | 0.20 (0.03) | ||
energy | 768 | 8 | -0.06 (0.00) | -0.44 (0.31) | 1.71 (0.00) | 1.85 (0.00) | 2.07 (0.01) | 1.75 (0.00) | 1.97 (0.01) | 1.12 (0.33) | 1.86 (0.01) | 1.92 (0.01) | 2.00 (0.04) | 1.96 (0.01) | 2.07 (0.01) | 2.28 (0.01) | |
concrete | 1030 | 8 | -0.97 (0.00) | -26 (10) | -0.43 (0.00) | -0.45 (0.00) | -0.35 (0.02) | -0.33 (0.01) | -0.88 (0.38) | -1.85 (1.01) | -0.32 (0.00) | -0.32 (0.00) | -0.35 (0.00) | -0.21 (0.00) | -0.22 (0.01) | -0.12 (0.00) | |
solar | 1066 | 10 | -1.54 (0.03) | 0.71 (0.02) | 0.26 (0.34) | -1.75 (0.08) | -1.21 (0.02) | -1.20 (0.03) | -1.56 (0.05) | -1.58 (0.08) | -1.65 (0.06) | 0.04 (0.07) | 0.07 (0.01) | 0.54 (0.01) | 0.22 (0.01) | 0.54 (0.01) | 0.20 (0.02) |
airfoil | 1503 | 5 | -1.11 (0.00) | -0.64 (0.32) | -0.79 (0.05) | 0.08 (0.02) | 0.14 (0.03) | -0.90 (0.05) | -0.19 (0.20) | 0.32 (0.09) | -0.44 (0.03) | -0.36 (0.01) | 0.07 (0.00) | 0.30 (0.00) | -0.02 (0.03) | 0.34 (0.01) | |
winered | 1599 | 11 | -1.14 (0.00) | -9.64 (1.40) | -1.86 (0.85) | -1.09 (0.00) | -1.11 (0.00) | -1.08 (0.00) | -1.11 (0.00) | -1.09 (0.01) | -1.63 (0.09) | -1.07 (0.00) | -1.06 (0.00) | -1.10 (0.00) | -0.84 (0.01) | -1.06 (0.03) | -1.31 (0.40) |
gas | 2565 | 128 | 0.23 (0.00) | -241 (77) | 1.07 (0.00) | 1.60 (0.05) | 1.69 (0.05) | 0.93 (0.01) | 1.20 (0.02) | 1.46 (0.05) | 1.69 (0.08) | 1.56 (0.06) | 0.70 (0.10) | 1.57 (0.16) | 1.30 (0.14) | 1.61 (0.12) | |
skillcraft | 3338 | 19 | -0.95 (0.00) | -5.80 (0.29) | -0.94 (0.00) | -0.94 (0.00) | -0.94 (0.00) | -0.95 (0.00) | -0.99 (0.02) | -1.10 (0.03) | -0.91 (0.00) | -0.91 (0.00) | -0.92 (0.00) | -0.93 (0.00) | -0.92 (0.00) | -0.94 (0.00) | |
sml | 4137 | 26 | 0.32 (0.00) | -3.79 (0.87) | 1.53 (0.00) | 1.72 (0.01) | 1.83 (0.01) | 1.46 (0.00) | 1.71 (0.04) | 1.93 (0.01) | 1.52 (0.00) | 1.52 (0.00) | 1.79 (0.00) | 1.91 (0.00) | 1.92 (0.01) | 1.97 (0.05) | |
winewhite | 4898 | 11 | -1.22 (0.00) | -1.37 (0.04) | -1.14 (0.00) | -1.14 (0.00) | -1.14 (0.00) | -1.13 (0.00) | -1.13 (0.00) | -1.14 (0.01) | -1.13 (0.00) | -1.13 (0.00) | -1.13 (0.00) | -1.10 (0.00) | -1.13 (0.00) | -1.09 (0.00) | |
parkinsons | 5875 | 20 | -1.30 (0.00) | -1.14 (0.56) | 1.99 (0.00) | 2.61 (0.01) | 2.75 (0.02) | 1.87 (0.02) | 2.38 (0.05) | 2.41 (0.04) | 1.79 (0.02) | 1.82 (0.02) | 2.36 (0.02) | 2.76 (0.02) | 2.71 (0.07) | 3.12 (0.05) | |
kin8nm | 8192 | 8 | -1.12 (0.00) | -0.01 (0.01) | -0.29 (0.00) | -0.01 (0.00) | -0.00 (0.00) | -0.30 (0.00) | -0.03 (0.01) | -0.00 (0.00) | -0.29 (0.00) | -0.29 (0.00) | -0.02 (0.00) | -0.00 (0.00) | -0.00 (0.00) | 0.03 (0.00) | |
pumadyn32nm | 8192 | 32 | -1.44 (0.00) | -2.10 (0.09) | 0.08 (0.00) | 0.11 (0.01) | 0.11 (0.00) | 0.11 (0.00) | 0.11 (0.00) | 0.10 (0.00) | -1.44 (0.00) | -1.44 (0.00) | 0.10 (0.01) | -0.62 (0.25) | 0.11 (0.00) | 0.11 (0.00) | |
power | 9568 | 4 | -0.43 (0.01) | -0.44 (0.08) | -2.14 (0.06) | -0.65 (0.04) | -0.75 (0.03) | -0.80 (0.02) | -0.58 (0.03) | -1.41 (0.00) | -1.41 (0.00) | -0.39 (0.04) | -0.23 (0.02) | -0.36 (0.04) | -0.28 (0.05) | -0.25 (0.06) | -0.11 (0.06) |
naval | 11934 | 14 | -0.63 (0.00) | 3.07 (0.08) | 3.39 (0.12) | 4.52 (0.02) | 4.43 (0.03) | 4.35 (0.03) | 3.17 (0.03) | -1.41 (0.00) | -1.41 (0.00) | 4.19 (0.01) | 4.24 (0.01) | 4.36 (0.01) | 4.52 (0.02) | 4.27 (0.02) | 4.41 (0.02) |
pol | 15000 | 26 | -1.10 (0.00) | -0.05 (0.08) | -411 (19) | 0.48 (0.00) | 1.51 (0.01) | 1.45 (0.01) | 0.46 (0.00) | 1.32 (0.03) | 1.31 (0.07) | 0.37 (0.08) | -0.50 (0.00) | 2.34 (0.02) | 2.63 (0.01) | 1.47 (0.01) | 2.72 (0.10) |
elevators | 16599 | 18 | -0.74 (0.00) | -0.28 (0.00) | -33 (1) | -0.44 (0.00) | -0.41 (0.01) | -0.40 (0.00) | -0.46 (0.00) | -0.40 (0.01) | -0.39 (0.01) | -0.37 (0.00) | -0.36 (0.00) | -0.29 (0.00) | -0.27 (0.00) | -0.28 (0.00) | -0.27 (0.00) |
bike | 17379 | 17 | -0.76 (0.00) | 3.90 (0.01) | 4.47 (0.02) | 0.82 (0.01) | 3.49 (0.01) | 3.68 (0.01) | 1.06 (0.01) | 1.21 (0.95) | 2.77 (0.10) | 2.48 (0.01) | 2.66 (0.01) | 3.48 (0.00) | 3.75 (0.01) | 3.72 (0.01) | 3.95 (0.01) |
kin40k | 40000 | 8 | -1.43 (0.00) | 0.65 (0.01) | 1.24 (0.13) | 0.02 (0.00) | 0.84 (0.00) | 1.17 (0.00) | -0.02 (0.00) | -1.43 (0.00) | -1.43 (0.00) | 0.04 (0.00) | 0.05 (0.00) | 0.87 (0.01) | 0.93 (0.00) | 1.15 (0.00) | 1.27 (0.00) |
protein | 45730 | 9 | -1.25 (0.00) | -0.78 (0.02) | -0.57 (0.00) | -1.06 (0.00) | -0.98 (0.00) | -0.95 (0.00) | -1.08 (0.00) | -1.10 (0.07) | -0.95 (0.00) | -0.85 (0.00) | -0.80 (0.00) | -0.70 (0.00) | -0.61 (0.00) | -0.68 (0.00) | -0.57 (0.00) |
tamielectric | 45781 | 3 | -1.43 (0.00) | -1.31 (0.00) | -1.31 (0.00) | -1.43 (0.00) | -1.43 (0.00) | -1.43 (0.00) | -1.43 (0.00) | -1.43 (0.00) | -1.43 (0.00) | -1.31 (0.00) | -1.31 (0.00) | -1.31 (0.00) | -1.31 (0.00) | -1.31 (0.00) | -1.31 (0.00) |
keggdirected | 48827 | 20 | -0.34 (0.15) | 0.65 (0.32) | 1.59 (0.14) | 0.16 (0.01) | 0.21 (0.01) | 0.26 (0.02) | 0.26 (0.02) | 0.45 (0.02) | 0.53 (0.04) | 1.24 (0.06) | 2.03 (0.01) | 1.84 (0.05) | 2.23 (0.01) | 1.92 (0.02) | 2.26 (0.01) |
slice | 53500 | 385 | -0.47 (0.00) | -27 (1) | -358 (15) | 0.86 (0.00) | 1.80 (0.00) | 1.86 (0.00) | 1.02 (0.00) | 1.84 (0.02) | 1.93 (0.01) | 0.85 (0.00) | 0.90 (0.00) | 1.80 (0.00) | 1.88 (0.00) | 1.86 (0.00) | 2.02 (0.00) |
keggundirected | 63608 | 27 | 0.28 (0.00) | -344 (39) | 0.06 (0.01) | 0.06 (0.01) | 0.07 (0.01) | -0.03 (0.02) | -0.14 (0.23) | 0.10 (0.01) | -75 (4) | -4.21 (0.33) | -64 (6) | 2.37 (0.29) | -116 (17) | 2.98 (0.21) | |
3droad | 434874 | 3 | -1.41 (0.00) | -0.85 (0.00) | -0.53 (0.01) | -0.79 (0.00) | -0.61 (0.01) | -0.58 (0.01) | -1.15 (0.06) | -1.42 (0.00) | -1.42 (0.00) | -0.71 (0.00) | -0.65 (0.00) | -0.69 (0.00) | -0.55 (0.01) | -0.61 (0.00) | -0.48 (0.00) |
song | 515345 | 90 | -1.28 (0.00) | -1.11 (0.00) | -1.09 (0.00) | -1.19 (0.00) | -1.18 (0.00) | -1.18 (0.00) | -1.19 (0.00) | -1.17 (0.00) | -1.17 (0.00) | -1.15 (0.00) | -1.14 (0.00) | -1.13 (0.00) | -1.10 (0.00) | -1.12 (0.00) | -1.07 (0.00) |
buzz | 583250 | 77 | -1.56 (0.04) | 0.01 (0.00) | 0.07 (0.00) | -0.24 (0.00) | -0.15 (0.00) | -0.15 (0.00) | -0.92 (0.02) | -0.48 (0.01) | -0.43 (0.03) | -0.22 (0.01) | -0.23 (0.01) | -0.09 (0.01) | -0.04 (0.03) | -0.09 (0.01) | 0.04 (0.00) |
nytaxi | 1420068 | 8 | -1.53 (0.01) | -1.64 (0.04) | -1.90 (0.04) | -1.42 (0.01) | -1.44 (0.01) | -1.42 (0.01) | -1.46 (0.01) | -1.78 (0.00) | -1.77 (0.00) | -1.09 (0.03) | -0.90 (0.04) | -1.57 (0.04) | -1.03 (0.04) | -1.74 (0.05) | -0.95 (0.07) |
houseelectric | 2049280 | 11 | -0.52 (0.00) | 1.41 (0.00) | 1.48 (0.01) | 1.37 (0.00) | 1.41 (0.00) | 1.47 (0.01) | 0.01 (0.04) | -1.43 (0.00) | -1.43 (0.00) | 1.65 (0.03) | 1.82 (0.01) | 1.66 (0.05) | 2.01 (0.00) | 1.77 (0.06) | 2.03 (0.00) |
Median difference from GP baseline | -0.38 | -2.06 | -6016.67 | 0 | 0.06 | 0.09 | -0.02 | -0.02 | -0.28 | 0.04 | 0.05 | 0.12 | 0.26 | 0.18 | 0.32 | ||
Average ranks | 4.99 (0.25) | 5.32 (0.31) | 4.35 (0.38) | 6.80 (0.21) | 8.65 (0.20) | 9.63 (0.20) | 5.57 (0.19) | 5.57 (0.24) | 5.21 (0.28) | 8.94 (0.22) | 9.36 (0.22) | 9.82 (0.17) | 12.04 (0.18) | 10.71 (0.19) | 13.03 (0.20) |
Linear | CVAE 50 | CVAE | GP | GP-GP | GP-GP-GP | GP | GP-GP | GP-GP-GP | LV-GP | LV-GP-GP | LV-GP-GP-GP | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Dataset | N | D | VI | VI | VI | VI | VI | VI | SGHMC | SGHMC | SGHMC | VI | IWVI | VI | IWVI | VI | IWVI |
challenger | 23 | 4 | 0.5191 | 0.9990 | 0.8870 | 0.7739 | 0.9823 | 0.9099 | 0.7001 | 0.7115 | 0.8466 | 0.8484 | 0.7187 | ||||
fertility | 100 | 9 | 0.9408 | 0.8353 | 0.9598 | 0.9341 | 0.9928 | 0.9969 | |||||||||
concreteslump | 103 | 7 | 0.9970 | 0.9922 | |||||||||||||
autos | 159 | 25 | 0.9823 | 0.8798 | 0.9903 | 0.9874 | 0.9990 | 0.9990 | |||||||||
servo | 167 | 4 | 0.8358 | 0.9989 | 0.9975 | 0.9990 | 0.9990 | 0.9988 | |||||||||
breastcancer | 194 | 33 | 0.9787 | 0.9814 | 0.9911 | 0.9943 | |||||||||||
machine | 209 | 7 | 0.9181 | 0.8967 | 0.9974 | 0.9989 | |||||||||||
yacht | 308 | 6 | 0.9987 | 0.9989 | 0.9990 | ||||||||||||
autompg | 392 | 7 | 0.9097 | 0.9948 | 0.9989 | 0.9987 | 0.9990 | 0.9988 | 0.9987 | 0.9985 | 0.9988 | ||||||
boston | 506 | 13 | 0.9916 | 0.9989 | 0.9989 | 0.9979 | 0.9989 | 0.9980 | 0.9981 | ||||||||
forest | 517 | 12 | 0.8453 | 0.7145 | 0.9981 | 0.9964 | 0.7034 | 0.7074 | 0.7470 | 0.7398 | 0.7608 | 0.7642 | |||||
stock | 536 | 11 | |||||||||||||||
pendulum | 630 | 9 | 0.9989 | 0.9988 | |||||||||||||
energy | 768 | 8 | 0.9987 | 0.9990 | |||||||||||||
concrete | 1030 | 8 | 0.9986 | 0.9782 | 0.9990 | 0.9989 | |||||||||||
solar | 1066 | 10 | 0.3028 | 0.3228 | 0.3380 | 0.3509 | 0.2728 | 0.2945 | 0.2557 | 0.2883 | |||||||
airfoil | 1503 | 5 | 0.9989 | 0.9986 | 0.9988 | 0.9988 | |||||||||||
winered | 1599 | 11 | 0.9790 | 0.6980 | 0.9234 | 0.9751 | 0.8473 | ||||||||||
gas | 2565 | 128 | 0.9984 | 0.9983 | 0.9935 | 0.9989 | 0.9983 | 0.9990 | 0.9986 | ||||||||
skillcraft | 3338 | 19 | 0.9989 | 0.9983 | 0.9989 | 0.9988 | 0.9990 | 0.9989 | |||||||||
sml | 4137 | 26 | |||||||||||||||
winewhite | 4898 | 11 | 0.9990 | 0.9990 | |||||||||||||
parkinsons | 5875 | 20 | 0.9989 | ||||||||||||||
kin8nm | 8192 | 8 | |||||||||||||||
pumadyn32nm | 8192 | 32 | 0.9770 | ||||||||||||||
power | 9568 | 4 | 0.9964 | 0.9963 | 0.9926 | 0.9964 | 0.9852 | 0.9963 | 0.9855 | ||||||||
naval | 11934 | 14 | 0.9649 | 0.9332 | |||||||||||||
pol | 15000 | 26 | 0.9970 | 0.8482 | 0.7214 | 0.9839 | 0.9508 | 0.9980 | |||||||||
elevators | 16599 | 18 | 0.9979 | 0.9986 | 0.9977 | 0.9987 | 0.9976 | 0.9988 | 0.9979 | ||||||||
bike | 17379 | 17 | 0.9815 | 0.9657 | 0.9337 | 0.9990 | 0.9990 | ||||||||||
kin40k | 40000 | 8 | 0.9990 | ||||||||||||||
protein | 45730 | 9 | 0.8839 | 0.8800 | 0.9117 | 0.9041 | 0.8904 | 0.8761 | 0.8840 | 0.8691 | |||||||
tamielectric | 45781 | 3 | 0.9551 | 0.9502 | 0.9558 | 0.9562 | 0.9550 | 0.9549 | 0.9551 | 0.9549 | |||||||
keggdirected | 48827 | 20 | 0.8279 | 0.5350 | 0.7498 | 0.5445 | 0.6839 | 0.5187 | 0.7039 | 0.5001 | |||||||
slice | 53500 | 385 | 0.9990 | 0.9983 | 0.9988 | 0.9988 | |||||||||||
keggundirected | 63608 | 27 | 0.4804 | 0.9882 | 0.5287 | 0.3281 | 0.5511 | 0.2049 | 0.9295 | 0.2322 | |||||||
3droad | 434874 | 3 | 0.9421 | 0.9479 | 0.9432 | 0.9271 | 0.9650 | 0.9467 | 0.9744 | 0.9464 | |||||||
song | 515345 | 90 | 0.9936 | 0.9957 | 0.9983 | 0.9932 | 0.9969 | 0.9893 | 0.9963 | 0.9835 | |||||||
buzz | 583250 | 77 | 0.9952 | 0.9909 | 0.9988 | 0.9938 | 0.9984 | 0.9838 | |||||||||
nytaxi | 1420068 | 8 | 0.9608 | 0.9422 | 0.9698 | 0.9274 | 0.9515 | 0.9019 | 0.9467 | 0.8981 | |||||||
houseelectric | 2049280 | 11 | 0.9830 | 0.9676 | 0.9962 | 0.9671 | 0.9933 | 0.9743 |
CVAE | CVAE | LV-GP (IW) | LV-GP (IW) | LV-GP-GP (IW) | |
---|---|---|---|---|---|
forest |
|||||
solar |
|||||
power |
|||||
keggdirected |
CVAE | CVAE | LV-GP (IW) | LV-GP (IW) | LV-GP-GP (IW) | |
---|---|---|---|---|---|
keggundirected |
|||||
houseelectric |
|||||
pol |
|||||
bike |
Comments
There are no comments yet.