Moment-matching approximation of deep GP
For many problems, relevant data are plentiful but explicit knowledge is not. Predictions about target variables may be informed by data sources that are noisy but plentiful, or data which the target variable is merely some function of. Intrepretable and flexible machine learning methods capable of fusing data across sources are lacking. We generalize the Deep Gaussian Processes so that GPs in intermediate layers can represent the posterior distribution summarizing the data from a related source. We model the prior-posterior stacking DGP with a single GP. The exact second moment of DGP is calculated analytically, and is taken as the kernel function for GP. The result is a kernel that captures effective correlation through function composition, reflects the structure of the observations from other data sources, and can be used to inform prediction based on limited direct observations. Therefore, the approximation of prior-posterior DGP can be considered a novel kernel composition which blends the kernels in different layers and have explicit dependence on the data. We consider two synthetic multi-source prediction problems: a) predicting a target variable that is merely a function of the source data and b) predicting noise-free data using a kernel trained on noisy data. Our method produces better prediction and tighter uncertainty on the synthetic data when comparing with standard GP and other DGP method, suggesting that our data-informed approximate DGPs are a powerful tool for integrating data across sources.READ FULL TEXT VIEW PDF
Moment-matching approximation of deep GP
Gaussian Process (GP) (Rasmussen and Williams, 2006)
is a flexible model that imposes a prior distribution over continuous functions in Bayesian framework. For regression tasks, the ability to estimate uncertainty is the main advantage of GP over the deterministic models such as deep neural networks. This advantage originates in the assumption that the finite set of latent random variables relating to both observed and unobserved data are collectively subject to the multivariate Gaussian distribution. As such, one can elegantly formulate prediction and calibrate uncertainty within the Bayesian framework.
Two main weakness of GP are the need to choose a kernel and the lack of explicit dependence on the observations in the predictive covariance. The kernel function encodes the similarity between data and , and hence the prior distribution over functions. Typically, this must be chosen by the modeler, presumably based on knowledge of the domain. But explicit knowledge is sometimes not available, and even when it is, it can be challenging to translate knowledge into an appropriate choice of kernel. The second weakness is the lack of explicit dependence on the observations, ’s, in the predictive covariance. In standard GP, uncertainty about a predicted value is completely determined by closeness of prior observations to in the input space, with no influence of the actual predicted values for those prior observations. This left unjustified in most research on GPs, presumably because there are not ready methods to incorporate such a dependence.
Research has attempted to ameliorate limitations of existing kernels through composition. For example, basic operations including addition, multiplication, and convolution allow composite kernels that are more flexible then the kernels they combine (Rasmussen and Williams, 2006; Duvenaud et al., 2013). More recently, research has investigated composition of kernels. For example, one may pass the input through some deterministic transformation, , which then generates the new kernel, , with new properties within the the standard GP framework Hinton and Salakhutdinov (2008). More generally, one may pass the input through a probabilistic transformation, for example taking the output of one GP as input into another, which gives rise to Deep GPs. Through the connection to kernel methods and Deep neural networks (Wilson et al., 2016; Raissi and Karniadakis, 2016), Deep Gaussian Processes (DGPs) (Damianou and Lawrence, 2013) and warped GPs (Snelson et al., 2004; Lázaro-Gredilla, 2012) can be regarded as a probabilistic generalization of kernel composition. Indeed, the Deep GP prior is non-Gaussian, consistent with increased expressiveness as compared to GPs, which leads to challenges for inference. Although increased expressiveness is useful in allowing application across domains, it is not the same as incorporating knowledge about a domain, which enables strong inference from limited observations.
A less obvious benefit from DGP is that the stacking structure is suitable for fusing the collection of data coming from different sources before making prediction (Perdikaris et al., 2017; Cutajar et al., 2019). For example, one may consider the regression task in which the accurate observations are rare but very close to the ground truth while there is large amount of less precise data available. Ideally, the plentiful data could be used to produce more informative prior while the more precise data could help reduce the uncertainty. More generally, one may consider a regression task in which the desired observations are rare, and one has access to large amount of relevant data. Relevant data may inform or constrain inference without being sampled from the target function itself. By training on this relevant data, one could construct a domain specific kernel that leverages very general knowledge about the domain—that two variables are related—and the kernel would then encode this knowledge in a way that could inform predictions of new variables based on limited observations.
In this paper, we propose a novel DGP structure in which the hidden layer connecting the input emits the random function sampled from the posterior distribution summarizing the data from the source variable. The next hidden layer then emits random function sampled from standard GP, which connects with the output. To tackle the intractability of inference, we follow the method of moments. Thus, we approximate our prior-posterior DGP with GP Lu et al. (2019). We shall show that the present approach can address the two weakness of standard GP mentioned above. The effective kernel captures long-range and multi-scale correlation via marginalized composition, and it also obtains explicit dependence of the observations through posterior mean. In comparison with directly applying GP on union of noisy and noise-free data, our multi-source GP shows reduced uncertainty and greater accuracy when predicting based on limited data. Experiments on synthetic multi-fidelity data also present better accuracy and tighter uncertainty in comparison with other DGP model and direct application of RBF-GP GPy (2012) with the rare data.
The machine learning tasks with data from multiple sources have been studied in the contexts of multi-fidelity learning (Kennedy and O’Hagan, 2000; Raissi and Karniadakis, 2016; Perdikaris et al., 2017; Cutajar et al., 2019) and multi-output learning (Álvarez and Lawrence, 2011; Moreno-Muñoz et al., 2018; Kaiser et al., 2018; Kazlauskaite et al., 2019; Janati et al., 2019). The kernel learning appeared in the literature of deep neural networks (Neal, 2012; Williams, 1997; Cho and Saul, 2009; Poole et al., 2016; Lee et al., 2017; Daniely et al., 2016; Agrawal et al., 2020) and neural tangent kernel (Jacot et al., 2018; Karakida et al., 2019; Yang, 2019). Hinton and Salakhutdinov (2008); Wilson et al. (2016); Raissi and Karniadakis (2016) demonstrated learning of the composition in kernel function through deep models.
The approach in our paper is different from the multi-fidelity DGP (Cutajar et al., 2019) which relies on inducing points (Snelson and Ghahramani, 2006; Titsias, 2009; Titsias and Lawrence, 2010) and variational inference Salimbeni and Deisenroth (2017). In the prediction stage, Cutajar et al. (2019) relies on the Monte Carlo sampling but our method does not.
Student-t process is the only stochastic model known, to our knowledge, to possess the capacity of outputting the predictive covariance which depends on the observations Shah et al. (2014).
Gaussian Process (GP) is the continuum generalization of multivariate Gaussian distribution over a set of random variables denoted by
. As the joint distribution
is specified by the mean vectorof size and kernel matrix of size , the distribution over the function is labeled as where, due to the marginal property of Gaussian distribution, the mean function and kernel function
generalize their counterparts in discrete case. GP regression is related to kernel ridge regression. One can regard the set of kernel functionsevaluated at the training input
as a basis in standard linear regression with regularization(Steinke and Schölkopf, 2008). For the zero-mean case, the conditional distribution associated with the function value at new input is still a Gaussian,
, with the predictive mean and variance,
respectively. The above can be easily adapted to connect with noisy observations
, and the hyperparameters in the kernel are determined by optimizing the marginal likelihood for the training observations.
The kernel function is key to the properties of function sampled from GP. Squared exponential kernel is a common model for smooth functions whereas the non-stationary Brownian motion kernel can generate stochastic non-differentiable continuous functions. New kernels can be generated by a simple trick in which one composes a deterministic transformation with a standard GP kernel. For example, passing the data through before an SE kernel allows functions of multi-length scale.
Deep Gaussian Process (DGP) Damianou and Lawrence (2013) generalizes GP the composition trick by passing the input through a GP instead of a deterministic transformation. Marginalizing the random transformation in the latent layers, the DGP prior distribution reads,
where the intractable marginalization arises because the random variables appear in the inverse of kernel matrix in the GP connecting to final layer . Deeper models can be straightforwardly obtained by inserting more latent layers in Equation 3.
The variational inference method Hensman et al. (2013); Salimbeni and Deisenroth (2017) tackles the intractable marginalization by introducing inducing inputs in each layer and assuming the corresponding function values are from some tractable distribution. Although the approximate inference is effective, the approximate posterior distributions over these latent functions did not capture some non-Gaussian properties such as the multi-modalness Havasi et al. (2018); Lu et al. (2019).
Although the form of distribution in Equation 3 is not known exactly due to the intractable marginalization over latent function layers, Lu et al. (2019) studied the second and fourth moments of the DGP models where squared exponential and squared cosine covariance functions are employed in the final layer connecting to observations. Their study suggested that the two-layer DGP distribution is heavier-tailed and the joint distribution has the invariant symmetry under if both conditional distributions in Equation 3 are zero-mean. Such zero-mean assumption is justified because the latent function variables, e.g. , are not connected with any observation.
In the context of multi-source data regression, GPs are very useful for modeling the general relation between mutually dependent observations from different sources (Kaiser et al., 2018; Cutajar et al., 2019). Instead of selecting some predetermined mean function for intermediate GP in the warped GP approach (Snelson et al., 2004; Kaiser et al., 2018), we follow the DGP structure and direct part of the observations to connect with the latent function (Cutajar et al., 2019) so that the conditional distribution in Equation 3 can be posterior distribution given the partial data. More explicitly, referring to Fig. 1, we consider the collection of plentiful and rare data denoted by and , and extend the DGP model to take into account that the latent layer represents the random functions from the posterior distribution given ,
The DGP distribution in Equation 4 can be considered as prior over the composite functions taking into account the partial data. For convenience, we extend the notation in Lu et al. (2019) to represent the two-layer DGP. For instance, SE[NN] stands for the structure associated with Equation 4 where the second layer connecting with output is zero-mean GP with SE kernel function while the first layer connecting with input is posterior distribution given and kernel function is the neural network (NN) covariance function (Williams, 1997). In the followings, we shall extract the second and fourth moments of two families of DGP models, SE and SC where the first layer GP can use arbitrary kernel function.
Here we shall evaluate the expectation values and associated with the DGP distribution in Equation 4. For the second moment, the marginal property of Gaussian and the integration with respect to the exposed terms ’s allow us not to confront with marginalizing out the ’s being in the inverse of covariance matrix in . Instead, only the tractable integral one needs to perform is . It is convenient to write the square in quadratic form, where the notation for the two-entry column vector and the two-by-two matrix have been used. Therefore, the SE covariance function in the layer connecting with output now reads,
The marginal property of Gaussian again simplifies the integration into
where the two-entry posterior mean vector
and the two-by-two covariance matrix,
are obtained from the posterior mean and covariance matrix, respectively, given the partial data .
The second moment of SE in Equation 6 can be shown to have the following form,
where the symbol represents the determinant of the matrix .
Without loss of generality, we assume the hyperparameters in the output layer. The analytic form for the second moment is given in (Lu et al., 2019),
where the matrix . It can be shown that the involved 2-by-2 matrices satisfy the following identity,
with which we can proceed to show and the exponent in above reads . ∎
The second moment captures the effective covariance generated through the marginalized composition in DGP distribution. Unlike the discussions in (Lu et al., 2019) where the GPs in latent layers are zero-mean, the presence of Gaussian posterior in DGP allows the effective covariance function in Equation 7 to depend on the posterior mean given the partial data, which is a novel method of data driven composition of deep multi-scale kernels. The factor reproduces the kernel for zero-mean case in which the length scales in and give rise to multi-scale correlation. Another novelty of this kernel is that the difference between the posterior mean from first GP appears in the second factor allows the partial training observations to enter the covariance function and the ensuing predictive covariance. Standard GPs does not have the capacity of providing observation-dependent predictive covariance (Shah et al., 2014). In addition, the exponent of this factor is scaled by input-dependent length scale . Lastly, one can collapse the first GP into a deterministic mapping, i.e. , by setting the signal magnitude , which leads to a transformed SE kernel function .
The statistics of DGP are not solely determined by the second and first moments, however. We can show that the fourth moment of interest can be derived in a similar manner. The calculation of higher moments for multivariate Gaussian distribution is faciliated by the theorem in (Isserlis, 1918) with which the fourth moment with the summation being over all three distinct ways decomposing the quartet into two doublets (for instance and ).
The fourth moment of SE is given by the sum over distinct doublet decomposition and corresponding expectation values,
with and the expressions,
We start with rewriting the product of the covariance function where the row vector and the matrix
The above zeros stand for 2-by-2 zero matrices in the off-diagonal blocks. The procedure of obtaining expectation value with respect to the 4-variable multivariate Gaussian distribution is similar to the previous one in obtaining the second moment. Namely,
in which the calculation of inverse of 4-by-4 matrix and its determinant is quite tedious but tractable. ∎
In the present case, the second GP in Figure 1 is also zero-mean but employs the squared cosine kernel function, i.e. . After writing , the same trick can be applied to obtain the second moment of SC DGP.
The second moment of SC DGP distribution is
where the symbol represents the posterior covariance from the first-layer GP.
The above follows from the expectation value identity along with the identification of vector . ∎
Similarly, the fourth moment of SC also has closed form in the following lemma.
The fourth moment is a sum over all three partition of quartet into pair of doublets, and the relation with the second moment is given by,
The proof can be found in Appendix.
One can determine whether a univariate distribution is heavy-tailed or light-tailed by the sign of its excessive kurtosis, and Gaussian distribution corresponds to zero. The fourth moments along with the second moments in the zero-mean two-layer DGP studied inLu et al. (2019) suggest that both the SE and SC compositions result in heavy-tailed statistics. Nonzero posterior mean in the present two-layer DGP enriches the statistical property. First, in both SE and SC compositions, the present DGP can be a GP if the signal in the first layer, which results in zero covariance and . Moreover, one can show that the fourth moment, for instance, , which is signature of multivariate Gaussian distribution.
However, the presence of mean function in second and fourth moments complicates the consideration for general cases. We first consider the SC composition and the simpler fourth moment . Since we are interested in approximating DGP with GP, the fourth moment in GP approximation shall read . The true fourth moment, compared with the GP approximation, contains additional contribution proportional to,
where we denote and . Therefore, this fourth moment is underestimated if one approximates this DGP with GP. In general, one could either prove that the most general fourth moment is always underestimated given any posterior mean function, or one could prove the opposite by finding a posterior mean such that the true fourth moment is smaller than the one obtained from the approximate GP. As for the SE DGP, the situation is even more complicated as the posterior mean and covariance both appear in the exponent in Equation 7 and 9, which makes the demonstration even more challenging.
The previous derivations show construction of effective kernel function in the posterior-prior stacking DGP, SE (Equation 7) and SC (Equation 10). The kernel functions depend on the input ’s implicitly through the posterior mean and covariance given data . One has the freedom to include explicit dependence of input by multiplying with SE kernel Duvenaud et al. (2014),
Given a set of rare observation , the multi-source regression task is to infer the function, which has following functional dependence,
and the hidden function can be inferred from the plentiful data . For instance, could represent the amount of sun exposure while could represent temperature at an array of locations. The hierarchy of two-level inference suits with the prior-posterior DGP shown in Figure 1. Moreover, marginalization over the hidden function
is the Bayesian spirit, instead of selecting the most probable
. As shown in previous section, our approach is to approximate the DGP with a GP in which the employed kernel function represents the nontrivial correlation and dependence on plentiful data out of the intractable marginalization over the hidden function. In passing, we note that the non-Gaussian character not captured by this approximation is related to whether outlier samples should appear more (heavy-tailed) or less (light-tailed) frequently than the multivariate Gaussian distribution can generate. Below, we shall describe our approximate inference with the effective GP model shown in Figure2.
The first objective is to infer the hidden function from . Namely, we are interested in and evaluated at and
, respectively, and they follow the joint multivariate normal distribution,
given the information from the plentiful data. Consequently, the posterior distribution can be obtained by standard GP procedure. The relevant information regarding within the posterior mean and and covariance can lead to the kernel matrix in Figure 2 at input and . Then optimization over the evidence for results in the corresponding hyperparameters for the effective GP. Algorithm 1 lists the summary of calculation procedure for multi-source DGP kernel learning with the rare data.
Here, we shall demonstrate the results from our multi-source GP inference with synthetic rare and plentiful data. There are two demonstrations of data here. In the first one, the relation between the two sources of data is fixed but implicit. This type of problem is the same as that studied in multi-fidelity simulation Perdikaris et al. (2017); Cutajar et al. (2019)
. On the other hand, in the second type of problem the relation between noisy data and noise-free data is not through a fixed mapping. Thus, it is interesting to note that the multi-source GP inference may be analogous with the denoising autoencoderAlain and Bengio (2014); Bengio et al. (2013).
We first show the example with the plentiful data generated from (30 data) and the rare data from (10 data). We apply the standard GPy regression package with RBF kernel GPy (2012) to infer the distribution over hidden function from plentiful data . With the predictive mean and covariance matrix , we can obtain the effective kernel matrix via Equation 7. The left top panel in Figure 3 displays the heatmap for the corresponding pathology kernel in Equation 12 evaluated at the grid of 30 points between . The left bottom panel in Figure 3 shows the sample functions generated from the effective GP as a prior using the pathology kernel learned from the plentiful data. Here, we remark that sampling function from DGP prior models is non-trivial as we did not find many instances in the literature except (Duvenaud et al., 2014; Dunlop et al., 2018). In the end, the regression with the rare data can be done with optimizing the evidence of . In the present experiment, we set the in the pathology kernel and employ grid search for the optimal values for and . The panel (a) in Figure 4 shows the regression result where the predictive uncertainty substantially improves upon the result from directly applying GP with the rare data, which can be found in the Supplemental Material (Fig.6 right panel). In particular, one may note that there is no data for in the region but our prediction mean is still close to the truth and the uncertainty is small. The result from direct application of standard GP sees similar predictive mean but the uncertainty is very large for .
For comparison with the simulation results in (Cutajar et al., 2019), we follow the nonlinear-A example there and generate 30 plentiful data from the function and 10 rare data from . Learning from the plentiful data leads to the effective pathology kernel in the right top panel in Figure 3. Apparently, the learned covariance has a very different structure than that learned from the previous case, which shows the flexibility of our kernel learning. Two random functions are sampled from the corresponding GP and are displayed in the right bottom panel. It can be seen that the sampled functions possess the length scale learned from the low-fidelity function, and additional length scales which may originate from the deep model can be found too. Finally, we use the rare data for our effective GP regression, and the prediction is shown in the panel (b) of Figure 4. In the present simulation, there are 30 random data as and 10 as , and the corresponding result using the code of Cutajar et al. (2019) can be found in panel (c) of Figure 4. The two methods show similar accuracy and tight uncertainty near the region where the rare data is available. In the region where no rare data is seen, on the other hand, our method demonstrates more accurate prediction (black curve) and tighter uncertain region between the blue dashed curves.
Unlike the above cases where the plentiful data has a fixed relation with the rare data, the noise-free data can not be obtained by sending the noisy data through some predetermined transformation. Nevertheless, if we regard the marginalization over in Equation 13 as a kind of model averaging Goodfellow et al. (2016), we shall expect tighter variance from the present regression algorithm.
In figure 5, we demonstrate the regression result with 40 noisy and random data in from the high-fidelity function in the nonlinear-A example, i,e, with , and 10 noise-free data from the same function. The black cross symbols represent the noisy data while the red circles for the noise-free ones. The light-blue curve and the shadow region represent the predictive mean and uncertainty from applying RBF-GP to the union of both data. The relatively large noise and the lack of data in some regions result in significant uncertainty around the prediction. The predictive mean and uncertainty are presented by the black curve and blue dashed curves, respectively. The predictions from both methods are similar, but the uncertainty in our method is much tighter, which is expected from the model averaging perspective.
In this paper we propose a novel kernel learning inspired from the multi-source Deep Gaussian Process structure. Our approach addresses two limitations of prior research on GPs: the need to choose a kernel and the lack of explicit dependence on the observations in the predictive covariance. We resolve limitations associated with reliance on experts to choose kernels, introducing new data-dependent kernels together with effective approximate inference. Our results show that the method is effective, and we prove that our moment-matching approximation retains some of multi-scale and long-ranged correlation that are characteristic of deep models. We resolve limitations associated with lack of explicit dependence on observations in the predictive covariance by introducing the data-driven kernel. Our results show the benefits of joint dependence on the input and the predicted variables in reduced uncertainty in regions with sparse observations (e.g. Figure 4).
Central to the allure of Bayesian methods, including Gaussian Processes, is the ability to calibrate model uncertainty through marginalization over hidden variables. The power and promise of deep GP is in allowing rich composition of functions while maintaining the Bayesian character of inference over unobserved functions. Whereas most approaches are based on variational approximations for inference and Monte Carlo sampling in prediction stage, our approach uses a moment-based approximation in which deep GP is a analytically approximated with a GP. For both, the full implications of these approximations are unknown. Through analysis of higher moments, we can show that our approach retains some of important signatures of deep models, while avoiding the need for further optimization or sample-based approximation. Continued research is required to understand the full strengths and limitations of each approach.
From the true DGP distribution given in the main text, is can be shown with the theorem in Isserlis (1918) that the fourth moment is the sum,
where the distribution for the quartet is a multivariate Gaussian distribution, specified by its mean vector and covariance matrix ,
We shall focus on the first term in the sum,
which can be expressed as the exponential form,
where there are a total of 4 similar terms involving in the first bracket. In the second bracket, there are also 4 similar terms with different sign combinations. Next we can apply the identity valid for both two-dimension and four-dimension cases. One can show that
where we denote the exponential cross term , and the plus/minus is associated with the sign in the exponent of in . Now we can first collect the terms and obtain the second moment,
where the two-indices symbol . Similarly, the fourth moment is given by,
Comparing with the expression for the second moment, the difference now reads,
Note that and can be positive or negative. The remaining two terms in the sum Eq. (1) follow identical derivation, thus we complete the proof of Lemma 4.
Here we present the comparison with the results obtained by directly applying RBF-GP (GPy package) GPy (2012). We generate randomly 30 points as and 10 points as using np.random.rand function in numpy with random seed of 59. The data (black cross) serves as relevant information to the target knowledge (red line) which is partially known through the rare data (red circle). The predictive mean and variance are displayed with black solid line and blue dashed lines, respectively. We also use the examples from the nonlinear-A (Fig. 7) and nonlinear-B (Fig. 8) cases in Cutajar et al. (2019).
Kernel methods for deep learning.In Advances in neural information processing systems, pages 342–350.