Gaussian process models are widely used in many areas, such as, computer experiments, stochastic optimization via simulation, and machine learning using Gaussian processes. The explosion of interest in big data has brought huge datasets into the spotlight, which has triggered demands for more sophisticated statistical modeling techniques and methodologies for large-scale Gaussian processes. Researchers always face computational problems dealing with a huge covariance matrix when making inference from a large-scale dataset for Gaussian process based models, using the maximum likelihood estimation (MLE) for parameter estimation and the best linear unbiased predictor (BLUP) for prediction. Although the standard methods, the MLE and the BLUP, are statistically more efficient than other alternative methods, they requires intractable computation on a huge covariance matrix when Gaussian process models are of large scale. Therefore, there are increasing demands in finding approximations to these standard methods for computational convenience, while trying not to lose too much statistical efficiency.
Some methods have been proposed attempting to address the above computational problem. One strategy is to simplify the covariance matrix with more easily manipulated structures. For example, covariance tapering was used by Furrer, Genton, and Nychka (2006) and Kaufman, Schervish, and Nychka (2008) to yield a sparse covariance matrix for computational convenience; a low-rank models is used to represent the Gaussian processes in a lower-dimensional subspace so that calculation on a much smaller covariance matrix is only required (Banerjee et al., 2008; Cressie and Johannesson, 2008; Stein, 2008). In spite of its merits, using a simplified covariance matrix for Gaussian processes may lead to unnecessary efficiency loss (such as ignoring long-range dependence in covariance tapering), and this strategy also has limited applications.
To take advantage of the benefits of the MLE and the BLUP, it is urgent to develop more efficient approximations to the full likelihood, and thus another strategy is to use the composite likelihood (Lindsay, 1988; Varin, Reid and Firth, 2011) for parameter estimation and prediction. The composite likelihood was first proposed in the literature to address computational issues in some scenarios when the full likelihood fails in parameter estimation. Vecchia (1988) proposed the composite conditional likelihood to approximate the full likelihood, which is further developed by Stein, Chi and Welty (2004) to approximate the restricted likelihood, by using blocks of observations to improve statistical efficiency. The composite marginal likelihood was also used to approximate the full likelihood by Heagerty and Lele (1998), Curriero and Lele (1999) and Caragea and Smith (2007). An unnegligible factor of efficiency loss by these composite likelihood methods is that the component likelihoods therein may not form an optimal combination to approximate the full likelihood.
In this paper we adopt the composite conditional likelihood for parameter estimation. A key difference between our composite conditional likelihood and previous ones is that we propose a sound combination of well selected component likelihoods to approximate the full likelihood, according to the chain rule of conditional probability. Another critical difference is that we consider the dependence among some component likelihoods, which is intentionally ignored in the conventional composite likelihoods, in order to improve the approximation to the full likelihood. Due to these tow critical differences, the statistical efficiency of the proposed composite likelihood is increased substantially. The proposed method puts as many data as possible into each component conditional likelihood while still preserves the divide and conquer aspects. It is well known that maximum composite likelihood estimates are consistent and asymptotically normal(Lindsay, 1988; Varin, Reid and Firth, 2011) under some regularity conditions on the composite likelihood, similar as those in Mardia and Marshall (1984) within increasing domain asymptotics.
Until recently, the composite likelihood was used to mitigate computational burden for making prediction within large-scale Gaussian process models by Eidsvik, et al. (2014), where predictions based on a block composite likelihood was developed. However, this method fails to give optimal weights for component likelihoods, the product of which forms the composite likelihood used for making prediction for Gaussian processes. It seems that the framework of the composite likelihood cannot offer any tool to calculate these weights. Instead we transform the BLUP, which needs the inverse of a huge covariance matrix, into a convex optimization that only requires the inverse of small-scale covariance matrices, by considering dependence between conditional densities in order to calculate the optimal weights. We call this method “composite inference”.
This method makes full use of information by bringing together all the block subsets into the convex optimization. Furthermore, the original covariance matrix rather than the approximate covariance matrix is used to make prediction so the proposed method is more accurate than those using the approximate covariance matrix. In addition, we also find that the method making prediction using the composite likelihood (Eidsvik, et al., 2014) is one sub-solution of the proposed convex optimization. Finally, the proposed method is compatible with parallel computing during both parameter estimation and prediction, making it even more computationally efficient to deal with today’s increasingly growing data.
The remainder of this paper is organized as follows. Section 2 introduces notations of Gaussian processes used throughout the paper. The composite likelihood methods are reviewed and the proposed method is developed in the section 3. Numerical examples are given in section 4. Finally, section 5 concludes the paper.
2 Gaussian Process
Before introducing the proposed method, we first presents a very brief review of Gaussian processes (Santner, Williams, and Notz, 2003). Denote the distinct input points by , , and the corresponding responses by . The Gaussian process model is defined as
is a vector ofpre-specified regression functions. are unknown coefficients to be estimated. The is a univariate Gaussian process with zero mean and variance and its correlation function is with unknown correlation parameters . A common choice is the squared exponential correlation function , where . Denote the correlation matrix by , and . The log-likelihood (up to an additive constant) of the Gaussian process model is given by
The parameters , and can be obtained by the MLE and the BLUP at a input point is
Note that both parameter estimation by the MLE and prediction by the BLUP involve intensive computation on calculating and/or , which is of complexity, making them computationally intensive. What’s more, it is impossible to load the entire covariance matrix into the memory of normal desktop computers when is large, for example, a matrix in MATLAB requires about 74.5GB memory. These two concerns have triggered demands for more computationally efficient and tractble methods to replace the MLE and the BLUP for large-scale Gaussian processes.
3 Composite Inference for Gaussian Processes
Maximum likelihood estimation is generally the prior choice for parameter estimation, but exact computation repeatedly of the full likelihood is painfully prohibitive when the sample size is very large. To address the computational problem, the composite likelihood is adopted. The general principle of the composite likelihood is to simplify complex dependence relationships by computing marginal or conditional likelihoods of a subset of the variables, and then multiplying them together to form a estimation function.
However there are numerous combinations of component likelihoods to form the composite likelihood, and it is unclear which combination is better. In addition, the dependence among these component likelihoods is intentionally ignored. Therefore these two factors make the composite likelihood less statistically efficient than the full likelihood. To increase its statistical efficiency, we must consider again the dependence among these component likelihoods and select a reasonable combination, while still preserving the divide and conquer aspects of the composite likelihood. Next we will present some building blocks that capture the dependence among component likelihoods for Gaussian processes, which is the key to the proposed methods for estimating unknown parameters and making predictions.
Assume the whole dataset is decomposed into block subsets of roughly the same size, each containing data points for , where . The sliced Latin hypercube design (SLHD) by Qian (2012) can be an option to generate these subsets. Denote conditional on
by another random variable, i.e. . The question is how are and correlated. To answer this question, we first present a theorem (the proof is given in the appendix) showing that a conditional random variable can be represented by another random variable without conditioning, followed by a Corollary applied in Gaussian processes, which shows the correlation between and .
For a random vector where each
is i.i.d. following standard normal distribution, and amatrix such that is nonsingular, then
The Theorem simplifies the conditional probability into a tractable probability without conditioning, making other operations on the conditional probability possible. Therefore we can have the following Corollary (the proof is also given in the appendix) showing the joint distribution of . The Corollary can be used in both parameter estimation and prediction using composite likelihood to increase statistical efficiency, because the dependence between conditional densities are considered, and hence motivates the proposed method. The remarks followed give the best weights for conditional densities such that their weighted sum reaches the minimum variance. This weighted sum is used to approximate in the proposed method for the purpose of parameter estimation and prediction.
For a Gaussian process defined in (1) with covariance function given by and define , then follows a multivariate normal distribution and
Denote and , then , where , reaches its minimum variance, , when . We call the distribution of with optimal weights the best composite conditional distribution.
3.1 Parameter Estimation Using The Composite Likelihood
The composite likelihood is a computationally efficient estimating function which approximates the full joint likelihood function by the product of a collection of component likelihoods. Besag (1975) on the analysis of spatial models is one of the first to study composite likelihood, who worked on composite conditional likelihoods, notably pseudo-likelihood. Lindsay (1988) coined the term composite likelihood for the product of likelihoods. More details on the composite likelihood methods can be found in Varin, Reid and Firth (2011).
We first give some notations. Denote , where denotes the sample in and , where the notation denotes a column vector . According the Theorem, is normally distributed and
where for . According to the remarks of the Corollary, the best weight is , where is the covariance matrix of .
There are two classes of composite likelihoods commonly used in literature: the composite conditional likelihood (CCL) and the composite marginal likelihood (CML). In the context of composite likelihood, for example, the composite conditional likelihood is given by
Similar composite likelihood method was proposed in Vecchia (1988) and Stein, Chi and Welty (2004). The composite marginal likelihood was also studied in the literature by Heagerty and Lele (1998), Caragea and Smith (2007) and Eidsvik, et al. (2014), for example, given by
It can be seen that all these composite likelihoods somehow ignore the correlation between component likelihoods and it cannot be guaranteed that the combinations of component likelihood in theses composite likelihoods are better than other combinations. The clue on selecting the optimal combination is behind in the chain rule of conditional probability. Recall that, by the chain rule, the full likelihood can be written as
where where denotes the sample in , and . The best choice is to use this combination directly, but the computation of some component likelihoods therein is still prohibitive.
The term , called the full conditional distribution, is therefore replaced by the best composite conditional distribution , which is of much more computational convenience, by making full use of the joint density function of to catch the correlation between and . Note that we drop the conditioning on for simplicity, but it can be taken into consideration if necessary.
The component likelihood takes all the variables of into consideration in order to increase statistical efficiency, but only a much smaller covariance matrix is needed to calculate the likelihood. This may be not possible in the previously proposed composite likelihoods using either marginal or conditional likelihoods. What’s more, it doesn’t suffer difficulties on selection of observation and/or conditional sets, which are common problems in the previously proposed composite likelihoods.
Therefore the proposed composite likelihood is
To simplify this composite likelihood, and are first denoted and thus . Then recall that and , where and . So the proposed composite log-likelihood, up to an additive constant, becomes
By equating the partial derivatives of the composite log-likelihood with regard to and to zero, we can get the estimates of and conditional on ,
By plugging and into in (6), we have
Therefore can be finally optimized by
3.2 Asymptotics for The Maximum Composite Likelihood
Denote the score function by and the Hessian matrix by . Let and and denote the positive square root of a positive matrix by , i.e. .
The consistency and asymptotic normality of the maximum composite likelihood estimators are in general ensured, under the same regularity conditions as for the usual maximum likelihood estimators (Lindsay, 1988), for example, the continuity, growth and convergence conditions in Mardia and Marshall (1984) and (Sweeting, 1980), in the context of increasing domain asymptotics of Gaussian processes. The idea on the analysis of the asymptotic distribution is that, under those regularity conditions, by Taylor expansion similar to that used in the maximum likelihood, is asymptotically normally distributed:
3.3 Prediction Using The Composite Likelihood
Composite likelihood is popularly used for parameter estimation. Indeed it can be also used in Gaussian process model to approximate the BLUP (3). In the Gaussian process model (1), the joint distribution of and is
As shown by Jones et al. (1998), the maximum likelihood estimator of is identical to the BLUP in (3). By the similar vein, the maximum composite likelihood predictor was developed by (Eidsvik, et al., 2014) to approximate the maximum likelihood estimator.
Throughout remainder of this section, we assume true values of , and are known, but , and are used instead in practice. The weighted composite likelihood at an unobserved location is
where is the weight of component likelihood and . By differentiating the composite likelihood and equaling its first derivatives to zero, the prediction of is
where . Similar results was proposed in (Eidsvik, et al., 2014) where equal weights were given, i.e. .
In general, the weights can be given manually according to some criteria, but what are their optimal values? It seems that the composite likelihood fails to solve this problem. To answer this question, we propose the “composite inference”, which gives analytical optimal solutions. In fact, we will see that the predictor in (9) is a sub-solution of the proposed method.
3.4 Composite Inference
For making prediction from large scale Gaussian process models, composite likelihood sounds a good option, but it still cannot answer the question in the last subsection. To address this problem, we propose the composite inference here. We denote again and , where is an unobserved location. According to the Corollary, the expectation for is and the covariance matrix for is .
The linear unbiased block predictor for should be a linear combination of , and thus we have
The rest is to calculate the weight such that the prediction has the minimum variance, by the following convex optimization
As the same results in the remarks of the Corollary, the optimal solution of the weights is
In fact, by maximizing the full likelihood of with regard to , the same result for the prediction of can be obtained, i.e., . Clearly it doesn’t follows into the framework of the composite likelihood. Indeed, the composite likelihood developed by (Eidsvik, et al., 2014) is actually intended to approximate the full likelihood of . In addition, it is not quite straight forward to see that maximizing the full likelihood of will reach the minimum variance.
Note that it can not be guaranteed that is always nonsingular, for example, when is exactly one of . What’s more, will be ill conditioned if is close to any one of . Therefore the optimization problem (11) is transformed into another convex optimization
If the elements of are distinct, it can be proved that the covariance matrix is positive definite (Santner, Williams, and Notz, 2003). According to the Proposition in the appendix, is also positive definite and hence is nonsingular. By using the Lagrange multiplier, one can get
Clearly when is exactly one of , and hence . It is easy to see that is unbiased and it reaches the minimum variance given the partition . So we call the best linear unbiased block predictor (BLUBP). In addition, the composite predictor in (9) is in fact a sub-solution of the proposed predictor, which may not be optimal.
Moreover it is important to see that, when the density of the observations increases to infinity, the proposed predictor will converges to the BLUP, and hence preserving the same infill asymptotic properties as the optimal predictor. This is because that each converges to the BLUP and hence its weighted average , as the density of the observations increases to infinity.
Numerical examples are given in this section to demonstrate the performance of the proposed method for both parameter estimation and prediction, compared with that of the methods using the maximum likelihood and the conventional composite likelihoods. In the first part, simulation studies are conducted. In the simulations, the design points are generated by the SLHD with subgroups each with design points and the corresponding responses are sampled from a given Gaussian process with specified parameters. In the second part, a numerical example on Schewfel function is given to show the performance of the proposed method applied to large-scale applications. Note that all the methods compared use the same setting and the same partition of the whole dataset.
We first compare the composite conditional distribution, the distribution of weighted sum of , with the full conditional distribution . Set , and . 16 design points are generated by the SLHD and the Gaussian process with those specified parameters is used to make prediction for untried points according to the methods in comparison. Note that in this simulation true values of parameter (rather than parameter estimates) are used to make prediction in order to remove other distracting factors. A typical example when is shown in Figure 1, where the black line is the prediction and the dashed lines are its 3-confidence interval. It can be seen that the prediction from the proposed method is very close to that from the maximum likelihood (the BLUP), and is much better than that from the conventional composite likelihood using equation (9).
We increase from 4 to 8, i.e. each subgroup has 2 points, while keeping other setting unchanged and similar results are obtained as shown in Figure 2. It can be seen that, the proposed method approximate extremely well to the BLUP. This suggests that the best composite conditional distribution approximates very well to the full conditional distribution. Simulations also shows that, assuming the total number of design points are the same, the approximation to the BLUP of the proposed method with larger is consistently better than that of the proposed method with smaller . In general, the accuracy of the predictor is determined by the way of partitioning the dataset and the information loss happens roughly in the places of the predictor with larger uncertainty.
Next we will compared parameter estimation using the proposed composite likelihood (CI) with that using the the maximum likelihood (ML) and the conventional composite likelihoods, the block version of CML and CCL in equation (4) and (5). The simulation is repeated 10000 times. In each simulation, 100 design points in the domain are randomly generated by the SLHD with 10 subgroups and 1000 equally space testing points are generated, and then their responses are sampled from the given Gaussian process with , and . The bias and mean square error (MSE) of parameter estimates corresponding to the 4 methods are shown in Table 1.
It can be seen from Table 1 that the proposed composite likelihood gives almost the same bias and MSE of parameter estimates as the maximum likelihood. In fact, in most simulations, it gives exactly the same parameter estimates as the maximum likelihood. However, the MSEs of the conventional composite likelihoods are much worse than those of the proposed composite likelihood, because they ignores dependence between component likelihoods and the combination of the component likelihoods therein is not well selected. This means that the proposed composite likelihood could be a competitive alternative to the maximum likelihood, even for small scale Gaussian processes. For large-scale Gaussian processes where the maximum likelihood is infeasible, we suggest the proposed composite likelihood for parameter estimation, especially when parallel computing services are available.
With optimized parameter estimates, predictions are made for the 1000 equally spaced testing points in each simulation and the RMSE of predictive accuracy is recorded as shown in Figure 3. The first boxplot uses the standard method: the maximum likelihood for parameter estimation and the BLUP for prediction. The proposed method for parameter estimation and prediction is used in the second boxplot, where the predictive accuracy is almost the same as that of the standard method. The block version of CML and CCL is used respective for parameter estimation and the composite likelihood in equation (9) is used to make prediction in the last two boxplots, where the RMSE of predictive accuracy is much worse than that using the proposed method. This suggests that the proposed method performs almost the same as the standard method, and is much better than other alternative methods.
A simulation study in a two dimensional case is also conducted and similar conclusions can be drawn. In each simulation, 100 design points with 10 subgroups are randomly generated by the SLHD in the 2-D domain , with 1600 equally spaced testing points, and their responses are generated from the Gaussian process with , and . The simulation is also repeated 10000 times. The bias and MSE of parameter estimates of each method are given in Table 2, and the predictive accuracy is given in Figure 4.
The parameter estimation of the proposed composite likelihood is also extremely close to the full likelihood, though the MSE of the roughness parameters is slightly larger. The prediction by the proposed method is also very close to that using the standard method. On the other hand, the parameter estimation and prediction by the conventional composite likelihood is much worse.
Finally, a case study on Schewfel function is given to show the performance of the proposed method in large-scale applications compared with other methods. The Schewfel function used here is given by
This function is very complex and hence 100000 data points are used to fit a Gaussian process model. The dataset is divided into 200 subsets, each has 500 points by the SLHD and 200000 testing points are generated by the Latin hypercube design. The maximum likelihood fails to give parameter estimation because it runs out of memory when calculating the full likelihood. Therefore we compare the proposed method with the conventional methods using composite likelihoods. The mean squared prediction error of the proposed method is 0.1605, while the mean squared prediction error of the methods using the composite likelihood is respectively 0.7864 (the CML is used) and 0.7863 (the CCL is used). This means that the proposed method greatly outperforms its conventional counterparts.
We have presented the intuitively appealing and practically useful method on parameter estimation and prediction for Gaussian processes. The proposed composite likelihood systematically addresses the difficulties using composite likelihoods to approximate the full likelihood, by considering the dependence among some well selected component likelihoods. It is much more statistically efficient than its counterparts, and it approximates extremely well to the full likelihood even for small-scale Gaussian processes. The proposed composite inference gives the optimal prediction (the BLUBP) for a given partition of the dataset, which is also extremely close to the prediction by the BLUP. The proposed method could be a useful and convenient alternative to the standard method when it is not applicable.
The proposed composite likelihood for parameter estimation is more statistically efficient but less computationally efficient than the conventional ones, so it can be replaced by other more computationally efficient composite likelihoods in some scenarios where statistical efficiency is less important than computational efficiency. On the other hand, for making prediction, the proposed BLUBP is strongly recommended.
The way of partitioning the whole dataset affects the prediction accuracy of the proposed method, but it is not clear which way is the best, and thus the best way of partitioning will be explored in the future work. In addition, its implementation details on computational and numerical issues, along with various examples, will be reported in a subsequent article. Finally, the proposed method in Bayesian context will be developed and reported elsewhere.
For a random vector where each is i.i.d. following standard normal distribution, and a matrix such that is nonsingular, then
By singular value decomposition, we have, where and hence and . Denote , which is split into two parts
where is of size . Then we have
For a Gaussian process defined in (1) with covariance function given by and define , then follows a multivariate normal distribution and
Let , . Since
thus there exist and , , such that
By the Theorem we have
and thus follows a multivariate normal distribution and
Denote , which is divided into subsets: . If be the correlation function of two points and , such that is positive definite, then is positive definite, where