Regression is a fundamental problem in machine learning, signal processing, and system identification. In general, an input-output paircan be described by
where is an unknown regression function and is a zero-mean error term. Given a set of input-output pairs, the goal is to model and infer the value of at some test points.
Gaussian Processes (GPs) are a family of nonlinear and non-parametric models(Rasmussen & Williams, 2006). Inferences based on GPs are conceptually straightforward and the model class provides an internal measure of uncertainties of the inferred quantities. Therefore GPs are widely applied in machine learning (Bishop, 2006), time series analysis (Shumway & Stoffer, 2011), spatial statistics (Kroese & Botev, 2015), and control systems (Deisenroth et al., 2015).
A GP model of is specified by its mean and covariance functions, which are learned from data. After GP learning, the latent values of can be inferred from observed data. Both GP learning and inference rely on evaluating a likelihood function with available data. In case of a large set of data, this evaluation becomes computational prohibitive due to the requirement of inverting covariance matrices, which has a typical runtime on the order , where is the size of training data. Thus the direct use of GP has been restricted to moderately sized data sets.
Considerable research efforts have investigated scalable GP models or approximations which can be divided into four broad categories: subset of data, low rank approximation of covariance matrices (Williams & Seeger, 2001), sparse GPs (Quiñonero Candela & Rasmussen, 2005; Hensman et al., 2013), and product of experts (or Bayesian committee machines) (Tresp, 2000; Deisenroth & Ng, 2015)
. Many methods rely on fast heuristic searches(Seeger et al., 2003; Titsias, 2009) for an optimal subset of data or dimensions to, for instance, maximize the information gain, which has been shown to be an NP-hard problem (Krause et al., 2008).
Sparse GPs use a small and special subset of data, namely inducing variables, and apply two key ideas: an assumption of conditional independence between training data and testing data given inducing variables; and the approximated conditional distribution for training data given inducing variables, for example, the fully or partially independent training conditional (FITC or PITC) (Quiñonero Candela & Rasmussen, 2005; Snelson & Ghahramani, 2006, 2007; Bijl et al., 2015). Based on these assumptions, sparse GPs reduce the complexity to where is the number of inducing variables. By contrast, the product of experts (PoEs) or Bayesian committee machines (BCM) (Tresp, 2000; Cao & Fleet, 2014; Deisenroth & Ng, 2015)
does not require to find a special set of variables. Those methods divide the large training data into segments and uses each with the GP model as a ‘local expert’. The final result are formed by multiplying the results given by each local experts, usually with certain weights in order to be robust against outliers.
Several important questions remain for scalable GP methods. First, how are these different methods connected to each other? Second, how do we quantify the difference between an approximated GP posterior distribution and a full GP posterior distribution? This work aims to address both questions. First, we show that various approximated GP posteriors can be derived by applying a general belief updating framework (Bissiri et al., 2016). Using composite likelihoods (Varin et al., 2011), we obtain a scalable composite GP model, which can be implemented in a recursive fashion. Second, we analyze how the posterior of the composite GPs differs from that of a full GP with respect to predictive performance and its representation of uncertainty (Bernardo, 1979; Cover & Thomas, 1991).
2 Problem Formulation
The joint mean and covariance matrix are functions of the test points . Using a training data set
our goal is to update the belief distribution over so as to produce a prediction with a dispersion measure for uncertainty.
This specifies a conditional data model , that is Gaussian with mean and covariance
. The standard Bayesian inference framework then updates the prior belief distributioninto the posterior via Bayes’ rule.
The inference of
depends, moreover, on a specified mean and covariance model, which can be learned in several ways. When it is parameterized by a vector, the maximum marginal likelihood approach is a popular learning approach that aims to solve the problem
which requires several matrix inversions in itself. Finally, the posterior
is evaluated at . Both (4) and (5) require a runtime on the order and a storage that scales as , which renders standard GP training and inference intractable for large (Quiñonero Candela & Rasmussen, 2005).
In this work, our goal is to firstly formulate an alternative update of the belief distribution, secondly provide a scalable training and inference method and finally present an analysis on the performance of approximation in terms of MSE and information loss.
3 Updating Belief Distributions
The posterior above can be thought of as a special case of updating the belief distribution into a new distribution using the data . A more general belief updating framework was formulated in (Bissiri et al., 2016)
. Using this framework, we first define a loss functionand then find the distribution which minimizes the average loss
regularized by the Kullback–Leibler divergence (KLD)(Cover & Thomas, 1991). The first term in (6) fits to the data, while the second term constrains it to the prior belief distribution. The updated belief distribution is then obtained as
It is readily seen that for the loss function
the minimizer of is the posterior , cf. (Bissiri et al., 2016) for more details. It is also interesting to point out that the above optimal belief updating framework has the same structure as the KLD optimization problem in variational Bayesian inferences (Fox & Roberts, 2012). Here, however, the problem is considered from a different angle: in (Fox & Roberts, 2012)
, the challenge is to tackle the joint distribution of high-dimensionalby factorizing it into single variable factors; in this paper the challenge is to tackle the distribution of large data sets .
To alleviate the computational requirements using the GP loss (8), we formulate a different loss function based on marginal blocks of the full data distribution similar to the composite likelihood approach, cf. (Varin et al., 2011). Specifically, we choose
where the data set has been divided into segments
with samples each. As we show below, this cost function enables scalable and online processing for large data set.
Theorem 1 (Composite GP (CGP) update).
The proof follows by recognizing that (6) is equivalent to the divergence
which attains the minimum 0 only when . Then the result is obtained by noting that is a distribution that integrates to unity. The recursive computation of the CGP posterior in (11) can be tackled using standard tools (Särkkä, 2013) as discussed in Section 4 below.
Here it is instructive to compare the CGP with the SGP approach (Snelson & Ghahramani, 2006) which is formulated using a set of latent ‘inducing variables’ with a joint distribution . By assuming that is independent of when given , the joint distribution can be factorized into . Then the implicit loss function used in SGP is based on marginalizing out from the joint distribution:
using a segmented model . The SGP posterior is
This formulation reproduces the FITC and PITC approximations depending on the size of the segments , cf. (Quiñonero Candela & Rasmussen, 2005; Snelson & Ghahramani, 2007). The inducing variables are targeting a compressed representation of the data and the information is transferred to the belief distribution of via . Thus must be carefully selected so as to transfer the maximum amount of information about the training data
. The optimal placement of inducing variables involves a challenging combinatorial optimization problem and but can be tackled using greedy search heuristics(Seeger et al., 2003; Krause et al., 2008; Titsias, 2009).
4 Recursive Computation
The model parameters are fixed unknown quantities and typically learned using the maximum likelihood (ML) method. Once this is completed, the prediction of along with its dispersion is computed. In this section, we present recursive computations for both CGP learning and prediction.
The Fisher information matrix (FIM), , quantifies the information about the model parameters contained in data when assuming a model
. In case of Gaussian distributed data, the FIM can be obtained by the Slepian-Bangs formula. We refer readers to (B.3.3) in(Stoica & Moses, 1997)
for this formula. The larger the FIM in the Löwner order sense, the lower errors we may achieve when learning the optimal model parameters. Indeed, the inverse of the FIM, provides an estimate of the variance of an efficient estimator(Kay, 1993; Van Trees et al., 2013).
Each data segment yields its own maximum likelihood estimate with an approximate error covariance matrix . When the segments are obtained from the same data generating process, the model is applicable to each segment and we combine the estimates to provide a refined model parameter at segment :
The quantities are computed recursively as
where and , cf. (Zachariah et al., 2017).
Figure 1 illustrates the recursive learning of a CGP learning for a model with zero-mean function and the squared exponential (SE) covariance function () is shown in Figure 1. In the example we split the training data () into a number of segments. Three cases with different segmented data length (, , and ) are shown. Note that the ML estimator is asymptotically efficient estimator and thus the accuracy of and improves as becomes larger. Therefore there is a trade-off between accuracy and computational cost. Setting , the accuracy of the weighted combination (15) is satisfying, while the computational complexity is still maintained at a relatively low level comparing to the full GP learning.
It is worth pointing out that, the above idea of segmenting data and weighting the contributions from each module is similar to the approach found in the recent work by Jacob, et al. (2017). The modularized inference increases robustness to misspecifications of the joint data model.
We now present the mean and covariance of the updated belief distribution for CGP, cf. Theorem 1. Specifically, given the CGP posterior distribution and a new data segment , the posterior is updated recursively, cf. (Särkkä, 2013).
First, a prior is constructed. For , the GP prior in (2) is used as the prior distribution for ; for , previous posterior is used as the new prior.
Second, the conditional distribution is obtained with mean and covariance matrix
Finally, the posterior has a mean and covariance:
The matrices and are defined as
Together the equations form a recursive computation where is the basis for obtaining after observing .
5 Performance Analysis of Composite GPs
Given the appealing computational properties of CGP, a natural question is how good is its posterior as compared with that of GP? Specifically, we are interested in the predictive performance and the ability to represent uncertainty about the latent state , which will be addressed in the following subsections. Both aspects are related to the amount of information that the training data provides about the latent state, i.e., (Cover & Thomas, 1991).
5.1 Data-Averaged MSE
We begin by considering an arbitrary test point, so that is scalar and is any predictor. Given the prior belief distribution of with known variance , the data-averaged mean squared error (MSE) is lower bounded by the mutual information between and the training data :
under fairly general conditions, cf. Theorem 17.3.2 in (Cover & Thomas, 1991). When the marginal data distribution obtained from GP is well-specified, the bound (24) equals the posterior variance of the GP and is attained by setting . Thus also represents the reduced uncertainty of the updated belief distribution for GP.
In this scenario, what is the additional MSE incurred when using the CGP posterior mean as a predictor ? In general, the data-averaged MSE equals
where the second term is the additional MSE given by the difference between the GP and CGP posterior means, respectively. To obtain closed-form expressions of this term, we consider the case of segments.
Theorem 2 (Excess MSE of CGP).
The additional MSE of CGP when equals
where is the covariance matrix of the -th and -th data segments. The vectors and are given by
where is the column vector of covariances between the testing data and the -th data segment; and the two block matrices ( and ) are coefficient matrices for the full GP and CGP predictions, respectively.
The full derivation is omitted here due to page limitations. We present a sketch of the proof in the following. It can be shown that the coefficient matrix for GP prediction is the block inverse of the covariance matrix of data :
To derive the corresponding coefficient matrix for CGP, we first define the approximated covariance matrix between data block and :
which means the segmented observations are indirectly connected by the testing data. Thereafter, the approximated conditional covariance matrix of given can be expressed as
After a few steps of algebraic manipulation, the block coefficients matrix for CGP can be obtained as
where is the posterior variance of latent variable using the first data block .
By comparing the expressions for the coefficients matrices of GP and CGP (), we can make the following observations. First, for CGP, the covariance matrix is replaced by an approximation , which relies on the testing points to ‘connect’ the two blocks, cf. (30). Second, the CGP prediction effectively assumes that the segmented training data blocks are conditionally independent, when given testing data. Therefore, the coefficients matrix is simply the inverse of first data block’s covariance matrix, and . Thus Theorem 2 provides a means of probing the sources of the additional MSE incurred when using CGP compared to GP.
5.2 Data-Averaged KLD
Another way to compare different updated belief distributions is to quantify how much they differ from the prior distribution . Specifically, we use the Kullback–Leibler divergence and average it over all realizations under the marginal data model. For GP, it is straight-forward to show the identity , which is the average information gain about provided by the data and the selected model, cf. (Bernardo, 1979).
Theorem 3 (Difference between average information gains).
The data-averaged KL divergences of the posterior belief distributions differ by
where is the mutual information of the latent variable and all observations , and is the sum of mutual information of and observation block .
The data-averaged KLD between the CGP posterior and the prior is given by
Note that the second last holds because , , and
Remark 1 (Redundancy and synergy).
The difference between information gains in Theorem 3 can be positive or negative. If the difference is negative the segmented data is said to be redundant and there is overlapping information in the data segments about the latent variable; otherwise the segmented data are said to be synergistic in which case there is more information about the latent state by jointly considering all segmented data (Barrett, 2015; Timme et al., 2014).
Remark 2 (Under- and overestimation of uncertainty).
When the marginal data distribution obtained from GP is correctly specified, provides the correct measure of uncertainty about the latent variable, cf. (24). In this scenario, the CGP belief distribution will either under- or overestimate the uncertainty depending on whether the data segments are redundant or synergistic in nature.
In this section, we present examples of CGPs for processing large data sets. The first two examples are based on synthetic data. In the third example, we demonstrate the CGP for NOx predictions based on real-world data.
6.1 Synthetic Time Series Data
Considering the GP model with a linear mean function , and a covariance function with periodic patterns , where the period . In total 4224 data points are simulated: the first 4096 points are used as observations, and the last 128 points need to be predicted. Assuming the hyper-parameters are known, the GP, CGP, and SGP predictive posterior distributions are illustrated in Figure 2. In the CGP case, the observations are sequentially divided into four segments (each segment has the length of 1024); in the SGP (FITC) case, 128 inducing variables are placed uniformly across the space of observation inputs.
Several observations are made from this example. First, the recursive computations render the CGP very fast comparing to the GP or the SGP methods. Running on a standard laptop, runtime of CGP i less than a second to process 5000 data points and the run time grows linearly with respect to . Second, in this particular case, the predictive variances are underestimated with the composite GP. This is consistent with redundant information in the data segments as per Remark 2.
6.2 Synthetic Spatial Data
In this example, we consider a two-dimensional Gaussian random field (GRF) model: , where . This GRF model can be efficiently simulated via circulant embedding (Kroese & Botev, 2015). The contour plot of a realization of the GRF () is illustrated in Figure 4. This grid of GRF points are partitioned into points for prediction as observations and test points.
The predictions are illustrated in Figure 3. Visually, CGP is most similar to GP, and the posterior variances of the CGP representing the uncertainty is slightly higher than that of GP. By contrast, SGP and the GP exhibit notable differences and the posterior variance of the former is significantly higher than the latter. The additional MSE (25) incurred by using these posterior distributions is visualized for CGP ( and ) and SGP in Figure 5, using 100 different realizations of the GRF. When the number of data segments varies from 4 to 16, the computation time of CGP is reduced but the additional MSE increases, which is consistent with our analysis in the previous section. By contrast, SGP yields a notably higher additional MSE over GP.
6.3 A Real-world Case: NOx Prediction
Next we demonstrate the use of CGP in a real application for predicting NOx (nitrogen oxides). The data set (available online: http://slb.nu/slbanalys/) includes more than 10 years hourly NOx measurements for a city with 1 million population. Figure 6(a)
shows that the NOx data are far from Gaussian: the measurements are non-negative, highly skewed towards to lower values, and have a long tail in high values. Therefore, we apply the logarithm transformation(Shumway & Stoffer, 2011)
to adjust the NOx measurements to a normal distributed data. The result after transformation is shown in Figure6(b).
An example of 24-hour NOx prediction is illustrated in Figure 7
. In this example, previous two years measurements (17472 hourly measurements) are used by the CGP model to produce the posterior of NOx levels on March 12, 2014. The produced posterior is a log-normal process: the predictions are non-negative; the distribution is skewed to the lower values and has a long tail in higher values; and the width of credibility intervals is significantly bigger for rush hours comparing to late and early hours in working days.
Finally, we show the root-mean square error (RMSE) and runtime results for the CGP and the SGP, with 365 24-hour predictions over the year of 2014. The GP prediction is used as a reference for comparison. As GP prediction is not very scalable, we limit the maximum number of weeks used for one prediction (for instance the prediction shown in Figure 7) to be 50. Thereafter, we vary the number of observations per batch for the CGP and the number of inducing variables for the SGP from 1 to 25 weeks. The resulting RMSE () and runtime (seconds) are shown in Figure 8. The RMSE of GP with 50 weeks of data is slightly above 75, which is provided as the lower bound when comparing the CGP and the SGP results. Figure 8 shows a trade-off between the runtime and the RMSE with CGPs and SGPs for this particular data set when the total data used for one prediction is fixed (50 weeks). The SGP with 5 weeks of inducing variable achieves lower RMSE than the CGP with 10 batches and 5 weeks per batch. However, the runtime of the CGP is significantly lower than the SGP. We must notice that when the number of observations per batch or number of inducing variables increases, the difference of runtimes becomes larger and the gap between RMSEs are closing, which shows a nice scalablility and accuracy of the CGP model.
In this work, we addressed the problem of learning GP models and predicting an underlying process when in scenarios where the data sets are large. Using a general belief update framework, we applied a composite likelihood approach and derived the CGP posterior distribution. It can be both be updated and learned recursively with a runtime that is linear in the number of data segments. We show that GP, CGP, and SGP posteriors can be all recovered in this framework using different likelihood models.
Furthermore, we compared the CGP posterior with that of GP. We obtained closed-form expressions of the additional prediction MSE incurred by CGP and the differences in information gains under both models. The results can be used as a conceptual as well as computational tool for designing segmentation schemes and evaluating errors induced when using the scalable CGP in different applications. The design of segmentation schemes based on the derived quantities is a topic of future research.
- Barrett (2015) Barrett, Adam B. Exploration of synergistic and redundant information sharing in static and dynamical gaussian systems. Phys. Rev. E, 91:052802, May 2015. doi: 10.1103/PhysRevE.91.052802.
- Bernardo (1979) Bernardo, Jose M. Reference posterior distributions for bayesian inference. Journal of the Royal Statistical Society. Series B (Methodological), 41(2):113–147, 1979. ISSN 00359246.
- Bijl et al. (2015) Bijl, Hildo, Wingerden, Jan-Willem, Schön, Thomas B., and Verhaegen, Michel. Online sparse gaussian process regression using fitc and pitc approximations. IFAC-PapersOnLine, 48(28):703 – 708, 2015. ISSN 2405-8963. 17th IFAC Symposium on System Identification SYSID 2015.
- Bishop (2006) Bishop, Christopher M. Pattern recognition and machine learning. Springer, New York, NY, 2006. ISBN 9780387310732;0387310738;.
- Bissiri et al. (2016) Bissiri, P. G., Holmes, C. C., and Walker, S. G. A general framework for updating belief distributions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5):1103–1130, 2016. ISSN 1467-9868. doi: 10.1111/rssb.12158.
- Cao & Fleet (2014) Cao, Yanshuai and Fleet, David J. Generalized product of experts for automatic and principled fusion of gaussian process predictions. CoRR, abs/1410.7827, 2014.
- Cover & Thomas (1991) Cover, Thomas M. and Thomas, Joy A. Elements of information theory. Wiley, New York, 1991. ISBN 9780471062592;0471062596;.
- Deisenroth et al. (2015) Deisenroth, M. P., Fox, D., and Rasmussen, C. E. Gaussian processes for data-efficient learning in robotics and control. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2):408–423, Feb 2015. ISSN 0162-8828. doi: 10.1109/TPAMI.2013.218.
- Deisenroth & Ng (2015) Deisenroth, Marc and Ng, Jun Wei. Distributed gaussian processes. In Bach, Francis and Blei, David (eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 1481–1490, Lille, France, 07–09 Jul 2015. PMLR.
- Fox & Roberts (2012) Fox, Charles W. and Roberts, Stephen J. A tutorial on variational bayesian inference. Artificial Intelligence Review, 38(2):85–95, Aug 2012. ISSN 1573-7462. doi: 10.1007/s10462-011-9236-8.
- Hensman et al. (2013) Hensman, James, Fusi, Nicoló, and Lawrence, Neil D. Gaussian processes for big data. CoRR, abs/1309.6835, 2013.
- Jacob et al. (2017) Jacob, P. E., Murray, L. M., Holmes, C. C., and Robert, C. P. Better together? Statistical learning in models made of modules. ArXiv e-prints, August 2017.
- Kay (1993) Kay, Steven M. Fundamentals of statistical signal processing: Vol. 1, Estimation theory. Prentice Hall PTR, Upper Saddle River, N.J, 1993. ISBN 0133457117;9780133457117;.
- Krause et al. (2008) Krause, Andreas, Singh, Ajit, and Guestrin, Carlos. Near-optimal sensor placements in gaussian processes: Theory, efficient algorithms and empirical studies. J. Mach. Learn. Res., 9:235–284, June 2008. ISSN 1532-4435.
- Kroese & Botev (2015) Kroese, Dirk P. and Botev, Zdravko I. Spatial Process Simulation, pp. 369–404. Springer International Publishing, 2015. ISBN 978-3-319-10064-7. doi: 10.1007/978-3-319-10064-7˙12.
- Quiñonero Candela & Rasmussen (2005) Quiñonero Candela, Joaquin and Rasmussen, Carl Edward. A unifying view of sparse approximate gaussian process regression. J. Mach. Learn. Res., 6:1939–1959, December 2005. ISSN 1532-4435.
- Rasmussen & Williams (2006) Rasmussen, Carl E. and Williams, Christopher K. I. Gaussian processes for machine learning. MIT Press, Cambridge, Massachusetts, 2006. ISBN 9780262182539;026218253X;.
- Särkkä (2013) Särkkä, Simo. Bayesian Filtering and Smoothing. Cambridge University Press, New York, NY, USA, 2013. ISBN 1107619289, 9781107619289.
- Seeger et al. (2003) Seeger, Matthias, Williams, Christopher, and Lawrence, Neil. Fast Forward Selection to Speed Up Sparse Gaussian Process Regression. In Artificial Intelligence and Statistics 9, 2003.
- Shumway & Stoffer (2011) Shumway, Robert H. and Stoffer, David S. Time series analysis and its applications: with R examples. Springer, New York, 3rd edition, 2011. ISBN 144197864X;9781441978646;.
- Snelson & Ghahramani (2006) Snelson, Edward and Ghahramani, Zoubin. Sparse gaussian processes using pseudo-inputs. In Weiss, Y., Schölkopf, P. B., and Platt, J. C. (eds.), Advances in Neural Information Processing Systems 18, pp. 1257–1264. MIT Press, 2006.
- Snelson & Ghahramani (2007) Snelson, Edward and Ghahramani, Zoubin. Local and global sparse gaussian process approximations. In Meila, Marina and Shen, Xiaotong (eds.), Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, volume 2 of Proceedings of Machine Learning Research, pp. 524–531, San Juan, Puerto Rico, 21–24 Mar 2007. PMLR.
- Stoica & Moses (1997) Stoica, Petre and Moses, Randolph L. Introduction to spectral analysis. Prentice Hall, Upper Saddle River, 1997. ISBN 9780132584197;0132584190;.
- Timme et al. (2014) Timme, Nicholas, Alford, Wesley, Flecker, Benjamin, and Beggs, John M. Synergy, redundancy, and multivariate information measures: an experimentalist’s perspective. Journal of Computational Neuroscience, 36(2):119–140, Apr 2014. ISSN 1573-6873. doi: 10.1007/s10827-013-0458-4.
- Titsias (2009) Titsias, Michalis. Variational learning of inducing variables in sparse gaussian processes. In van Dyk, David and Welling, Max (eds.), Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, volume 5 of Proceedings of Machine Learning Research, pp. 567–574, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, 16–18 Apr 2009. PMLR.
- Tresp (2000) Tresp, Volker. A bayesian committee machine. Neural Comput., 12(11):2719–2741, November 2000. ISSN 0899-7667. doi: 10.1162/089976600300014908.
- Van Trees et al. (2013) Van Trees, Harry L., Bell, Kristine L., and Tian, Zhi. Detection estimation and modulation theory. John Wiley & Sons, Inc, Hoboken, N.J, second edition, 2013. ISBN 9780470542965;0470542969;.
- Varin et al. (2011) Varin, Cristiano, Reid, Nancy, and Firth, David. An overview of composite likelihood methods. Statistica Sinica, 21(1):5–42, 2011. ISSN 10170405, 19968507.
- Williams & Seeger (2001) Williams, Christopher K. I. and Seeger, Matthias. Using the nyström method to speed up kernel machines. In Leen, T. K., Dietterich, T. G., and Tresp, V. (eds.), Advances in Neural Information Processing Systems 13, pp. 682–688. MIT Press, 2001.
- Zachariah et al. (2017) Zachariah, D., Dwivedi, S., Händel, P., and Stoica, P. Scalable and passive wireless network clock synchronization in los environments. IEEE Transactions on Wireless Communications, 16(6):3536–3546, June 2017. ISSN 1536-1276. doi: 10.1109/TWC.2017.2683486.