1 Introduction
GPs are a nonlinear regression method that capture function smoothness across inputs through a response covariance function (Williams and Rasmussen, 1996)
. GPs extend to multioutput regression, where the objective is to build a probabilistic regression model over vectorvalued observations by identifying latent crossoutput processes. Multioutput GPs frequently appear in timeseries and geostatistical contexts, such as the problem of imputing missing temperature readings for sensors in different locations or missing foreign exchange rates and commodity prices given rates and prices for other goods over time
(Osborne et al., 2008; Álvarez et al., 2010). Efficient model estimation would enable researchers to quickly explore large spaces of parameterizations to find an appropriate one for their task.For input points, exact GP inference requires maintaining an
matrix of covariances between response variables at each input and performing
inversions with that matrix (Williams and Rasmussen, 1996). Some singleoutput GP methods exploit structure in this matrix to reduce runtime (Wilson et al., 2015). In the multioutput setting, the same structure does not exist. Approximations developed for multioutput methods instead reduce the dimensionality of the GP estimation problem from to , but still require to scale with to retain accuracy (Nguyen et al., 2014). The polynomial dependence on is still cubic: the matrix inversion underlying the stateoftheart multioutput GP estimation method ignores LMC’s structure. Namely, the crosscovariance between two outputs is determined by a linear combination of stationary subkernels. On a grid of inputs, each subkernel induces matrix structure, so viewing the LMC kernel as a linear combination of structured matrices we can avoid direct matrix inversion.Our paper is organized as follows. In Sec. 2 we give background on singleoutput and multioutput GPs, as well as some history in exploiting structure for matrix inversions in GPs. Sec. 3 details both related work that was built upon in LLGP and existing methods for multioutput GPs, followed by Sec. 4 describing our contributions. Sec. 5 describes our method. Then, in Sec. 6 we compare the performance of LLGP to existing methods and offer concluding remarks in Sec. 7.
2 Background
2.1 Gaussian processes (GPs)
A GP is a set of random variables (RVs)
indexed by , with the property that, for any finite collection of , the RVs are jointly Gaussian with zero mean without loss of generality and a prespecified covariance , , where and (Williams and Rasmussen, 1996). Given observations of , inference at a single point of an valued RV is performed by conditioning (Williams and Rasmussen, 1996). Predictive accuracy is sensitive to a particular parameterization of our kernel, and model estimation is performed by maximizing data log likelihood with respect to parameters of , . Gradientbased optimization methods then require the gradient with respect to every parameter of . Fixing :(1) 
2.2 Multioutput linear GPs
We build multioutput GP models as instances of general GPs, where a multioutput model explicitly represents correlations between outputs through a shared input space (Álvarez et al., 2012). Here, for outputs, we write our indexing set as , a point from a shared domain coupled with an output index. Then, if we make observations at for output , we can set:
An LMC kernel is of the form:
(2) 
where are stationary kernels on . Typically, the positive semidefinite (PSD) matrices formed by are parameterized as , with and a preset rank. Importantly, even though each is stationary, is only stationary on if is Toeplitz. In practice, where we wish to capture covariance across outputs as a dimensional latent process, is not Toeplitz, so .
The LMC kernel provides a flexible way of specifying multiple additive contributions to the covariance between two inputs for two different outputs. The contribution of the th kernel to the covariance between the th and th outputs is then specified by the multiplicative factor . By choosing to have rank , the corresponding LMC model can have any contribution between two outputs that best fits the data, so long as remains PSD. By reducing the rank , the interactions of the outputs have lowerrank latent processes, with rank 0 indicating no interaction (i.e., if , then we have an independent GP for each output).
2.3 Structured covariance matrices
If we can identify structure in the covariance , then we can develop fast inmemory representations and efficient matrixvector multiplications (MVMs) for —this has been used in the past to accelerate GP model estimation (Gilboa et al., 2015; Cunningham et al., 2008). The Kronecker product of matrices of order is a block matrix of order with th block . We can represent the product by keeping representations of and separately, rather than the product. Then, the corresponding MVMs can be computed in time , where is the runtime of a MVM. For GPs on uniform dimension grids, this reduces the runtime of finding from to (Gilboa et al., 2015).
Symmetric Toeplitz matrices are constant along their diagonals and fully determined by their top row , yielding an representation. Such matrices arise naturally when we examine the covariance matrix induced by a stationary kernel applied to a onedimensional grid of inputs. Since the difference in adjacent inputs is the same for all , we have the Toeplitz property that . Furthermore, we can embed in the upperleft corner of a circulant matrix of twice its size, in time. This approach has been used for fast inference in singleoutput GP time series with uniformly spaced inputs (Cunningham et al., 2008). When applied to grids of more than one dimension, the resulting covariance becomes blockToeplitz with Toeplitz blocks (BTTB) (Wilson et al., 2015). Consider a twodimensional grid with separations . For a fixed pair of , this grid contains a onedimensional subgrid over varying values. The pairwise covariances for a stationary kernel in this subgrid also exhibit Toeplitz structure, since we still have . Ordering points in lexicographic order, so that the covariance matrix has order blocks with pairwise covariances between varying values between a pair of fixed values, the previous sentence implies are Toeplitz. By similar reasoning, since for any , , we have , thus the block structure of is itself Toeplitz, and hence is BTTB (Fig. 1). For higher dimensions, one can imagine a recursive blockToeplitz structure. BTTB matrices themselves admit runtime for MVMs, with being the total grid size, or in the two dimensional case, and if they are symmetric they can be represented with only their top row as well. The MVM runtime constant scales exponentially in the input dimension, however, so this approach is only applicable to lowinputdimension problems.
3 Related work
3.1 Approximate inference methods
Inducing point approaches create a tractable model to approximate the exact GP. For example, the deterministic training conditional (DTC) for a singleoutput GP fixes inducing points and estimates kernel hyperparameters for (QuiñoneroCandela and Rasmussen, 2005). This approach is agnostic to kernel stationarity, so one may use inducing points for all outputs , with the model equal to the exact GP model when (Álvarez et al., 2010). Computationally, these approaches resemble making rank approximations to the covariance matrix.
In Collaborative Multioutput Gaussian Processes (COGP), multioutput GP algorithms further share an internal representation of the covariance structure among all outputs at once (Nguyen et al., 2014). COGP fixes inducing points for some sized and puts a GP prior on with a restricted LMC kernel that matches the Semiparametric Latent Factor Model (SLFM) (Seeger et al., 2005). Applying the COGP prior to corresponds to an LMC kernel (Eq. 2) where is set to 0 and . Moreover, SLFM and COGP models include an independent GP for each output, represented in LMC as additional kernels , where and . COGP uses its shared structure to derive efficient expressions for variational inference (VI) for parameter estimation.
3.2 Structured Kernel Interpolation (SKI)
SKI abandons the inducingpoint approach: instead of using an intrinsically sparse model, SKI approximates the original directly (Wilson and Nickisch, 2015). To do this efficiently, SKI relies on the differentiability of . For within a grid , , and
as the cubic interpolation weights
(Keys, 1981), . The simultaneous interpolation then yields the SKI approximation: . has only nonzero entries, with . Even without relying on structure, SKI reduces the representation of to an rank matrix.Massively Scalable Gaussian Processes (MSGP) exploits structure as well: the kernel on a grid has Kronecker and Toeplitz matrix structure (Wilson et al., 2015). Drawing on previous work on structured GPs (Cunningham et al., 2008; Gilboa et al., 2015), MSGP uses linear conjugate gradient descent as a method for evaluating efficiently for Eq. 1. In addition, an efficient eigendecomposition that carries over to the SKI kernel for the remaining term in Eq. 1 has been noted previously (Wilson et al., 2014).
Although evaluating is not feasible in the LMC setting because the LMC sum breaks Kronecker and Toeplitz structure, the approach of creating structure with SKI carries over to LLGP.
4 Contributions of LLGP
First, we identify a BTTB structure induced by the LMC kernel evaluated on a grid. Next, we show how an LMC kernel can be decomposed into two block diagonal components, one of which has structure similar to that of SLFM (Seeger et al., 2005). Both of these structures coordinate for fast matrixvector multiplication with the covariance matrix for any vector z.
With multiple outputs, LMC crosscovariance interactions violate the assumptions of SKI’s cubic interpolation, which require full stationarity (invariance to translation in the indexing space ) and differentiability. We show a modification to SKI is compatible with the piecewisedifferentiable LMC kernel, which is only invariant to translation along the indexing subspace (the subkernels are stationary). This partial stationarity induces the previouslyidentified BTTB structure of the LMC kernel and enables acceleration of GP estimation for nonuniform inputs.
For lowdimensional inputs, the above contributions offer an asymptotic and practical runtime improvement in hyperparameter estimation while also expanding the feasible kernel families to any differentiable LMC kernels, relative to COGP (Tab. 1) (Nguyen et al., 2014). LLGP is also, to the author’s knowledge, first in experimentally validating the viability of SKI for input dimensions higher than one, which addresses a large class of GP applications that stand to benefit from tractable jointoutput modeling.
5 Large Linear GP
We propose a linear model of coregionalization (LMC) method based on recent structurebased optimizations for GP estimation instead of variational approaches. Critically, the accuracy of the method need not be reduced by keeping the number of interpolation points low, because its reliance on structure allows better asymptotic performance. For simplicity, our work focuses on multidimensional outputs, onedimensional inputs, and Gaussian likelihoods.
For a given , we construct an operator which approximates MVMs with the exact covariance matrix, which for brevity we overload as , so . Using only the action of MVM with the covariance operator, we derive . Critically, we cannot access itself, only , so we choose AdaDelta as a gradientonly highlevel optimization routine for (Zeiler, 2012).
5.1 Gradient construction
Gibbs and MacKay (1996) describe the algorithm for GP model estimation in terms of only MVMs with the covariance matrix. In particular, we can solve for satisfying in Eq. 1 using linear conjugate gradient descent (LCG). Moreover, Gibbs and MacKay develop a stochastic approximation by introducing RV r with :
(3) 
This approximation improves as the size of increases, so, as in other work (Cutajar et al., 2016), we let and take a fixed number samples from r.
We depart from Gibbs and MacKay in two important ways (Algorithm 1). First, we do not construct , but instead keep a lowmemory representation (Sec. 5.2). Second, we use minres instead of LCG as the Krylovsubspace inversion method used to compute inverses from MVMs. Iterative minres solutions to numerically semidefinite matrices monotonically improve in practice, as opposed to LCG (Fong and Saunders, 2012). This is essential in GP optimization, where the diagonal noise matrix , iid for each output, shrinks throughout learning. Inversionbased methods then require additional iterations because , the spectral condition number of , increases as becomes less diagonally dominant (Fig. 2).
Every AdaDelta iteration (invoking Algorithm 1) then takes total time (Raykar and Duraiswami, 2007). This analysis holds as long as the error in the gradients is fixed and we can compute MVMs with the matrix for each at least as fast as . Indeed, we assume a differentiable kernel and then recall that applying the linear operator will maintain the structure of .
For a gradientonly stopping heuristic, we maintain the running maximum gradient
norm. If gradient norms drop below a proportion of the running max norm more than a preset number of times, we terminate (Fig. 2). This heuristic is avoidable since it is possible to evaluate with only MVMs (Han et al., 2015), but using the heuristic proved sufficient and results in a simpler gradientonly optimization routine.5.2 Fast MVMs and parsimonious kernels
The bottleneck of Algorithm 1 is the iterative MVM operations in minres. Since only enters computation as an MVM operator, the required memory is dictated by its representation , which need not be dense as long as we can perform MVM with any vector to arbitrary, fixed precision.
When LMC kernels are evaluated on a grid of points for each output, so , the simultaneous covariance matrix equation without noise over (Eq. 4) holds for BTTB matrices formed by the stationary kernels evaluated the shared interpolating points for all outputs:
(4) 
To adapt SKI to our context of multioutput GPs, we build a grid out of a common subgrid that extends to all outputs with . Since the LMC kernel evaluated between two sets of outputs is differentiable, as long as spans a range larger than each , the corresponding SKI approximation (Eq. 5) holds with the same asymptotic convergence cubic in .
(5) 
We cannot fold the Gaussian noise into the interpolated term since it does not correspond to a differentiable kernel. However, since is diagonal, it is efficient to represent and multiply with. MVM with requires MVM by the sparse matrices , which all take space and time.
We consider different representations of (Eq. 5) to reduce the memory and runtime requirements for performing the multiplication in the following sections.
5.2.1 Sum representation
In sum, we represent with a length list. At each index , is a dense matrix of order and is a BTTB matrix of order , represented using only the top row. In turn, multiplication is performed by multiplying each matrix in the list with z and summing the results. The Kronecker MVM may be expressed as fast BTTB MVMs with and dense MVMs with . In turn, assuming , the runtime for each of the terms is .
5.2.2 BlockToeplitz representation
In bt, we note that is a block matrix with blocks :
On a grid , these matrices are BTTB because they are linear combinations of BTTB matrices. bt requires sized rows to represent each . Then, using the usual block matrix multiplication, an MVM takes time since each inner block MVM is accelerated due to BTTB structure.
On a grid of inputs with , the SKI interpolation becomes . In this case, using bt alone leads to a faster algorithm—applying the Chan BTTB preconditioner reduces the number of MVMs necessary to find an inverse (Chan and Olkin, 1994).
5.2.3 SLFM representation
For the rankbased slfm representation, let be the average rank, , and rewrite the kernel:
Note is rank . Under some reindexing , which flattens the double summation such that each corresponds to a unique , the term may be rewritten as
where with a matrix of horizontally stacked column vectors (Seeger et al., 2005). Next, we rearrange the remaining term as , where is BTTB. Thus, slfm represents as the sum of two block diagonal matrices of block order and , where each block is a BTTB order matrix; thus, MVMs run in .
Note that bt and slfm each have a faster run time than the other depending on whether . An algorithm that uses this condition to decide between representations can minimize runtime (Tab. 1).
Method  Upfront cost for  Additional cost per hyperparameter 

Exact  
COGP  
LLGP 
5.3 GP mean and variance prediction
The predictive mean can be computed in time by (Wilson et al., 2015).
The full predictive covariance estimate requires finding a new term . This is done by solving the linear system in a matrixfree manner onthefly; in particular, is computed via minres for every new test point . Over several test points, this process is parallelizable.
6 Results
We evaluate the methods on held out data by using standardized mean square error (SMSE) of the test points with the predicted mean, and the negative log predictive density (NLPD) of the Gaussian likelihood of the inferred model. NLPD takes confidence into account, while SMSE only evaluates the mean prediction. In both cases, lower values represent better performance. We evaluated the performance of our representations of the kernel, sum, bt, and slfm, by computing exact gradients using the standard Cholesky algorithm over a variety of different kernels (see the supplement).^{1}^{1}1Hyperparameters, data, code, and benchmarking scripts are available in https://github.com/vlad17/runlmc.
Predictive performance on heldout data with SMSE and NLPD offers an applestoapples comparison with COGP, which was itself proposed on the basis of these two metrics. Training data log likelihood would unfairly favor LLGP, which optimizes directly. Note that predicting a constant value equal to each output’s holdout mean results in a baseline SMSE of .
Finally, we attempted to run hyperparameter optimization with both an exact GP and a variational DTC approximation as provided by the GPy package, but both runtime and predictive performance were already an order of magnitude worse than both LLGP and COGP on our smallest dataset from Sec. 6.2 (GPy, since 2012).
6.1 Synthetic dataset
First, we evaluate raw learning performance on a synthetic dataset. We fix the GP index as and generate a fixed SLFM RBF kernel with , fixed lengthscales, and covariance hyperparameters . We set the output dimension to be , so this synthetic dataset might resemble a geospatial GP model for subterranean mineral density, where the various minerals would be different outputs. Sampling GP values from the unit square, we hold out approximately of them, corresponding to the values for the final output in the upperright quadrant of the sampling square.
We consider the problem of estimating the prefixed GP hyperparameters (starting from randomlyinitialized ones) with LLGP and COGP. We evaluate the fit based on imputation performance on the heldout values (Tbl. 2). For COGP, we use hyperparameter settings applied to the Sarcos dataset from the COGP paper, a dataset of approximately the same size, which has the number of inducing points . However, COGP failed to have abovebaseline SMSE performance with learned inducing points, even on a range of up to and various learning rates. Using fixed inducing points allowed for moderate improvement in COGP, which we used for comparison. For LLGP, we use a grid of size on the unit square with no learning rate tuning. LLGP was able to estimate more predictive hyperparameter values in about the same amount of time it took COGP to learn significantly worse values in terms of prediction.
Metric  LLGP  COGP  COGP+ 
seconds  ()  101 (0)  () 
SMSE  0.12 (0.00)  ()  () 
NLPD  0.28 (0.00)  ()  () 

did not improve performance. Since COGP does not provide a terminating condition for its optimization, we also show its performance when permitted to train longer, labelled COGP+. All trials were run 3 times, with parenthesized values representing standard error shown below.
6.2 Foreign exchange rates (FX2007)
We replicate the mediumsized dataset from COGP as an application to evaluate LLGP performance. The dataset includes ten foreign currency exchange rates—CAD, EUR, JPY, GBP, CHF, AUD, HKD, NZD, KRW, and MXN—and three precious metals—XAU, XAG, and XPT—implying that . In LLGP, we set , as recommended for LMC models on this dataset (Álvarez et al., 2010). COGP roughly corresponds to the the SLFM model, which has a total of 94 hyperparameters, compared to 53 for LLGP. All kernels are squared exponential. The data used in this example are from 2007, and include training points and test points. The test points include 50 contiguous points extracted from each of the CAD, JPY, and AUD exchanges. For this application, LLGP uses
interpolating points. We used the COGP settings from the paper. LLGP outperforms COGP in terms of predictive mean and variance estimation as well as runtime (Tab.
3).Metric  LLGP  COGP 

seconds  69 (8)  () 
SMSE  0.21 (0.00)  () 
NLPD  3.62 (0.03)  () 

6.3 Weather dataset
Next, we replicate results from a weather dataset, a large time series used to validate COGP. Here, weather sensors Bramblemet, Sotonmet, Cambermet, and Chimet record air temperature over five days in five minute intervals, with some dropped records due to equipment failure. Parts of Cambernet and Chimet are dropped for imputation, yielding training measurements and test measurements.
We use the default COGP parameters.^{2}^{2}2https://github.com/trungngv/cogp We tested LLGP models on and interpolating points.
Metric 


COGP  

seconds  73 (12)  ()  ()  
SMSE  ()  ()  0.08 (0.00)  
NLPD  ()  1.69 (0.19)  ()  

LLGP performed slightly worse than COGP in SMSE, but both NLPD and runtime indicate significant improvements (Tab. 4, Fig. 4). Varying the number of interpolating points from to demonstrates the runtime versus NLPD tradeoff. While NLPD improvement diminishes as increases, LLGP still improves upon COGP for a wide range of by an order of magnitude in runtime and almost two orders of magnitude in NLPD.
7 Conclusion
In this paper, we present LLGP, which we show adapts and accelerates SKI (Wilson and Nickisch, 2015) for the problem of multioutput GP regression. LLGP exploits structure unique to LMC kernels, enabling a parsimonious representation of the covariance matrix, and gradient computations in .
LLGP provides an efficient means to approximate the loglikelihood gradients using interpolation. We have shown on several datasets that this can be done in a way that is faster and leads to more accurate results than variational approximations. Because LLGP scales well with increases in , capturing complex interactions in the covariance with an accurate interpolation is cheap, as demonstrated by performance on a variety of datasets (Tab. 2, Tab. 3, Tab. 4).
Future work could extend LLGP to accept large input dimensions, though most GP use cases are covered by lowdimensional inputs. Finally, an extension to nonGaussian noise and use of LLGP as a preconditioner for finetuned exact GP models is also feasible in a manner following prior work (Cutajar et al., 2016).
Acknowledgments
The authors would like to thank Princeton University and University of California, Berkeley, for providing the computational resources necessary to evaluate our method. Further, the authors owe thanks to Professors Joseph Gonzalez, Ion Stoica, Andrew Wilson, and John Cunningham for reviewing drafts of the paper and offering insightful advice about directions to explore.
References

Álvarez et al. (2012)
Mauricio Álvarez, Lorenzo Rosasco, Neil Lawrence, et al.
Kernels for vectorvalued functions: A review.
Foundations and Trends® in Machine Learning
, 4(3):195–266, 2012.  Wilson et al. (2015) Andrew Wilson, Christoph Dann, and Hannes Nickisch. Thoughts on massively scalable Gaussian processes. arXiv preprint arXiv:1511.01870, 2015.
 Williams and Rasmussen (1996) Christopher Williams and Carl Rasmussen. Gaussian processes for regression. Advances in neural information processing systems, pages 514–520, 1996.
 Osborne et al. (2008) Michael Osborne, Stephen Roberts, Alex Rogers, Sarvapali Ramchurn, and Nicholas Jennings. Towards realtime information processing of sensor network data using computationally efficient multioutput Gaussian processes. In 7th international conference on Information processing in sensor networks, pages 109–120. IEEE Computer Society, 2008.
 Álvarez et al. (2010) Mauricio Álvarez, David Luengo, Michalis Titsias, and Neil D Lawrence. Efficient multioutput Gaussian processes through variational inducing kernels. In AISTATS, volume 9, pages 25–32, 2010.
 Nguyen et al. (2014) Trung Nguyen, Edwin Bonilla, et al. Collaborative multioutput Gaussian processes. In UAI, pages 643–652, 2014.
 Gilboa et al. (2015) Elad Gilboa, Yunus Saatçi, and John Cunningham. Scaling multidimensional inference for structured Gaussian processes. IEEE transactions on pattern analysis and machine intelligence, 37(2):424–436, 2015.
 Cunningham et al. (2008) John Cunningham, Krishna Shenoy, and Maneesh Sahani. Fast Gaussian process methods for point process intensity estimation. In 25th international conference on Machine learning, pages 192–199. ACM, 2008.
 QuiñoneroCandela and Rasmussen (2005) Joaquin QuiñoneroCandela and Carl Rasmussen. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6(Dec):1939–1959, 2005.

Seeger et al. (2005)
Matthias Seeger, YeeWhye Teh, and Michael Jordan.
Semiparametric latent factor models.
In
Eighth Conference on Artificial Intelligence and Statistics
, 2005.  Wilson and Nickisch (2015) Andrew Wilson and Hannes Nickisch. Kernel interpolation for scalable structured Gaussian processes (kissgp). In The 32nd International Conference on Machine Learning, pages 1775–1784, 2015.
 Keys (1981) Robert Keys. Cubic convolution interpolation for digital image processing. IEEE transactions on acoustics, speech, and signal processing, 29(6):1153–1160, 1981.
 Wilson et al. (2014) Andrew Wilson, Elad Gilboa, John Cunningham, and Arye Nehorai. Fast kernel learning for multidimensional pattern extrapolation. In Advances in Neural Information Processing Systems, pages 3626–3634, 2014.
 Zeiler (2012) Matthew Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
 Gibbs and MacKay (1996) Mark Gibbs and David MacKay. Efficient implementation of Gaussian processes, 1996.
 Cutajar et al. (2016) Kurt Cutajar, Michael Osborne, John Cunningham, and Maurizio Filippone. Preconditioning kernel matrices. In ICML, pages 2529–2538, 2016.
 Fong and Saunders (2012) David Fong and Michael Saunders. CG versus MINRES: an empirical comparison. SQU Journal for Science, 17(1):44–62, 2012.
 Raykar and Duraiswami (2007) Vikas Raykar and Ramani Duraiswami. Fast large scale Gaussian process regression using approximate matrixvector products. In Learning workshop, 2007.
 Han et al. (2015) Insu Han, Dmitry Malioutov, and Jinwoo Shin. Largescale logdeterminant computation through stochastic Chebyshev expansions. In ICML, pages 908–917, 2015.
 Chan and Olkin (1994) Tony Chan and Julia Olkin. Circulant preconditioners for toeplitzblock matrices. Numerical Algorithms, 6(1):89–101, 1994.
 GPy (since 2012) GPy. GPy: A gaussian process framework in python. http://github.com/SheffieldML/GPy, since 2012.
Comments
There are no comments yet.