Large Linear Multi-output Gaussian Process Learning

05/30/2017
by   Vladimir Feinberg, et al.
0

Gaussian processes (GPs), or distributions over arbitrary functions in a continuous domain, can be generalized to the multi-output case: a linear model of coregionalization (LMC) is one approach. LMCs estimate and exploit correlations across the multiple outputs. While model estimation can be performed efficiently for single-output GPs, these assume stationarity, but in the multi-output case the cross-covariance interaction is not stationary. We propose Large Linear GP (LLGP), which circumvents the need for stationarity by inducing structure in the LMC kernel through a common grid of inputs shared between outputs, enabling optimization of GP hyperparameters for multi-dimensional outputs and low-dimensional inputs. When applied to synthetic two-dimensional and real time series data, we find our theoretical improvement relative to the current solutions for multi-output GPs is realized with LLGP reducing training time while improving or maintaining predictive mean accuracy. Moreover, by using a direct likelihood approximation rather than a variational one, model confidence estimates are significantly improved.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

11/05/2019

GP-ALPS: Automatic Latent Process Selection for Multi-Output Gaussian Process Models

A simple and widely adopted approach to extend Gaussian processes (GPs) ...
12/11/2020

Gap Filling of Biophysical Parameter Time Series with Multi-Output Gaussian Processes

In this work we evaluate multi-output (MO) Gaussian Process (GP) models ...
07/12/2017

Enhanced Particle Swarm Optimization Algorithms for Multiple-Input Multiple-Output System Modelling using Convolved Gaussian Process Models

Convolved Gaussian Process (CGP) is able to capture the correlations not...
05/02/2018

Toward a diagnostic toolkit for linear models with Gaussian-process distributed random effects

Gaussian processes (GPs) are widely used as distributions of random effe...
07/28/2020

Multi-Output Gaussian Processes with Functional Data: A Study on Coastal Flood Hazard Assessment

Most of the existing coastal flood Forecast and Early-Warning Systems do...
02/12/2020

Development of modeling and control strategies for an approximated Gaussian process

The Gaussian process (GP) model, which has been extensively applied as p...
09/10/2020

Generalized Multi-Output Gaussian Process Censored Regression

When modelling censored observations, a typical approach in current regr...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

GPs are a nonlinear regression method that capture function smoothness across inputs through a response covariance function (Williams and Rasmussen, 1996)

. GPs extend to multi-output regression, where the objective is to build a probabilistic regression model over vector-valued observations by identifying latent cross-output processes. Multi-output GPs frequently appear in time-series and geostatistical contexts, such as the problem of imputing missing temperature readings for sensors in different locations or missing foreign exchange rates and commodity prices given rates and prices for other goods over time

(Osborne et al., 2008; Álvarez et al., 2010). Efficient model estimation would enable researchers to quickly explore large spaces of parameterizations to find an appropriate one for their task.

For input points, exact GP inference requires maintaining an

matrix of covariances between response variables at each input and performing

inversions with that matrix (Williams and Rasmussen, 1996). Some single-output GP methods exploit structure in this matrix to reduce runtime (Wilson et al., 2015). In the multi-output setting, the same structure does not exist. Approximations developed for multi-output methods instead reduce the dimensionality of the GP estimation problem from to , but still require to scale with to retain accuracy (Nguyen et al., 2014). The polynomial dependence on is still cubic: the matrix inversion underlying the state-of-the-art multi-output GP estimation method ignores LMC’s structure. Namely, the cross-covariance between two outputs is determined by a linear combination of stationary subkernels. On a grid of inputs, each subkernel induces matrix structure, so viewing the LMC kernel as a linear combination of structured matrices we can avoid direct matrix inversion.

Our paper is organized as follows. In Sec. 2 we give background on single-output and multi-output GPs, as well as some history in exploiting structure for matrix inversions in GPs. Sec. 3 details both related work that was built upon in LLGP and existing methods for multi-output GPs, followed by Sec. 4 describing our contributions. Sec. 5 describes our method. Then, in Sec. 6 we compare the performance of LLGP to existing methods and offer concluding remarks in Sec. 7.

2 Background

2.1 Gaussian processes (GPs)

A GP is a set of random variables (RVs)

indexed by , with the property that, for any finite collection of , the RVs are jointly Gaussian with zero mean without loss of generality and a prespecified covariance , , where and (Williams and Rasmussen, 1996). Given observations of , inference at a single point of an -valued RV is performed by conditioning (Williams and Rasmussen, 1996). Predictive accuracy is sensitive to a particular parameterization of our kernel, and model estimation is performed by maximizing data log likelihood with respect to parameters of , . Gradient-based optimization methods then require the gradient with respect to every parameter of . Fixing :

(1)

2.2 Multi-output linear GPs

We build multi-output GP models as instances of general GPs, where a multi-output model explicitly represents correlations between outputs through a shared input space (Álvarez et al., 2012). Here, for outputs, we write our indexing set as , a point from a shared domain coupled with an output index. Then, if we make observations at for output , we can set:

An LMC kernel is of the form:

(2)

where are stationary kernels on . Typically, the positive semi-definite (PSD) matrices formed by are parameterized as , with and a preset rank. Importantly, even though each is stationary, is only stationary on if is Toeplitz. In practice, where we wish to capture covariance across outputs as a -dimensional latent process, is not Toeplitz, so .

The LMC kernel provides a flexible way of specifying multiple additive contributions to the covariance between two inputs for two different outputs. The contribution of the th kernel to the covariance between the th and th outputs is then specified by the multiplicative factor . By choosing to have rank , the corresponding LMC model can have any contribution between two outputs that best fits the data, so long as remains PSD. By reducing the rank , the interactions of the outputs have lower-rank latent processes, with rank 0 indicating no interaction (i.e., if , then we have an independent GP for each output).

2.3 Structured covariance matrices

If we can identify structure in the covariance , then we can develop fast in-memory representations and efficient matrix-vector multiplications (MVMs) for —this has been used in the past to accelerate GP model estimation (Gilboa et al., 2015; Cunningham et al., 2008). The Kronecker product of matrices of order is a block matrix of order with th block . We can represent the product by keeping representations of and separately, rather than the product. Then, the corresponding MVMs can be computed in time , where is the runtime of a MVM. For GPs on uniform dimension- grids, this reduces the runtime of finding from to (Gilboa et al., 2015).

Symmetric Toeplitz matrices are constant along their diagonals and fully determined by their top row , yielding an representation. Such matrices arise naturally when we examine the covariance matrix induced by a stationary kernel applied to a one-dimensional grid of inputs. Since the difference in adjacent inputs is the same for all , we have the Toeplitz property that . Furthermore, we can embed in the upper-left corner of a circulant matrix of twice its size, in time. This approach has been used for fast inference in single-output GP time series with uniformly spaced inputs (Cunningham et al., 2008). When applied to grids of more than one dimension, the resulting covariance becomes block-Toeplitz with Toeplitz blocks (BTTB) (Wilson et al., 2015). Consider a two-dimensional grid with separations . For a fixed pair of , this grid contains a one-dimensional subgrid over varying values. The pairwise covariances for a stationary kernel in this subgrid also exhibit Toeplitz structure, since we still have . Ordering points in lexicographic order, so that the covariance matrix has order- blocks with pairwise covariances between varying values between a pair of fixed values, the previous sentence implies are Toeplitz. By similar reasoning, since for any , , we have , thus the block structure of is itself Toeplitz, and hence is BTTB (Fig. 1). For higher dimensions, one can imagine a recursive block-Toeplitz structure. BTTB matrices themselves admit runtime for MVMs, with being the total grid size, or in the two dimensional case, and if they are symmetric they can be represented with only their top row as well. The MVM runtime constant scales exponentially in the input dimension, however, so this approach is only applicable to low-input-dimension problems.

Figure 1: An illustration of a stationary kernel evaluated on a two-dimensional input grid with separations , resulting in BTTB structure (here, with ). Identical matrices are colored the same. We use the shorthand .

3 Related work

3.1 Approximate inference methods

Inducing point approaches create a tractable model to approximate the exact GP. For example, the deterministic training conditional (DTC) for a single-output GP fixes inducing points and estimates kernel hyperparameters for (Quiñonero-Candela and Rasmussen, 2005). This approach is agnostic to kernel stationarity, so one may use inducing points for all outputs , with the model equal to the exact GP model when (Álvarez et al., 2010). Computationally, these approaches resemble making rank- approximations to the covariance matrix.

In Collaborative Multi-output Gaussian Processes (COGP), multi-output GP algorithms further share an internal representation of the covariance structure among all outputs at once (Nguyen et al., 2014). COGP fixes inducing points for some -sized and puts a GP prior on with a restricted LMC kernel that matches the Semiparametric Latent Factor Model (SLFM) (Seeger et al., 2005). Applying the COGP prior to corresponds to an LMC kernel (Eq. 2) where is set to 0 and . Moreover, SLFM and COGP models include an independent GP for each output, represented in LMC as additional kernels , where and . COGP uses its shared structure to derive efficient expressions for variational inference (VI) for parameter estimation.

3.2 Structured Kernel Interpolation (SKI)

SKI abandons the inducing-point approach: instead of using an intrinsically sparse model, SKI approximates the original directly (Wilson and Nickisch, 2015). To do this efficiently, SKI relies on the differentiability of . For within a grid , , and

as the cubic interpolation weights 

(Keys, 1981), . The simultaneous interpolation then yields the SKI approximation: . has only nonzero entries, with . Even without relying on structure, SKI reduces the representation of to an -rank matrix.

Massively Scalable Gaussian Processes (MSGP) exploits structure as well: the kernel on a grid has Kronecker and Toeplitz matrix structure (Wilson et al., 2015). Drawing on previous work on structured GPs (Cunningham et al., 2008; Gilboa et al., 2015), MSGP uses linear conjugate gradient descent as a method for evaluating efficiently for Eq. 1. In addition, an efficient eigendecomposition that carries over to the SKI kernel for the remaining term in Eq. 1 has been noted previously (Wilson et al., 2014).

Although evaluating is not feasible in the LMC setting because the LMC sum breaks Kronecker and Toeplitz structure, the approach of creating structure with SKI carries over to LLGP.

4 Contributions of LLGP

First, we identify a BTTB structure induced by the LMC kernel evaluated on a grid. Next, we show how an LMC kernel can be decomposed into two block diagonal components, one of which has structure similar to that of SLFM (Seeger et al., 2005). Both of these structures coordinate for fast matrix-vector multiplication with the covariance matrix for any vector z.

With multiple outputs, LMC cross-covariance interactions violate the assumptions of SKI’s cubic interpolation, which require full stationarity (invariance to translation in the indexing space ) and differentiability. We show a modification to SKI is compatible with the piecewise-differentiable LMC kernel, which is only invariant to translation along the indexing subspace (the subkernels are stationary). This partial stationarity induces the previously-identified BTTB structure of the LMC kernel and enables acceleration of GP estimation for non-uniform inputs.

For low-dimensional inputs, the above contributions offer an asymptotic and practical runtime improvement in hyperparameter estimation while also expanding the feasible kernel families to any differentiable LMC kernels, relative to COGP (Tab. 1) (Nguyen et al., 2014). LLGP is also, to the author’s knowledge, first in experimentally validating the viability of SKI for input dimensions higher than one, which addresses a large class of GP applications that stand to benefit from tractable joint-output modeling.

5 Large Linear GP

We propose a linear model of coregionalization (LMC) method based on recent structure-based optimizations for GP estimation instead of variational approaches. Critically, the accuracy of the method need not be reduced by keeping the number of interpolation points low, because its reliance on structure allows better asymptotic performance. For simplicity, our work focuses on multi-dimensional outputs, one-dimensional inputs, and Gaussian likelihoods.

For a given , we construct an operator which approximates MVMs with the exact covariance matrix, which for brevity we overload as , so . Using only the action of MVM with the covariance operator, we derive . Critically, we cannot access itself, only , so we choose AdaDelta as a gradient-only high-level optimization routine for (Zeiler, 2012).

5.1 Gradient construction

Gibbs and MacKay (1996) describe the algorithm for GP model estimation in terms of only MVMs with the covariance matrix. In particular, we can solve for satisfying in Eq. 1 using linear conjugate gradient descent (LCG). Moreover, Gibbs and MacKay develop a stochastic approximation by introducing RV r with :

(3)

This approximation improves as the size of increases, so, as in other work (Cutajar et al., 2016), we let and take a fixed number samples from r.

We depart from Gibbs and MacKay in two important ways (Algorithm 1). First, we do not construct , but instead keep a low-memory representation (Sec. 5.2). Second, we use minres instead of LCG as the Krylov-subspace inversion method used to compute inverses from MVMs. Iterative minres solutions to numerically semidefinite matrices monotonically improve in practice, as opposed to LCG (Fong and Saunders, 2012). This is essential in GP optimization, where the diagonal noise matrix , iid for each output, shrinks throughout learning. Inversion-based methods then require additional iterations because , the spectral condition number of , increases as becomes less diagonally dominant (Fig. 2).

Figure 2: Trace of (a) the number of MVMs that minres must perform to invert and (b) given at each optimization iteration for a GP applied to the dataset in Sec. 6.2. In (b), in green, we have 20% of the rolling maximum -norm of previous gradients.

Every AdaDelta iteration (invoking Algorithm 1) then takes total time (Raykar and Duraiswami, 2007). This analysis holds as long as the error in the gradients is fixed and we can compute MVMs with the matrix for each at least as fast as . Indeed, we assume a differentiable kernel and then recall that applying the linear operator will maintain the structure of .

For a gradient-only stopping heuristic, we maintain the running maximum gradient

-norm. If gradient norms drop below a proportion of the running max norm more than a pre-set number of times, we terminate (Fig. 2). This heuristic is avoidable since it is possible to evaluate with only MVMs (Han et al., 2015), but using the heuristic proved sufficient and results in a simpler gradient-only optimization routine.

1:procedure LLGP(, y, , )
2:     , sampling .
3:     for z in , in parallel do
4:         .
5:     end for
6:     
7:     for  in  do Compute
8:          Eq. 3
9:          Eq. 1
10:     end for
11:     return
12:end procedure
Algorithm 1 Compute an approximation of . Assume minres is the inversion routine. We also assume we have access to linear operators , representing matrices .

5.2 Fast MVMs and parsimonious kernels

The bottleneck of Algorithm 1 is the iterative MVM operations in minres. Since only enters computation as an MVM operator, the required memory is dictated by its representation , which need not be dense as long as we can perform MVM with any vector to arbitrary, fixed precision.

When LMC kernels are evaluated on a grid of points for each output, so , the simultaneous covariance matrix equation without noise over (Eq. 4) holds for BTTB matrices formed by the stationary kernels evaluated the shared interpolating points for all outputs:

(4)

To adapt SKI to our context of multi-output GPs, we build a grid out of a common subgrid that extends to all outputs with . Since the LMC kernel evaluated between two sets of outputs is differentiable, as long as spans a range larger than each , the corresponding SKI approximation (Eq. 5) holds with the same asymptotic convergence cubic in .

(5)

We cannot fold the Gaussian noise into the interpolated term since it does not correspond to a differentiable kernel. However, since is diagonal, it is efficient to represent and multiply with. MVM with requires MVM by the sparse matrices , which all take space and time.

We consider different representations of (Eq. 5) to reduce the memory and runtime requirements for performing the multiplication in the following sections.

5.2.1 Sum representation

In sum, we represent with a -length list. At each index , is a dense matrix of order and is a BTTB matrix of order , represented using only the top row. In turn, multiplication is performed by multiplying each matrix in the list with z and summing the results. The Kronecker MVM may be expressed as fast BTTB MVMs with and dense MVMs with . In turn, assuming , the runtime for each of the terms is .

5.2.2 Block-Toeplitz representation

In bt, we note that is a block matrix with blocks :

On a grid , these matrices are BTTB because they are linear combinations of BTTB matrices. bt requires -sized rows to represent each . Then, using the usual block matrix multiplication, an MVM takes time since each inner block MVM is accelerated due to BTTB structure.

On a grid of inputs with , the SKI interpolation becomes . In this case, using bt alone leads to a faster algorithm—applying the Chan BTTB preconditioner reduces the number of MVMs necessary to find an inverse (Chan and Olkin, 1994).

5.2.3 SLFM representation

For the rank-based slfm representation, let be the average rank, , and re-write the kernel:

Note is rank . Under some re-indexing , which flattens the double summation such that each corresponds to a unique , the term may be rewritten as

where with a matrix of horizontally stacked column vectors (Seeger et al., 2005). Next, we rearrange the remaining term as , where is BTTB. Thus, slfm represents as the sum of two block diagonal matrices of block order and , where each block is a BTTB order matrix; thus, MVMs run in .

Note that bt and slfm each have a faster run time than the other depending on whether . An algorithm that uses this condition to decide between representations can minimize runtime (Tab. 1).

Method Up-front cost for Additional cost per hyperparameter
Exact
COGP
LLGP
Table 1: Asymptotic Runtimes. For both LLGP and COGP, is a configurable parameter that increases up to to improve accuracy. depend on the LMC kernel, which has hyperparameters (Eq. 2). The asymptotic performance is given in the table. COGP is only independent of because it cannot represent models for . Computing at requires an up-front cost in addition to the per-hyperparameter cost for each . Multiplicative log terms in are hidden, as are exponential dependencies of the input dimension.

5.3 GP mean and variance prediction

The predictive mean can be computed in time by (Wilson et al., 2015).

The full predictive covariance estimate requires finding a new term . This is done by solving the linear system in a matrix-free manner on-the-fly; in particular, is computed via minres for every new test point . Over several test points, this process is parallelizable.

6 Results

We evaluate the methods on held out data by using standardized mean square error (SMSE) of the test points with the predicted mean, and the negative log predictive density (NLPD) of the Gaussian likelihood of the inferred model. NLPD takes confidence into account, while SMSE only evaluates the mean prediction. In both cases, lower values represent better performance. We evaluated the performance of our representations of the kernel, sum, bt, and slfm, by computing exact gradients using the standard Cholesky algorithm over a variety of different kernels (see the supplement).111Hyperparameters, data, code, and benchmarking scripts are available in https://github.com/vlad17/runlmc.

Predictive performance on held-out data with SMSE and NLPD offers an apples-to-apples comparison with COGP, which was itself proposed on the basis of these two metrics. Training data log likelihood would unfairly favor LLGP, which optimizes directly. Note that predicting a constant value equal to each output’s holdout mean results in a baseline SMSE of .

Finally, we attempted to run hyperparameter optimization with both an exact GP and a variational DTC approximation as provided by the GPy package, but both runtime and predictive performance were already an order of magnitude worse than both LLGP and COGP on our smallest dataset from Sec. 6.2 (GPy, since 2012).

6.1 Synthetic dataset

First, we evaluate raw learning performance on a synthetic dataset. We fix the GP index as and generate a fixed SLFM RBF kernel with , fixed lengthscales, and covariance hyperparameters . We set the output dimension to be , so this synthetic dataset might resemble a geospatial GP model for subterranean mineral density, where the various minerals would be different outputs. Sampling GP values from the unit square, we hold out approximately of them, corresponding to the values for the final output in the upper-right quadrant of the sampling square.

We consider the problem of estimating the pre-fixed GP hyperparameters (starting from randomly-initialized ones) with LLGP and COGP. We evaluate the fit based on imputation performance on the held-out values (Tbl. 2). For COGP, we use hyperparameter settings applied to the Sarcos dataset from the COGP paper, a dataset of approximately the same size, which has the number of inducing points . However, COGP failed to have above-baseline SMSE performance with learned inducing points, even on a range of up to and various learning rates. Using fixed inducing points allowed for moderate improvement in COGP, which we used for comparison. For LLGP, we use a grid of size on the unit square with no learning rate tuning. LLGP was able to estimate more predictive hyperparameter values in about the same amount of time it took COGP to learn significantly worse values in terms of prediction.

Metric LLGP COGP COGP+
seconds () 101 (0) ()
SMSE 0.12 (0.00) () ()
NLPD 0.28 (0.00) () ()

Table 2: Predictive Performance versus Training Time Tradeoffs on the Synthetic Dataset. We evaluate the learned LLGP model with . COGP was evaluated with , which were used on a similar-sized dataset from the COGP paper, and increasing

did not improve performance. Since COGP does not provide a terminating condition for its optimization, we also show its performance when permitted to train longer, labelled COGP+. All trials were run 3 times, with parenthesized values representing standard error shown below.

6.2 Foreign exchange rates (FX2007)

We replicate the medium-sized dataset from COGP as an application to evaluate LLGP performance. The dataset includes ten foreign currency exchange rates—CAD, EUR, JPY, GBP, CHF, AUD, HKD, NZD, KRW, and MXN—and three precious metals—XAU, XAG, and XPT—implying that . In LLGP, we set , as recommended for LMC models on this dataset (Álvarez et al., 2010). COGP roughly corresponds to the the SLFM model, which has a total of 94 hyperparameters, compared to 53 for LLGP. All kernels are squared exponential. The data used in this example are from 2007, and include training points and test points. The test points include 50 contiguous points extracted from each of the CAD, JPY, and AUD exchanges. For this application, LLGP uses

interpolating points. We used the COGP settings from the paper. LLGP outperforms COGP in terms of predictive mean and variance estimation as well as runtime (Tab. 

3).

Metric LLGP COGP
seconds 69 (8) ()
SMSE 0.21 (0.00) ()
NLPD -3.62 (0.03) ()

Table 3: Average Predictive Performance and Training Time Over Runs for LLGP and COGP on the FX2007 Dataset. Parenthesized values are standard error. LLGP was run with LMC set to , , and interpolating points. COGP used a kernel with inducing points.
Figure 3:

Test outputs for the FX2007 dataset. COGP mean is black, with 95% confidence intervals shaded in grey. LLGP mean is a solid red curve, with light green 95% confidence intervals. Magenta points are in the training set, while blue ones are in the test set. Notice LLGP variance corresponds to an appropriate level of uncertainty on the test set and certainty on the training set, as opposed to the uniform variance from COGP.

6.3 Weather dataset

Figure 4: Test outputs for the Weather dataset. COGP mean is black, with 95% confidence intervals shaded in grey. LLGP mean is a solid red curve, with light green 95% confidence intervals. Magenta points are in the training set, while blue ones are in the test set. Like Fig. 3, the training run was not cherry-picked.

Next, we replicate results from a weather dataset, a large time series used to validate COGP. Here, weather sensors Bramblemet, Sotonmet, Cambermet, and Chimet record air temperature over five days in five minute intervals, with some dropped records due to equipment failure. Parts of Cambernet and Chimet are dropped for imputation, yielding training measurements and test measurements.

We use the default COGP parameters.222https://github.com/trungngv/cogp We tested LLGP models on and interpolating points.

Metric
LLGP
LLGP
COGP
seconds 73 (12) () ()
SMSE () () 0.08 (0.00)
NLPD () 1.69 (0.19) ()

Table 4: Average Predictive Performance and Training Time Over Runs of LLGP and COGP on the Weather Dataset. Parenthesized values are standard error. Both LLGP and COGP trained the SLFM model. We show LLGP with and interpolating points and COGP with inducing points.

LLGP performed slightly worse than COGP in SMSE, but both NLPD and runtime indicate significant improvements (Tab. 4, Fig. 4). Varying the number of interpolating points from to demonstrates the runtime versus NLPD tradeoff. While NLPD improvement diminishes as increases, LLGP still improves upon COGP for a wide range of by an order of magnitude in runtime and almost two orders of magnitude in NLPD.

7 Conclusion

In this paper, we present LLGP, which we show adapts and accelerates SKI (Wilson and Nickisch, 2015) for the problem of multi-output GP regression. LLGP exploits structure unique to LMC kernels, enabling a parsimonious representation of the covariance matrix, and gradient computations in .

LLGP provides an efficient means to approximate the log-likelihood gradients using interpolation. We have shown on several datasets that this can be done in a way that is faster and leads to more accurate results than variational approximations. Because LLGP scales well with increases in , capturing complex interactions in the covariance with an accurate interpolation is cheap, as demonstrated by performance on a variety of datasets (Tab. 2, Tab. 3, Tab. 4).

Future work could extend LLGP to accept large input dimensions, though most GP use cases are covered by low-dimensional inputs. Finally, an extension to non-Gaussian noise and use of LLGP as a preconditioner for fine-tuned exact GP models is also feasible in a manner following prior work (Cutajar et al., 2016).

Acknowledgments

The authors would like to thank Princeton University and University of California, Berkeley, for providing the computational resources necessary to evaluate our method. Further, the authors owe thanks to Professors Joseph Gonzalez, Ion Stoica, Andrew Wilson, and John Cunningham for reviewing drafts of the paper and offering insightful advice about directions to explore.

References

  • Álvarez et al. (2012) Mauricio Álvarez, Lorenzo Rosasco, Neil Lawrence, et al. Kernels for vector-valued functions: A review.

    Foundations and Trends® in Machine Learning

    , 4(3):195–266, 2012.
  • Wilson et al. (2015) Andrew Wilson, Christoph Dann, and Hannes Nickisch. Thoughts on massively scalable Gaussian processes. arXiv preprint arXiv:1511.01870, 2015.
  • Williams and Rasmussen (1996) Christopher Williams and Carl Rasmussen. Gaussian processes for regression. Advances in neural information processing systems, pages 514–520, 1996.
  • Osborne et al. (2008) Michael Osborne, Stephen Roberts, Alex Rogers, Sarvapali Ramchurn, and Nicholas Jennings. Towards real-time information processing of sensor network data using computationally efficient multi-output Gaussian processes. In 7th international conference on Information processing in sensor networks, pages 109–120. IEEE Computer Society, 2008.
  • Álvarez et al. (2010) Mauricio Álvarez, David Luengo, Michalis Titsias, and Neil D Lawrence. Efficient multioutput Gaussian processes through variational inducing kernels. In AISTATS, volume 9, pages 25–32, 2010.
  • Nguyen et al. (2014) Trung Nguyen, Edwin Bonilla, et al. Collaborative multi-output Gaussian processes. In UAI, pages 643–652, 2014.
  • Gilboa et al. (2015) Elad Gilboa, Yunus Saatçi, and John Cunningham. Scaling multidimensional inference for structured Gaussian processes. IEEE transactions on pattern analysis and machine intelligence, 37(2):424–436, 2015.
  • Cunningham et al. (2008) John Cunningham, Krishna Shenoy, and Maneesh Sahani. Fast Gaussian process methods for point process intensity estimation. In 25th international conference on Machine learning, pages 192–199. ACM, 2008.
  • Quiñonero-Candela and Rasmussen (2005) Joaquin Quiñonero-Candela and Carl Rasmussen. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6(Dec):1939–1959, 2005.
  • Seeger et al. (2005) Matthias Seeger, Yee-Whye Teh, and Michael Jordan. Semiparametric latent factor models. In

    Eighth Conference on Artificial Intelligence and Statistics

    , 2005.
  • Wilson and Nickisch (2015) Andrew Wilson and Hannes Nickisch. Kernel interpolation for scalable structured Gaussian processes (kiss-gp). In The 32nd International Conference on Machine Learning, pages 1775–1784, 2015.
  • Keys (1981) Robert Keys. Cubic convolution interpolation for digital image processing. IEEE transactions on acoustics, speech, and signal processing, 29(6):1153–1160, 1981.
  • Wilson et al. (2014) Andrew Wilson, Elad Gilboa, John Cunningham, and Arye Nehorai. Fast kernel learning for multidimensional pattern extrapolation. In Advances in Neural Information Processing Systems, pages 3626–3634, 2014.
  • Zeiler (2012) Matthew Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
  • Gibbs and MacKay (1996) Mark Gibbs and David MacKay. Efficient implementation of Gaussian processes, 1996.
  • Cutajar et al. (2016) Kurt Cutajar, Michael Osborne, John Cunningham, and Maurizio Filippone. Preconditioning kernel matrices. In ICML, pages 2529–2538, 2016.
  • Fong and Saunders (2012) David Fong and Michael Saunders. CG versus MINRES: an empirical comparison. SQU Journal for Science, 17(1):44–62, 2012.
  • Raykar and Duraiswami (2007) Vikas Raykar and Ramani Duraiswami. Fast large scale Gaussian process regression using approximate matrix-vector products. In Learning workshop, 2007.
  • Han et al. (2015) Insu Han, Dmitry Malioutov, and Jinwoo Shin. Large-scale log-determinant computation through stochastic Chebyshev expansions. In ICML, pages 908–917, 2015.
  • Chan and Olkin (1994) Tony Chan and Julia Olkin. Circulant preconditioners for toeplitz-block matrices. Numerical Algorithms, 6(1):89–101, 1994.
  • GPy (since 2012) GPy. GPy: A gaussian process framework in python. http://github.com/SheffieldML/GPy, since 2012.