1 Introduction
Gaussian processes (GPs) provide a powerful approach to regression and extrapolation, with applications as varied as time series analysis (Wilson and Adams, 2013; Duvenaud et al., 2013), blackbox optimization (Jones et al., 1998; Snoek et al., 2012), and personalized medicine and counterfactual prediction (Dürichen et al., 2015; Schulam and Saria, 2015; Herlands et al., 2016; Gardner et al., 2015). Historically, one of the key limitations of Gaussian process regression has been the computational intractability of inference when dealing with more than a few thousand data points. This complexity stems from the need to solve linear systems and compute log determinants involving an symmetric positive definite covariance matrix . This task is commonly performed by computing the Cholesky decomposition of (Rasmussen and Williams, 2006), incurring complexity. To reduce this complexity, inducing point methods make use of a small set of points to form a rank approximation of (QuiñoneroCandela and Rasmussen, 2005; Snelson and Ghahramani, 2006; Hensman et al., 2013; Titsias, 2009). Using the matrix inversion and determinant lemmas, inference can be performed in time (Snelson and Ghahramani, 2006).
Recently, however, an alternative class of inference techniques for Gaussian processes have emerged based on iterative numerical linear algebra techniques (Wilson and Nickisch, 2015; Dong et al., 2017). Rather than explicitly decomposing the full covariance matrix, these methods leverage Krylov subspace methods (Golub and Van Loan, 2012) to perform linear solves and log determinants using only matrixvector multiples (MVMs) with the covariance matrix. Letting denote the time complexity of computing given a vector , these methods provide excellent approximations to linear solves and log determinants in time, where is typically some small constant Golub and Van Loan (2012).^{1}^{1}1 In practice, depends on the conditioning of , but is independent of . This approach has led to scalable GP methods that differ radically from previous approaches – the goal shifts from computing efficient Cholesky decompositions to computing efficient MVMs. Structured kernel interpolation (SKI) (Wilson and Nickisch, 2015) is a recently proposed inducing point method that, given a regular grid of inducing points, allows for MVMs to be performed in an impressive time.
These MVM approaches have two fundamental drawbacks. First, Wilson and Nickisch (2015) use Kronecker factorizations for SKI to take advantage of fast MVMs, constraining the number of inducing points to grow exponentially with the dimensionality of the inputs, limiting the applicability of SKI to problems with fewer than about input dimensions. Second, the computational benefits of iterative MVM inference methods come at the cost of reduced modularity. If all we know about a kernel is that it decomposes as , it is not obvious how to efficiently perform MVMs with , even if we have access to fast MVMs with both and . In order for MVM inference to be truly modular, we should be able to perform inference equipped with nothing but the ability to perform MVMs with . One of the primary advantages of GPs is the ability to construct very expressive kernels by composing simpler ones (Rasmussen and Williams, 2006; Gönen and Alpaydın, 2011; Durrande et al., 2011; Duvenaud et al., 2013; Wilson, 2014). One of the most common kernel compositions is the elementwise product of kernels. This composition can encode different functional properties for each input dimension (e.g., Rasmussen and Williams, 2006; Gönen and Alpaydın, 2011; Duvenaud et al., 2013; Wilson, 2014), or express correlations between outputs in multitask settings (MacKay, 1998; Bonilla et al., 2008; Álvarez and Lawrence, 2011). Moreover, the RBF and ARD kernels – arguably the most popular kernels in use – decompose into product kernels.
In this paper, we propose a single solution which addresses both of these limitations of iterative methods – improving modularity while simultaneously alleviating the curse of dimensionality. In particular:

[wide, labelwidth=!, labelindent=0pt]

We demonstrate that MVMs with product kernels can be approximated efficiently by computing the Lanczos decomposition of each component kernel. If MVMs with a kernel can be performed in time, then MVMs with the elementwise product of kernels can be approximated in time, where is typically a very small constant.

Our fast productkernel MVM algorithm, entitled SKIP, enables the use of structured kernel interpolation with product kernels without resorting to the exponential complexity of Kronecker products. SKIP can be applied even when the product kernels use different interpolation grids, and enables GP inference and learning in for products of kernels.

We apply SKIP to highdimensional regression problems by expressing dimensional kernels as the product of onedimensional kernels. This formulation affords an exponential improvement over the standard SKI complexity of , and achieving state of the art performance over popular inducing point methods (Hensman et al., 2013; Titsias, 2009).

We demonstrate that SKIP can reduce the complexity of multitask GPs (MTGPs) to for a problem with tasks. We exploit this fast inference, developing a model that discovers cluster of tasks using Gibbs sampling.

We make our GPU implementations available as easy to use code as part of a new package for Gaussian processes, GPyTorch, available at https://github.com/cornelliusgp/gpytorch.
2 Background
In this section, we provide a brief review of Gaussian process regression and an overview of iterative inference techniques for Gaussian processes based on matrixvector multiplies.
2.1 Gaussian Processes
A Gaussian process generalizes multivariate normal distributions to distributions over functions that are specified by a prior
mean function and a prior covariance function . By definition, the function values of a GP at any finite set of inputsare jointly Gaussian distributed:
where and . Generally, denotes a matrix of crosscovariances between the sets and .
Under a Gaussian noise observation model, , the predictive distribution at given data is
(1)  
(2) 
where and
. All kernel matrices implicitly depend on hyperparameters
. The log marginal likelihood of the data, conditioned only on these hyperparameters, is given by(3) 
which provides a utility function for kernel learning.
2.2 Inference with matrixvector multiplies
In order to compute the predictive mean in (1), the predictive covariance in (2), and the marginal log likelihood in (3), we need to perform linear solves (i.e. ) and log determinants (i.e. ). Traditionally, these operations are achieved using the Cholesky decomposition of (Rasmussen and Williams, 2006). Computing this decomposition requires operations and storing the result requires space. Given the Cholesky decomposition, linear solves can be computed in time and log determinants in time.
There exist alternative approaches (e.g. Wilson and Nickisch, 2015) that require only matrixvector multiplies (MVMs) with . To compute linear solves, we use the method of conjugate gradients (CG). This technique exploits that the solution to is the unique minimizer of the quadratic function , which can be found by iterating a simple three term recurrence. Each iteration requires a single MVM with the matrix (Shewchuk et al., 1994). Letting denote the time complexity of an MVM with , iterations of CG requires time. If is , then CG is exact when . However, the linear solve can often be approximated by iterations, since the magnitude of the residual often decays exponentially. In practice the value of required for convergence to high precision is a small constant that depends on the conditioning of rather than (Golub and Van Loan, 2012). A similar technique known as stochastic Lanczos quadrature exists for approximating log determinants in time (Dong et al., 2017; Ubaru et al., 2017). In short, inference and learning for GP regression can be done in time using these iterative approaches.
Critically, if the kernel matrices admit fast MVMs – either through the structure of the data (Saatçi, 2012; Cunningham et al., 2008) or the structure of a general purpose kernel approximation (Wilson and Nickisch, 2015) – this iterative approach offers massive scalability gains over conventional Choleskybased methods.
2.3 Structured kernel interpolation
Structured kernel interpolation (SKI) (Wilson and Nickisch, 2015) replaces a userspecified kernel with an approximate kernel that affords very fast matrixvector multiplies. Assume we are given a set of inducing points that we will use to approximate kernel values. Instead of computing kernel values between data points directly, SKI computes kernel values between inducing points and interpolates these kernel values to approximate the true data kernel values. This leads to the approximate SKI kernel:
(4) 
where is a sparse vector that contains interpolation weights. For example, when using local cubic interpolation (Keys, 1981), contains four nonzero elements. Applying this approximation for all data points in the training set, we see that:
(5) 
With arbitrary inducing points , matrixvector multiplies with require time. In one dimension, we can reduce this running time by instead choosing to be a regular grid, which results in being Toeplitz. In higher dimensions, a multidimensional grid results in being the Kronecker product of Toeplitz matrices. This decompositions enables matrixvector multiplies in at most time, and storage. However, a Kronecker decomposition of leads to an exponential time complexity in , the dimensionality of the inputs (Wilson and Nickisch, 2015).
3 MVMs WITH PRODUCT KERNELS
In this section we derive an approach to exploit product kernel structure for fast MVMs, towards alleviating the curse of dimensionality in SKI. Suppose a kernel separates as a product as follows:
(6) 
Given a training data set , the kernel matrix resulting from the product of kernels in (6) can be expressed as , where represents elementwise multiplication. In other words:
(7) 
The key limitation we must deal with is that, unlike a sum of matrices, vector multiplication does not distribute over the elementwise product:
(8) 
We will assume we have access to fast MVMs for each component kernel matrix . Without fast MVMs, there is a trivial solution to computing the elementwise matrix vector product: explicitly compute the kernel matrix in time and then compute . We further assume that admits a low rank approximation, following prior work on inducing point methods following prior work on inducing point methods (Snelson and Ghahramani, 2006; Titsias, 2009; Wilson and Nickisch, 2015; Hensman et al., 2013).
A naive algorithm for a twokernel product.
We initially assume for simplicity that there are only components kernels in the product. We will then show how to extend the two kernel case to arbitrarily sized product kernels. We seek to perform matrix vector multiplies:
(9) 
Eq. (9) may be expressed in terms of matrixmatrix multiplication using the following identity:
(10) 
where is a diagonal matrix whose elements are (Figure 1), and denotes the diagonal of . Because is an matrix, computing the entries of naively requires matrixvector multiplies with and . The time complexity to compute (10) is therefore . This reformulation does not naively offer any time savings.
Exploiting lowrank structure.
Suppose however that we have access to rank approximations of and :
where , are and , are (Figure 1). This rank decomposition makes the MVM significantly cheaper to compute. Plugging these decompositions in to (10), we derive:
(11) 
We prove the following key lemma in the supplementary materials about (11):
Lemma .
Suppose that and , where and are matrices and and are . Then can be computed with (11) in time.
Therefore, if we can efficiently compute lowrank decompositions of and , then we immediately apply Section 3 to perform fast MVMs.
Computing lowrank structure.
With Section 3, we have reduced the problem of computing MVMs with to that of constructing lowrank decompositions for and . Since we are assuming we can take fast MVMs with these kernel matrices, we now turn to the Lanczos decomposition (Lanczos, 1950; Paige, 1972). The Lanczos decomposition is an iterative algorithm that takes a symmetric matrix and probe vector and returns and such that , with orthogonal.
This decomposition is exact after iterations. However, if we only compute columns of , then is an effective lowrank approximation of (Nickisch et al., 2009; Simon and Zha, 2000)
. Unlike standard low rank approximations (such as the singular value decomposition), the algorithm for computing the Lanczos decomposition
requires only MVMs, leading to the following lemma:Lemma .
Suppose that MVMs with can be computed in time. Then the rank Lanczos decomposition can be computed in time.
The above discussion motivates the following algorithm for computing , which is summarized by Figure 1: First, compute the rank Lanczos decomposition of each matrix; then, apply (11). Lemmas 3 and 3 together imply that this takes time.
Extending to product kernels with three components.
Now consider a kernel that decomposes as the product of three components, . An MVM with this kernel is given by Define and . Then
(12) 
reducing the three component problem back to two components. To compute the Lanczos decomposition of , we use the method described above for computing MVMs with .
Extending to product kernels with many components.
The approach for the three component setting leads naturally to a divide and conquer strategy. Given a kernel matrix we define
(13)  
(14) 
which lets us rewrite . By applying this splitting recursively, we can compute matrixvector multiplies with , leading to the following running time complexity:
Theorem 3.1.
Suppose that , and that computing a matrixvector multiply with any requires operations. Computing an MVM with requires time, where is the rank of the Lanczos decomposition used.
Sequential MVMs.
If we are computing many MVMs with the same matrix, then we can further reduce this complexity by caching the Lanczos decomposition. The terms represent the time to construct the Lanczos decomposition. However, note that this decomposition is not dependent on the vector that we wish to multiply with. Therefore, if we save the decomposition for future computation, we have the following corollary:
Corollary .
Any subsequent MVMs with require time.
If matrixvector multiplications with can be performed with significantly fewer than operations, this results in a significant complexity improvement over explicitly computing the full kernel matrix .
3.1 Structured kernel interpolation for products (SKIP)
So far we have assumed access to fast MVMs with each constituent kernel matrix of an elementwise (Hadamard) product: . To achieve this, we apply the SKI approximation (Section 2.3) to each component:
(15) 
When using SKI approximations, the running time of our product kernel inference technique with iterations of CG becomes . The running time of SKIP is compared to that of other inference techniques in Table 2.
4 Mvm Accuracy and Scalability
We first evaluate the accuracy of our proposed approach with product kernels in a controlled synthetic setting. We draw data points in dimensions from and compute an RBF kernel matrix with lengthscale over these data points. We evaluate the relative error of SKIP compared exact MVMs as a function of – the number of Lanczos iterators. We perform this test for for 4, 8, and 12 dimensional data, resulting in a product kernel with 4, 8, and 12 components respectively. The results, averaged over 100 trials, are shown in Figure 2 (left). Even in the 12 dimensional setting, an extremely small value of is sufficient to get very accurate MVMs: less than 1% error is achieved when . For a discussion of increasing error with dimensionality, see Section 7. In future experiments, we set the maximum number of Lanczos iterations to , but note that the convergence criteria is typically met far sooner. In the right side of Figure 2, we demonstrate the improved scaling of our method with the number of inducing points per dimension over KISSGP. To do this, we use the dimensional Power dataset from the UCI repository, and plot inference step time as a function of . While our method clearly scales better with than both KISSGP and SGPR, we also note that because SKIP only applies the inducing point approximation to onedimensional kernels, we anticipate ultimately needing significantly fewer inducing points than either SGPR or KISSGP which need to cover the full dimensional space with inducing points.
Dataset  Metric 








Pumadyn (, )  Test MAE  0.721  0.766  0.766  0.766  –  0.766  
Train Time ()  4  28  67  235  –  65  
Elevators (, )  Test MAE  0.072  0.157  0.157  0.157  –  0.072  
Train Time ()  12  46  122  425  –  23  
Precipitation (, )  Test MAE  –  14.79  –  –  9.81  14.08  
Train Time ()  –  1432  –  –  615  34.16  
KEGG (, )  Test MAE  –  0.101  0.093  0.087  –  0.065  
Train Time ()  –  116  299  9926  –  66  
Protein (, )  Test MAE  –  7.219  4.97  4.72  –  1.97  
Train Time ()  –  139  397  1296  –  35  
Video (, )  Test MAE  –  6.836  6.463  6.270  –  5.621  
Train Time ()  –  113  334  1125  –  57 
5 Application 1: An Exponential Improvement to Ski
Method  Complexity of 1 Inference Step 
GP (Chol)  
GP (MVM)  
SVGP  
KISSGP  
SKIP 
Wilson and Nickisch (2015) use a Kronecker decomposition of to apply SKI for dimensions, which requires a fully connected multidimensional grid of inducing points . Thus if we wish to have distinct inducing point values for each dimension, the grid requires inducing points – i.e. MVMs with the SKI approximate require time. It is therefore computationally infeasible to apply SKI with a Kronecker factorization, referred to in Wilson and Nickisch (2015) as KISSGP, to more than about five dimensions. However, using the proposed SKIP method of Section 3, we can reduce the running time complexity of SKI in dimensions from exponential to linear ! If we express a dimensional kernel as the product of onedimensional kernels, then each component kernel requires only grid points, rather than . For the RBF and ARD kernels, decomposing the kernel in this way yields the same kernel function.
Datasets.
We evaluate SKIP on six benchmark datasets. The precipitation dataset contains hourly rainfall measurements from hundreds of stations around the country. The remaining datasets are taken from the UCI machine learning dataset repository. KISSGP (SKI with a Kronecker factorization) is not applicable when
, and the full GP is not applicable on the four largest datasets.Methods.
We compare against the popular sparse variational Gaussian processes (SGPR) (Titsias, 2009; Hensman et al., 2013) implemented in GPflow (Matthews et al., 2017). We also compare to our GPU implementation of KISSGP where possible, as well as our GPU implementation of the full GP on the two smallest datasets. All experiments were run on an NVIDIA Titan Xp. We evaluate SGPR using 200, 400 and 800 inducing points. All models use the RBF kernel and a constant prior mean function. We optimize hyperparameters with ADAM using default optimization parameters.
Discussion.
The results of our experiments are shown in Table 1. On the two smallest datasets, the Full GP model outperforms all other methods in terms of speed. This is due to the overhead added by inducing point methods significantly outweighing simple solves with conjugate gradients with such little data. SKIP is able to match the error of the full GP model on elevators, and all methods have comparable error on the Pumadyn dataset.
On the precipitation dataset, inference with standard KISSGP is still tractable due to the low dimensionality, and KISSGP is both fast and accurate. Using SKIP results in higher error than KISSGP, because we were able to use significantly fewer Lanczos iterations for our approximate MVMs than on other datasets due to the space complexity. We discuss the space complexity limitation further in the discussion section. Nevertheless, SKIP still performs better than SGPR. SGPR results with 400 and 800 inducing points are unavailable due to GPU memory constraints. On the remaining datasets, SKIP is able to achieve comparable or better overall error than SGPR, but with a significantly lower runtime.
6 Application 2: MultiTask Learning
We demonstrate how the fast elementwise matrix vector products with SKIP can also be applied to accelerate multitask Gaussian processes (MTGPs). Additionally, because SKIP provides cheap marginal likelihood computations, we extend standard MTGPs to construct an interpretable and robust multitask GP model which discovers latent clusters among tasks using Gibbs’ sampling. We apply this model to a particularly consequential child development dataset from the Gates foundation.
Motivating problem.
The Gates foundation has collected an aggregate longitudinal dataset of child development, from studies performed around the world. We are interested in predicting the future development for a given child (as measured by weight) using a limited number of existing measurements. Children in the dataset have a varying number of measurements (ranging from 5 to 30), taken at irregular times throughout their development. We therefore model this problem with a multitask approach, where we treat each child’s development as a task. This approach is the basis of several medical forecasting models (Alaa et al., 2017; Cheng et al., 2017; Xu et al., 2016).
Multitask learning with GPs.
The common multitask setup involves datasets corresponding to a set of different tasks, . The multitask Gaussian process (MTGP) of Bonilla et al. (2008) extends standard GP regression to share information between several related tasks. MTGPs assume that the covariance between data points factors as the product of kernels over (e.g. spatial or temporal) inputs and tasks. Specifically, given data points and from tasks and , the MTGP kernel is given by
(16) 
where is a kernel over inputs, and – the coregionalization kernel – is commonly parameterized by a lowrank covariance matrix that encodes pairwise correlations between all pairs of tasks. The entries of are learned by maximizing (3). We can express the covariance matrix for all measurements as
where is an matrix with onehot rows: if the observation belongs to task . We can apply SKIP to multitask problems by using a SKI approximation of and computing its Lanczos decomposition. If is rank, with , then we do not need to decompose since the matrix affords MVMs.^{2}^{2}2 MVMs are because has nonzero elements and is an matrix. For onedimensional inputs, the time complexity of an MVM with is – a substantial improvement over standard inducingpoint methods with MTGPs, which typically require at least time (Bonilla et al., 2008; Álvarez and Lawrence, 2011). For , SKIP speeds up marginal likelihood computations by a factor of .
Learning clusters of tasks.
Motivated by the work of Rasmussen and Ghahramani (2002), Shi et al. (2005), Schulam and Saria (2015), Hensman et al. (2015), and Xu et al. (2016), we propose a modification to the standard MTGP framework. We hypothesize that similarities between tasks can be better expressed through latent subpopulations, or clusters, rather than through pairwise associations. We place an independent uniform categorical prior over , the cluster assignment for task . Given measurements for tasks and , we propose a kernel consisting of product and sum structure that captures clusterlevel trends and individuallevel trends:
Here, and are both Matérn kernels () operating on , and the terms represent indicator functions. Both terms can be easily expressed as product kernels. We infer the posterior distribution of cluster assignments through Gibbs sampling. Given , the cluster assignments for all tasks except the , we sample an assignment for the task from the marginal posterior distribution
Drawing a sample for the full vector requires calculations of (3), an operation made relatively inexpensive by applying SKIP to the underlying model.
Results.
We compare the clusterbased MTGP against two baselines: 1) a singletask GP baseline, which treats all available data as a single task, and 2) the standard MTGP. In Figure 4, we measure the extrapolation accuracy for 25 children as additional children (tasks) are added to the model. As the models are supplied with data from additional children, they are able to refine the extrapolations on all children. The predictions of the cluster model slightly outperform the standard MTGP, and significantly outperform the singletask model. Perhaps the key advantage of the clustering approach is interpretability: in Figure 3 (left), we see three distinct development types: aboveaverage, average, and below average. We then demonstrate that as more data is observed when we apply the model to a new child with limited measurements, the model becomes increasingly certain that the child belongs to the aboveaverage subpopulation.
7 Discussion
It is our hope that this work highlights a question of foundational importance for scalable GP inference: given the ability to compute and quickly for matrices and , how do we compute efficiently? We have shown an answer to this question can exponentially improve the scalability and general applicability of MVMbased methods for fast Gaussian processes.
Stochastic diagonal estimation.
Our method relies primarily on quickly computing the diagonal in Equation (10
). Techniques exist for stochastic diagonal estimation
(Fitzsimons et al., 2016; Hutchinson, 1990; Selig et al., 2012; Bekas et al., 2007). We found that these techniques converged slower than our method in practice, but they may be more appropriate for kernels with high rank structure.Higherorder product kernels.
A fundamental property of the Hadamard product is that suggesting that we may need higher rank approximations with increasing dimension. In the limit, the SKI approximation can be used in place of the Lanczos decomposition in equation (10), resulting in an exact algorithm with runtime: simply set and MVMs require time instead of , and set and MVMs now require instead of . This adaptation is rarely necessary, as the accuracy of MVMs with SKIP increases exponetially in in practice.
Space complexity.
To perform the matrixvector multiplication algorithm described above, we must store the Lanczos decomposition of each component kernel matrix and intermediate matrices in the merge step for storage. This is better storage than the storage required in full GP regression, or storage for standard inducing point methods, but worse than the linear storage requirements of SKI. In practice, we note that GPU memory is indeed often the major limitation of our method, as storing even or copies of a dataset in GPU memory can be expensive.
Acknowledgments
JRG, GP, and KQW are supported in part by grants from the National Science Foundation (III1525919, IIS1550179, IIS1618134, S&AS 1724282, and CCF1740822), the Office of Naval Research DOD (N000141712175), and the Bill and Melinda Gates Foundation. AGW is supported by NSF IIS1563887.
References
 Alaa et al. (2017) Alaa, A. M., Yoon, J., Hu, S., and van der Schaar, M. (2017). Personalized risk scoring for critical care prognosis using mixtures of gaussian processes. IEEE Transactions on Biomedical Engineering.
 Álvarez and Lawrence (2011) Álvarez, M. A. and Lawrence, N. D. (2011). Computationally efficient convolved multiple output Gaussian processes. Journal of Machine Learning Research, 12(May):1459–1500.
 Bekas et al. (2007) Bekas, C., Kokiopoulou, E., and Saad, Y. (2007). An estimator for the diagonal of a matrix. Applied numerical mathematics, 57(11):1214–1229.
 Bonilla et al. (2008) Bonilla, E. V., Chai, K. M., and Williams, C. (2008). Multitask Gaussian process prediction. In NIPS, pages 153–160.
 Cheng et al. (2017) Cheng, L.F., Darnell, G., Chivers, C., Draugelis, M. E., Li, K., and Engelhardt, B. E. (2017). Sparse multioutput gaussian processes for medical time series prediction. arXiv preprint arXiv:1703.09112.
 Cunningham et al. (2008) Cunningham, J. P., Shenoy, K. V., and Sahani, M. (2008). Fast gaussian process methods for point process intensity estimation. In Proceedings of the 25th international conference on Machine learning, pages 192–199. ACM.
 Dong et al. (2017) Dong, K., Eriksson, D., Nickisch, H., Bindel, D., and Wilson, A. G. (2017). Scalable log determinants for gaussian process kernel learning. In NIPS.
 Dürichen et al. (2015) Dürichen, R., Pimentel, M. A., Clifton, L., Schweikard, A., and Clifton, D. A. (2015). Multitask gaussian processes for multivariate physiological timeseries analysis. IEEE Transactions on Biomedical Engineering, 62(1):314–322.
 Durrande et al. (2011) Durrande, N., Ginsbourger, D., and Roustant, O. (2011). Additive kernels for gaussian process modeling. arXiv preprint arXiv:1103.4023.
 Duvenaud et al. (2013) Duvenaud, D., Lloyd, J. R., Grosse, R., Tenenbaum, J. B., and Ghahramani, Z. (2013). Structure discovery in nonparametric regression through compositional kernel search. In Proceedings of the 30th International Conference on Machine Learning, pages 1166–1174.
 Fitzsimons et al. (2016) Fitzsimons, J. K., Osborne, M. A., Roberts, S. J., and Fitzsimons, J. F. (2016). Improved stochastic trace estimation using mutually unbiased bases. arXiv preprint arXiv:1608.00117.

Gardner et al. (2015)
Gardner, J. R., Song, X., Weinberger, K. Q., Barbour, D. L., and Cunningham,
J. P. (2015).
Psychophysical detection testing with bayesian active learning.
In UAI, pages 286–295.  Golub and Van Loan (2012) Golub, G. H. and Van Loan, C. F. (2012). Matrix computations, volume 3. JHU Press.
 Gönen and Alpaydın (2011) Gönen, M. and Alpaydın, E. (2011). Multiple kernel learning algorithms. Journal of machine learning research, 12(Jul):2211–2268.
 Hensman et al. (2013) Hensman, J., Fusi, N., and Lawrence, N. D. (2013). Gaussian processes for big data. arXiv preprint arXiv:1309.6835.
 Hensman et al. (2015) Hensman, J., Rattray, M., and Lawrence, N. D. (2015). Fast nonparametric clustering of structured timeseries. IEEE transactions on pattern analysis and machine intelligence, 37(2):383–393.
 Herlands et al. (2016) Herlands, W., Wilson, A., Nickisch, H., Flaxman, S., Neill, D., Van Panhuis, W., and Xing, E. (2016). Scalable gaussian processes for characterizing multidimensional change surfaces. In Artificial Intelligence and Statistics, pages 1013–1021.
 Hutchinson (1990) Hutchinson, M. F. (1990). A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in StatisticsSimulation and Computation, 19(2):433–450.
 Jones et al. (1998) Jones, D. R., Schonlau, M., and Welch, W. J. (1998). Efficient global optimization of expensive blackbox functions. Journal of Global optimization, 13(4):455–492.
 Keys (1981) Keys, R. (1981). Cubic convolution interpolation for digital image processing. IEEE transactions on acoustics, speech, and signal processing, 29(6):1153–1160.

Lanczos (1950)
Lanczos, C. (1950).
An iteration method for the solution of the eigenvalue problem of linear differential and integral operators
. United States Governm. Press Office Los Angeles, CA.  MacKay (1998) MacKay, D. J. (1998). Introduction to Gaussian processes. In Bishop, C. M., editor, Neural Networks and Machine Learning, chapter 11, pages 133–165. SpringerVerlag.

Matthews et al. (2017)
Matthews, A. G. d. G., van der Wilk, M., Nickson, T., Fujii, K., Boukouvalas,
A., LeónVillagrá, P., Ghahramani, Z., and Hensman, J. (2017).
Gpflow: A gaussian process library using tensorflow.
Journal of Machine Learning Research, 18(40):1–6.  Nickisch et al. (2009) Nickisch, H., Pohmann, R., Schölkopf, B., and Seeger, M. (2009). Bayesian experimental design of magnetic resonance imaging sequences. In NIPS, pages 1441–1448.
 Paige (1972) Paige, C. C. (1972). Computational variants of the lanczos method for the eigenproblem. IMA Journal of Applied Mathematics, 10(3):373–381.
 QuiñoneroCandela and Rasmussen (2005) QuiñoneroCandela, J. and Rasmussen, C. E. (2005). A unifying view of sparse approximate gaussian process regression. Journal of Machine Learning Research, 6(Dec):1939–1959.
 Rasmussen and Ghahramani (2002) Rasmussen, C. E. and Ghahramani, Z. (2002). Infinite mixtures of gaussian process experts. In NIPS, pages 881–888.
 Rasmussen and Williams (2006) Rasmussen, C. E. and Williams, C. K. (2006). Gaussian processes for machine learning, volume 1. MIT press Cambridge.
 Saatçi (2012) Saatçi, Y. (2012). Scalable inference for structured Gaussian process models. PhD thesis, University of Cambridge.
 Schulam and Saria (2015) Schulam, P. and Saria, S. (2015). A framework for individualizing predictions of disease trajectories by exploiting multiresolution structure. In NIPS, pages 748–756.
 Selig et al. (2012) Selig, M., Oppermann, N., and Enßlin, T. A. (2012). Improving stochastic estimates with inference methods: Calculating matrix diagonals. Physical Review E, 85(2):021134.
 Shewchuk et al. (1994) Shewchuk, J. R. et al. (1994). An introduction to the conjugate gradient method without the agonizing pain.
 Shi et al. (2005) Shi, J. Q., MurraySmith, R., and Titterington, D. (2005). Hierarchical gaussian process mixtures for regression. Statistics and computing, 15(1):31–41.
 Simon and Zha (2000) Simon, H. D. and Zha, H. (2000). Lowrank matrix approximation using the lanczos bidiagonalization process with applications. SIAM Journal on Scientific Computing, 21(6):2257–2274.
 Snelson and Ghahramani (2006) Snelson, E. and Ghahramani, Z. (2006). Sparse Gaussian processes using pseudoinputs. In NIPS, pages 1257–1264.
 Snoek et al. (2012) Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959.
 Titsias (2009) Titsias, M. K. (2009). Variational learning of inducing variables in sparse gaussian processes. In International Conference on Artificial Intelligence and Statistics, pages 567–574.
 Ubaru et al. (2017) Ubaru, S., Chen, J., and Saad, Y. (2017). Fast estimation of tr (f (a)) via stochastic lanczos quadrature. SIAM Journal on Matrix Analysis and Applications, 38(4):1075–1099.
 Wilson and Adams (2013) Wilson, A. and Adams, R. (2013). Gaussian process kernels for pattern discovery and extrapolation. In ICML, pages 1067–1075.
 Wilson (2014) Wilson, A. G. (2014). Covariance kernels for fast automatic pattern discovery and extrapolation with gaussian processes. University of Cambridge.
 Wilson and Nickisch (2015) Wilson, A. G. and Nickisch, H. (2015). Kernel interpolation for scalable structured gaussian processes (kissgp). In ICML, pages 1775–1784.
 Xu et al. (2016) Xu, Y., Xu, Y., and Saria, S. (2016). A bayesian nonparametric approach for estimating individualized treatmentresponse curves. In Machine Learning for Healthcare Conference, pages 282–300.
S1 Proof of Section 3
Letting denote the row of and denote the row of , we can express the entry , as:
To evaluate this for all , we first once compute the matrix:
This can be done in time. and can each be computed in time, as the matrices are and the matrices are . Multiplying one of the results by takes time as it is diagonal. Finally, multiplying the two resulting matrices together takes time.
After computing , we can compute each element of the matrixvector multiply as:
Because is , each of these takes time to compute. Since there are entries to evaluate in the MVM in total, the total time requirement after computing is time. Thus, given low rank structure, we can compute in time total.
S2 Proof of Theorem 3.1
Given the Lanczos decompositions of and , we can compute matrixvector multiplies with in time each. This lets us compute the Lanczos decomposition of in time.
For clarity, suppose first that , i.e., . We first Lanczos decompose , and . Assuming for simplicty that MVMs with each matrix take the same amount of time, This takes time total. We then use these Lanczos decompositions to compute matrixvector multiples with in time each. This allows us to Lanczos decompose it in time total. We can then compute matrixvector multiplications in time.
In the most general setting where , we first Lanczos decompose the component matrices in and then perform merges as described above, each of which takes time. After computing all necessary Lanczos decompositions, matrixvector multiplications with can be performed in time.
As a result, a single matrixvector multiply with takes time. With the Lanczos decompositions precomputed, multiple MVMs in a row can be performed significantly faster. For example, running iterations of conjugate gradients with takes time.