1 Introduction
In this paper, we propose a general framework for classification of sparse and irregularlysampled time series. An irregularlysampled time series is a sequence of samples with irregular intervals between their observation times. These intervals can be large when the time series are also sparsely sampled. Such time series data are studied in various areas including climate science (Schulz and Stattegger, 1997), ecology (Clark and Bjørnstad, 2004), biology (Ruf, 1999), medicine (Marlin et al., 2012) and astronomy (Scargle, 1982). Classification in this setting is challenging both because the data cases are not naturally defined in a fixeddimensional feature space due to irregular sampling and variable numbers of samples, and because there can be substantial uncertainty about the underlying temporal processes due to the sparsity of observations.
Recently, Li and Marlin (2015)
introduced the mixture of expected Gaussian kernels (MEG) framework, an uncertaintyaware kernel for classifying sparse and irregularly sampled time series. Classification with MEG kernels is shown to outperform models that ignore uncertainty due to sparse and irregular sampling. On the other hand, various deep learning models including convolutional neural networks
(LeCun et al., 2004)have been successfully applied to fields such as computer vision and natural language processing, and have been shown to achieve stateoftheart results on various tasks. Some of these models have desirable properties for time series classification, but cannot be directly applied to sparse and irregularly sampled time series.
Inspired by the MEG kernel, we propose an uncertaintyaware classification framework that enables learning blackbox classification models from sparse and irregularly sampled time series data. This framework is based on the use of a computational layer that we refer to as the Gaussian process (GP) adapter. The GP adapter uses Gaussian process regression to transform the irregular time series data into a uniform representation, allowing sparse and irregularly sampled data to be fed into any blackbox classifier learnable using gradient descent while preserving uncertainty. However, the time and space of exact GP regression makes the GP adapter prohibitively expensive when scaling up to large time series.
To address this problem, we show how to speed up the key computation of sampling from a GP posterior based on combining the structured kernel interpolation (SKI) framework that was recently proposed by Wilson and Nickisch (2015) with Lanczos methods for approximating matrix functions (Chow and Saad, 2014). Using the proposed sampling algorithm, the GP adapter can run in linear time and space in terms of the length of the time series, and time when inducing points are used.
We also show that GP adapter can be trained endtoend together with the parameters of the chosen classifier by backpropagation through the iterative Lanczos method. We present results using logistic regression, fullyconnected feedforward networks, convolutional neural networks and the MEG kernel. We show that endtoend discriminative training of the GP adapter outperforms a variety of baselines in terms of classification performance, including models based only on GP mean interpolation, or with GP regression trained separately using marginal likelihood.
2 Gaussian processes for sparse and irregularlysampled time series
Our focus in this paper is on time series classification in the presence of sparse and irregular sampling. In this problem, the data contain independent tuples consisting of a time series and a label . Thus, . Each time series is represented as a list of time points , and a list of corresponding values . We assume that each time series is observed over a common time interval . However, different time series are not necessarily observed at the same time points (i.e. in general). This implies that the number of observations in different time series is not necessary the same (i.e. in general). Furthermore, the time intervals between observation within a single time series are not assumed to be uniform.
Learning in this setting is challenging because the data cases are not naturally defined in a fixeddimensional feature space due to the irregular sampling. This means that commonly used classifiers that take fixedlength feature vectors as input are not applicable. In addition, there can be substantial uncertainty about the underlying temporal processes due to the sparsity of observations.
To address these challenges, we build on ideas from the MEG kernel (Li and Marlin, 2015) by using GP regression (Rasmussen, 2006) to provide an uncertaintyaware representation of sparse and irregularly sampled time series. We fix a set of reference time points and represent a time series in terms of its posterior marginal distribution at these time points. We use GP regression with a zeromean GP prior and a covariance function
parameterized by kernel hyperparameters
. Letbe the independent noise variance of the GP regression model. The GP parameters are
.Under this model, the marginal posterior GP at
is Gaussian distributed with the mean and covariance given by
(1)  
(2) 
where denotes the covariance matrix with . We note that it takes time to exactly compute the posterior mean , and time to exactly compute the full posterior covariance matrix , where and .
3 The GP adapter and uncertaintyaware time series classification
In this section we describe our framework for time series classification in the presence of sparse and irregular sampling. Our framework enables any blackbox classifier learnable by gradientbased methods to be applied to the problem of classifying sparse and irregularly sampled time series.
3.1 Classification frameworks and the Gaussian process adapter
In Section 2 we described how we can represent a time series through the marginal posterior it induces under a Gaussian process regression model at any set of reference time points . By fixing a common set of reference time points for all time series in a data set, every time series can be transformed into a common representation in the form of a multivariate Gaussian with being the random vector distributed according to the posterior GP marginalized over the time points .^{1}^{1}1 The notation explicitly expresses that both and are functions of the GP parameters . Besides, they are also functions of as shown in (1) and (2). Here we assume that the GP parameters are shared across the entire data set.
If the values were observed, we could simply apply a blackbox classifier. A classifier can be generally defined by a mapping function parameterized by
, associated with a loss function
where is a label value from the output space . However, in our caseis a Gaussian random variable, which means
is now itself a random variable given a label . Therefore, we use the expectation as the overall loss between the label and a time series given its Gaussian representation . The learning problem becomes minimizing the expected loss over the entire data set:(3) 
Once we have the optimal parameters and , we can make predictions on unseen data. In general, given an unseen time series and its Gaussian representation , we can predict its label using (4), although in many cases this can be simplified into a function of with the expectation taken on or inside of .
(4) 
We name the above approach the UncertaintyAware Classification (UAC) framework. Importantly, this framework propagates the uncertainty in the GP posterior induced by each time series all the way through to the loss function. Besides, we call the transformation the Gaussian process adapter, since it provides a uniform representation to connect the raw irregularly sampled time series data to a blackbox classifier.
Variations of the UAC framework can be derived by taking the expectation at various position of where . Taking the expectation at an earlier stage simplifies the computation, but the uncertainty information will be integrated out earlier as well.^{2}^{2}2 For example, the loss of the expected output of the classifier .
In the extreme case, if the expectation is computed immediately followed by the GP adapter transformation, it is equivalent to using a plugin estimate
for in the loss function,. We refer to this as the IMPutation (IMP) framework. The IMP framework discards the uncertainty information completely, which further simplifies the computation. This simplified variation may be useful when the time series are more densely sampled, where the uncertainty is less of a concern.
3.2 Learning with the GP adapter
In the previous section, we showed that the UAC framework can be trained using (3
). In this paper, we use stochastic gradient descent to scalably optimize (
3) by updating the model using a single time series at a time, although it can be easily modified for batch or minibatch updates. From now on, we will focus on the optimization problem where are the output of the GP adapter given a time series and its label . For many classifiers, the expected loss cannot be analytically computed. In such cases, we use the Monte Carlo average to approximate the expected loss:(5) 
To learn the parameters of both the classifier and the Gaussian process regression model jointly under the expected loss, we need to be able to compute the gradient of the expectation given in (5). To achieve this, we reparameterize the Gaussian random variable using the identity where and satisfies (Kingma and Welling, 2014). The gradients under this reparameterization are given below, both of which can be approximated using Monte Carlo sampling as in (5). We will focus on efficiently computing the gradient shown in (7) since we assume that the gradient of the base classifier can be computed efficiently.
(6)  
(7) 
There are several choices for that satisfy . One common choice of is the Cholesky factor, a lower triangular matrix, which can be computed using Cholesky decomposition in for a covariance matrix (Golub and Van Loan, 2012). We instead use the symmetric matrix square root . We will show that this particular choice of leads to an efficient and scalable approximation algorithm in Section 4.2.
4 Fast sampling from posterior Gaussian processes
The computation required by the GP adapter is dominated by the time needed to draw samples from the marginal GP posterior using . In Section 2 we noted that the time complexity of exactly computing the posterior mean and covariance is and , respectively. Once we have both and we still need to compute the square root of , which requires an additional time to compute exactly. In this section, we show how to efficiently generate samples of .
4.1 Structured kernel interpolation for approximating GP posterior means
The main idea of the structured kernel interpolation (SKI) framework recently proposed by Wilson and Nickisch (2015) is to approximate a stationary kernel matrix by the approximate kernel defined below where is a collection of evenlyspaced inducing points.
(8) 
Letting and , is a sparse interpolation matrix where each row contains only a small number of nonzero entries. We use local cubic convolution interpolation (cubic interpolation for short) (Keys, 1981) as suggested in Wilson and Nickisch (2015). Each row of the interpolation matrices has at most four nonzero entries. Wilson and Nickisch (2015) showed that when the kernel is locally smooth (under the resolution of ), cubic interpolation results in accurate approximation. This can be justified as follows: with cubic interpolation, the SKI kernel is essentially the twodimensional cubic interpolation of using the exact regularly spaced samples stored in , which corresponds to classical bicubic convolution. In fact, we can show that asymptotically converges to as increases by following the derivation in Keys (1981).
Plugging the SKI kernel into (1), the posterior GP mean evaluated at can be approximated by
(9) 
The inducing points are chosen to be evenlyspaced because
forms a symmetric Toeplitz matrix under a stationary covariance function. A symmetric Toeplitz matrix can be embedded into a circulant matrix to perform matrix vector multiplication using fast Fourier transforms
(Golub and Van Loan, 2012).Further, one can use the conjugate gradient method to solve for which only involves computing the matrixvector product . In practice, the conjugate gradient method converges within only a few iterations. Therefore, approximating the posterior mean using SKI takes only time to compute. In addition, since a symmetric Toeplitz matrix can be uniquely characterized by its first column, and can be stored as a sparse matrix, approximating requires only space.
4.2 The Lanczos method for covariance square rootvector products
With the SKI techniques, although we can efficiently approximate the posterior mean , computing is still challenging. If computed exactly, it takes time to compute and time to take the square root. To overcome the bottleneck, we apply the SKI kernel to the Lanczos method, one of the Krylov subspace approximation methods, to speed up the computation of as shown in Algorithm 1. The advantage of the Lanczos method is that neither nor needs to be computed explicitly. Like the conjugate gradient method, another example of the Krylov subspace method, it only requires the computation of matrixvector products with as the matrix.
The idea of the Lanczos method is to approximate in the Krylov subspace . The iteration in Algorithm 1
, usually referred to the Lanczos process, essentially performs the GramSchmidt process to transform the basis
into an orthonormal basis for the subspace .The optimal approximation of in the Krylov subspace that minimizes the norm of the error is the orthogonal projection of onto as . Since we choose , the optimal projection can be written as where is the first column of the identify matrix.
One can show that the tridiagonal matrix defined in Algorithm 1 satisfies (Saad, 2003). Also, we have
since the eigenvalues of
approximate the extremal eigenvalues of (Saad, 1980). Therefore we have .The error bound of the Lanczos method is analyzed in Ilić et al. (2009). Alternatively one can show that the Lanczos approximation converges superlinearly (Parlett, 1980). In practice, for a covariance matrix , the approximation is sufficient for our sampling purpose with . As is now a matrix, we can use any standard method to compute its square root in time (Björck and Hammarling, 1983), which is considered when is chosen to be a small constant. Now the computation of the Lanczos method for approximating is dominated by the matrixvector product during the Lanczos process.
Here we apply the SKI kernel trick again to efficiently approximate by
(10) 
Similar to the posterior mean, can be approximated in time and linear space. Therefore, for basis vectors, the entire Algorithm 1 takes time and space, which is also the complexity to draw a sample from the posterior GP.
To reduce the variance when estimating the expected loss (5), we can draw multiple samples from the posterior GP: where . Since all of the samples are associated with the same covariance matrix , we can use the block Lanczos process (Golub and Underwood, 1977), an extension to the singlevector Lanczos method presented in Algorithm 1, to simultaneously approximate for all random vectors . Similarly, during the block Lanczos process, we use the block conjugate gradient method (Feng et al., 1995; Dubrulle, 2001) to simultaneously solve the linear equation for multiple .
5 Endtoend learning with the GP adapter
The most common way to train GP parameters is through maximizing the marginal likelihood (Rasmussen, 2006)
(11) 
If we follow this criterion, training the UAC framework becomes a twostage procedure: first we learn GP parameters by maximizing the marginal likelihood. We then compute and given each time series and the learned GP parameters . Both and are then fixed and used to train the classifier using (6).
In this section, we describe how to instead train the GP parameters discriminatively endtoend using backpropagation. As mentioned in Section 3, we train the UAC framework by jointly optimizing the GP parameters and the parameters of the classifier according to (6) and (7).
The most challenging part in (7) is to compute .^{3}^{3}3 For brevity, we drop from the gradient notation in this section. For , we can derive the gradient of the approximating posterior mean (9) as given in Appendix A. Note that the gradient can be approximated efficiently by repeatedly applying fast Fourier transforms and the conjugate gradient method in the same time and space complexity as computing (9).
On the other hand, can be approximated by backpropagating through the Lanczos method described in Algorithm 1. To carry out backpropagation, all operations in the Lanczos method must be differentiable. For the approximation of during the Lanczos process, we can similarly compute the gradient of (10) efficiently using the SKI techniques as in computing (see Appendix A).
The gradient for the last step of Algorithm 1 can be derived as follows. From , we have . This is known as the Sylvester equation, which has the form of where are matrices and is the unknown matrix to solve for. We can compute the gradient by solving the Sylvester equation using the BartelsStewart algorithm (Bartels and Stewart, 1972) in time for a matrix , which is considered for a small constant .
Overall, training the GP adapter using stochastic optimization with the aforementioned approach takes time and space for inducing points, observations in the time series, and features generated by the GP adapter.
6 Related work
The recently proposed mixtures of expected Gaussian kernels (MEG) (Li and Marlin, 2015)
for classification of irregular time series is probably the closest work to ours. The random feature representation of the MEG kernel is in the form of
, which the algorithm described in Section 4 can be applied to directly. However, by exploiting the spectral property of Gaussian kernels, the expected random feature of the MEG kernel is shown to be analytically computable by . With the SKI techniques, we can efficiently approximate both and in the same time and space complexity as the GP adapter. Moreover, the random features of the MEG kernel can be viewed as a stochastic layer in the classification network, with no trainable parameters. All are randomly initialized once in the beginning and associated with the output of the GP adapter in a nonlinear way described above.Moreover, the MEG kernel classification is originally a twostage method: one first estimates the GP parameters by maximizing the marginal likelihood and then uses the optimized GP parameters to compute the MEG kernel for classification. Since the random feature is differentiable, with the approximation of and described in Section 5, we can form a similar classification network that can be efficiently trained endtoend using the GP adapter. In Section 7.2, we will show that training the MEG kernel endtoend leads to better classification performance.
7 Experiments
In this section, we present experiments and results exploring several facets of the GP adapter framework including the quality of the approximations and the classification performance of the framework when combined with different base classifiers.
7.1 Quality of GP sampling approximations
The key to scalable learning with the GP adapter relies on both fast and accurate approximation for drawing samples from the posterior GP. To assess the approximation quality, we first generate a synthetic sparse and irregularlysampled time series by sampling from a zeromean Gaussian process at random time points. We use the squared exponential kernel with randomly chosen hyperparameters. We then infer and at some reference given . Let denote our approximation of . In this experiment, we set the output size to be , that is, . We evaluate the approximation quality by assessing the error computed with a fixed random vector .
The leftmost plot in Figure 1 shows the approximation error under different numbers of inducing points with Lanczos iterations. The middle plot compares the approximation error as the number of Lanczos iterations varies, with inducing points. These two plots show that the approximation error drops as more inducing points and Lanczos iterations are used. In both plots, the three lines correspond to different sizes for : 1000 (bottom line), 2000 (middle line), 3000 (top line). The separation between the curves is due to the fact that the errors are compared under the same number of inducing points. Longer time series leads to lower resolution of the inducing points and hence the higher approximation error.
Note that the approximation error comes from both the cubic interpolation and the Lanczos method. Therefore, to achieve a certain normalized approximation error across different data sizes, we should simultaneously use more inducing points and Lanczos iterations as the data grows. In practice, we find that is sufficient for estimating the expected loss for classification.
The rightmost plot in Figure 1 compares the time to draw a sample using exact computation versus the approximation method described in Section 4 (exact and Lanczos in the figure). We also compare the time to compute the gradient with respect to the GP parameters by both the exact method and the proposed approximation (exact BP and Lanczos BP in the figure) because this is the actual computation carried out during training. In this part of the experiment, we use and . The plot shows that Lanczos approximation with the SKI kernel yields speedups of between 1 and 3 orders of magnitude. Interestingly, for the exact approach, the time for computing the gradient roughly doubles the time of drawing samples. (Note that time is plotted in log scale.) This is because computing gradients requires both forward and backward propagation, whereas drawing samples corresponds to only the forward pass. Both the forward and backward passes take roughly the same computation in the exact case. However, the gap is relatively larger for the approximation approach due to the recursive relationship of the variables in the Lanczos process. In particular, is defined recursively in terms of all of , which makes the backpropagation computation more complicated than the forward pass.
7.2 Classification with GP adapter
LogReg  MLP  ConvNet  MEG kernel  

Marginal likelihood  IMP  77.90  85.49  87.61  – 
UAC  78.23  87.05  88.17  84.82  
Endtoend  IMP  79.12  86.49  89.84  – 
UAC  79.24  87.95  91.41  86.61 
In this section, we evaluate the performance of classifying sparse and irregularlysampled time series using the UAC framework. We test the framework on the uWave data set,^{4}^{4}4 The data set UWaveGestureLibraryAll is available at http://timeseriesclassification.com. a collection of gesture samples categorized into eight gesture patterns (Liu et al., 2009). The data set has been split into 3582 training instances and 896 test instances. Each time series contains 945 fully observed samples. Following the data preparation procedure in the MEG kernel work (Li and Marlin, 2015), we randomly sample 10% of the observations from each time series to simulate the sparse and irregular sampling scenario. In this experiment, we use the squared exponential covariance function for . Together with the independent noise parameter , the GP parameters are . To bypass the positive constraints on the GP parameters, we reparameterize them by such that , , and .
To demonstrate that the GP adapter is capable of working with various classifiers, we use the UAC framework to train three different classifiers: a multiclass logistic regression (LogReg), a fullyconnected feedforward network (MLP), and a convolutional neural network (ConvNet). The detailed architecture of each model is described in Appendix C.
We use inducing points, features output by the GP adapter, Lanczos iterations, and samples. We split the training set into two partitions: for training and
for validation. We jointly train the classifier with the GP adapter using stochastic gradient descent with Nesterov momentum. We apply early stopping based on the validation set. We also compare to classification with the MEG kernel implemented using our GP adapter as described in Section
6. We use random features trained with multiclass logistic regression.Table 1 shows that among all three classifiers, training GP parameters discriminatively always leads to better accuracy than maximizing the marginal likelihood. This claim also holds for the results using the MEG kernel. Further, taking the uncertainty into account by sampling from the posterior GP always outperforms training using only the posterior means. Finally, we can also see that the classification accuracy improves as the model gets deeper.
8 Conclusions and future work
We have presented a general framework for classifying sparse and irregularlysampled time series and have shown how to scale up the required computations using a new approach to generating approximate samples. We have validated the approximation quality, the computational speedups, and the benefit of the proposed approach relative to existing baselines.
There are many promising directions for future work including investigating more complicated covariance functions like the spectral mixture kernel (Wilson and Adams, 2013), different classifiers including the encoder LSTM (Sutskever et al., 2014)
, and extending the framework to multidimensional time series and GPs with multidimensional index sets (e.g., for spatial data). Lastly, the GP adapter can also be applied to other problems such as dimensionality reduction by combining it with an autoencoder.
Acknowledgements
This work was supported by the National Science Foundation under Grant No. 1350522.
References
 Bartels and Stewart [1972] Richard H. Bartels and GW Stewart. Solution of the matrix equation . Communications of the ACM, 15(9):820–826, 1972.
 Björck and Hammarling [1983] Åke Björck and Sven Hammarling. A Schur method for the square root of a matrix. Linear algebra and its applications, 52:127–140, 1983.
 Chow and Saad [2014] Edmond Chow and Yousef Saad. Preconditioned krylov subspace methods for sampling multivariate gaussian distributions. SIAM Journal on Scientific Computing, 36(2):A588–A608, 2014.
 Clark and Bjørnstad [2004] J.S. Clark and O.N. Bjørnstad. Population time series: process variability, observation errors, missing values, lags, and hidden states. Ecology, 85(11):3140–3150, 2004.
 Dubrulle [2001] Augustin A Dubrulle. Retooling the method of block conjugate gradients. Electronic Transactions on Numerical Analysis, 12:216–233, 2001.
 Feng et al. [1995] YT Feng, DRJ Owen, and D Perić. A block conjugate gradient method applied to linear systems with multiple righthand sides. Computer methods in applied mechanics and engineering, 1995.
 Golub and Van Loan [2012] Gene H Golub and Charles F Van Loan. Matrix computations, volume 3. JHU Press, 2012.
 Golub and Underwood [1977] Gene Howard Golub and Richard Underwood. The block Lanczos method for computing eigenvalues. Mathematical software, 3:361–377, 1977.
 Ilić et al. [2009] M Ilić, Ian W Turner, and Daniel P Simpson. A restarted Lanczos approximation to functions of a symmetric matrix. IMA journal of numerical analysis, page drp003, 2009.
 Keys [1981] Robert G Keys. Cubic convolution interpolation for digital image processing. Acoustics, Speech and Signal Processing, IEEE Transactions on, 29(6):1153–1160, 1981.
 Kingma and Welling [2014] Diederik P Kingma and Max Welling. Autoencoding variational bayes. Proceedings of the 2nd International Conference on Learning Representations (ICLR), 2014.

LeCun et al. [2004]
Yann LeCun, Fu Jie Huang, and Leon Bottou.
Learning methods for generic object recognition with invariance to
pose and lighting.
In
Proceedings of Computer Vision and Pattern Recognition (CVPR)
, 2004. 
Li and Marlin [2015]
Steven ChengXian Li and Benjmain M. Marlin.
Classification of sparse and irregularly sampled time series with
mixtures of expected Gaussian kernels and random features.
In
31st Conference on Uncertainty in Artificial Intelligence
, 2015.  Liu et al. [2009] Jiayang Liu, Lin Zhong, Jehan Wickramasuriya, and Venu Vasudevan. uwave: Accelerometerbased personalized gesture recognition and its applications. Pervasive and Mobile Computing, 2009.
 Marlin et al. [2012] Benjamin M. Marlin, David C. Kale, Robinder G. Khemani, and Randall C. Wetzel. Unsupervised pattern discovery in electronic health care data using probabilistic clustering models. In Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium, pages 389–398, 2012.
 Parlett [1980] Beresford N Parlett. The symmetric eigenvalue problem, volume 7. SIAM, 1980.

Rasmussen [2006]
Carl Edward Rasmussen.
Gaussian processes for machine learning.
2006.  Ruf [1999] T. Ruf. The lombscargle periodogram in biological rhythm research: analysis of incomplete and unequally spaced timeseries. Biological Rhythm Research, 30(2):178–201, 1999.
 Saad [1980] Yousef Saad. On the rates of convergence of the Lanczos and the blockLanczos methods. SIAM Journal on Numerical Analysis, 17(5):687–706, 1980.
 Saad [2003] Yousef Saad. Iterative methods for sparse linear systems. Siam, 2003.
 Scargle [1982] Jeffrey D Scargle. Studies in astronomical time series analysis. iistatistical aspects of spectral analysis of unevenly spaced data. The Astrophysical Journal, 263:835–853, 1982.
 Schulz and Stattegger [1997] M. Schulz and K. Stattegger. Spectrum: Spectral analysis of unevenly spaced paleoclimatic time series. Computers & Geosciences, 23(9):929–945, 1997.
 Sutskever et al. [2014] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
 Wilson and Adams [2013] Andrew Gordon Wilson and Ryan Prescott Adams. Gaussian process kernels for pattern discovery and extrapolation. In Proceedings of the 30th International Conference on Machine Learning, 2013.
 Wilson and Nickisch [2015] Andrew Gordon Wilson and Hannes Nickisch. Kernel interpolation for scalable structured Gaussian processes (KISSGP). In Proceedings of the 32nd International Conference on Machine Learning, 2015.
Appendix A Gradients for GP approximation
a.1 Gradients of the approximate posterior GP covariancevector product
Throughout we denote the independent noise variance as for clarity. Let be the approximate posterior covariance derived by the SKI kernel, and be one of the GP hyperparameters. For any vector , the gradient is given below. Note that during the Lanczos process, is a function of , which should be properly handled in backpropagation.
To reduce redundant computations, we introduce the following variables:
The gradient with respect to is therefore .
The gradient with respect to the noise variance is given by
a.2 Gradients of the approximate posterior GP mean
Let denote the approximate posterior mean derived by the SKI kernel. The gradient with respect to the GP hyperparameter is given by
To reduce redundant computations, we introduce the following variables:
The gradient with respect to is therefore .
The gradient with respect to the noise variance is given by
Appendix B Cubic interpolation in backpropagation
The choice of cubic convolution interpolation proposed by Keys (1981) is preferable over other interpolation methods such as spline interpolation when training the GP parameters. If spline interpolation is used to construct the SKI kernel , the interpolation matrix depends not only on and but also on the kernel , which depends on the GP parameters . As a result, the gradient needs to be computed and thus introduces a huge overhead in backpropagation. On the other hand, the interpolation matrix based on the cubic convolution interpolation depends only on and , which are fixed once the data are given. Therefore, with cubic convolution interpolation, both and are constant matrices throughout the entire training process.
Appendix C Architectures used in the experiment
The architecture of each classifier compared in Section 7.2 are described as follows. The fullyconnected network consists of two fullyconnected layers, each of which contains units. The convolutional network contains a total of five layers: the first and the third layer are both onedimensional convolutional layers with four filters of size
. The second and the fourth layer are onedimensional maxpooling layers of size
. The last layer is a fullyconnected layer with units. We apply rectified linear activation to all of the convolutional and fullyconnected layers. Each classifier takes input features produced by the GP adapter.