Log In Sign Up

A scalable end-to-end Gaussian process adapter for irregularly sampled time series classification

We present a general framework for classification of sparse and irregularly-sampled time series. The properties of such time series can result in substantial uncertainty about the values of the underlying temporal processes, while making the data difficult to deal with using standard classification methods that assume fixed-dimensional feature spaces. To address these challenges, we propose an uncertainty-aware classification framework based on a special computational layer we refer to as the Gaussian process adapter that can connect irregularly sampled time series data to any black-box classifier learnable using gradient descent. We show how to scale up the required computations based on combining the structured kernel interpolation framework and the Lanczos approximation method, and how to discriminatively train the Gaussian process adapter in combination with a number of classifiers end-to-end using backpropagation.


page 1

page 2

page 3

page 4


Learning to Detect Sepsis with a Multitask Gaussian Process RNN Classifier

We present a scalable end-to-end classifier that uses streaming physiolo...

A Robust Two-Sample Test for Time Series data

We develop a general framework for hypothesis testing with time series d...

Probabilistic structure discovery in time series data

Existing methods for structure discovery in time series data construct i...

Analysis of Nonstationary Time Series Using Locally Coupled Gaussian Processes

The analysis of nonstationary time series is of great importance in many...

Heteroscedastic Temporal Variational Autoencoder For Irregularly Sampled Time Series

Irregularly sampled time series commonly occur in several domains where ...

Neural Inference of Gaussian Processes for Time Series Data of Quasars

The study of quasar light curves poses two problems: inference of the po...

1 Introduction

In this paper, we propose a general framework for classification of sparse and irregularly-sampled time series. An irregularly-sampled time series is a sequence of samples with irregular intervals between their observation times. These intervals can be large when the time series are also sparsely sampled. Such time series data are studied in various areas including climate science (Schulz and Stattegger, 1997), ecology (Clark and Bjørnstad, 2004), biology (Ruf, 1999), medicine (Marlin et al., 2012) and astronomy (Scargle, 1982). Classification in this setting is challenging both because the data cases are not naturally defined in a fixed-dimensional feature space due to irregular sampling and variable numbers of samples, and because there can be substantial uncertainty about the underlying temporal processes due to the sparsity of observations.

Recently, Li and Marlin (2015)

introduced the mixture of expected Gaussian kernels (MEG) framework, an uncertainty-aware kernel for classifying sparse and irregularly sampled time series. Classification with MEG kernels is shown to outperform models that ignore uncertainty due to sparse and irregular sampling. On the other hand, various deep learning models including convolutional neural networks

(LeCun et al., 2004)

have been successfully applied to fields such as computer vision and natural language processing, and have been shown to achieve state-of-the-art results on various tasks. Some of these models have desirable properties for time series classification, but cannot be directly applied to sparse and irregularly sampled time series.

Inspired by the MEG kernel, we propose an uncertainty-aware classification framework that enables learning black-box classification models from sparse and irregularly sampled time series data. This framework is based on the use of a computational layer that we refer to as the Gaussian process (GP) adapter. The GP adapter uses Gaussian process regression to transform the irregular time series data into a uniform representation, allowing sparse and irregularly sampled data to be fed into any black-box classifier learnable using gradient descent while preserving uncertainty. However, the time and space of exact GP regression makes the GP adapter prohibitively expensive when scaling up to large time series.

To address this problem, we show how to speed up the key computation of sampling from a GP posterior based on combining the structured kernel interpolation (SKI) framework that was recently proposed by Wilson and Nickisch (2015) with Lanczos methods for approximating matrix functions (Chow and Saad, 2014). Using the proposed sampling algorithm, the GP adapter can run in linear time and space in terms of the length of the time series, and time when inducing points are used.

We also show that GP adapter can be trained end-to-end together with the parameters of the chosen classifier by backpropagation through the iterative Lanczos method. We present results using logistic regression, fully-connected feedforward networks, convolutional neural networks and the MEG kernel. We show that end-to-end discriminative training of the GP adapter outperforms a variety of baselines in terms of classification performance, including models based only on GP mean interpolation, or with GP regression trained separately using marginal likelihood.

2 Gaussian processes for sparse and irregularly-sampled time series

Our focus in this paper is on time series classification in the presence of sparse and irregular sampling. In this problem, the data contain independent tuples consisting of a time series and a label . Thus, . Each time series is represented as a list of time points , and a list of corresponding values . We assume that each time series is observed over a common time interval . However, different time series are not necessarily observed at the same time points (i.e.  in general). This implies that the number of observations in different time series is not necessary the same (i.e.  in general). Furthermore, the time intervals between observation within a single time series are not assumed to be uniform.

Learning in this setting is challenging because the data cases are not naturally defined in a fixed-dimensional feature space due to the irregular sampling. This means that commonly used classifiers that take fixed-length feature vectors as input are not applicable. In addition, there can be substantial uncertainty about the underlying temporal processes due to the sparsity of observations.

To address these challenges, we build on ideas from the MEG kernel (Li and Marlin, 2015) by using GP regression (Rasmussen, 2006) to provide an uncertainty-aware representation of sparse and irregularly sampled time series. We fix a set of reference time points and represent a time series in terms of its posterior marginal distribution at these time points. We use GP regression with a zero-mean GP prior and a covariance function

parameterized by kernel hyperparameters

. Let

be the independent noise variance of the GP regression model. The GP parameters are


Under this model, the marginal posterior GP at

is Gaussian distributed with the mean and covariance given by


where denotes the covariance matrix with . We note that it takes time to exactly compute the posterior mean , and time to exactly compute the full posterior covariance matrix , where and .

3 The GP adapter and uncertainty-aware time series classification

In this section we describe our framework for time series classification in the presence of sparse and irregular sampling. Our framework enables any black-box classifier learnable by gradient-based methods to be applied to the problem of classifying sparse and irregularly sampled time series.

3.1 Classification frameworks and the Gaussian process adapter

In Section 2 we described how we can represent a time series through the marginal posterior it induces under a Gaussian process regression model at any set of reference time points . By fixing a common set of reference time points for all time series in a data set, every time series can be transformed into a common representation in the form of a multivariate Gaussian with being the random vector distributed according to the posterior GP marginalized over the time points .111 The notation explicitly expresses that both and are functions of the GP parameters . Besides, they are also functions of as shown in (1) and (2). Here we assume that the GP parameters are shared across the entire data set.

If the values were observed, we could simply apply a black-box classifier. A classifier can be generally defined by a mapping function parameterized by

, associated with a loss function

where is a label value from the output space . However, in our case

is a Gaussian random variable, which means

is now itself a random variable given a label . Therefore, we use the expectation as the overall loss between the label and a time series given its Gaussian representation . The learning problem becomes minimizing the expected loss over the entire data set:


Once we have the optimal parameters and , we can make predictions on unseen data. In general, given an unseen time series and its Gaussian representation , we can predict its label using (4), although in many cases this can be simplified into a function of with the expectation taken on or inside of .


We name the above approach the Uncertainty-Aware Classification (UAC) framework. Importantly, this framework propagates the uncertainty in the GP posterior induced by each time series all the way through to the loss function. Besides, we call the transformation the Gaussian process adapter, since it provides a uniform representation to connect the raw irregularly sampled time series data to a black-box classifier.

Variations of the UAC framework can be derived by taking the expectation at various position of where . Taking the expectation at an earlier stage simplifies the computation, but the uncertainty information will be integrated out earlier as well.222 For example, the loss of the expected output of the classifier .

In the extreme case, if the expectation is computed immediately followed by the GP adapter transformation, it is equivalent to using a plug-in estimate

for in the loss function,

. We refer to this as the IMPutation (IMP) framework. The IMP framework discards the uncertainty information completely, which further simplifies the computation. This simplified variation may be useful when the time series are more densely sampled, where the uncertainty is less of a concern.

In practice, we can train the model using the UAC objective (3) and predict instead by IMP. In that case, the predictions would be deterministic and can be computed efficiently without drawing samples from the posterior GP as described later in Section 4.

3.2 Learning with the GP adapter

In the previous section, we showed that the UAC framework can be trained using (3

). In this paper, we use stochastic gradient descent to scalably optimize (

3) by updating the model using a single time series at a time, although it can be easily modified for batch or mini-batch updates. From now on, we will focus on the optimization problem where are the output of the GP adapter given a time series and its label . For many classifiers, the expected loss cannot be analytically computed. In such cases, we use the Monte Carlo average to approximate the expected loss:


To learn the parameters of both the classifier and the Gaussian process regression model jointly under the expected loss, we need to be able to compute the gradient of the expectation given in (5). To achieve this, we reparameterize the Gaussian random variable using the identity where and satisfies (Kingma and Welling, 2014). The gradients under this reparameterization are given below, both of which can be approximated using Monte Carlo sampling as in (5). We will focus on efficiently computing the gradient shown in (7) since we assume that the gradient of the base classifier can be computed efficiently.


There are several choices for that satisfy . One common choice of is the Cholesky factor, a lower triangular matrix, which can be computed using Cholesky decomposition in for a covariance matrix (Golub and Van Loan, 2012). We instead use the symmetric matrix square root . We will show that this particular choice of leads to an efficient and scalable approximation algorithm in Section 4.2.

4 Fast sampling from posterior Gaussian processes

The computation required by the GP adapter is dominated by the time needed to draw samples from the marginal GP posterior using . In Section 2 we noted that the time complexity of exactly computing the posterior mean and covariance is and , respectively. Once we have both and we still need to compute the square root of , which requires an additional time to compute exactly. In this section, we show how to efficiently generate samples of .

4.1 Structured kernel interpolation for approximating GP posterior means

The main idea of the structured kernel interpolation (SKI) framework recently proposed by Wilson and Nickisch (2015) is to approximate a stationary kernel matrix by the approximate kernel defined below where is a collection of evenly-spaced inducing points.


Letting and , is a sparse interpolation matrix where each row contains only a small number of non-zero entries. We use local cubic convolution interpolation (cubic interpolation for short) (Keys, 1981) as suggested in Wilson and Nickisch (2015). Each row of the interpolation matrices has at most four non-zero entries. Wilson and Nickisch (2015) showed that when the kernel is locally smooth (under the resolution of ), cubic interpolation results in accurate approximation. This can be justified as follows: with cubic interpolation, the SKI kernel is essentially the two-dimensional cubic interpolation of using the exact regularly spaced samples stored in , which corresponds to classical bicubic convolution. In fact, we can show that asymptotically converges to as increases by following the derivation in Keys (1981).

Plugging the SKI kernel into (1), the posterior GP mean evaluated at can be approximated by


The inducing points are chosen to be evenly-spaced because

forms a symmetric Toeplitz matrix under a stationary covariance function. A symmetric Toeplitz matrix can be embedded into a circulant matrix to perform matrix vector multiplication using fast Fourier transforms

(Golub and Van Loan, 2012).

Further, one can use the conjugate gradient method to solve for which only involves computing the matrix-vector product . In practice, the conjugate gradient method converges within only a few iterations. Therefore, approximating the posterior mean using SKI takes only time to compute. In addition, since a symmetric Toeplitz matrix can be uniquely characterized by its first column, and can be stored as a sparse matrix, approximating requires only space.

4.2 The Lanczos method for covariance square root-vector products

With the SKI techniques, although we can efficiently approximate the posterior mean , computing is still challenging. If computed exactly, it takes time to compute and time to take the square root. To overcome the bottleneck, we apply the SKI kernel to the Lanczos method, one of the Krylov subspace approximation methods, to speed up the computation of as shown in Algorithm 1. The advantage of the Lanczos method is that neither nor needs to be computed explicitly. Like the conjugate gradient method, another example of the Krylov subspace method, it only requires the computation of matrix-vector products with as the matrix.

The idea of the Lanczos method is to approximate in the Krylov subspace . The iteration in Algorithm 1

, usually referred to the Lanczos process, essentially performs the Gram-Schmidt process to transform the basis

into an orthonormal basis for the subspace .

The optimal approximation of in the Krylov subspace that minimizes the -norm of the error is the orthogonal projection of onto as . Since we choose , the optimal projection can be written as where is the first column of the identify matrix.

One can show that the tridiagonal matrix defined in Algorithm 1 satisfies (Saad, 2003). Also, we have

since the eigenvalues of

approximate the extremal eigenvalues of (Saad, 1980). Therefore we have .

The error bound of the Lanczos method is analyzed in Ilić et al. (2009). Alternatively one can show that the Lanczos approximation converges superlinearly (Parlett, 1980). In practice, for a covariance matrix , the approximation is sufficient for our sampling purpose with . As is now a matrix, we can use any standard method to compute its square root in time (Björck and Hammarling, 1983), which is considered when is chosen to be a small constant. Now the computation of the Lanczos method for approximating is dominated by the matrix-vector product during the Lanczos process.

Input: covariance matrix , dimension of the Krylov subspace , random vector and for  to  do   return   // Algorithm 1 Lanczos method for approximating

Here we apply the SKI kernel trick again to efficiently approximate by


Similar to the posterior mean, can be approximated in time and linear space. Therefore, for basis vectors, the entire Algorithm 1 takes time and space, which is also the complexity to draw a sample from the posterior GP.

To reduce the variance when estimating the expected loss (5), we can draw multiple samples from the posterior GP: where . Since all of the samples are associated with the same covariance matrix , we can use the block Lanczos process (Golub and Underwood, 1977), an extension to the single-vector Lanczos method presented in Algorithm 1, to simultaneously approximate for all random vectors . Similarly, during the block Lanczos process, we use the block conjugate gradient method (Feng et al., 1995; Dubrulle, 2001) to simultaneously solve the linear equation for multiple .

5 End-to-end learning with the GP adapter

The most common way to train GP parameters is through maximizing the marginal likelihood (Rasmussen, 2006)


If we follow this criterion, training the UAC framework becomes a two-stage procedure: first we learn GP parameters by maximizing the marginal likelihood. We then compute and given each time series and the learned GP parameters . Both and are then fixed and used to train the classifier using (6).

In this section, we describe how to instead train the GP parameters discriminatively end-to-end using backpropagation. As mentioned in Section 3, we train the UAC framework by jointly optimizing the GP parameters and the parameters of the classifier according to (6) and (7).

The most challenging part in (7) is to compute .333 For brevity, we drop from the gradient notation in this section. For , we can derive the gradient of the approximating posterior mean (9) as given in Appendix A. Note that the gradient can be approximated efficiently by repeatedly applying fast Fourier transforms and the conjugate gradient method in the same time and space complexity as computing (9).

On the other hand, can be approximated by backpropagating through the Lanczos method described in Algorithm 1. To carry out backpropagation, all operations in the Lanczos method must be differentiable. For the approximation of during the Lanczos process, we can similarly compute the gradient of (10) efficiently using the SKI techniques as in computing (see Appendix A).

The gradient for the last step of Algorithm 1 can be derived as follows. From , we have . This is known as the Sylvester equation, which has the form of where are matrices and is the unknown matrix to solve for. We can compute the gradient by solving the Sylvester equation using the Bartels-Stewart algorithm (Bartels and Stewart, 1972) in time for a matrix , which is considered for a small constant .

Overall, training the GP adapter using stochastic optimization with the aforementioned approach takes time and space for inducing points, observations in the time series, and features generated by the GP adapter.

6 Related work

The recently proposed mixtures of expected Gaussian kernels (MEG) (Li and Marlin, 2015)

for classification of irregular time series is probably the closest work to ours. The random feature representation of the MEG kernel is in the form of

, which the algorithm described in Section 4 can be applied to directly. However, by exploiting the spectral property of Gaussian kernels, the expected random feature of the MEG kernel is shown to be analytically computable by . With the SKI techniques, we can efficiently approximate both and in the same time and space complexity as the GP adapter. Moreover, the random features of the MEG kernel can be viewed as a stochastic layer in the classification network, with no trainable parameters. All are randomly initialized once in the beginning and associated with the output of the GP adapter in a nonlinear way described above.

Moreover, the MEG kernel classification is originally a two-stage method: one first estimates the GP parameters by maximizing the marginal likelihood and then uses the optimized GP parameters to compute the MEG kernel for classification. Since the random feature is differentiable, with the approximation of and described in Section 5, we can form a similar classification network that can be efficiently trained end-to-end using the GP adapter. In Section 7.2, we will show that training the MEG kernel end-to-end leads to better classification performance.

7 Experiments

In this section, we present experiments and results exploring several facets of the GP adapter framework including the quality of the approximations and the classification performance of the framework when combined with different base classifiers.

7.1 Quality of GP sampling approximations

Figure 1: Left: Sample approximation error versus the number of inducing points. Middle: Sample approximation error versus the number of Lanczos iterations. Right: Running time comparisons (in seconds). BP denotes computing the gradient of the sample using backpropagation.

The key to scalable learning with the GP adapter relies on both fast and accurate approximation for drawing samples from the posterior GP. To assess the approximation quality, we first generate a synthetic sparse and irregularly-sampled time series by sampling from a zero-mean Gaussian process at random time points. We use the squared exponential kernel with randomly chosen hyperparameters. We then infer and at some reference given . Let denote our approximation of . In this experiment, we set the output size to be , that is, . We evaluate the approximation quality by assessing the error computed with a fixed random vector .

The leftmost plot in Figure 1 shows the approximation error under different numbers of inducing points with Lanczos iterations. The middle plot compares the approximation error as the number of Lanczos iterations varies, with inducing points. These two plots show that the approximation error drops as more inducing points and Lanczos iterations are used. In both plots, the three lines correspond to different sizes for : 1000 (bottom line), 2000 (middle line), 3000 (top line). The separation between the curves is due to the fact that the errors are compared under the same number of inducing points. Longer time series leads to lower resolution of the inducing points and hence the higher approximation error.

Note that the approximation error comes from both the cubic interpolation and the Lanczos method. Therefore, to achieve a certain normalized approximation error across different data sizes, we should simultaneously use more inducing points and Lanczos iterations as the data grows. In practice, we find that is sufficient for estimating the expected loss for classification.

The rightmost plot in Figure 1 compares the time to draw a sample using exact computation versus the approximation method described in Section 4 (exact and Lanczos in the figure). We also compare the time to compute the gradient with respect to the GP parameters by both the exact method and the proposed approximation (exact BP and Lanczos BP in the figure) because this is the actual computation carried out during training. In this part of the experiment, we use and . The plot shows that Lanczos approximation with the SKI kernel yields speed-ups of between 1 and 3 orders of magnitude. Interestingly, for the exact approach, the time for computing the gradient roughly doubles the time of drawing samples. (Note that time is plotted in log scale.) This is because computing gradients requires both forward and backward propagation, whereas drawing samples corresponds to only the forward pass. Both the forward and backward passes take roughly the same computation in the exact case. However, the gap is relatively larger for the approximation approach due to the recursive relationship of the variables in the Lanczos process. In particular, is defined recursively in terms of all of , which makes the backpropagation computation more complicated than the forward pass.

7.2 Classification with GP adapter

LogReg MLP ConvNet MEG kernel
Marginal likelihood IMP 77.90 85.49 87.61
UAC 78.23 87.05 88.17 84.82
End-to-end IMP 79.12 86.49 89.84
UAC 79.24 87.95 91.41 86.61
Table 1: Comparison of classification accuracy (in percent). IMP and UAC refer to the loss functions for training described in Section 3.1, and we use IMP predictions throughout. Although not belonging to the UAC framework, we put the MEG kernel in UAC since it is also uncertainty-aware.

In this section, we evaluate the performance of classifying sparse and irregularly-sampled time series using the UAC framework. We test the framework on the uWave data set,444 The data set UWaveGestureLibraryAll is available at a collection of gesture samples categorized into eight gesture patterns (Liu et al., 2009). The data set has been split into 3582 training instances and 896 test instances. Each time series contains 945 fully observed samples. Following the data preparation procedure in the MEG kernel work (Li and Marlin, 2015), we randomly sample 10% of the observations from each time series to simulate the sparse and irregular sampling scenario. In this experiment, we use the squared exponential covariance function for . Together with the independent noise parameter , the GP parameters are . To bypass the positive constraints on the GP parameters, we reparameterize them by such that , , and .

To demonstrate that the GP adapter is capable of working with various classifiers, we use the UAC framework to train three different classifiers: a multi-class logistic regression (LogReg), a fully-connected feedforward network (MLP), and a convolutional neural network (ConvNet). The detailed architecture of each model is described in Appendix C.

We use inducing points, features output by the GP adapter, Lanczos iterations, and samples. We split the training set into two partitions: for training and

for validation. We jointly train the classifier with the GP adapter using stochastic gradient descent with Nesterov momentum. We apply early stopping based on the validation set. We also compare to classification with the MEG kernel implemented using our GP adapter as described in Section 

6. We use random features trained with multi-class logistic regression.

Table 1 shows that among all three classifiers, training GP parameters discriminatively always leads to better accuracy than maximizing the marginal likelihood. This claim also holds for the results using the MEG kernel. Further, taking the uncertainty into account by sampling from the posterior GP always outperforms training using only the posterior means. Finally, we can also see that the classification accuracy improves as the model gets deeper.

8 Conclusions and future work

We have presented a general framework for classifying sparse and irregularly-sampled time series and have shown how to scale up the required computations using a new approach to generating approximate samples. We have validated the approximation quality, the computational speed-ups, and the benefit of the proposed approach relative to existing baselines.

There are many promising directions for future work including investigating more complicated covariance functions like the spectral mixture kernel (Wilson and Adams, 2013), different classifiers including the encoder LSTM (Sutskever et al., 2014)

, and extending the framework to multi-dimensional time series and GPs with multi-dimensional index sets (e.g., for spatial data). Lastly, the GP adapter can also be applied to other problems such as dimensionality reduction by combining it with an autoencoder.


This work was supported by the National Science Foundation under Grant No. 1350522.


  • Bartels and Stewart [1972] Richard H. Bartels and GW Stewart. Solution of the matrix equation . Communications of the ACM, 15(9):820–826, 1972.
  • Björck and Hammarling [1983] Åke Björck and Sven Hammarling. A Schur method for the square root of a matrix. Linear algebra and its applications, 52:127–140, 1983.
  • Chow and Saad [2014] Edmond Chow and Yousef Saad. Preconditioned krylov subspace methods for sampling multivariate gaussian distributions. SIAM Journal on Scientific Computing, 36(2):A588–A608, 2014.
  • Clark and Bjørnstad [2004] J.S. Clark and O.N. Bjørnstad. Population time series: process variability, observation errors, missing values, lags, and hidden states. Ecology, 85(11):3140–3150, 2004.
  • Dubrulle [2001] Augustin A Dubrulle. Retooling the method of block conjugate gradients. Electronic Transactions on Numerical Analysis, 12:216–233, 2001.
  • Feng et al. [1995] YT Feng, DRJ Owen, and D Perić. A block conjugate gradient method applied to linear systems with multiple right-hand sides. Computer methods in applied mechanics and engineering, 1995.
  • Golub and Van Loan [2012] Gene H Golub and Charles F Van Loan. Matrix computations, volume 3. JHU Press, 2012.
  • Golub and Underwood [1977] Gene Howard Golub and Richard Underwood. The block Lanczos method for computing eigenvalues. Mathematical software, 3:361–377, 1977.
  • Ilić et al. [2009] M Ilić, Ian W Turner, and Daniel P Simpson. A restarted Lanczos approximation to functions of a symmetric matrix. IMA journal of numerical analysis, page drp003, 2009.
  • Keys [1981] Robert G Keys. Cubic convolution interpolation for digital image processing. Acoustics, Speech and Signal Processing, IEEE Transactions on, 29(6):1153–1160, 1981.
  • Kingma and Welling [2014] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. Proceedings of the 2nd International Conference on Learning Representations (ICLR), 2014.
  • LeCun et al. [2004] Yann LeCun, Fu Jie Huang, and Leon Bottou. Learning methods for generic object recognition with invariance to pose and lighting. In

    Proceedings of Computer Vision and Pattern Recognition (CVPR)

    , 2004.
  • Li and Marlin [2015] Steven Cheng-Xian Li and Benjmain M. Marlin. Classification of sparse and irregularly sampled time series with mixtures of expected Gaussian kernels and random features. In

    31st Conference on Uncertainty in Artificial Intelligence

    , 2015.
  • Liu et al. [2009] Jiayang Liu, Lin Zhong, Jehan Wickramasuriya, and Venu Vasudevan. uwave: Accelerometer-based personalized gesture recognition and its applications. Pervasive and Mobile Computing, 2009.
  • Marlin et al. [2012] Benjamin M. Marlin, David C. Kale, Robinder G. Khemani, and Randall C. Wetzel. Unsupervised pattern discovery in electronic health care data using probabilistic clustering models. In Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium, pages 389–398, 2012.
  • Parlett [1980] Beresford N Parlett. The symmetric eigenvalue problem, volume 7. SIAM, 1980.
  • Rasmussen [2006] Carl Edward Rasmussen.

    Gaussian processes for machine learning.

  • Ruf [1999] T. Ruf. The lomb-scargle periodogram in biological rhythm research: analysis of incomplete and unequally spaced time-series. Biological Rhythm Research, 30(2):178–201, 1999.
  • Saad [1980] Yousef Saad. On the rates of convergence of the Lanczos and the block-Lanczos methods. SIAM Journal on Numerical Analysis, 17(5):687–706, 1980.
  • Saad [2003] Yousef Saad. Iterative methods for sparse linear systems. Siam, 2003.
  • Scargle [1982] Jeffrey D Scargle. Studies in astronomical time series analysis. ii-statistical aspects of spectral analysis of unevenly spaced data. The Astrophysical Journal, 263:835–853, 1982.
  • Schulz and Stattegger [1997] M. Schulz and K. Stattegger. Spectrum: Spectral analysis of unevenly spaced paleoclimatic time series. Computers & Geosciences, 23(9):929–945, 1997.
  • Sutskever et al. [2014] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
  • Wilson and Adams [2013] Andrew Gordon Wilson and Ryan Prescott Adams. Gaussian process kernels for pattern discovery and extrapolation. In Proceedings of the 30th International Conference on Machine Learning, 2013.
  • Wilson and Nickisch [2015] Andrew Gordon Wilson and Hannes Nickisch. Kernel interpolation for scalable structured Gaussian processes (KISS-GP). In Proceedings of the 32nd International Conference on Machine Learning, 2015.

Appendix A Gradients for GP approximation

a.1 Gradients of the approximate posterior GP covariance-vector product

Throughout we denote the independent noise variance as for clarity. Let be the approximate posterior covariance derived by the SKI kernel, and be one of the GP hyperparameters. For any vector , the gradient is given below. Note that during the Lanczos process, is a function of , which should be properly handled in backpropagation.

To reduce redundant computations, we introduce the following variables:

The gradient with respect to is therefore .

The gradient with respect to the noise variance is given by

a.2 Gradients of the approximate posterior GP mean

Let denote the approximate posterior mean derived by the SKI kernel. The gradient with respect to the GP hyperparameter is given by

To reduce redundant computations, we introduce the following variables:

The gradient with respect to is therefore .

The gradient with respect to the noise variance is given by

Appendix B Cubic interpolation in backpropagation

The choice of cubic convolution interpolation proposed by Keys (1981) is preferable over other interpolation methods such as spline interpolation when training the GP parameters. If spline interpolation is used to construct the SKI kernel , the interpolation matrix depends not only on and but also on the kernel , which depends on the GP parameters . As a result, the gradient needs to be computed and thus introduces a huge overhead in backpropagation. On the other hand, the interpolation matrix based on the cubic convolution interpolation depends only on and , which are fixed once the data are given. Therefore, with cubic convolution interpolation, both and are constant matrices throughout the entire training process.

Appendix C Architectures used in the experiment

The architecture of each classifier compared in Section 7.2 are described as follows. The fully-connected network consists of two fully-connected layers, each of which contains units. The convolutional network contains a total of five layers: the first and the third layer are both one-dimensional convolutional layers with four filters of size

. The second and the fourth layer are one-dimensional max-pooling layers of size

. The last layer is a fully-connected layer with units. We apply rectified linear activation to all of the convolutional and fully-connected layers. Each classifier takes input features produced by the GP adapter.