kerasgp
Keras + Gaussian Processes: Learning scalable deep and recurrent kernels.
view repo
Many applications in speech, robotics, finance, and biology deal with sequential data, where ordering matters and recurrent structures are common. However, this structure cannot be easily captured by standard kernel functions. To model such structure, we propose expressive closedform kernel functions for Gaussian processes. The resulting model, GPLSTM, fully encapsulates the inductive biases of long shortterm memory (LSTM) recurrent networks, while retaining the nonparametric probabilistic advantages of Gaussian processes. We learn the properties of the proposed kernels by optimizing the Gaussian process marginal likelihood using a new provably convergent semistochastic gradient procedure and exploit the structure of these kernels for scalable training and prediction. This approach provides a practical representation for Bayesian LSTMs. We demonstrate stateoftheart performance on several benchmarks, and thoroughly investigate a consequential autonomous driving application, where the predictive uncertainties provided by GPLSTM are uniquely valuable.
READ FULL TEXT VIEW PDF
We introduce scalable deep kernels, which combine the structural propert...
read it
Deep kernel learning combines the nonparametric flexibility of kernel
m...
read it
Deep Gaussian processes (DGPs) provide a Bayesian nonparametric alterna...
read it
Gaussian processes are powerful models for probabilistic machine learnin...
read it
The application of Gaussian processes (GPs) to large data sets is limite...
read it
Gaussian processes are rich distributions over functions, with generaliz...
read it
Neural processes (NPs) learn stochastic processes and predict the
distri...
read it
Keras + Gaussian Processes: Learning scalable deep and recurrent kernels.
There exists a vast array of machine learning applications where the underlying datasets are sequential. Applications range from the entirety of robotics, to speech, audio and video processing. While neural network based approaches have dealt with the issue of
representation learningfor sequential data, the important question of modeling and propagating uncertainty across time has rarely been addressed by these models. For a robotics application such as a selfdriving car, however, it is not just desirable, but essential to have complete predictive densities for variables of interest. When trying to stay in lane and keep a safe following distance from the vehicle front, knowing the uncertainty associated with lanes and lead vehicles is as important as the point estimates.
Recurrent models with long shortterm memory (LSTM) (Hochreiter and Schmidhuber, 1997) have recently emerged as the leading approach to modeling sequential structure. The LSTM is an efficient gradientbased method for training recurrent networks. LSTMs use a memory cell inside each hidden unit and a special gating mechanism that stabilizes the flow of the backpropagated errors, improving the learning process of the model. While the LSTM provides stateoftheart results on speech and text data (Graves et al., 2013; Sutskever et al., 2014), quantifying uncertainty or extracting full predictive distributions from deep models is still an area of active research (Gal and Ghahramani, 2016a).
In this paper, we quantify the predictive uncertainty of deep models by following a Bayesian nonparametric approach. In particular, we propose kernel functions which fully encapsulate the structural properties of LSTMs, for use with Gaussian processes. The resulting model enables Gaussian processes to achieve stateoftheart performance on sequential regression tasks, while also allowing for a principled representation of uncertainty and nonparametric flexibility. Further, we develop a provably convergent semistochastic optimization algorithm that allows minibatch updates of the recurrent kernels. We empirically demonstrate that this semistochastic approach significantly improves upon the standard nonstochastic firstorder methods in runtime and in the quality of the converged solution. For additional scalability, we exploit the algebraic structure of these kernels, decomposing the relevant covariance matrices into Kronecker products of circulant matrices, for training time and test predictions (Wilson et al., 2015; Wilson and Nickisch, 2015). Our model not only can be interpreted as a Gaussian process with a recurrent kernel, but also as a deep recurrent network with probabilistic outputs, infinitely many hidden units, and a utility function robust to overfitting.
Throughout this paper, we assume basic familiarity with Gaussian processes (GPs). We provide a brief introduction to GPs in the background section; for a comprehensive reference, see, e.g., Rasmussen and Williams (2006). In the following sections, we formalize the problem of learning from sequential data, provide background on recurrent networks and the LSTM, and present an extensive empirical evaluation of our model. Specifically, we apply our model to a number of tasks, including system identification, energy forecasting, and selfdriving car applications. Quantitatively, the model is assessed on the data ranging in size from hundreds of points to almost a million with various signaltonoise ratios demonstrating stateoftheart performance and linear scaling of our approach. Qualitatively, the model is tested on consequential selfdriving applications: lane estimation and lead vehicle position prediction. Indeed, the main focus of this paper is on achieving stateoftheart performance on consequential applications involving sequential data, following straightforward and scalable approaches to building highly flexible Gaussian process.
We release our code as a library at: http://github.com/alshedivat/kerasgp. This library implements the ideas in this paper as well as deep kernel learning (Wilson et al., 2016a) via a Gaussian process layer that can be added to arbitrary
deep architectures and deep learning frameworks, following the Keras API specification. More tutorials and resources can be found at
https://people.orie.cornell.edu/andrew/code.We consider the problem of learning a regression function that maps sequences to realvalued target vectors. Formally, let
be a collection of sequences, , each with corresponding length, , where , and is an arbitrary domain. Let , , be a collection of the corresponding realvalued target vectors. Assuming that only the most recent steps of a sequence are predictive of the targets, the goal is to learn a function, , from some family, , based on the available data.As a working example, consider the problem of estimating position of the lead vehicle at the next time step from LIDAR, GPS, and gyroscopic measurements of a selfdriving car available for a number of previous steps. This task is a classical instance of the sequencetoreals regression, where a temporal sequence of measurements is regressed to the future position estimates. In our notation, the sequences of inputs are vectors of measurements, , are indexed by time and would be of growing lengths. Typically, input sequences are considered up to a finitetime horizon, , that is assumed to be predictive for the future targets of interest. The targets, , are twodimensional vectors that encode positions of the lead vehicle in the egocentric coordinate system of the selfdriving car.
Note that the problem of learning a mapping, , is challenging. While considering whole sequences of observations as input features is necessary for capturing longterm temporal correlations, it virtually blows up the dimensionality of the problem. If we assume that each measurement is dimensional, i.e., , and consider previous steps as distinct features, the regression problem will become dimensional. Therefore, to avoid overfitting and be able to extract meaningful signal from a finite amount of data, it is crucial to exploit the sequential nature of observations.
Recurrent models. One of the most successful ways to exploit sequential structure of the data is by using a class of recurrent models. In the sequencetoreals regression scenario, such a model expresses the mapping in the following general recurrent form:
(1) 
where is an input observation at time , is a corresponding latent representation, and is a target vector. Functions and specify model transitions and emissions, respectively, and and are additive noises. While and can be arbitrary, they are typically timeinvariant. This strong but realistic assumption incorporated into the structure of the recurrent mapping significantly reduces the complexity of the family of functions, , regularizes the problem, and helps to avoid severe overfitting.
Recurrent models can account for various patterns in sequences by memorizing internal representations of their dynamics via adjusting and
. Recurrent neural networks (RNNs) model recurrent processes by using linear parametric maps followed by nonlinear activations:
(2) 
where are weight matrices to be learned^{1}^{1}1The bias terms are omitted for clarity of presentation. and and
here are some fixed elementwise functions. Importantly and contrary to the standard hidden Markov models (HMMs),
the state of an RNN at any time is distributed and effectively represented by an entire hidden sequence, . A major disadvantage of the vanilla RNNs is that their training is nontrivial due to the socalled vanishing gradient problem (Bengio et al., 1994): the error backpropagated through time steps diminishes exponentially which makes learning longterm relationships nearly impossible.LSTM. To overcome vanishing gradients, Hochreiter and Schmidhuber (1997) proposed a long shortterm memory (LSTM) mechanism that places a memory cell
into each hidden unit and uses differentiable gating variables. The update rules for the hidden representation at time
have the following form (here and are elementwise sigmoid and hyperbolic tangent functions, respectively):Gaussian processes.
The Gaussian process (GP) is a Bayesian nonparametric model that generalizes the Gaussian distributions to functions. We say that a random function
is drawn from a GP with a mean function and a covariance kernel , , if for any vector of inputs, , the corresponding vector of function values is Gaussian:with mean , such that , and covariance matrix that satisfies . GPs can be seen as distributions over the reproducing kernel Hilbert space (RKHS) of functions which is uniquely defined by the kernel function, (Schölkopf and Smola, 2002). GPs with RBF kernels are known to be universal approximators with prior support to within an arbitrarily small epsilon band of any continuous function (Micchelli et al., 2006).
Assuming additive Gaussian noise, , and a GP prior on , given training inputs and training targets , the predictive distribution of the GP evaluated at an arbitrary test point is:
(4) 
where
(5)  
Here, , , , and are matrices that consist of the covariance function, , evaluated at the corresponding points, and , and is the mean function evaluated at . GPs are fit to the data by optimizing the evidence
—the marginal probability of the data given the model—with respect to kernel hyperparameters. The evidence has the form:
(6) 
where we use a shorthand for , and implicitly depends on the kernel hyperparameters. This objective function consists of a model fit and a complexity penalty term that results in an automatic Occam’s razor for realizable functions (Rasmussen and Ghahramani, 2001). By optimizing the evidence with respect to the kernel hyperparameters, we effectively learn the the structure of the space of functional relationships between the inputs and the targets. For further details on Gaussian processes and relevant literature we refer interested readers to the classical book by Rasmussen and Williams (2006).
Turning back to the problem of learning from sequential data, it seems natural to apply the powerful GP machinery to modeling complicated relationships. However, GPs are limited to learning only pairwise correlations between the inputs and are unable to account for longterm dependencies, often dismissing complex temporal structures. Combining GPs with recurrent models has potential to addresses this issue.
The problem of learning from sequential data, especially from temporal sequences, is well known in the control and dynamical systems literature. Stochastic temporal processes are usually described either with generative autoregressive models (AM) or with statespace models (SSM) (Van Overschee and De Moor, 2012). The former approach includes nonlinear autoregressive models with exogenous inputs (NARX) that are constructed by using, e.g., neural networks (Lin et al., 1996) or Gaussian processes (Kocijan et al., 2005). The latter approach additionally introduces unobservable variables, the state, and constructs autoregressive dynamics in the latent space. This construction allows to represent and propagate uncertainty through time by explicitly modeling the signal (via the state evolution) and the noise. Generative SSMs can be also used in conjunction with discriminative models via the Fisher kernel (Jaakkola and Haussler, 1999).
Modeling time series with GPs is equivalent to using linearGaussian autoregressive or SSM models (Box et al., 1994). Learning and inference are efficient in such models, but they are not designed to capture longterm dependencies or correlations beyond pairwise. Wang et al. (2005) introduced GPbased statespace models (GPSSM) that use GPs for transition and/or observation functions. These models appear to be more general and flexible as they account for uncertainty in the state dynamics, though require complicated approximate training and inference, which are hard to scale (Turner et al., 2010; Frigola et al., 2014).
Perhaps the most recent relevant work to our approach is recurrent Gaussian processes (RGP) (Mattos et al., 2015)
. RGP extends the GPSSM framework to regression on sequences by using a recurrent architecture with GPbased activation functions. The structure of the RGP model mimics the standard RNN, where every parametric layer is substituted with a Gaussian process. This procedure allows one to propagate uncertainty throughout the network for an additional cost. Inference is intractable in RGP, and efficient training requires a sophisticated
approximation procedure, the socalled recurrent variational Bayes. In addition, the authors have to turn to RNNbased approximation of the variational mean functions to battle the growth of the number of variational parameters with the size of data. While technically promising, RGP seems problematic from the application perspective, especially in its implementation and scalability aspects.Our model has several distinctions with prior work aiming to regress sequences to reals. Firstly, one of our goals is to keep the model as simple as possible while being able to represent and quantify predictive uncertainty. We maintain an analytical objective function and refrain from complicated and difficulttodiagnose inference schemes. This simplicity is achieved by giving up the idea of propagating signal through a chain GPs connected in a recurrent fashion. Instead, we propose to directly learn kernels with recurrent structure via joint optimization of a simple functional composition of a standard GP with a recurrent model (e.g., LSTM), as described in detail in the following section. Similar approaches have recently been explored and proved to be fruitful for nonrecurrent deep networks (Wilson et al., 2016a, b; Calandra et al., 2016). We remark that combinations of GPs with nonlinear functions have also been considered in the past in a slightly different setting of warped regression targets (Snelson et al., 2003; Wilson and Ghahramani, 2010; LázaroGredilla, 2012)
. Additionally, uncertainty over the recurrent parts of our model is represented via dropout, which is computationally cheap and turns out to be equivalent to approximate Bayesian inference in a deep Gaussian process
(Damianou and Lawrence, 2013) with particular intermediate kernels (Gal and Ghahramani, 2016a, b). Finally, one can also view our model as a standalone flexible Gaussian process, which leverages learning techniques that scale to massive datasets (Wilson and Nickisch, 2015; Wilson et al., 2015).Gaussian processes with different kernel functions correspond to different structured probabilistic models. For example, some special cases of the Matérn class of kernels induce models with Markovian structure (Stein, 1999). To construct deep kernels with recurrent structure we transform the original input space with an LSTM network and build a kernel directly in the transformed space, as shown in Figure 0(b).
In particular, let be an arbitrary deterministic transformation of the input sequences into some latent space, . Next, let be a realvalued kernel defined on . The decomposition of the kernel and the transformation, , is defined as
(7) 
It is trivial to show that is a valid kernel defined on (MacKay, 1998, Ch. 5.4.3). In addition, if is represented by a neural network, the resulting model can be viewed as the same network, but with an additional GPlayer and the negative log marginal likelihood (NLML) objective function used instead of the standard mean squared error (MSE).
Input embedding is wellknown in the Gaussian process literature (e.g. MacKay, 1998; Hinton and Salakhutdinov, 2008). Recently, Wilson et al. (2016a, b) have successfully scaled the approach and demonstrated strong results in regression and classification tasks for kernels based on feedforward and convolutional architectures. In this paper, we apply the same technique to learn kernels with recurrent structure by transforming input sequences with a recurrent neural network that acts as . In particular, a (multilayer) LSTM architecture is used to embed steps of the input time series into a single vector in the hidden space, . For the embedding, as common, we use the last hidden vector produced by the recurrent network. Note that however, any variations to the embedding (e.g., using other types of recurrent units, adding 1dimensional pooling layers, or attention mechanisms) are all fairly straightforward^{2}^{2}2Details on the particular architectures used in our empirical study are discussed in the next section.. More generally, the recurrent transformation can be random itself (Figure 1), which would enable direct modeling of uncertainty within the recurrent dynamics, but would also require inference for (e.g., as in Mattos et al., 2015). In this study, we limit our consideration of random recurrent maps to only those induced by dropout.
Unfortunately, once the the MSE objective is substituted with NLML, it no longer factorizes over the data. This prevents us from using the wellestablished stochastic optimization techniques for training our recurrent model. In the case of feedforward and convolutional networks, Wilson et al. (2016a) proposed to pretrain the input transformations and then finetune them by jointly optimizing the GP marginal likelihood with respect to hyperparameters and the network weights using fullbatch algorithms. When the transformation is recurrent, stochastic updates play a key role. Therefore, we propose a semistochastic blockgradient optimization procedure which allows minibatching weight updates and fully joint training of the model from scratch.
The negative log marginal likelihood of the Gaussian process has the following form:
(8) 
where () is the Gram kernel matrix, , is computed on and implicitly depends on the base kernel hyperparameters, , and the parameters of the recurrent neural transformation, , denoted and further referred as the transformation hyperparameters. Our goal is to optimize with respect to both and .
The derivative of the NLML objective with respect to is standard and takes the following form (Rasmussen and Williams, 2006):
(9) 
where is depends on the kernel function, , and usually has an analytic form. The derivative with respect to the th transformation hyperparameter, , is as follows:^{3}^{3}3Stepbystep derivations are given in Appendix B.
(10) 
where corresponds to the latent representation of the the th data instance. Once the derivatives are computed, the model can be trained with any firstorder or quasiNewton optimization routine. However, application of the stochastic gradient method—the de facto standard optimization routine for deep recurrent networks—is not straightforward: neither the objective, nor its derivatives factorize over the data^{4}^{4}4Cannot be represented as sums of independent functions of each data point. due to the kernel matrix inverses, and hence convergence is not guaranteed.
Semistochastic alternating gradient descent. Observe that once the kernel matrix, , is fixed, the expression, , can be precomputed on the full data and fixed. Subsequently, Eq. (10) turns into a weighted sum of independent functions of each data point. This observation suggests that, given a fixed kernel matrix, one could compute a stochastic update for on a minibatch of training points by only using the corresponding submatrix of . Hence, we propose to optimize GPs with recurrent kernels in a semistochastic fashion, alternating between updating the kernel hyperparameters, , on the full data first, and then updating the weights of the recurrent network, , using stochastic steps. The procedure is given in Algorithm 1.
Semistochastic alternating gradient descent is a special case of blockstochastic gradient iteration (Xu and Yin, 2015). While the latter splits the variables into arbitrary blocks and applies Gauss–Seidel type stochastic gradient updates to each of them, our procedure alternates between applying deterministic updates to and stochastic updates to of the form and ^{5}^{5}5In principle, stochastic updates of are also possible. As we will see next, we choose in practice to keep the kernel matrix fixed while performing stochastic updates. Due to sensitivity of the kernel to even small changes in , convergence of the fully stochastic scheme is fragile.. The corresponding Algorithm 1 is provably convergent for convex and nonconvex problems under certain conditions. The following theorem adapts results of Xu and Yin (2015) to our optimization scheme.
[informal] Semistochastic alternating gradient descent converges to a fixed point when the learning rate, , decays as for any .
Applying alternating gradient to our case has a catch: the kernel matrix (and its inverse) has to be updated each time and are changed, i.e., on every minibatch iteration (marked red in Algorithm 1). Computationally, this updating strategy defeats the purpose of stochastic gradients because we have to use the entire data on each step. To deal with the issue of computational efficiency, we use ideas from asynchronous optimization.
Asynchronous techniques. One of the recent trends in parallel and distributed optimization is applying updates in an asynchronous fashion (Agarwal and Duchi, 2011). Such strategies naturally require some tolerance to delays in parameter updates (Langford et al., 2009). In our case, we modify Algorithm 1 to allow delayed kernel matrix updates.
The key observation is very intuitive: when the stochastic updates of are small enough, does not change much between minibatches, and hence we can perform multiple stochastic steps for before recomputing the kernel matrix, , and still converge. For example, may be updated once at the end of each pass through the entire data (see Algorithm 2). To ensure convergence of the algorithm, it is important to strike the balance between (a) the learning rate for and (b) the frequency of the kernel matrix updates. The following theorem provides convergence results under certain conditions.
[informal] Semistochastic gradient descent with
delayed kernel updates converges to a fixed point when the learning rate, , decays as for any .Why stochastic optimization? GPs with recurrent kernels can be also trained with fullbatch gradient descent, as proposed by Wilson et al. (2016a). However, stochastic gradient methods have been proved to attain better generalization (Hardt et al., 2016) and often demonstrate superior performance in deep and recurrent architectures (Wilson and Martinez, 2003). Moreover, stochastic methods are ‘online’, i.e., they update model parameters based on subsets of an incoming data stream, and hence can scale to very large datasets. In our experiments, we demonstrate that GPs with recurrent kernels trained with Algorithm 2 converge faster (i.e., require fewer passes through the data) and attain better performance than if trained with fullbatch techniques.
Stochastic variational inference. Stochastic variational inference (SVI) in Gaussian processes (Hensman et al., 2013) is another viable approach to enabling stochastic optimization for GPs with recurrent kernels. Such method would optimize a variational lower bound on the original objective that factorizes over the data by construction. Recently, Wilson et al. (2016b) developed such a stochastic variational approach in the context of deep kernel learning. Note that unlike all previous existing work, our proposed approach does not require a variational approximation to the marginal likelihood to perform minibatch training of Gaussian processes.
Learning and inference with Gaussian processes requires solving a linear system involving an kernel matrix, , and computing a log determinant over . These operations typically require computations for training data points, and storage. In our approach, scalability is achieved through semistochastic training and structureexploiting inference. In particular, asynchronous semistochastic gradient descent reduces both the total number of passes through the data required for the model to converge and the number of calls to the linear system solver; exploiting the structure of the kernels significantly reduces the time and memory complexities of the linear algebraic operations.
More precisely, we replace all instances of the covariance matrix with , where
is a sparse interpolation matrix, and
is the covariance matrix evaluated over latent inducing points, which decomposes into a Kronecker product of circulant matrices (Wilson and Nickisch, 2015; Wilson et al., 2015). This construction makes inference and learning scale as and test predictions be , while preserving model structure. For the sake of completeness, we provide an overview of the underlying algebraic machinery in Appendix A.At a high level, because is sparse and is structured it is possible to take extremely fast matrix vector multiplications (MVMs) with the approximate covariance matrix
. One can then use methods such as linear conjugate gradients, which only use MVMs, to efficiently solve linear systems. MVM or scaled eigenvalue approaches
(Wilson and Nickisch, 2015; Wilson et al., 2015) can also be used to efficiently compute the log determinant and its derivatives. Kernel interpolation (Wilson et al., 2015) also enables fast predictions, as we describe further in the Appendix.Dataset  Task  # time steps  # dim  # outputs  Abs. corr.  SNR 
Drives  system ident.  500  1  1  0.7994  25.72 
Actuator  1,024  1  1  0.0938  12.47  
GEF  power load  38,064  11  1  0.5147  89.93 
wind power  130,963  16  1  0.1731  4.06  
Car  speed  932,939  6  1  0.1196  159.33 
gyro yaw  6  1  0.0764  3.19  
lanes  26  16  0.0816  —  
lead vehicle  9  2  0.1099  — 
Statistics for the data used in experiments. SNR was determined by assuming a certain degree of smoothness of the signal, fitting kernel ridge regression with RBF kernel to predict the targets from the input time series, and regarding the residuals as the noise. Tasks with low average correlation between inputs and targets and lower SNR are harder prediction problems.
We compare the proposed Gaussian processes with recurrent kernels based on RNN and LSTM architectures (GPRNN/LSTM) with a number of baselines on datasets of various complexity and ranging in size from hundreds to almost a million of time points. For the datasets with more than a few thousand points, we use a massively scalable version of GPs (see Section 4.2) and demonstrate its scalability during inference and learning. We carry out a number of experiments that help to gain empirical insights about the convergence properties of the proposed optimization procedure with delayed kernel updates. Additionally, we analyze the regularization properties of GPRNN/LSTM and compare them with other techniques, such as dropout. Finally, we apply the model to the problem of lane estimation and lead vehicle position prediction, both critical in autonomous driving applications.
Below, we describe each dataset we used in our experiments and the associated prediction tasks. The essential statistics for the datasets are summarized in Table 1.
System identification. In the first set of experiments, we used publicly available nonlinear system identification datasets: Actuator^{6}^{6}6http://www.iau.dtu.dk/nnbook/systems.html (Sjöberg et al., 1995) and Drives^{7}^{7}7http://www.it.uu.se/research/publications/reports/2010020/NonlinearData.zip. (Wigren, 2010). Both datasets had one dimensional input and output time series. Actuator had the size of the valve opening as the input and the resulting change in oil pressure as the output. Drives was from a system with motors that drive a pulley using a flexible belt; the input was the sum of voltages applied to the motors and the output was the speed of the belt.
Smart grid data^{8}^{8}8The smart grid data were taken from Global Energy Forecasting Kaggle competitions organized in 2012.. We considered the problem of forecasting for the smart grid that consisted of two tasks (Figure 2). The first task was to predict power load from the historical temperature data. The data had 11 input time series coming from hourly measurements of temperature on 11 zones and an output time series that represented the cumulative hourly power load on a U.S. utility. The second task was to predict power generated by wind farms from the wind forecasts. The data consisted of 4 different hourly forecasts of the wind and hourly values of the generated power by a wind farm. Each wind forecast was a 4element vector that corresponded to zonal component, meridional component, wind speed and wind angle. In our experiments, we concatenated the 4 different 4element forecasts, which resulted in a 16dimensional input time series.
Selfdriving car dataset^{9}^{9}9The dataset is proprietary. It was released in part for public use under the Creative Commons Attribution 3.0 license: http://archive.org/details/commadataset. More about the selfdriving car: http://www.bloomberg.com/features/2015georgehotzselfdrivingcar/. . One of the main target applications of the proposed model is prediction for autonomous driving. We considered a large dataset coming from sensors of a selfdriving car that was recorded on two car trips with discretization of . The data featured two sets of GPS ECEF locations, ECEF velocities, measurements from a fiberoptic gyro compass, LIDAR, and a few more time series from a variety of IMU sensors. Additionally, locations of the left and right lanes were extracted from a video stream for each time step as well as the position of the lead vehicle from the LIDAR measurements. We considered the data from the first trip for training and from the second trip for validation and testing. A visualization of the car routes with 25 second discretization in the ENU coordinates are given in Figure 3. We consider four tasks, the first two of which are more of proofofconcept type variety, while the final two are fundamental to good performance for a selfdriving car:
[itemsep=1pt, topsep=1pt]
Speed prediction from noisy GPS velocity estimates and gyroscopic inputs.
Prediction of the angular acceleration of the car from the estimates of its speed and steering angle.
Pointwise prediction of the lanes from the estimates at the previous time steps, and estimates of speed, gyroscopic and compass measurements.
Prediction of the lead vehicle location from its location at the previous time steps, and estimates of speed, gyroscopic and compass measurements.
We provide more specific details on the smart grid data and selfdriving data in Appendix D.
Models and metrics. We used a number of classical baselines: NARX (Lin et al., 1996), GPNARX models (Kocijan et al., 2005), and classical RNN and LSTM architectures. The kernels of our models, GPNARX/RNN/LSTM, used the ARD base kernel function and were structured by the corresponding baselines.^{10}^{10}10GPNARX is a special instance of our more general framework and we trained it using the proposed semistochastic algorithm. As the primary metric, we used root mean squared error (RMSE) on a held out set and additionally negative log marginal likelihood (NLML) on the training set for the GPbased models.
We train all the models to perform onestepahead prediction in an autoregressive setting, where targets at the future time steps are predicted from the input and target values at a fixed number of past time steps. For the system identification task, we additionally consider the nonautoregressive scenario (i.e., mapping only input sequences to the future targets), where we are performing prediction in the free simulation mode, and included recurrent Gaussian processes (Mattos et al., 2015) in our comparison. In this case, none of the future targets are available and the models have to reuse their own past predictions to produce future forecasts).
A note on implementation. Recurrent parts of each model were implemented using Keras^{11}^{11}11http://www.keras.io library. We extended Keras with the Gaussian process layer and developed a backed engine based on the GPML library^{12}^{12}12http://www.gaussianprocess.org/gpml/code/matlab/doc/. Our approach allows us to take full advantage of the functionality available in Keras and GPML, e.g., use automatic differentiation for the recurrent part of the model. Our code is available at http://github.com/alshedivat/kgp/.
This section discusses quantitative and qualitative experimental results. We only briefly introduce the model architectures and the training schemes used in each of the experiments. We provide a comprehensive summary of these details in Appendix E.
To address the question of whether stochastic optimization of recurrent kernels is necessary and to assess the behavior of the proposed optimization scheme with delayed kernel updates, we conducted a number of experiments on the Actuator dataset (Figure 4).
First, we constructed two GPLSTM models with 1 recurrent hidden layer and 16 or 64 hidden units and trained them with (nonstochastic) fullbatch iterative procedure (similar to the proposal of Wilson et al. (2016a)) and with our semistochastic optimizer with delayed kernel updates (Algorithm 2). The convergence results are given on the first two charts. Both in terms of the error on a held out set and the NLML on the training set, the models trained with minibatches converged faster and demonstrated better final performance.
Next, we compared the two optimization schemes on the same GPLSTM architecture with different sizes of the hidden layer ranging from 2 to 32. It is clear from the third chart that, even though fullbatch approach seemed to find a better optimum when the number of hidden units was small, the stochastic approach was clearly superior for larger hidden layers.
Finally, we compared the behavior of Algorithm 2
with different number of minibatches used for each epoch (equivalently, the number of steps between the kernel matrix updates) and different learning rates. The results are give on the last chart. As expected, there is a fine balance between the number of minibatches and the learning rate: if the number of minibatches is large (
i.e., the delay between the kernel updates becomes too long) while the learning rate is high enough, optimization does not converge; at the same time, an appropriate combination of the learning rate and the minibatch size leads better generalization than the default batch approach of Wilson et al. (2016a).In this set of experiments, our main goal is to provide a comparison between three different modes of onestepahead prediction, referred to as (i) regression, (ii) autoregression, and (iii) free simulation, and compare performance of our models with RGP—a classical RNN with every parametric layer substituted with a Gaussian process (Mattos et al., 2015)—on the Actuator and Drives datasets. The difference between the prediction modes consists in whether and how the information about the past targets is used. In the regression scenario, inputs and targets are separate time series and the model learns to map input values at a number of past time points to a target value at a future point in time. Autoregression, additionally, uses the true past target values as inputs; in the free simulation mode, the model learns to map past inputs and its own past predictions to a future target.
In the experiments in autoregression and free simulation modes, we used short time lags, , as suggested by Mattos et al. (2015). In the regression mode, since the model does not build the recurrent relationships based on the information about the targets (or their estimates), it generally requires larger time lags that can capture the state of the dynamics. Hence we increased the time lag to 32 in the regression mode. More details are given in Appendix E.
Drives  Actuator  
regression  autoregression  free simulation  regression  autoregression  free simulation  
NARX  0.33 (0.02)  0.19 (0.03)  0.38 (0.03)  0.49 (0.05)  0.18 (0.01)  0.57 (0.04) 
RNN  0.53 (0.02)  0.17 (0.04)  0.56 (0.03)  0.56 (0.03)  0.17 (0.01)  0.68 (0.05) 
LSTM  0.29 (0.02)  0.14 (0.02)  0.40 (0.02)  0.40 (0.03)  0.19 (0.01)  0.44 (0.03) 
GPNARX  0.28 (0.02)  0.16 (0.04)  0.28 (0.02)  0.46 (0.03)  0.14 (0.01)  0.63 (0.04) 
GPRNN  0.37 (0.04)  0.16 (0.03)  0.45 (0.03)  0.49 (0.02)  0.15 (0.01)  0.55 (0.04) 
GPLSTM  0.25 (0.02)  0.13 (0.02)  0.32 (0.03)  0.36 (0.01)  0.14 (0.01)  0.43 (0.03) 
RGP  —  —  0.249  —  —  0.368 
Average performance of the models in terms of RMSE on the system identification tasks. The averages were computed over 5 runs; the standard deviation is given in the parenthesis. Results for the RGP model are as reported by
Mattos et al. (2015), available only for the free simulation.We present the results in Table 2
. We note that GPbased architectures consistently yielded improved predictive performance compared to their vanilla deep learning counterparts on both of the datasets, in each mode. Given the small size of the datasets, we attribute such behavior to better regularization properties of the negative log marginal likelihood loss function. We also found out that when GPbased models were initialized with weights of pretrained neural networks, they tended to overfit and give overly confident predictions on these tasks. The best performance was achieved when the models were trained from a random initialization
(contrary to the findings of Wilson et al., 2016a). In free simulation mode RGP performs best of the compared models. This result is expected—RGP was particularly designed to represent and propagate uncertainty through a recurrent process. Our framework focuses on using recurrence to build expressive kernels for regression on sequences.The suitability of each prediction mode depends on the task at hand. In many applications where the future targets become readily available as the time passes (e.g., power estimation or stock market prediction), the autoregression mode is preferable. We particularly consider autoregressive prediction in the further experiments.
For both smart grid prediction tasks we used LSTM and GPLSTM models with 48 hour time lags and were predicting the target values one hour ahead. LSTM and GPLSTM were trained with one or two layers and 32 to 256 hidden units. The best models were selected on 25% of the training data used for validation. For autonomous driving prediction tasks, we used the same architectures but with 128 time steps of lag (1.28 ). All models were regularized with dropout (Srivastava et al., 2014; Gal and Ghahramani, 2016b). On both GEF and selfdriving car datasets, we used the scalable version of Gaussian process (MSGP) (Wilson et al., 2015). Given the scale of the data and the challenge of nonlinear optimization of the recurrent models, we initialized the recurrent parts of GPRNN and GPLSTM with pretrained weights of the corresponding neural networks. Finetuning of the models was performed with Algorithm 2. The quantitative results are provided in Table 3 and demonstrate that GPs with recurrent kernels attain the stateoftheart performance.
GEF  Car  
power load  wind power  speed  gyro yaw  lanes  lead vehicle  
NARX  0.54 (0.02)  0.84 (0.01)  0.114 (0.010)  0.19 (0.01)  0.13 (0.01)  0.41 (0.02) 
RNN  0.61 (0.02)  0.81 (0.01)  0.152 (0.012)  0.22 (0.01)  0.33 (0.02)  0.44 (0.03) 
LSTM  0.45 (0.01)  0.77 (0.01)  0.027 (0.008)  0.13 (0.01)  0.08 (0.01)  0.40 (0.01) 
GPNARX  0.78 (0.03)  0.83 (0.02)  0.125 (0.015)  0.23 (0.02)  0.10 (0.01)  0.34 (0.02) 
GPRNN  0.24 (0.02)  0.79 (0.01)  0.089 (0.013)  0.24 (0.01)  0.46 (0.08)  0.41 (0.02) 
GPLSTM  0.17 (0.02)  0.76 (0.01)  0.019 (0.006)  0.08 (0.01)  0.06 (0.01)  0.32 (0.02) 


Additionally, we investigated convergence and regularization properties of LSTM and GPLSTM models on the GEFpower dataset. The first two charts of Figure 6 demonstrate that GPbased models are less prone to overfitting, even when the data is not enough. The third panel shows that architectures with a particular number of hidden units per layer attain the best performance on the power prediction task. An additional advantage of the GPlayers over the standard recurrent networks is that the best architecture could be identified based on the negative log likelihood of the model as shown on the last chart.
Finally, Figure 5 qualitatively demonstrates the difference between the predictions given by LSTM vs. GPLSTM on pointwise lane estimation (Figure 4(a)) and the front vehicle tracking (Figure 4(b)) tasks. We note that GPLSTM not only provides a more robust fit, but also estimates the uncertainty of its predictions. Such information can be further used in downstream predictionbased decision making, e.g., such as whether a selfdriving car should slow down and switch to a more cautious driving style when the uncertainty is high.
Following Wilson et al. (2015), we performed a generic scalability analysis of the MSGPLSTM model on the car sensors data. The LSTM architecture was the same as described in the previous section: it was transforming multidimensional sequences of inputs to a twodimensional representation. We trained the model for 10 epochs on 10%, 20%, 40%, and 80% of the training set with 100, 200, and 400 inducing points per dimension and measured the average training time per epoch and the average prediction time per testing point. The measured time was the total time spent on both LSTM optimization and MSGP computations. The results are presented in Figure 7.
The training time per epoch (one full pass through the entire training data) grows linearly with the number of training examples and depends linearly on the number of inducing points (Figure 7, two left charts). Thus, given a fixed number of inducing points per dimension, the time complexity of MSGPLSTM learning and inference procedures is linear in the number of training examples. The prediction time per testing data point is virtually constant and does not depend on neither on the number of training points, nor on the number of inducing points (Figure 7, two right charts).
We proposed a method for learning kernels with recurrent long shortterm memory structure on sequences. Gaussian processes with such kernels, termed the GPLSTM, have the structure and learning biases of LSTMs, while retaining a probabilistic Bayesian nonparametric representation. The GPLSTM outperforms a range of alternatives on several sequencetoreals regression tasks. The GPLSTM also works on data with low and high signaltonoise ratios, and can be scaled to very large datasets, all with a straightforward, practical, and generally applicable model specification. Moreover, the semistochastic scheme proposed in our paper is provably convergent and efficient in practical settings, in conjunction with structure exploiting algebra. In short, the GPLSTM provides a natural mechanism for Bayesian LSTMs, quantifying predictive uncertainty while harmonizing with the standard deep learning toolbox. Predictive uncertainty is of high value in robotics applications, such as autonomous driving, and could also be applied to other areas such as financial modeling and computational biology.
There are several exciting directions for future research. The GPLSTM quantifies predictive uncertainty but does not model the propagation of uncertainty in the inputs through a recurrent structure. Treating free simulation as a structured prediction problem and using online corrective algorithms, e.g., DAGGER (Ross et al., 2011), are likely to improve performance of GPLSTM in the free prediction mode. This approach would not require explicitly modeling and propagating uncertainty through the recurrence and would maintain the high computational efficiency of our method.
Alternatively, it would be exciting to have a probabilistic treatment of all parameters of the GPLSTM kernel, including all LSTM weights. Such an extension could be combined with stochastic variational inference, to enable both classification and nonGaussian likelihoods as in Wilson et al. (2016b), but also open the doors to stochastic gradient Hamiltonian Monte Carlo (Chen et al., 2014) (SGHMC) for efficient inference over kernel parameters. Indeed, SGHMC has recently been used for efficient inference over network parameters in the Bayesian GAN (Saatchi and Wilson, 2017). A Bayesian approach to marginalizing the weights of the GPLSTM kernel would also provide a principled probabilistic mechanism for learning model hyperparameters.
One could relax several additional assumptions. We modeled each output dimension with independent GPs that shared a recurrent transformation. To capture the correlations between output dimensions, it would be promising to move to a multitask formulation. In the future, one could also learn the time horizon in the recurrent transformation, which could lead to major additional performance gains.
Finally, the semistochastic learning procedure naturally complements research in asynchronous optimization (e.g., Deisenroth and Ng, 2015). In combination with stochastic variational inference, the semistochastic approach could be used for parallel kernel learning, sidestepping the independence assumptions in prior work. We envision that such efforts for Gaussian processes will harmonize with current progress in Bayesian deep learning.
The authors thank Yifei Ma for helpful discussions and the anonymous reviewers for the valuable comments that helped to improve the paper. This work was supported in part by NIH R01GM114311, AFRL/DARPA FA87501220324, and NSF IIS1563887.
Massively scalable Gaussian processes (MSGP) (Wilson et al., 2015) is a significant extension of the kernel interpolation framework originally proposed by Wilson and Nickisch (2015). The core idea of the framework is to improve scalability of the inducing point methods (QuinoneroCandela and Rasmussen, 2005) by (1) placing the virtual points on a regular grid, (2) exploiting the resulting Kronecker and Toeplitz structures of the relevant covariance matrices, and (3) do local cubic interpolation to go back to the kernel evaluated at the original points. This combination of techniques brings the complexity down to for training and for each test prediction. Below, we overview the methodology. We remark that a major difference in philosophy between MSGP and many classical inducing point methods is that the points are selected and fixed rather than optimized over. This allows to use significantly more virtual points which typically results in a better approximation of the true kernel.
Given a set of inducing points, the crosscovariance matrix, , between the training inputs, , and the inducing points, , can be approximated as using a (potentially sparse) matrix of interpolation weights, . This allows to approximate for an arbitrary set of inputs as . For any given kernel function, , and a set of inducing points, , structured kernel interpolation (SKI) procedure (Wilson and Nickisch, 2015) gives rise to the following approximate kernel:
(A.1) 
which allows to approximate . Wilson and Nickisch (2015) note that standard inducing point approaches, such as subset of regression (SoR) or fully independent training conditional (FITC), can be reinterpreted from the SKI perspective. Importantly, the efficiency of SKIbased MSGP methods comes from, first, a clever choice of a set of inducing points that allows to exploit algebraic structure of , and second, from using very sparse local interpolation matrices. In practice, local cubic interpolation is used (Keys, 1981).
If inducing points, , form a regularly spaced dimensional grid, and we use a stationary product kernel (e.g., the RBF kernel), then decomposes as a Kronecker product of Toeplitz matrices:
(A.2) 
The Kronecker structure allows to compute the eigendecomposition of by separately decomposing , each of which is much smaller than . Further, to efficiently eigendecompose a Toeplitz matrix, it can be approximated by a circulant matrix^{13}^{13}13Wilson et al. (2015) explored 5 different approximation methods known in the numerical analysis literature.
which eigendecomposes by simply applying discrete Fourier transform (DFT) to its first column. Therefore, an approximate eigendecomposition of each
is computed via the fast Fourier transform (FFT) and requires only time.To perform inference, we need to solve ; kernel learning requires evaluating . The first task can be accomplished by using an iterative scheme—linear conjugate gradients—which depends only on matrix vector multiplications with . The second is done by exploiting the Kronecker and Toeplitz structure of for computing an approximate eigendecomposition, as described above.
To achieve constant time prediction, we approximate the latent mean and variance of
by applying the same SKI technique. In particular, for a set of testing points, , we have(A.3)  
where and , and and are and sparse interpolation matrices, respectively. Since is precomputed at training time, at test time, we only multiply the latter with matrix which results which costs operations leading to operations per test point. Similarly, approximate predictive variance can be also estimated in operations (Wilson et al., 2015).
Note that the fast prediction methodology can be readily applied to any trained Gaussian process model as it is agnostic to the way inference and learning were performed.
GPs with deep recurrent kernels are trained by minimizing the negative log marginal likelihood objective function. Below we derive the update rules.
By applying the chain rule, we get the following first order derivatives:
(B.4) 
The derivative of the log marginal likelihood w.r.t. to the kernel hyperparameters, , and the parameters of the recurrent map, , are generic and take the following form (Rasmussen and Williams, 2006, Ch. 5, Eq. 5.9):
(B.5)  
(B.6) 
The derivative is also standard and depends on the form of a particular chosen kernel function, . However, computing each part of is a bit subtle, and hence we elaborate these derivations below.
Consider the th entry of the kernel matrix, . We can think of as a matrixvalued function of all the data vectors in dimensional transformed space which we denote by . Then is a scalarvalued function of and its derivative w.r.t. the th parameter of the recurrent map, , can be written as follows:
(B.7) 
Notice that is a derivative of a scalar w.r.t. to a matrix and hence is a matrix; is a derivative of a matrix w.r.t. to a scalar which is taken elementwise and also gives a matrix. Also notice that is a function of , but it only depends the th and th elements for which the kernel is being computed. This means that will have only nonzero th row and th column and allows us to rewrite (B.7) as follows:
(B.8)  
Since the kernel function has two arguments, the derivatives must be taken with respect of each of them and evaluated at the corresponding points in the hidden space, and . When we plug this into (B.6), we arrive at the following expression:
(B.9) 
The same expression can be written in a more compact form using the Einstein notation:
(B.10) 
where indexes the dimensions of the and indexes the dimensions of .
In practice, deriving a computationally efficient analytical form of might be too complicated for some kernels (e.g., the spectral mixture kernels (Wilson and Adams, 2013)), especially if the gridbased approximations of the kernel are enabled. In such cases, we can simply use a finite difference approximation of this derivative. As we remark in the following section, numerical errors that result from this approximation do not affect convergence of the algorithm.
Convergence results for the semistochastic alternating gradient schemes with and without delayed kernel matrix updates are based on (Xu and Yin, 2015). There are a few notable differences between the original setting and the one considered in this paper:
Xu and Yin (2015) consider a stochastic program that minimizes the expectation of the objective w.r.t. some distribution underlying the data:
(C.11) 
where every iteration a new is sampled from the underlying distribution. In our case, the goal is to minimize the negative log marginal likelihood on a particular given dataset. This is equivalent to the original formulation (C.11), but with the expectation taken w.r.t. the empirical distribution that corresponds to the given dataset.
The optimization procedure of Xu and Yin (2015) has access to only a single random point generated from the data distribution at each step. Our algorithm requires having access to the entire training data each time the kernel matrix is computed.
For a given sample, Xu and Yin (2015) propose to loop over a number of coordinate blocks and apply Gauss–Seidel type gradient updates to each block. Our semistochastic scheme has only two parameter blocks, and , where is updated deterministically on the entire dataset while is updated with stochastic gradient on samples from the empirical distribution.
Noting these differences, we first adapt convergence results for the smooth nonconvex case (Xu and Yin, 2015, Theorem 2.10) to our scenario, and then consider the variant with delaying kernel matrix updates.
As shown in Algorithm 1, we alternate between updating and . At step , we get a minibatch of size , , which is just a selection of points from the full set, . Define the gradient errors for and at step as follows:
(C.12) 
where and are the true gradients and and are estimates^{14}^{14}14Here, we consider the gradients and their estimates scaled by the number of full data points, , and the minibatch size, , respectively. These constant scaling is introduced for sake of having cleaner proofs.. We first update and then , and hence the expressions for the gradients take the following form:
(C.13)  
(C.14)  
(C.15) 
Note that as we have shown in Appendix B, when the kernel matrix is fixed, and factorize over and , respectively. Further, we denote all the minibatches sampled before as .
For any step , . First, , and hence is trivial. Next, by definition of , we have . Consider the following:
Consider : is a deterministic function of and . is being updated deterministically using , and hence only depends on and . Therefore, it is independent of , which means that .
Now, consider : Noting that the expectation is taken w.r.t. the empirical distribution and does not depend on the current minibatch, we can write:
(C.16)  
Finally, .
In other words, semistochastic gradient descent that alternates between updating and
computes unbiased estimates of the gradients on each step.
Note that in case if is computed approximately, Lemma C.1 still holds since both and will contain exactly the same numerical errors.
For any step , and .
Lemma C.1 and Assumption C.1 result into a stronger condition than the original assumption given by Xu and Yin (2015). This is due to the semistochastic nature of the algorithm, it simplifies the analysis, though it is not critical. Assumptions C.1 and C.1 are straightforwardly adapted from the original paper.
The objective function, , is lower bounded and its partial derivatives w.r.t. and are uniformly Lipschitz with constant :
Comments
There are no comments yet.