Warped Input Gaussian Processes for Time Series Forecasting

12/05/2019 ∙ by David Tolpin, et al. ∙ 0

We introduce a Gaussian process-based model for handling of non-stationarity. The warping is achieved non-parametrically, through imposing a prior on the relative change of distance between subsequent observation inputs. The model allows the use of general gradient optimization algorithms for training and incurs only a small computational overhead on training and prediction. The model finds its applications in forecasting in non-stationary time series with either gradually varying volatility, presence of change points, or a combination thereof. We evaluate the model on synthetic and real-world time series data comparing against both baseline and known state-of-the-art approaches and show that the model exhibits state-of-the-art forecasting performance at a lower implementation and computation cost.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Gaussian processes [24] possess properties that make them the approach of choice in time series forecasting:

  • A Gaussian process works with as little or as much data as available.

  • Non-uniformly sampled observations, missing observations, and observation noise are handled organically.

  • Uncertainty of a future observation is predicted along with the mean.

In a basic setting though, a Gaussian process models a stationary time series with homoscedastic noise. When either the covariance between observations or the noise vary depending on observations inputs or outputs, predictions produced by a Gaussian process with a stationary kernel and constant noise variance will be either biased or overly uncertain, hence handling non-stationarity and heteroscedasticity is crucial in many applications. Non-stationarity often arises in financial time series, where market volatility, affecting forecasting uncertainty, changes with time 

[37, 21]; heteroscedastic noise is common in vital signs monitoring of patients in intensive care units, where the noise depends on patient activity and medical interventions [6], both non-stationarity and heteroscedasticity are characteristic for time series of sensor readings in mobile robotics [15].

Various approaches have been proposed for handling non-stationarity using Gaussian processes. When a time series is piecewise stationary, change point detection is deemed an appropriate model, with a stationary homoscedastic Gaussian process modelling stretches between consequent change points [10, 26, 4]. In cases where the covariance or noise change gradually and smoothly, it is common to introduce a non-parametric dependency of kernel and noise parameters on inputs [12, 22, 16, 31, 15, 36], however, this makes structural modelling of time series, which constitutes an important advantage of Gaussian processes and facilitates introduction of prior knowledge in the model, more challenging.

Another popular way to handle both abrupt and gradual changes in time series is through mapping of the input space [28, 18, 8, 32, 2, 19, 3]. Covariance between observations depends on observations inputs as well as on kernel parameters, and non-stationarity can be modelled by smoothly modifying observation inputs (warping the input space). Several methods have been proposed to learn the input space transformation, and a number of other approaches can be viewed as employing input space warping for handling non-stationarity. However, many such approaches either meet difficulties in practical application, or require elaborated inference algorithms [33, 2, 27, 7], which may impact the simplicity of use of Gaussian processes.

In this work we introduce a model for non-parametric warping of input space for Gaussian processes, suitable in particular for time series forecasting but also applicable to other domains. The model is easy to implement, imposes only a small computational overhead on training and prediction, and allows to use the whole arsenal of Gaussian process kernels to model time series structure using prior knowledge. We provide a reference implementation of the model and evaluate the model on a synthetic and real-world data, comparing forecasting performance with both baseline and state-of-the-art approaches and show that the model exhibits state-of-the-art forecasting performance at a lower implementation and computation cost.

This work brings the following contributions:

  • A novel approach to handling non-stationarity in Gaussian processes.

  • A Gaussian process model for forecasting in non-stationary time series.

  • A reference implementation of the model within a probabilistic programming framework.

2 Preliminaries

A Gaussian Process is a collection of random variables, any finite number of which have (consistent) joint Gaussian distributions. A Gaussian process is fully specified by its mean function

and covariance, or kernel, function and defines a distribution over functions. The mean function is often set to zero, . A Gaussian process defines a distribution over functions:


Any finite set of values of function at inputs

follows the multivariate normal distribution

with mean and covariance matrix .

Posterior inference in Gaussian processes can be performed analytically. Let be the observations at inputs . Then the posterior distribution of values at inputs is


where is the covariance matrix between and .

Kernel functions normally have hyperparameters; we shall write

to denote that the kernel function has hyperparameter , possibly multidimensional, or omit the hyperparameters when they are clear from the context. Training a Gaussian process involves choosing based on the observations. For example, the Gaussian, or RBF, kernel has the form


and is parameterized by a single hyperparameter .

A straightforward way to choose is to maximize log marginal likelihood of observations :


where is the number of observations.

There is no closed form solution for maximizing in general, however gradient-based optimization methods allow to obtain an approximate solution efficiently.

3 Warped Input Gaussian Process Model

A major advantage of Gaussian process regression in general, and for application to time series in particular, is the explicit inclusion of uncertainty in the model: both the mean and the variance are predicted at unobserved inputs. However, perhaps somewhat counterintuitively, the variance, given the kernel and the kernel’s hyperparameters, does not depend on observed outputs. Indeed, the covariance matrix in equation (2) does not depend on .

One way to circumvent this limitation of Gaussian processes is to introduce non-stationarity into the kernel function, such that the covariance depends on both the distance between inputs, and on inputs themselves. In some kernels, such as the dot product kernel , non-stationarity is fixed in the kernel design. In other kernels, non-stationarity comes through dependency of kernel hyperparameters on the inputs, and the dependency itself can be learned from data [11, 22, 19]. Related to varying kernel hyperparameters with inputs is the idea of warping the input space [28]. A stationary kernel depends on both the distance between inputs and the hyperparameters. Consider, for example, the RBF kernel (3). Increasing hyperparameter , customarily called the length scale, has the same effect on the covariance as decreasing the distance between and by the same relative amount. Moving points away from each other will effectively decrease the length scale and covariance between the points. Warping of the input space has an intuitive interpretation for time series: the time runs faster in areas with high output volatility and slower when the output is stable.

A research problem addressed by different warping methods is how the warping is represented and what objective should be maximized to learn the optimal warping for a given problem instance. In what follows, we introduce warping of the input space of a one-dimensional Gaussian process by imposing a prior on the distances between adjacent inputs. We train the process by maximizing the combined log marginal likelihood of the observations under the prior and of the Gaussian process. The model is trivially extended to a multi-dimensional Gaussian process where only a single dimension is warped, such a in the case of a time series where there are multiple predictors but only the time needs to be warped to account for temporal non-stationarity.

3.1 Model

In a Gaussian process model for handling non-stationarity through displacement of observation inputs, the choice is of the form of the prior imposed on the inputs. One option is to impose a Gaussian process prior on the inputs. This is a rich prior allowing to model complex structured non-stationarity; deep Gaussian processes [7] is a realization of such prior. However, inference in the presence of such prior requires special techniques and is computationally expensive. On the other extreme is imposing an independent Gaussian prior on each input, which is related to the modelling of input uncertainty [20, 9]. [20] show though that independent input noise may be reduced to independent output noise, and as such is not sufficiently expressive for modelling non-stationarity for forecasting. Here, we propose a prior that is just a single step away from an independent prior on each input, namely one which corresponds to a 3-diagonal striped covariance matrix , such that , which is equivalent to imposing independent priors on distances between adjacent inputs. An intuition behind this prior is that the distance between adjacent locations increases, and the effective length scale decreases, in areas with high volatility, and vice versa in areas with low volatility. For convenience of inference, we formulate the prior in terms of relative change of distance between inputs. We exploit the structure of this prior to specify the model compactly, without having to manipulate the full covariance matrix of the prior.

Formally, let be a one-dimensional Gaussian process. Let also be a distribution on . Then, given inputs , the generative model for outputs is


In words, inputs are transformed (warped) into by stretching or compressing distances between adjacent inputs and by relative amounts drawn from . For brevity, we call the introduced model WGP in the rest of the paper. serves as a prior belief on distances between adjacent inputs, relative to the original distances. Without loss of generality, the mean of can be assumed to be 1, so that the mean of the prior belief is that no warping is applied.

3.2 Training

Training of a WGP model is performed by maximizing the log marginal likelihood :


where is a normalization constant that does not depend on either hyperparameters or observations and is not required for training. As with kernel hyperparameters, derivatives of by both hyperparameters and transformed inputs are readily obtainable analytically or through algorithmic differentiation [14].

3.3 Forecasting

After training, forecasting is virtually identical to that of a regular Gaussian process, with one exception: for prediction in a new location , the warped image of the location must be obtained for substituting into (2). The possible options are:

  • Choosing that maximizes for and .

  • Setting and, consequently, .

  • Setting and .

The first option is best aligned with log marginal likelihood maximization during training but computationally expensive. The last option expresses a smoothness assumption: the length scale is likely to be similar in adjacent inputs. We experimented with the three options and found that empirically on synthetic and real-world datasets predictive accuracy of the third option is virtually indistinguishable from the first one. In the empirical evaluation, we computed the warped location for forecasting as .

3.4 Modelling Seasonality

Time series are often modelled by combining trend and seasonality, that is, similarity between nearby observations on one hand and observations at similar phases of a period on the other hand. In Gaussian processes, kernels based on the periodic kernel [18] are used to model seasonality. Warping of the input space, used to maintain would interfere with dependencies induced by the periodic kernel. Consider monitoring of vital sign time series in intensive care unit [5]: while volatility of the time series may evolve over time, and warping the time may be adequate for modelling non-stationarity, observations at the same astronomical time of the day tend to be similar.

A solution for warping the trend time but keeping the seasonality time unwarped is to include both original and warped dimension into the input space. This way, kernel features modelling the trend and thus affected by non-stationarity are made dependant on the warped time, and those modelling seasonality — on the original time. Generative model (7) extends (5) by combining and on input to the Gaussian process:


Consider, for example, the following kernel, composed of locally periodic and trend terms:


In this kernel, the and components reflect local dependencies between inputs and hence should be affected by input warping. The component, however, expresses dependencies between points at similar phases of different periods, with the period length normally known upfront and staying fixed. Thus, in the presence of warping, the modified kernel must receive both warped and original inputs and pass appropriate inputs to each of the components:


4 Empirical Evaluation

dataset no warping warped warped, periodic deep GP
trend 0.27340.0464 0.23840.0649 0.23870.0620 0.61100.0511
trend+seasonal -0.25750.0273 -0.32780.0288 -0.36830.0312 0.12360.0659
Table 1: Negative log predictive density on synthetic datasets.
(a) no-warping, NLPD = 0.1887 (b) warped, NLPD = -0.1620 (c) deep GP, NLPD = -0.0912
Figure 1: Forecasting on an instance from the synthetic dataset.
dataset no warping warped deep GP
LIDAR 0.2543 0.2290 0.2370
Marathon 0.1887 -0.1620 -0.2183
Motorcycle 1.9320 0.8063 1.191897
Table 2: Negative log predictive density on real-world datasets.
(a) no-warping, NLPD = 0.1887 (b) warped, NLPD = -0.1620 (c) deep GP, NLPD = -0.2183
Figure 2: Forecasting on Marathon dataset.
(a) no-warping, NLPD = 1.9320 (b) warped, NLPD = 0.8063 (c) deep GP, NLPD = 1.1919
Figure 3: Forecasting on Motorcycle dataset.

The empirical evaluation relies on modelling and inference capabilities provided by differentiable probabilistic programming [14, 13]. We implemented WGP using Infergo [34], a probabilistic programming facility, and GoGP [35], a library for probabilistic programming with Gaussian processes. The source code, data, and detailed results of empirical evaluations are available at https://bitbucket.org/dtolpin/wigp. An implementation of LBFGS [17] provided by Gonum [1] was used for inferring hyperparameters. As a state-of-the-art algorithm for non-stationary Gaussian process regression, we used an implementation of deep Gaussian processes from https://github.com/SheffieldML/PyDeepGP.

We evaluated the model on synthetic and real world data. Two kernels were employed in the empirical evaluation:

  1. A kernel, used with both synthetic and real-world data.

  2. A weighted sum of kernel and a periodic kernel, used with synthetic data.

The latter kernel was applied to synthetic data generated both with and without seasonal component, to evaluate influence of prior structural knowledge on one hand, and possible adverse effect of model misspecification (periodic component where there is no seasonality in the data) on the other hand. A parameterized homoscedastic noise term was added to the kernel in all evaluations. Vague log-normal priors were imposed on kernel hyperparameters.

4.1 Synthetic datasets

Synthetic data was generated by sampling 100 instances from Gaussian processes with Matern(5/2) kernel, and with a sum of Matern(5/2) and a periodic kernel, to emulate seasonality in the data. To emulate non-stationarity, log distances between inputs were sampled from a Gaussian process with an RBF kernel and then unwarped into equidistant inputs. Samples from the periodic kernel component were drawn for equidistant inputs, in accordance with the assumption that the seasonality period is fixed.

Table 1 provides negative log predictive density (NLPD) for regular, unwarped, Gaussian process, warped Gaussian process with and without the periodic component, and deep Gaussian process on the synthetic dataset. Smaller NLPD means better forecasting accuracy. WGP outperforms both regular and deep Gaussian process by a wide margin on the synthetic dataset. Using a kernel with periodic component on seasonal data improves forecasting, but accounting for non-stationarity through warping always results in much better accuracy.

Figure 1 shows a typical forecast by each of the models on a single instance from the synthetic dataset.

4.2 Real-world datasets

We used three real-world datasets for the evaluation:

Table 2 compares performance of regular Gaussian process WGP, and deep Gaussian process on the data sets. WGP shows the best predictive performance on LIDAR and Motorcycle data. On the Marathon time series, deep Gaussian process performs slightly better, apparently due to smoothness of the data. Figures 2 and 3 show forecasting with each of the models on Marathon and Motorcycle datasets.

5 Related Work

Work related to this research is concerned with Gaussian processes for time series forecasting, non-stationarity in Gaussian process regression, and warping of the input space for representing non-stationarity, in the order of narrowing focus. [25] gives an introduction to Gaussian processes for time series modelling, including handling of non-stationarity through change point detection.

Non-stationarity in Gaussian processes is attributed to either heteroscedasticity, that is, varying observation noise, or to non-stationarity of the covariance, or to both. Heteroscedasticity is addressed by modelling dependency of noise on the input and, possibly, output [12, 16, 15, 36, 9]. Non-stationarity of the covariance is represented through change points [10, 26, 4], non-stationary kernels [11, 22, 23] or input and output space transformations (warping) [28, 3, 19, 8, 7].

Current work uses warping of the input space to represent non-stationarity. However, unlike previous research, only observation inputs are transformed rather than the whole input space, allowing for a simpler representation and more efficient inference. Due to the non-parametric nature of transformation employed in this work, the introduced model is applicable to time series both with change points and with smooth non-stationarities.

6 Conclusion

We introduced a Gaussian process-based model where non-stationarity is handled through non-parametric warping of observation inputs. In application to time series, the model facilitates forecasting of future observations with variances depending on outputs, as well as inputs, of past observations, while staying within the framework of ‘standard’ Gaussian process inference. When the data is known to possess periodic properties or non-local correlations, these correlations can be encoded in the model while still handling non-stationarity through warping. The introduced approach to input warping can be used with existing Gaussian process libraries and algorithms, and there is room for compromise between accuracy of modelling non-stationarity and computation time.

It still remains an open question to which extent a more expressive warping may improve the quality of forecasting. Combining the introduced model with change-point detection may be beneficial in cases of abrupt changes in process parameters. Still, in cases where simplicity of implementation and robustness in face of variability in time series are of high priority, the introduced model appears to provide a practical and efficient solution.


  • [1] Gonum authors. Gonum numerical packages. http://gonum.org/, 2017.
  • [2] Thang D. Bui, José Miguel Hernández-Lobato, Daniel Hernández-Lobato, Yingzhen Li, and Richard E. Turner, ‘Deep Gaussian processes for regression using approximate expectation propagation’, in

    Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48

    , ICML’16, pp. 1472–1481. JMLR.org, (2016).
  • [3] R. Calandra, J. Peters, C. E. Rasmussen, and M. P. Deisenroth, ‘Manifold Gaussian processes for regression’, in

    International Joint Conference on Neural Networks (IJCNN)

    , pp. 3338–3345. IEEE, (July 2016).
  • [4] Varun Chandola and Ranga Raju Vatsavai, ‘A Gaussian process based online change detection algorithm for monitoring periodic time series’, in Proceedings of the 2011 SIAM International Conference on Data Mining, pp. 95–106, (2011).
  • [5] L. Clifton, D. A. Clifton, M. A. F. Pimentel, P. J. Watkinson, and L. Tarassenko, ‘Gaussian process regression in vital-sign early warning systems’, in 2012 Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 6161–6164, (Aug 2012).
  • [6] GW Colopy, SJ Roberts, and DA Clifton, ‘Gaussian processes for personalized interpretable volatility metrics in the step-down ward’, IEEE Journal of Biomedical and Health Informatics, 23(3), 949–959, (2019).
  • [7] Andreas Damianou. Deep Gaussian processes and variational propagation of uncertainty, 2018.
  • [8] Andreas Damianou and Neil Lawrence, ‘Deep Gaussian processes’, in

    Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics

    , eds., Carlos M. Carvalho and Pradeep Ravikumar, volume 31 of Proceedings of Machine Learning Research, pp. 207–215, Scottsdale, Arizona, USA, (29 Apr–01 May 2013). PMLR.
  • [9] Andreas C. Damianou, Michalis K. Titsias, and Neil D. Lawrence, ‘Variational inference for latent variables and uncertain inputs in Gaussian processes’, J. Mach. Learn. Res., 17(1), 1425–1486, (January 2016).
  • [10] Roman Garnett, Michael A. Osborne, and Stephen J. Roberts, ‘Sequential Bayesian prediction in the presence of changepoints’, in Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pp. 345–352, New York, NY, USA, (2009). ACM.
  • [11] M.N. Gibbs, Bayesian Gaussian Processes for Classification and Regression, Ph.D. dissertation, University of Cambridge, Cambridge, U.K., 1997.
  • [12] Paul W. Goldberg, Christopher K. I. Williams, and Christopher M. Bishop, ‘Regression with input-dependent noise: A Gaussian process treatment’, in Advances in Neural Information Processing Systems 10, eds., M. I. Jordan, M. J. Kearns, and S. A. Solla, 493–499, MIT Press, (1998).
  • [13] Andrew D. Gordon, Thomas A. Henzinger, Aditya V. Nori, and Sriram K. Rajamani, ‘Probabilistic programming’, in International Conference on Software Engineering (ICSE, FOSE track), (2014).
  • [14] Andreas Griewank and Andrea Walther, Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, second edn., 2008.
  • [15] K. Kersting, C. Plagemann, P. Pfaff, and W. Burgard, ‘Most likely heteroscedastic Gaussian process regression’, in International Conference on Machine Learning (ICML), Corvallis, Oregon, USA, (March 2007).
  • [16] Quoc V. Le, Alex J. Smola, and Stéphane Canu, ‘Heteroscedastic Gaussian process regression’, in Proceedings of the 22Nd International Conference on Machine Learning, ICML ’05, pp. 489–496, New York, NY, USA, (2005). ACM.
  • [17] Dong C. Liu and Jorge Nocedal, ‘On the limited memory BFGS method for large scale optimization’, Mathematical Programming, 45(1), 503–528, (Aug 1989).
  • [18] David JC MacKay, ‘Introduction to Gaussian processes’, NATO ASI Series F Computer and Systems Sciences, 168, 133–166.
  • [19] Sébastien. Marmin, David. Ginsbourger, Jean. Baccou, and Jacques. Liandrat, ‘Warped Gaussian processes and derivative-based sequential designs for ffunctions with heterogeneous variations’, SIAM/ASA Journal on Uncertainty Quantification, 6(3), 991–1018, (2018).
  • [20] Andrew Mchutchon and Carl E. Rasmussen, ‘Gaussian process training with input noise’, in Advances in Neural Information Processing Systems 24, eds., J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, 1341–1349, Curran Associates, Inc., (2011).
  • [21] Rajbir-Singh Nirwan and Nils Bertschinger, ‘Applications of Gaussian process latent variable models in finance’, in IntelliSys, (2018).
  • [22] Christopher J. Paciorek and Mark J. Schervish, ‘Nonstationary covariance functions for Gaussian process regression’, in Proceedings of the 16th International Conference on Neural Information Processing Systems, NIPS’03, pp. 273–280, Cambridge, MA, USA, (2003). MIT Press.
  • [23]

    Christian Plagemann, Kristian Kersting, and Wolfram Burgard, ‘Nonstationary Gaussian process regression using point estimates of local smoothness’, in

    Proceedings of the 2008th European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II, ECMLPKDD’08, pp. 204–219, Berlin, Heidelberg, (2008). Springer-Verlag.
  • [24] Carl Edward Rasmussen and Christopher K. I. Williams, Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning), The MIT Press, 2005.
  • [25] S. Roberts, M. Osborne, M. Ebden, S. Reece, N. Gibson, and S. Aigrain, ‘Gaussian processes for time-series modelling’, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 371(1984), 20110550, (2013).
  • [26] Yunus Saatchi, Ryan Turner, and Carl Rasmussen, ‘Gaussian process change point models’, in Proceedings of the 27th Annual International Conference on Machine Learning, ICML ’10, pp. 927–934, (08 2010).
  • [27] Hugh Salimbeni and Marc Deisenroth, ‘Doubly stochastic variational inference for deep Gaussian processes’, in Advances in Neural Information Processing Systems 30, eds., I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 4588–4599, Curran Associates, Inc., (2017).
  • [28] Paul D. Sampson and Peter Guttorp, ‘Nonparametric estimation of nonstationary spatial covariance structure’, Journal of the American Statistical Association, 87(417), pp. 108–119, (1992).
  • [29] M. Sigrist, ‘Air monitoring by spectroscopic techniques’, Chemical Analysis Series, 197, (1994).
  • [30] B.W. Silverman, ‘Some aspects of the spline smoothing approach to non-parametric curve fitting.’, Journal of the Royal Statistical Society, B, 47, 1–52, (1985).
  • [31] Edward Snelson and Zoubin Ghahramani, ‘Variable noise and dimensionality reduction for sparse Gaussian processes’, in Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence, UAI’06, pp. 461–468, Arlington, Virginia, United States, (2006). AUAI Press.
  • [32] Jasper Snoek, Kevin Swersky, Richard Zemel, and Ryan P. Adams, ‘Input warping for Bayesian optimization of non-stationary functions’, in Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML’14, pp. II–1674–II–1682. JMLR.org, (2014).
  • [33] Felipe Tobar, Thang D Bui, and Richard E Turner, ‘Learning stationary time series using Gaussian processes with nonparametric kernels’, in Advances in Neural Information Processing Systems 28, eds., C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, 3501–3509, Curran Associates, Inc., (2015).
  • [34] David Tolpin, ‘Deployable probabilistic programming’, in Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, Onward! 2019, pp. 1–16, New York, NY, USA, (2019). ACM.
  • [35] David Tolpin. Gogp, a library for probabilistic programming with gaussian processes. http://bitbucket.org/dtolpin/gogp, 2019.
  • [36] Chunyi Wang and Radford M. Neal. Gaussian process regression with heteroscedastic or non-Gaussian residuals, 2012.
  • [37] Yue Wu, José Miguel Hernández-Lobato, and Zoubin Ghahramani, ‘Gaussian process volatility model’, in Advances in Neural Information Processing Systems 27, eds., Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, 1044–1052, Curran Associates, Inc., (2014).