Parametric Gaussian Process Regression for Big Data

by   Maziar Raissi, et al.
Brown University

This work introduces the concept of parametric Gaussian processes (PGPs), which is built upon the seemingly self-contradictory idea of making Gaussian processes parametric. Parametric Gaussian processes, by construction, are designed to operate in "big data" regimes where one is interested in quantifying the uncertainty associated with noisy data. The proposed methodology circumvents the well-established need for stochastic variational inference, a scalable algorithm for approximating posterior distributions. The effectiveness of the proposed approach is demonstrated using an illustrative example with simulated data and a benchmark dataset in the airline industry with approximately 6 million records.



There are no comments yet.


page 5


Gaussian Processes for Big Data

We introduce stochastic variational inference for Gaussian process model...

Distributed Variational Inference in Sparse Gaussian Process Regression and Latent Variable Models

Gaussian processes (GPs) are a powerful tool for probabilistic inference...

A Hybrid Approach for Trajectory Control Design

This work presents a methodology to design trajectory tracking feedback ...

Parameter Identification for Digital Fabrication: A Gaussian Process Learning Approach

Tensioned cable nets can be used as supporting structures for the effici...

Spatial Multivariate Trees for Big Data Bayesian Regression

High resolution geospatial data are challenging because standard geostat...

Scaling up the Automatic Statistician: Scalable Structure Discovery using Gaussian Processes

Automating statistical modelling is a challenging problem that has far-r...

Bayesian hierarchical modeling and analysis for physical activity trajectories using actigraph data

Rapid developments in streaming data technologies are continuing to gene...

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Gaussian processes (see Rasmussen06gaussianprocesses; murphy2012machine

) is a non-parametric Bayesian machine learning technique that provides a flexible prior distribution over functions, enjoys analytical tractability, and has a fully probabilistic work-flow that returns robust posterior variance estimates, which quantify uncertainty in a natural way. Moreover, Gaussian processes are among a class of methods known as kernel machines (see

vapnik2013nature; scholkopf2002learning; tipping2001sparse) and are analogous to regularization approaches (see tikhonov1963solution; Tikhonov/Arsenin/77; poggio1990networks

). They can also be viewed as a prior on one-layer feed-forward Bayesian neural networks with an infinite number of hidden units


. Non-parametric models such as Gaussian processes need to “remember” the full dataset in order to be trained and make predictions. Therefore, the complexity of non-parametric models grows with the size of the dataset. For instance, when applying a Gaussian process to a dataset of size

, exact inference has computational complexity with storage demands of . In recent years, we have been witnessing tremendous amount of efforts (see e.g., snelson2007local; urtasun2008sparse) to reduce these complexities. Such efforts generally lead to a computational complexity of and storage demands of where is a user specified parameter governing the number of “inducing variables” (see e.g., csato2002sparse; seeger2003fast; quinonero2005unifying; titsias2009variational). However, as is truly pointed out in hensman2013gaussian even these reduced storage are prohibitive for “big data”. In hensman2013gaussian, the authors combine the idea of inducing variables with recent advances in variational inference (see e.g., hensman2012fast; hoffman2013stochastic) to develop a practical algorithm for fitting Gaussian processes using stochastic variational inference.

In contrast, the current work avoids stochastic variational inference and attempts to present an alternative approach to the one proposed in hensman2013gaussian. The seemingly self-contradictory idea is to make Gaussian processes parametric. The key feature of parametric models in general, and the current work in particular, is that predictions are conditionally independent of the observed data given the parameters. In other words, the data is distilled into the parameters and any subsequent prediction does not make use of the original dataset. This is very convenient as it enables efficient mini-batch training procedures. However, this is not without drawbacks since choosing a model from a particular parametric class constrains its flexibility. Therefore, it is of great importance to devise models that are aware of their imperfections and are capable of properly quantifying the uncertainty in their predictions associated with such limitations.

2 Methodology

Let us start by making the prior assumption that


is a zero mean Gaussian process Rasmussen06gaussianprocesses with covariance function which depends on the hyper-parameters . Moreover, let us postulate the existence of some hypothetical dataset with


Here, and . Let us define a parametric Gaussian process by the resulting conditional distribution




The parameters and of a parametric Gaussian process (3) will play a crucial role; The data will be distilled into these parameters and any subsequent predictions will not make use of the original dataset. This is very convenient as it enables an efficient mini-batch training procedure outlined in the following. Taking advantage of the favorable form (3) of a parametric Gaussian process, the mean and covariance matrix of the hypothetical dataset (2) can be updated by employing the posterior distribution resulting from conditioning on the observed mini-batch of data of size ; i.e.,


It is worth mentioning that and . The information corresponding to the mini-batch is now distilled in the parameters and . The hyper-parameters and noise variance parameter can be updated by taking a step proportional to the gradient of the negative log marginal likelihood


The training procedure is initialized by setting and where is some initial set of hyper-parameters. Having trained the hyper-parameters and parameters of the model, one can use equation (4) to predict the mean of the solution at a new test point . Moreover, the predicted variance is given by , where is obtained from equation (2).

3 Experiments

Parametric Gaussian process regression is entirely agnostic to the size of the dataset and can effectively handle datasets with millions or billions of records. The effectiveness of the proposed methodology will be demonstrated using an illustrative example with simulated data and a benchmark dataset in the literature on Gaussian processes and big data.

Figure 1: Illustrative example: (A) Plotted are 6000 training data generated by random perturbations of the one dimensional function . (B) Depicted is the resulting prediction of the model. The blue solid line represents the true data generating function , while the dashed red line depicts the predicted mean

. The shaded orange region illustrates the two standard deviations band around the mean. The red circles depict the resulting mean values

for the 8 hypothetical data points after a pass through the entire dataset while mini-batches of size one are employed per each iteration of the training algorithm. It is remarkable how the training procedure places the mean of the hypothetical dataset on the underlying function . (Code:

3.1 Illustrative example

To demonstrate the proposed framework, let us begin with a simple dataset generated by random perturbations of a one dimensional function given explicitly by . The training data are depicted in panel (A) of figure 1. The Gaussian process prior (1) used for this example is assumed to have a squared exponential Rasmussen06gaussianprocesses covariance function, i.e.,

where is a variance parameter and are the hyper-parameters. The model employs a hypothetical data-set (see equation (2)) of size . The locations

of the hypothetical dataset are obtained by employing the k-means clustering algorithm. The training procedures is carried out using the Adam stochastic optimizer

kingma2014adam with default settings and mini-batches of size one. After one pass through the entire training data, it is remarkable how the parameters and of the hypothetical dataset enable us to summarize the actual training data. The red circles in figure 1 denote the pairs of the hypothetical data. The resulting prediction of the model is plotted in figure 1.

3.2 Airline delays

The US flight delay prediction example, originally proposed in hensman2013gaussian, has reached a status of a standard benchmark dataset (see e.g., hensman2016variational; gal2014distributed; DBLP:conf/icml/DeisenrothN15; samo2016string; adam2016scalable) in Gaussian process regression, partly because of the massive size of the dataset with nearly million records and partly because of its large-scale non-stationary nature. The dataset111 consists of flight arrival and departure times for every commercial flight in the USA for the year 2008. Each record is complemented with details on the flight and the aircraft. The aim is to predict the delay in minutes of the aircraft at landing, . The eight covariates are the same as hensman2013gaussian, namely the age of the aircraft (number of years since deployment), route distance, airtime, departure time, arrival time, day of the week, day of the month, and month. Two third of the entire data set, which totals million records, is used for training and one third for testing. The output data are normalized by subtracting the training sample mean from the outputs and dividing the results by the sample standard deviation. The input data are normalized to the interval . The Gaussian process prior (1) used for this example is assumed to have a squared exponential Rasmussen06gaussianprocesses covariance function, i.e.,

where is a variance parameter,

is the vector of covariates, and

are the hyper-parameters. Moreover, anisotropy across input dimensions is handled by Automatic Relevance Determination (ARD) weights . From a theoretical point of view, each kernel gives rise to a Reproducing Kernel Hilbert Space aronszajn1950theory; saitoh1988theory; berlinet2011reproducing that defines a class of functions that can be represented by this kernel. In particular, the squared exponential covariance function chosen above implies smooth approximations. More complex function classes can be accommodated by appropriately choosing kernels. The model employs a hypothetical dataset (see equation (2)) of size . The locations of the hypothetical dataset are obtained by employing the k-means clustering algorithm. The training procedures is carried out using the Adam stochastic optimizer kingma2014adam with default settings and mini-batches of size . After iterations of the training procedure, the predictive mean squared error (MSE) on the normalized test data is given by . This value for the MSE is within the range reported in the literature (see e.g., table 2 in hensman2016variational). The MSE over the normalized data can be interpreted as a fraction of the sample variance of airline arrival delays. Thus a MSE of is as good as using the training mean as predictor. In order to further reduce the MSE one could increase the size of the hypothetical data-set, increase the batch-size, and/or choose a more accommodative covariance function. Moreover, to get a better idea of the relevance of the different features available in this dataset, figure 2 plots the automatic relevance determination parameters . The most relevant variable turns out to be the airtime that needs to be covered. The month and time of departure of the flight are also two important features in predicting flight delays.

Figure 2: Airline delays example: Automatic relevance determination parameters for the features used for predicting flight delays. (Code:

4 Related works

Despite some subtle differences, it is generally safe to recognize the input-output pairs (see equation (2)) as the so called “inducing points”, a frequently used term in the literature on sparse approximations to Gaussian process priors (see e.g., quinonero2005unifying for a compressive review). However, it is not advisable to interpret and (see equation (2)) as variational parameters hensman2013gaussian since no (stochastic) variational inference is carried out in the current work. Furthermore, to highlight the subtle differences between “inducing points” and what this work calls hypothetical dataset (2), it is worth observing that in the literature on sparse approximations to Gaussian processes it turns out that and . Under these assumptions and using equations (4) and (2), one obtains and . In other words, in the sparse Gaussian processes framework, and are essentially identical; i.e., . In contrast, this work treats and as parameters of the model responsible for encoding the history of observed data. In this regard, the current work is similar to hensman2013gaussian. However, unlike hensman2013gaussian, the parameter and are not variational parameters of some variational distribution.

5 Concluding remarks

Modern datasets are rapidly growing in size and complexity, and there is a pressing need to develop new statistical methods and machine learning techniques to harness this wealth of data. This work presented a novel regression framework for encoding massive amount of data into a small number of hypothetical data points. While being effective, the resulting model is conceptually very simple, is based on the idea of making Gaussian processes parametric, and it takes at most mathematical formulas to explain every single detail of the algorithm. This simplicity is extremely important specially when it comes to deploying machine learning algorithms on big data flow engines (see e.g., meng2016mllib) such as MapReduce dean2008mapreduce and Apache Spark zaharia2012resilient. Moreover, Gaussian processes are a powerful tool for probabilistic inference over functions. They offer desirable properties such as uncertainty estimates, automatic discovery of important dimensions, robustness to over-fitting, and principled ways of tuning hyper-parameters. Thus, scaling Gaussian processes to big datasets and deploying it on big data flow engines is, and will remain, an active area of research.


This works received support by the DARPA EQUiPS grant N66001-15-2-4055, and the AFOSR grant FA9550-17-1-0013.