### ParametricGP

None

view repo

This work introduces the concept of parametric Gaussian processes (PGPs), which is built upon the seemingly self-contradictory idea of making Gaussian processes parametric. Parametric Gaussian processes, by construction, are designed to operate in "big data" regimes where one is interested in quantifying the uncertainty associated with noisy data. The proposed methodology circumvents the well-established need for stochastic variational inference, a scalable algorithm for approximating posterior distributions. The effectiveness of the proposed approach is demonstrated using an illustrative example with simulated data and a benchmark dataset in the airline industry with approximately 6 million records.

READ FULL TEXT VIEW PDFGaussian processes (see Rasmussen06gaussianprocesses; murphy2012machine

) is a non-parametric Bayesian machine learning technique that provides a flexible prior distribution over functions, enjoys analytical tractability, and has a fully probabilistic work-flow that returns robust posterior variance estimates, which quantify uncertainty in a natural way. Moreover, Gaussian processes are among a class of methods known as kernel machines (see

vapnik2013nature; scholkopf2002learning; tipping2001sparse) and are analogous to regularization approaches (see tikhonov1963solution; Tikhonov/Arsenin/77; poggio1990networks). They can also be viewed as a prior on one-layer feed-forward Bayesian neural networks with an infinite number of hidden units

neal2012bayesian. Non-parametric models such as Gaussian processes need to “remember” the full dataset in order to be trained and make predictions. Therefore, the complexity of non-parametric models grows with the size of the dataset. For instance, when applying a Gaussian process to a dataset of size

, exact inference has computational complexity with storage demands of . In recent years, we have been witnessing tremendous amount of efforts (see e.g., snelson2007local; urtasun2008sparse) to reduce these complexities. Such efforts generally lead to a computational complexity of and storage demands of where is a user specified parameter governing the number of “inducing variables” (see e.g., csato2002sparse; seeger2003fast; quinonero2005unifying; titsias2009variational). However, as is truly pointed out in hensman2013gaussian even these reduced storage are prohibitive for “big data”. In hensman2013gaussian, the authors combine the idea of inducing variables with recent advances in variational inference (see e.g., hensman2012fast; hoffman2013stochastic) to develop a practical algorithm for fitting Gaussian processes using stochastic variational inference.In contrast, the current work avoids stochastic variational inference and attempts to present an alternative approach to the one proposed in hensman2013gaussian. The seemingly self-contradictory idea is to make Gaussian processes *parametric*. The key feature of parametric models in general, and the current work in particular, is that predictions are conditionally independent of the observed data given the parameters. In other words, the data is distilled into the parameters and any subsequent prediction does not make use of the original dataset. This is very convenient as it enables efficient *mini-batch* training procedures. However, this is not without drawbacks since choosing a model from a particular parametric class constrains its flexibility. Therefore, it is of great importance to devise models that are aware of their imperfections and are capable of properly quantifying the uncertainty in their predictions associated with such limitations.

Let us start by making the prior assumption that

(1) |

is a zero mean Gaussian process Rasmussen06gaussianprocesses with covariance function which depends on the hyper-parameters . Moreover, let us postulate the existence of some *hypothetical dataset* with

(2) |

Here, and . Let us define a *parametric Gaussian process* by the resulting conditional distribution

(3) |

where

(4) | |||||

The parameters and of a parametric Gaussian process (3) will play a crucial role; The data will be distilled into these parameters and any subsequent predictions will not make use of the original dataset. This is very convenient as it enables an efficient mini-batch training procedure outlined in the following. Taking advantage of the favorable form (3) of a *parametric Gaussian process*, the mean and covariance matrix of the hypothetical dataset (2) can be updated by employing the posterior distribution resulting from conditioning on the observed mini-batch of data of size ; i.e.,

(6) | |||||

(7) |

It is worth mentioning that and . The information corresponding to the mini-batch is now distilled in the parameters and . The hyper-parameters and noise variance parameter can be updated by taking a step proportional to the gradient of the *negative log marginal likelihood*

(8) |

The training procedure is initialized by setting and where is some initial set of hyper-parameters. Having trained the hyper-parameters and parameters of the model, one can use equation (4) to predict the mean of the solution at a new test point . Moreover, the predicted variance is given by , where is obtained from equation (2).

Parametric Gaussian process regression is entirely agnostic to the size of the dataset and can effectively handle datasets with millions or billions of records. The effectiveness of the proposed methodology will be demonstrated using an illustrative example with simulated data and a benchmark dataset in the literature on Gaussian processes and big data.

To demonstrate the proposed framework, let us begin with a simple dataset generated by random perturbations of a one dimensional function given explicitly by . The training data are depicted in panel (A) of figure 1. The Gaussian process prior (1) used for this example is assumed to have a squared exponential Rasmussen06gaussianprocesses covariance function, i.e.,

where is a variance parameter and are the hyper-parameters. The model employs a *hypothetical data-set* (see equation (2)) of size . The locations

of the hypothetical dataset are obtained by employing the k-means clustering algorithm. The training procedures is carried out using the Adam stochastic optimizer

kingma2014adam with default settings and mini-batches of size one. After one pass through the entire training data, it is remarkable how the parameters and of the hypothetical dataset enable us to summarize the actual training data. The red circles in figure 1 denote the pairs of the hypothetical data. The resulting prediction of the model is plotted in figure 1.The US flight delay prediction example, originally proposed in hensman2013gaussian, has reached a status of a standard benchmark dataset (see e.g., hensman2016variational; gal2014distributed; DBLP:conf/icml/DeisenrothN15; samo2016string; adam2016scalable) in Gaussian process regression, partly because of the massive size of the dataset with nearly million records and partly because of its large-scale non-stationary nature. The dataset^{1}^{1}1http://stat-computing.org/dataexpo/2009/ consists of flight arrival and departure times for every commercial flight in the USA for the year 2008. Each record is complemented with details on the flight and the aircraft. The aim is to predict the delay in minutes of the aircraft at landing, . The eight covariates are the same as hensman2013gaussian, namely the age of the aircraft (number of years since deployment), route distance, airtime, departure time, arrival time, day of the week, day of the month, and month. Two third of the entire data set, which totals million records, is used for training and one third for testing. The output data are normalized by subtracting the training sample mean from the outputs and dividing the results by the sample standard deviation. The input data are normalized to the interval . The Gaussian process prior (1) used for this example is assumed to have a squared exponential Rasmussen06gaussianprocesses covariance function, i.e.,

where is a variance parameter,

is the vector of covariates, and

are the hyper-parameters. Moreover, anisotropy across input dimensions is handled by Automatic Relevance Determination (ARD) weights . From a theoretical point of view, each kernel gives rise to a Reproducing Kernel Hilbert Space aronszajn1950theory; saitoh1988theory; berlinet2011reproducing that defines a class of functions that can be represented by this kernel. In particular, the squared exponential covariance function chosen above implies smooth approximations. More complex function classes can be accommodated by appropriately choosing kernels. The model employs aDespite some subtle differences, it is generally safe to recognize the input-output pairs (see equation (2)) as the so called “inducing points”, a frequently used term in the literature on sparse approximations to Gaussian process priors (see e.g., quinonero2005unifying for a compressive review). However, it is not advisable to interpret and (see equation (2)) as variational parameters hensman2013gaussian since no (stochastic) variational inference is carried out in the current work. Furthermore, to highlight the subtle differences between “inducing points” and what this work calls *hypothetical dataset* (2), it is worth observing that in the literature on sparse approximations to Gaussian processes it turns out that and . Under these assumptions and using equations (4) and (2), one obtains and . In other words, in the sparse Gaussian processes framework, and are essentially identical; i.e., . In contrast, this work treats and as parameters of the model responsible for encoding the history of observed data. In this regard, the current work is similar to hensman2013gaussian. However, unlike hensman2013gaussian, the parameter and are not variational parameters of some variational distribution.

Modern datasets are rapidly growing in size and complexity, and there is a pressing need to develop new statistical methods and machine learning techniques to harness this wealth of data. This work presented a novel regression framework for encoding massive amount of data into a small number of *hypothetical* data points. While being effective, the resulting model is conceptually very simple, is based on the idea of making Gaussian processes *parametric*, and it takes at most mathematical formulas to explain every single detail of the algorithm. This simplicity is extremely important specially when it comes to deploying machine learning algorithms on big data flow engines (see e.g., meng2016mllib) such as MapReduce dean2008mapreduce and Apache Spark zaharia2012resilient. Moreover, Gaussian processes are a powerful tool for probabilistic inference over functions. They offer desirable properties such as uncertainty estimates, automatic discovery of important dimensions, robustness to over-fitting, and principled ways of tuning hyper-parameters. Thus, scaling Gaussian processes to big datasets and deploying it on big data flow engines is, and will remain, an active area of research.

This works received support by the DARPA EQUiPS grant N66001-15-2-4055, and the AFOSR grant FA9550-17-1-0013.

Comments

There are no comments yet.