Dimension adaptability of Gaussian process models with variable selection and projection

12/04/2011 ∙ by Surya T. Tokdar, et al. ∙ 0

It is now known that an extended Gaussian process model equipped with rescaling can adapt to different smoothness levels of a function valued parameter in many nonparametric Bayesian analyses, offering a posterior convergence rate that is optimal (up to logarithmic factors) for the smoothness class the true function belongs to. This optimal rate also depends on the dimension of the function's domain and one could potentially obtain a faster rate of convergence by casting the analysis in a lower dimensional subspace that does not amount to any loss of information about the true function. In general such a subspace is not known a priori but can be explored by equipping the model with variable selection or linear projection. We demonstrate that for nonparametric regression, classification, density estimation and density regression, a rescaled Gaussian process model equipped with variable selection or linear projection offers a posterior convergence rate that is optimal (up to logarithmic factors) for the lowest dimension in which the analysis could be cast without any loss of information about the true function. Theoretical exploration of such dimension reduction features appears novel for Bayesian nonparametric models with or without Gaussian processes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Gaussian processes are widely used in Bayesian analyses for specifying prior distributions over function valued parameters. Examples include spatio-temporal modeling (Handcock and Stein, 1993; Kim et al., 2005; Banerjee et al., 2008), computer emulation (Sacks et al., 1989; Kennedy and O’Hagan, 2001; Oakley and O Hagan, 2002; Gramacy and Lee, 2008), nonparametric regression and classification (Neal, 1998; Csató et al., 2000; Rasmussen and Williams, 2006; Short et al., 2007), density estimation (Lenk, 1988; Tokdar, 2007),

density and quantile regression

(Tokdar et al., 2010; Tokdar and Kadane, 2011), functional data analysis (Shi and Wang, 2008; Petrone et al., 2009) and image analysis (Sudderth and Jordan, 2009). Rasmussen and Williams (2006) give a thorough overview of likelihood based exploration of Gaussian process models, including Bayesian treatments.

Theoretical properties of many Bayesian Gaussian process models have been well researched (see Tokdar and Ghosh, 2007; Choi and Schervish, 2007; Ghosal and Roy, 2006; van der Vaart and van Zanten, 2008, 2009; de Jonge and van Zanten, 2010; Castillo, 2011, and the references therein). In particular, van der Vaart and van Zanten (2009) present a remarkable adaptation property of such models for nonparametric regression, classification and density estimation. They show a common Gaussian process (GP) prior specification equipped with a suitable rescaling parameter offers posterior convergence at near optimal minimax asymptotic rates across many classes of finitely and infinitely differentiable true functions. The rescaling parameter is a stochastic counterpart of a global bandwidth parameter commonly seen in smoothing-based non-Bayesian methodology. However, a single prior distribution on the rescaling parameter is enough to ensure near optimal convergence across all these classes of functions.

In this article we explore additional adaptation properties of GP models that are also equipped with variable selection or linear projection. To appreciate the practical utility of this exercise, consider a nonparametric (mean) regression model , , where ’s are -dimensional and

’s are independent draws from a zero mean normal distribution. When

is assigned a suitable GP prior distribution equipped with a rescaling parameter and the true conditional mean function is Hölder smooth (Section 2.1), the posterior distribution on converges to at a rate . This rate, without the term is optimal for such an in a minimax asymptotic sense (Stone, 1982). Now suppose depends on only through its first two coordinates . If this information was known, we could cast the model as and assign with a GP prior distribution with rescaling to obtain a faster convergence rate of . If in addition we knew that depends only on the difference of the first two coordinates of , then we would instead cast the model as and with a rescaled GP prior on obtain an even faster convergence rate of .

In practice, we do not know what sort of lower dimensional projections of perfectly explain the dependence of on . But this could be explored by extending the GP model to include selection of variables (Linkletter et al., 2006) or linear projection onto lower dimensional subspaces (Tokdar et al., 2010). The questions we seek to answer are as follows. Do GP models equipped with rescaling and variable selection offer a posterior convergence rate of when the true is a Hölder -smooth function that depends only on many coordinates of its argument? More generally, do GP models equipped with rescaling and linear projection offer a posterior convergence rate of when the true is a Hölder -smooth function such that depends on a rank- linear projection of ? We demonstrate the answer to either question to be affirmative for extensions of the so called square exponential GP models in nonparametric mean regression, classification, density estimation and density regression.

Although projection or selection based dimension reduction is routinely employed in a variety of Bayesian nonparametric models with or without the use of Gaussian processes (see for example, Rodriguez and Dunson, 2011), their theoretical implications have not been fully explored. Best results so far demonstrate posterior consistency (Tokdar et al., 2010; Pati et al., 2011), which already holds without these advanced features. Our results indicate that there is indeed an added advantage in terms of possibly faster posterior convergence rates. These results, with necessary details are presented in Section 2, which, we hope, can be appreciated by all readers interested in Bayesian nonparametric models with or without technical knowledge about Gaussian processes. Section 3 presents a set of deeper and more fundamental results, with non-trivial extensions of results presented in van der Vaart and van Zanten (2009). However, we have tried our best to make our calculations easily accessible to other researchers interested in studying extensions of GP models with additional adaptation features. We conclude in Section 4 with remarks on density regression versus density estimation and on a recent, unpublished work on a similar topic by Bhattacharya et al. (2011).

2 Main results

2.1 Extending a rescaled GP with variable selection or projection

We will restrict ourselves to nonparametric models where a function valued parameter , to be modeled by a GP or its extensions, is defined over a compact subset of fom some . Without loss of generality we can assume this set to be equal to , the unit disc centered at the origin. If the actual domain of is not elliptic, such as a rectangle giving bounds on each coordinate of the argument , we will simply shift and scale it to fit inside . Working on the larger domain poses no technical difficulties.

Let be a separable, zero mean Gaussian process with an isotropic, square exponential covariance function . For any , and , define by

(1)

where for any vector

, denotes the diagonal matrix with the elements of on its diagonal. Note that if and only if where is the orthogonal projection matrix . Therefore the law of

defines a probability measure on functions

such that depends on only through the projection . Also note that with , the

-dimensional identity matrix,

simply projects along the axes selected by .

Let denote the number of ones in a . Suppose are distributed as

(2)

independently of , where , is a strictly positive probability mass function on and

is a strictly positive probability density function on

. When , we simply take to be degenerate at 1. The law of the process , which extends the square exponential GP law by equipping it with rescaling and linear projection, will be denoted . Similarly, the law of the process , which extends the square exponential GP law by equipping it with rescaling and variable selection, will be denoted .

In the sequel, a function defined on a compact subset of is called Hölder smooth for some if it has bounded continuous derivatives (in the interior of ) up to order , the largest integer smaller than with all its -th order partial derivatives being Hölder continuous with exponent no larger than .

2.2 Mean regression with Gaussian errors

Nonparametric regression of a response variable on a vector of covariates with Gaussian errors comes in two flavors, depending on how the

design points, i..e, the covariate values are obtained. They could either be fixed in an experimental study or measured as part of an observational study. The notion of posterior convergence differs slightly across the two contexts, a brief overview is given below.

Fixed design regression.

Suppose real-valued observations are modeled as for a given sequence of points from , with independent, errors . Assume is assigned a prior distribution where is either or and is a probability measure with a compact support inside and has a Lebesgue density that is strictly positive on this support.

Let denote the posterior distribution of given only the first observations , i.e.,

For every , define a design-dependent metric on as . Let be a sequence of positive numbers with and . For any fixed and we say the posterior converges at at a rate (or faster) if for some ,

whenever with independent . Here “” indicates convergence in probability.

Random design regression.

In the random design setting we have observations , which are partially modeled as with errors . The design points

are assumed to be independent observations from an unknown probability distribution

on and ’s and ’s are assumed independent. However inference on is not deemed important. Assume is assigned a prior distribution as in the previous subsection and the corresponding posterior distribution based on the first observations is denoted .

Let denote the -metric with respect to , i.e., . Consider a sequence as before. For any fixed and we say the posterior converges at at a rate (or faster) if for some ,

whenever with independently across .

Note that in either setting convergence at at a rate also implies convergence of the (marginal) posterior distribution on to at the same rate (or faster). For either setting we can state the following dimension adaptation result.

Theorem 1.

Assume is a Hölder -smooth function on and is inside the support of . If depends on only through many coordinates of and , the posterior converges at at a rate for every . Furthermore, if , depends on only through a rank- linear projection and , the posterior converges at at a rate for every .

2.3 Classification

Suppose observations , are (partially) modeled as , independently across , where

is the standard normal or the logistic cumulative distribution function, with

’s assumed to be independent draws from a probability distribution on . Assume is assigned a prior distribution which is either or , and let denote the corresponding posterior distribution based on the first observations , i.e.,

Consider a sequence as before. For any fixed and we say the posterior converges at at a rate (or faster) if for some ,

whenever and , independently across .

Theorem 2.

Let be a Hölder -smooth function on . If depends on only through many coordinates of and , the posterior converges at at a rate for every . Furthermore, if , depends on only through a rank- linear projection and , the posterior converges at at a rate for every .

2.4 Density or point pattern intensity estimation

Consider observations , modeled as independent draws from a probability density on that can be written as

for some fixed, non-negative function and some unknown . This type of models also arise in analyzing spatial point pattern data with non-homogeneous Poisson process models where the intensity function is expressed as . Assume is assigned the prior distribution which is either or , and let denote the induced prior distribution on . The corresponding posterior distribution based on is given by

Let denote the Hellinger metric.

Consider a sequence as before. For any fixed density on , we say the posterior converges at at a rate (or faster) if for some ,

whenever ’s are independent draws from .

Theorem 3.

Let be a probability density on satisfying for some Hölder -smooth . If depends on only through many coordinates of and , the posterior converges at at a rate for every . Furthermore, if , depends on only through a rank- linear projection and , the posterior converges at at a rate for every .

In the above theorem, the conditions on are equivalent to saying that varies in only along a linear subspace of the variable . In the context of two dimensional point patter models, this implies that the intensity function, relative to , is constant over the spatial domain or constant along a certain direction.

2.5 Density regression

Consider again observations , where we want to develop a regression model between ’s and ’s. In density regression the entire conditional density, and not just the conditional mean of given is modeled nonparametrically. Tokdar et al. (2010) consider the model , independently across , where the conditional densities , are given by point by point logistic transforms of a function :

(3)

for some fixed probability density on with cumulative distribution function . To construct a suitable prior distribution for , we consider an extension of the process .

Let be a separable, zero mean Gaussian process with isotropic, square-exponential covariance function . Define as

(4)

Let and , respectively, denote the laws of the processes and where are distributed as in (2) and .

Now suppose in (3) is assigned the prior distribution which is either or , and denote the induced prior distribution on by . The corresponding posterior distribution based on is given by

Let denote the metric for a probability distribution on .

Consider a sequence as before. For any fixed , we say the posterior converges at at a rate (or faster) if for some ,

whenever and , independently across .

Theorem 4.

Let satisfy

for an that is Hölder -smooth. If depends on only through many coordinates of and , the posterior converges at at a rate for every . Furthermore, if , depends on only through a rank- linear projection and , the posterior converges at at a rate for every .

3 Adaptation properties of GP extensions

Ghosal et al. (2000), later refined by Ghosal and van der Vaart (2007)

, provide a set of three sufficient conditions that can be used to establish posterior convergence rates for Bayesian non-parametric models for independent observations. One of these conditions relates to prior concentration at the true function, and the other two relate to existence of a sequence of compact sets which have relatively small sizes but receive large probabilities from the prior distribution.

For the results stated in Theorems 1, 2 and 3 relating respectively to mean regression, classification or density estimation, these three sufficient conditions map one to one (van der Vaart and van Zanten, 2008) to the following conditions on an extended GP with law or as appropriate, the true function and the desired rate : there exist sets and a sequence with , such that for all sufficiently large ,

(5)
(6)
(7)

where denotes the minimum number of balls of radius (with respect to a metric ) needed to cover a set . For the density regression results stated in Theorem 4, the sufficient conditions also map one to one to the above but with now following either or and with . This can be proved along the lines of Theorem 3.2 of van der Vaart and van Zanten (2008), by looking at the joint density of determined by and .

We verify these conditions in the following subsections by extending the calculations presented in van der Vaart and van Zanten (2009). A fundamental ingredient of these calculations is the reproducing kernel Hilbert space (RKHS) associated with a Gaussian process. The RKHS of is defined as the set of functions that can be represented as for some in the closure of . Similar definitions apply to the processes and with domains and instead of .

In Lemma 4.1 of van der Vaart and van Zanten (2009), the RKHS of is identified as the set of functions such that for some , where denotes the real part of a complex number , is the square root of and is the (unique) spectral measure on of , satisfying . For the isotropic, square exponential GP , the spectral measure is the

-dimensional Gaussian probability measure with mean zero and variance matrix

. The RKHS norm of such an is precisely .

By simple change of variables it follows that the RKHS of , for any and , is given by functions such that with RKHS norm where is the -dimensional Gaussian probability measure with mean 0 and variance matrix . In the rest of the paper this RKHS is denoted and is used to denote its unit ball at the origin. Also, is used to denote the Banach space of continuous functions on equipped with the supremum norm . The unit ball at origin of this space is denoted .

3.1 Variable selection extension

To start with let denote with fixed at the -dimensional identity matrix and let stand for the corresponding RKHS . For any and let denote the -dimensional vector of coordinates of selected by . Also for any and any let denote the class of Hölder -smooth, real functions on . Notice that if depends only on many coordinates of then there exist with and such that for all .

Theorem 5.

Let satisfy for some with and some . Then for every , there exist measurable subsets and a constant such that (5)-(7) hold with , and .

Proof.

Define by where denotes the unique zero-insertion expansion of to a -dimensional vector such that . For any , for every with we have and . So

From calculations presented in Section 5.1 of van der Vaart and van Zanten (2009) it follows for a large multiple of and all sufficiently large . This leads to (5) because is larger than any such by a power of .

It follows from the proof of Theorem 3.1 of van der Vaart and van Zanten (2009) that for some , , and for every , , , and the set

satisfies

(8)
(9)

for universal constants . For any define where is a large multiple of , is a large multiple of and . Then by the above inequalities, and