 # A Theoretical Framework for Bayesian Nonparametric Regression: Orthonormal Random Series and Rates of Contraction

We develop a unifying framework for Bayesian nonparametric regression to study the rates of contraction with respect to the integrated L_2-distance without assuming the regression function space to be uniformly bounded. The framework is built upon orthonormal random series in a flexible manner. A general theorem for deriving rates of contraction for Bayesian nonparametric regression is provided under the proposed framework. As specific applications, we obtain the near-parametric rate of contraction for the squared-exponential Gaussian process when the true function is analytic, the adaptive rates of contraction for the sieve prior, and the adaptive-and-exact rates of contraction for the un-modified block prior when the true function is α-smooth. Extensions to wavelet series priors and fixed-design regression problems are also discussed.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Consider the standard nonparametric regression problem , , where the predictors are referred to as design (points) and take values in , ’s are independent and identically distributed (i.i.d.) mean-zero Gaussian noises with , and ’s are the responses. We follow the popular Bayesian approach by assigning a carefully-selected prior, and perform inference tasks by finding the posterior distribution of given the observations .

We develop a theoretical framework for Bayesian nonparametric regression to study the rates of contraction with respect to the integrated -distance

 ∥f−g∥2={∫X[f(x)−g(x)]2dx}1/2.

The regression function is assumed to be represented using a set of appropriate orthonormal basis functions : . The coefficients are allowed to range over the entire real line and the space of regression functions is allowed to be unbounded, including the renowned Gaussian process priors as special examples.

Rates of contraction of posterior distributions for Bayesian nonparametric priors have been studied extensively. Following the earliest framework on generic rates of contraction theorems with i.i.d. data proposed by 

, specific examples for density estimation via Dirichlet process mixture models

[4, 13, 16, 37] and location-scale mixture models [22, 46] are discussed. For nonparametric regression, the rates of contraction had not been discussed until , who develop a generic framework for fixed-design to study rates of contraction with respect to the empirical -distance. There are extensive studies for various priors that fall into this framework, including location-scale mixture priors 

, conditional Gaussian tensor-product splines

, and Gaussian processes [42, 44], among which adaptive rates are obtained in [9, 10, 44].

Although it is interesting to achieve adaptive rates of contraction with respect to the empirical -distance for nonparametric regression, this might be restrictive since the empirical -distance quantifies the convergence of functions only at the given design points. In nonparametric regression, one also expects that the error between the estimated function and the true function can be globally small over the whole design space , i.e., small mean-squared error for out-of-sample prediction. Therefore the integrated -distance is a natural choice. For Gaussian processes, ,  provide contraction rates for nonparametric regression with respect to the integrated , and -distance, respectively. A novel spike-and-slab wavelet series prior is constructed in  to achieve adaptive contraction with respect to the stronger -distance. These examples however, take advantages of their respective prior structures and may not be easily generalized. A closely related reference is , which discusses the rates of contraction of the rescaled-Gaussian process prior for nonparametric random-design regression with respect to the integrated -distance, which is weaker than the integrated -distance. Although a generic framework for the integrated -distance is presented in , the prior there is imposed on a uniformly bounded function space and hence rules out some popular priors, e.g., the popular Gaussian process prior .

It is therefore natural to ask the following fundamental question: for Bayesian nonparametric regression, can one build a unifying framework to study rates of contraction for various priors with respect to the integrated -distance without assuming the uniform boundedness of the regression function space? In this paper we provide a positive answer to this question. Inspired by , we build the general framework upon series expansion of functions with respect to a set of orthonormal basis functions. The prior is not necessarily supported on a uniformly bounded function space. A general rate of contraction theorem with respect to the integrated -distance for Bayesian nonparametric regression is provided under the proposed framework. Examples of applications falling into this framework include the finite random series prior [30, 35], the (un-modified) block prior , and the classical squared-exponential Gaussian process prior . In particular, for the block prior regression, rather than modifying the block prior by conditioning on a truncated function space as in  with a known upper bound for the unknown true regression function, we prove that the un-modified block prior automatically yields rate-exact Bayesian adaptation for nonparametric regression without such a truncation. We further extend the proposed framework to sparse additive models in high dimensions. The analyses of the above applications and extension under the proposed framework also generalize their respective results in the literature. This is made clear in Sections 3 and 4.

The literatures regarding rates of contraction for Bayesian nonparmatric models are largely based on the cutting-edge prior-concentration-and-tesing procedure proposed by  and . The major challenge of establishing a general framework for Bayesian nonparametric regression is that unlike the Hellinger distance for the density estimation and the empirical -distance for fixed-regression problem, the integrated -distance for random-design regression problem is not locally testable, complicating the construction of certain test functions. The exact definition of local testability of a distance is provided later, but a locally testable distance is the key ingredient for the prior-concentration-and-tesing procedure to work. It turns out that by modifying the testing procedure, we only require the integrated -distance to be locally testable over an appropriate class of functions. The above arguments are made precise in Section 2.

The layout of this paper is as follows. In Section 2 we introduce the random series framework for Bayesian nonparametric regression and present the main result concerning rates of contraction. As applications of the main result, we derive the rates of contraction of various renowned priors for nonparametric regression in the literature with substantial improvements in Section 3. Section 4 elaborates on extension of the proposed framework to sparse additive models in high dimensions. The technical proofs of the main results are deferred to Section 5.

### Notations

For , we use to denote both the -norm on any finite dimensional Euclidean space and the integrated -norm of a measurable function (with respect to the Lebesgue measure). In particular, for any function , we use to denote the integrated -norm defined to be . We follow the convention that when , the subscript is omitted, i.e., . The Hilbert space denotes the space of sequences that are squared-summable. We use to denote the maximal integer no greater than , and to denote the minimum integer no less than . The notations and denote the inequalities up to a positive multiplicative constant, and we write if and . Throughout capital letters are used to denote generic positive constants and their values might change from line to line unless particularly specified, but are universal and unimportant for the analysis.

We refer to as a statistical model if it consists of a class of densities on a sample space with respect to some underlying

-finite measure. Given a (frequentist) statistical model

and the i.i.d. data from some , the prior and the posterior distribution on are always denoted by and , respectively. Given a function , we use to denote the empirical measure of , and to denote the empirical process of , given the i.i.d. data . With a slight abuse of notations, when applying to a set of design points , we also denote and to be the empirical measure and empirical process, even when the design points are deterministic. In particular,

denotes the probability density function of the (univariate) standard normal distribution, and we use the shorthand notation

to denote the density of . For a metric space , for any , the -covering number of , denoted by , is defined to be the minimum number of -balls of the form that are needed to cover .

## 2 The framework and main results

Consider the nonparametric regression model: , where are i.i.d. mean-zero Gaussian noises with , and are design points taking values in . Unless otherwise stated, the design points are assumed to be independently and uniformly sampled for simplicity throughout the paper. The framework presented in this paper naturally adapts to the case where the design points are independently sampled from a density function that is bounded away from and . We assume that the responses ’s are generated from for some unknown , thus the data can be regarded as i.i.d. samples from a distribution with joint density

. Throughout we assume that the variance

is known, but the framework can be easily extended to the case where is unknown by placing a prior on that is supported on a compact interval contained in with a density bounded away from and .

The regression model is parametrized by the function , and we follow the standard nonparametric Bayes procedure by assigning a prior distribution on as follows. Let be a set of orthonormal basis functions in , where is a countably infinite index set. Since , we also impose the prior on a sub-class of and write in terms of the orthonormal series expansion , where the coefficients are imposed a prior distribution such that a.s.. The prior on can be very general as long as it yields realizations that are squared-integrable with probability one. We shall also assume that the true function yields a series expansion , where , i.e., . We require that . The widely used tensor-product Fourier basis satisfies this requirement:

 ψk1⋯kp(x1,⋯,xp)=p∏j=1ψ1kj(xj),

where , , , and the index set is the -times cartesian product of the set of all positive integers .

###### Remark 2.1.

The prior is supported on a function space that may not be uniformly bounded and the setup here is more general than that in . The proposed framework includes the well-known Gaussian process prior as a special case. Let

be a positive definite covariance function that yields an eigenfunction expansion

, where is a set of orthonormal eigenfunctions of , and

are the corresponding eigenvalues. For a Gaussian process prior

on , the Karhunen-Loève theorem  yields the following representation of : , where independently.

Before presenting the main result for studying convergence of Bayesian nonparametric regression under the proposed framework, let us first recall the extensively used prior-concentration-and-testing procedure for deriving rates of contraction proposed by . The general setup is as follows. Suppose are i.i.d. data sampled from a distribution that yields a density with respect to some underlying -finite measure. Let be the parameter space (typically infinite dimensional for nonparametric problems) such that for some . By imposing a prior on the parameter (equipped with a suitable measurable structure), the posterior distribution can be written as

 Π(A∣Dn)=∫Aexp[Λn(θ∣Dn)]Π(dθ)∫Θexp[Λn(θ∣Dn)]Π(dθ),

for any measurable in , where is the log-likelihood ratio

 Λn(θ∣Dn)=n∑i=1logpθ(wi)p0(wi).

An appropriate distance on is crucial to study rates of contraction, since typically different distances would yield different rates, and when is infinite dimensional, different distances are possibly not equivalent. In order that the posterior distribution contracts to at rate with respect to a distance on , i.e., in -probability for some large constant , the prior-concentration-and-testing procedure requires that the following hold with some constants for sufficiently large :

1. The prior concentration condition holds:

 Π(E0(logp0pθ)≤ϵ2n,E0[(logp0pθ)2]≤ϵ2n)≥e−Dnϵ2n. (2.1)
2. There exists a sequence of subsets of (often referred to as the sieves) and test functions such that

 Π(Θcn)≤e−(D+4)nϵ2n, E0ϕn→0, and supθ∈Θn∩{d(θ,θ0)>Mϵn}Eθ(1−ϕn)≤e−D′Mnϵ2n.

The major technical barrier for establishing a general framework to study rates of contraction for nonparametric regression with respect to lies in the construction of suitable aforementioned test functions . Typically, the existence of suitable test functions can be guaranteed by elaborating on the covering numbers of certain function classes (see, for example, [12, 13, 16, 42] among others), provided that the metrics of interest satisfies the local testability condition (see Section 7 of ). Formally, we say that a distance is locally testable, if there exists a sequence of test functions such that the following holds for some constant when is sufficiently large whenever :

where is a fixed constant. When the parameter space itself is a class of densities, i.e., is parametrized by itself, it is proved that the Hellinger distance is locally testable . The widely-used empirical -distance for fixed-design nonparametric regression is also locally testable (see, for example, ). The integrated -distance is, nonetheless, not locally testable when the space of functions is not uniformly bounded. To resolve this issue, we modify the general procedure above by imposing more assumptions on the sieves.

Since is a countable set, we may assume without loss of generality that . For a fixed integer and a positive constant , we denote by a class of functions in that satisfies the following property:

 Fm(δ)⊂{f(x)=∞∑k=1βkψk(x):∞∑k=m+1|βk−β0k|≤δ}. (2.2)

The sieves are constructed by letting for certain sequences and .

Rather than considering the supremum over as in , in the proposed framework we consider the supremum over the smaller function class and obtain the following weaker result analogous to the local testability condition.

###### Lemma 2.1.

Let satisfies (2.2). Then for any with , there exists a test function such that

 E0ϕn ≤exp(−Cn∥f1−f0∥22), sup{f∈Fm(δ):∥f−f1∥22≤ξ2∥f0−f1∥22}Ef(1−ϕn) ≤exp(−Cn∥f1−f0∥22) +2exp(−Cn∥f1−f0∥22m∥f1−f0∥22+δ2)

for some constant and .

Based on Lemma 2.1, we are able to establish the following lemma that tests against the large composite alternative . In particular, it is instructive to the construction of aforementioned suitable test functions with respect to the integrated -distance .

###### Lemma 2.2.

Suppose that satisfies (2.2) for an and a . Let be a sequence with . Then there exists a sequence of test functions such that

 E0ϕn ≤∞∑j=MNnjexp(−Cnj2ϵ2n), sup{f∈Fm(δ):∥f−f0∥2>Mϵn}Ef(1−ϕn) ≤exp(−CM2nϵ2n)+2exp(−CM2nϵ2nmM2ϵ2n+δ2),

where is the covering number of

 Snj(ϵn)={f∈Fm(δ):jϵn<∥f−f0∥2≤(j+1)ϵn},

and is some positive constant.

The prior concentration condition (2.1) is a very important ingredient in the study of Bayes theory. It guarantees that the denominator appearing in the posterior distribution can be lower bounded by for some constant with large probability (see, for example, lemma 8.1 in 

). In the context of normal regression, the Kullback-Leibler divergence is proportional to the integrated

-distance between two regression functions. Motivated by this observation, we establish the following lemma that yields an exponential lower bound for the denominator under the proposed framework.

###### Lemma 2.3.

Denote

 B(m,ϵ,ω)={f(x)=∞∑k=1βkψk(x):∥f−f0∥2<ϵ,∞∑k=m+1|βk−β0k|≤ω}

for any and . Suppose sequences and satisfy , , , and is some constant. Then for any constant ,

 P0(∫exp(Λn)Π(df)≤Π(B(kn,ϵn,ω))exp[−(C+1σ2)nϵ2n])→0.

In some cases it is also straightforward to consider the prior concentration with respect to the stronger -norm. For example, for a wide class of Gaussian process priors, the prior concentration has been extensively studied (see, for example, [15, 42, 44] for more details).

###### Lemma 2.4.

Suppose the sequence satisfies and . Then for any constant ,

 P0(∫exp(Λn)Π(df)≤Π(∥f−f0∥∞<ϵn)exp[−(C+1σ2)nϵ2n])→0.

Now we present the main result regarding the rates of contraction for Bayesian nonparametric regression under the orthonormal random series framework. The proof is based on the modification of the prior-concentration-and-testing procedure.

###### Theorem 2.1 (Generic Contraction).

Let and be sequences such that as , with . Assume that the sieve satisfies (2.2) with for some constant . In addition, assume that there exists another sequence such that . Suppose the following conditions hold for some constant and sufficiently large and :

 ∞∑j=MNnjexp(−Dnj2ϵ2n)→0, (2.3) Π(Fcmn(δ))≲exp[−(2D+1σ2)nϵ–2n], (2.4) Π(B(kn,ϵ–n,ω))≥exp(−Dnϵ–2n), (2.5)

where is the covering number of

 Snj={f∈Fmn(δ):jϵn<∥f−f0∥2≤(j+1)ϵn},

and is defined in Lemma 2.3. Then

###### Remark 2.2.

In light of Lemma 2.4, by exploiting the proof of theorem 2.1 we remark that when the assumptions and conditions in theorem 2.1 hold with (2.5) replaced by , the same rates of contraction also hold: for sufficiently large .

## 3 Applications

In this section we consider three concrete priors on for the nonparametric regression problem ,

. For simplicity the design points are assumed to independently follow the one-dimensional uniform distribution

. For some of the examples, the results presented in this section can be easily generalized to the case where the design points are multi-dimensional by considering the tensor-product Fourier basis. The results for these applications under the proposed framework also generalize their respective counterparts in the literature.

### 3.1 Finite random series regression with adaptive rate

In this subsection the true regression function is assumed to be in the -Hölder ball

 Cα(Q)={f(x)=∞∑k=1βkψk(x):∞∑k=1kα|βk|≤Q},

where is the smoothness level, and is some positive constant. In particular, We do not assume that the smoothness level of is known a priori. In the literature of Bayes theory, such a procedure is referred to as adaptive. We shall also assume that a lower bound for is known: .

The finite random series prior [1, 30, 35, 45]

is a popular prior in the literature of Bayesian nonparametric theory. It is a class of hierarchical priors that first draw an integer-valued random variable serving as the number of “terms” to be used in a finite sum, and then sample the “term-wise” parameters given the number of “terms.” The finite random series prior typically does not depend on the smoothness level of the true function, often yields minimax-optimal rates of contraction (up to a logarithmic factor) in many other nonparametric problems (

e.g., density estimation [30, 35] and fixed-design regression ), and hence is fully adaptive. However, the adaptive rates of contraction for the finite random series prior in the random-design regression with respect to the integrated -distance has not been established. In this subsection we address this issue within the orthonormal random series framework proposed in Section 2.

We now formally construct the finite random series prior for nonparametric regression. Let be represented by a set of orthonormal basis functions in : . The coefficients are imposed a prior distribution as follows: first sample an integer-valued random variable from a density function (with respect to the counting measure on ), and then given the coefficients ’s are independently sampled according to

 Π(dβk∣N=m)={kγg(kγβk)dβk,if 1≤k≤m,δ0(dβk),if k≥m,

where is an exponential power density for some  and . We further require that

 πN(m)≥exp(−b0mlogm)and∞∑N=m+1πN(m)≤exp(−b1mlogm) (3.1)

for some constants ,

. For instance, the zero-truncated Poisson distribution

satisfies condition (3.1) .

The following theorem shows that the finite random series prior constructed above is adaptive and the rate of contraction with respect to the integrated -distance is minimax-optimal up to a logarithmic factor .

###### Theorem 3.1.

Suppose the true regression function for some and , and is imposed the prior given above. Then there exists some sufficiently large constant such that for any .

### 3.2 Block prior regression with adaptive and exact rate

In the literature of adaptive Bayesian procedure, the minimax-optimal rates of contraction are often obtained with an extra logarithmic factor. It typically requires extra work to obtain the exact minimax-optimal rate. Gao and Zhou  elegantly construct a modified block prior that yields rate-adaptive (i.e., the prior does not depend on the smoothness level) and rate-exact contraction for a wide class of nonparametric problems, without the logarithmic factor. Nevertheless, for nonparametric regression,  modifies the block prior by conditioning on the space of uniformly bounded functions. Requiring a known upper bound for the unknown when constructing the prior is restrictive since it eliminates the popular Gaussian process priors. Besides the theoretical concern, the block prior itself is also a conditional Gaussian process and such a modification is inconvenient for implementation. In this section, we address this issue by showing that for nonparametric regression such a modification is not necessary. Namely, the un-modified block prior itself also yields the exact minimax-optimal rate of contraction for in the -Sobolev ball

 Hα(Q)={f(x)=∞∑k=1βkψk(x):∞∑k=1k2αβ2k≤Q},

and it does not depend on the smoothness level of the true regression function.

We now describe the block prior. Given a sequence in the squared-summable sequence space , define the th block to be the integer index set and , where . We use to denote the coefficients with index lying in the th block . Let be the Fourier basis, i.e., , , and , . The block prior is a prior distribution on induced by the following distribution on the Fourier coefficients :

 βℓ∣Aℓ∼N(0,AℓInℓ),Aℓ∼gℓ,independently % for each ℓ,

where is a sequence of densities satisfying the following properties:

1. There exists such that for any and ,

 gℓ(t)≥exp(−c1eℓ). (3.2)
2. There exists such that for any ,

 ∫∞0tgℓ(t)dt≤4exp(−c2ℓ2). (3.3)
3. There exists such that for any ,

 ∫∞e−ℓ2gℓ(t)dt≤exp(−c3eℓ). (3.4)

The existence of a sequence of densities satisfying (3.2), (3.3), and (3.4) is verified in  (see proposition 2.1 in ).

Our major improvement for the block prior regression is the following theorem, which shows that the (un-modified) block prior yields rate-exact Bayesian adaptation for nonparametric regression.

###### Theorem 3.2.

Suppose the true regression function for some and , and is imposed the block prior as described above. Then for some sufficiently large constant .

Rather than using the sieve proposed in theorem 2.1 in , which does not necessarily satisfy (2.2), we construct in a slightly different fashion:

 Fmn(Q)={f(x)=∞∑k=1βkψk(x):∞∑k=1(βk−β0k)2k2α≤Q2}

with and . The covering number can be bounded using the metric entropy for Sobolev balls (for example, see lemma 6.4 in ), and the rest conditions in 2.1 can be verified using similar techniques as in .

As discussed in Section 4.2 in , the block prior can be easily extended to the wavelet basis functions and wavelet series. The wavelet basis functions are another widely-used class of orthonormal basis functions for . Let be an orthonormal basis of compactly supported wavelets for , with referring to the so-called “resolution level,” and to the “dilation” (see Section E.3 in ). We adopt the convention that the index set for the th resolution level runs through . The exact definition and specific formulas for the wavelet basis are not of great interest to us, and for a complete and thorough review of wavelets from a statistical perspective, we refer to . We shall assume that the wavelet basis ’s are appropriately selected such that for any , the following inequalities hold [6, 7, 19]:

 ∥f∥2≤∞∑j=0⎛⎝∑k∈Ijβ2jk⎞⎠1/2and∥f∥∞≤∞∑j=02j/2maxk∈Ij|βjk|.

It is worth noticing that unlike the Fourier basis, the wavelet basis functions are typically not uniformly bounded in : . However, the supremum norm of the wavelet series can be upper bounded in terms of the wavelet coefficients, which allows us to modify the framework in Section 2 for wavelet basis functions.

Given a function with wavelet series expansion , the block prior for the wavelet series is introduced through the wavelet coefficients ’s as follows:

 βj∣Aj ∼N(0,AkInk),Aj∼gj,independently for each j,

where , , and is given by

 gj(t) =⎧⎪ ⎪⎨⎪ ⎪⎩ej2log2(e−2jlog2−Tj)t+Tj,0≤t≤e−j2log2,e−2jlog2,e−j2log2e−jlog2, Tj =exp[(1+j2)log2]−exp[(−2j+j2−j)log2]+e−2jlog2.

We further assume that lies in the -Besov ball defined as follows  for some and :

 Bα2,2(Q)=⎧⎨⎩f(x)=∞∑j=0∑k∈Ijβjkψjk(x):∞∑j=022αj∑k∈Ijβ2jk≤Q2⎫⎬⎭,

which turns out to be equivalent to the aforementioned -Sobolev ball. For the block prior regression via wavelet series, the rate-exact Bayesian adaptation also holds.

###### Theorem 3.3.

Suppose the true regression function for some and , and is imposed the block prior for wavelet series as described above. Then there exists some sufficiently large constant such that .

### 3.3 Squared-exponential Gaussian process regression with fixed design

So far, the design points in this paper for the nonparametric regression problem are assumed to be randomly sampled over . This is referred to as the random-design regression problem. There are, however, many cases where the design points are fixed and can be controlled. One of the examples is the design and analysis of computer experiments [8, 34]. To emulate a computer model, the design points are typically manipulated so that they are reasonably spread. In some physical experiments the design points can also be required to be fixed .

In this subsection we consider one of the most widely used and perhaps the most popular Gaussian processes () with the covariance function of the squared-exponential form, for the fixed-design nonparametric regression problem. We show that optimal rates of contraction with respect to the integrated -distance is also attainable in such a scenario when the design points are reasonably selected, in contrast to the most Bayesian literatures that obtain rates of contraction with respect to the empirical -distance. This can be done by slightly extending the framework in Section 2.

We first present an assumption for the design points. Suppose that the design points are one-dimensional and are fixed instead randomly sampled. Intuitively, the design points need to be relatively “spread” so that the global behavior of the true signal can be recovered as much as possible. Formally, we require that the design points satisfy [51, 52]

 supx∈[0,1]∣∣ ∣∣1nn∑i=11(xi≤x)−x∣∣ ∣∣=O(1n). (3.5)

A simple example of such design is the univariate equidistance design, i.e., . It clearly satisfies (3.5) (see, for example, ).

Now we extend the orthonormal random series framework in Section 2 to the (one-dimensional) fixed-design regression problem with Fourier basis: Assume that yields Fourier series expansion , where , , and , . We also assume that lies in the -Hölder space with . Consider the sieve

 Fmn(δ)⊂{f(x)=∞∑k=1βkψk