Posterior Contraction Rates for Graph-Based Semi-Supervised Classification

This paper studies Bayesian nonparametric estimation of a binary regression function in a semi-supervised setting. We assume that the features are supported on a hidden manifold, and use unlabeled data to construct a sequence of graph-based priors over the regression function restricted to the given features. We establish contraction rates for the corresponding graph-based posteriors, interpolated to be supported over regression functions on the underlying manifold. Minimax optimal contraction rates are achieved under certain conditions. Our results provide novel understanding on why and how unlabeled data are helpful in Bayesian semi-supervised classification.

Authors

• 14 publications
• 6 publications
• Posterior Consistency of Semi-Supervised Regression on Graphs

Graph-based semi-supervised regression (SSR) is the problem of estimatin...
07/25/2020 ∙ by Andrea L. Bertozzi, et al. ∙ 4

• Minimax-optimal semi-supervised regression on unknown manifolds

We consider semi-supervised regression when the predictor variables are ...
11/07/2016 ∙ by Amit Moscovich, et al. ∙ 0

• Minimax semi-supervised confidence sets for multi-class classification

In this work we study the semi-supervised framework of confidence set cl...
04/29/2019 ∙ by Evgenii Chzhen, et al. ∙ 0

• A Quest for Structure: Jointly Learning the Graph Structure and Semi-Supervised Classification

Semi-supervised learning (SSL) is effectively used for numerous classifi...
09/26/2019 ∙ by Xuan Wu, et al. ∙ 0

• Posterior Contraction and Credible Sets for Filaments of Regression Functions

A filament consists of local maximizers of a smooth function f when movi...
03/11/2018 ∙ by Wei Li, et al. ∙ 0

• On the Consistency of Graph-based Bayesian Learning and the Scalability of Sampling Algorithms

A popular approach to semi-supervised learning proceeds by endowing the ...
10/20/2017 ∙ by Nicolas Garcia Trillos, et al. ∙ 0

• Almost exact recovery in noisy semi-supervised learning

This paper investigates noisy graph-based semi-supervised learning or co...
07/29/2020 ∙ by Konstantin Avrachenkov, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This paper investigates the semi-supervised learning problem of inferring a regression function

using labeled data and unlabeled data We focus on binary classification, where takes values on and

represents the probability with which a feature

belongs to the class labeled by We make a standard manifold assumption [5, 30, 2, 25, 10] and suppose that takes values on a hidden manifold . Using the given features we construct, without knowledge of a sequence (indexed by ) of priors over the restriction of

to the features. Our main contribution is to study the contraction of the corresponding interpolated posteriors. In doing so, we lay a frequentist foundation to Bayesian semi-supervised classification, give theoretical insight on the choice of data-driven prior models, and provide novel understanding on why and how unlabeled data are helpful in Bayesian formulations to semi-supervised learning.

The approach to semi-supervised learning that we analyze belongs to the broad class of graph-based methods [32]. The unifying idea behind these methods is to employ a graph-Laplacian of the features to uncover the geometry of and regularize the inference problem. In the Bayesian perspective that we adopt, the graph-Laplacian is used to define the covariance operator of a Gaussian field prior over the features , which is transformed by a link function to set a prior on restricted to the given features. Combining the prior with a likelihood function that incorporates the labeled data, we obtain a posterior distribution on regression functions over the ’s, which allows inference for the labels of the unlabeled features. The main contribution of this paper is to study the contraction of this graph-based posterior around as increases. Since is a function on , this naturally suggests pushing forward the graph-based posterior to a measure over functions on , which can be achieved by an interpolation map that extends functions on to . We shall study contraction rates of the pushforward (interpolated) graph-based posterior for theoretical understanding of the graph-based Bayesian approach to semi-supervised classification.

Our analysis is set in the general posterior contraction framework of [12] and consists of two parts. First, we assume perfect knowledge of , in which case the unlabeled data are not needed and the problem reduces to a standard binary regression problem on . This setting can be thought of as a limiting regime where has been fully recovered by the unlabeled data. We set a Matérn-type Gaussian field prior (see e.g. [20]), which is the continuum limit of the graph-based priors in the previous paragraph, and obtain posterior contraction rates for Sobolev-type truths. Similarly as [24, 23], we show that the minimax optimal convergence rate is attained only if the prior regularity matches the regularity of the target function. The novelty of this first partial result is in the study of posterior contraction on manifolds with Matérn-type priors, complementing [5] which studies posterior contraction with heat kernel priors on a manifold setting. Second, we go back to the semi-supervised problem where partial knowledge of is acquired through the features . We show that when grows at a certain polynomial rate with , the interpolated graph posteriors have the same rate of contraction as the posteriors obtained with full knowledge of . These results imply that optimal contraction rates for semi-supervised learning can be attained provided that sufficiently many unlabeled data are available.

An important related work is [15], which studies fully-supervised function estimation on large graphs without a continuum limit structure, assuming that the truth changes with the size of the graph. In contrast, we investigate posterior contraction with a fixed truth defined on the underlying manifold , by analyzing the continuum limit of graph-based priors. Another related line of work is [8, 11], which established the continuum limit of posterior distributions as the size of the unlabeled data set grows, without increasing the size of the labeled data set. We point out that these papers did not address the question of whether graph-based posteriors contract around the truth. The recent paper [1] studied posterior consistency for a fixed sample size in the small noise limit, whereas we consider the large limit and further establish posterior contraction rates. Rates of convergence for optimization rather than Bayesian formulations of semi-supervised learning have been established in [4].

Several works have investigated whether unlabeled data improve the performance of semi-supervised learning [17], and both positive [19] and negative [2, 25] conclusions have been reached under different settings. Our results provide qualitative and quantitative understanding on why and how much unlabeled data can improve the performance of Bayesian semi-supervised learning under a manifold assumption: a continuum prior over regression functions on that achieves optimal contraction rates can be approximated using the unlabeled data, still obtaining rate-optimal convergence provided that grows sufficiently fast with .

The rest of this paper is organized as follows. Section 2 formalizes our setting and provides the necessary background. Section 3 contains the first part of our analysis, concerning binary regression on a known manifold. Our main results on semi-supervised classification are in Section 4. To streamline our presentation, in Sections 3 and 4

we work under the assumption that the features are uniformly distributed on the underlying manifold. Section

5 shows how to generalize our results to nonuniform marginal density, and Section 6 closes with a discussion of several research directions that stem from our work.

2 Setting and Background

Let

be a random vector with

taking values on The goal of semi-supervised classification is to estimate the binary regression function

 f0(x):=P(Y=1|X=x)

given labeled data and unlabeled data , where are independent from . In applications, unlabeled data are often cheaper to collect and, for this reason, typically

We adopt a manifold assumption [19], and suppose that is supported on an -dimensional smooth, connected, compact manifold without boundary embedded in , with the absolute value of sectional curvature bounded and with Riemannian metric inherited from the embedding. We further assume that is absolutely continuous with respect to the volume form on , with a differentiable density that is bounded above and below by positive constants.

Our analysis in Section 3 sits on the continuum space and builds on the seminal work on posterior contraction with Gaussian field priors [24], which we review here succinctly in our manifold setting. Let be a link function that is differentiable and invertible, with uniformly bounded. These assumptions are satisfied for instance by the logistic function. We then put a prior on where is defined as

 fW(x):=Φ(Wx),

and is a Gaussian process on taking values in some Banach space . Practical implementations of this model are overviewed in [28]. The posterior contraction rates can be characterized in terms of the concentration function of , defined as

 φw0(ε):=infh∈H:∥h−w0∥B<ε∥h∥2H−logP(∥W∥B<ε), (2.1)

where and is the norm of the reproducing kernel Hilbert space for . The main result of [24] states that if belongs to the closure of in and satisfies then the posterior contracts around at rate Precisely, for every sufficiently large

 Ef0Π(f:dn(f,f0)≥Mεn|{(Xi,Yi)}ni=1)n→∞−−−→0, (2.2)

where

is some suitable discrepancy measure and the expectation is understood to be over the joint distribution of

determined by and Furthermore, [24][Theorem 2.2] implies that if is a sequence of Gaussian fields taking values in so that , then the sequence of posteriors with respect to contracts around at the same rate as above. Our analysis exploits these two results and can be summarized as follows. In Section 3 we establish posterior contraction rates for a Matérn-type Gaussian prior, which is approximated by a sequence of graph-based priors constructed in Section 4 at a rate of so that the same posterior contraction rates are attained. To achieve the approximation rate, needs to scale polynomially with . For the purpose of this paper, we shall take as the space and as the -norm.

3 Binary Regression on M

Now we describe the choice of Matérn-type prior under the assumption that is known. In this case the unlabeled data are unnecessary and the problem reduces to a standard binary regression problem on . For the purpose of exposition, we shall assume that is the uniform distribution on and the generalization to the nonuniform case will be addressed in Section 5.

We set the prior on to be the Gaussian measure

 π=N(0,Cs),Cs=(I−Δ)−s, (3.1)

where is the Laplace-Beltrami operator on , parametrizes the regularity of prior draws, and the fractional order operator is defined spectrally. A random function admits a Karhunen-Loève expansion

 W=∞∑i=1(1+λi)−s2ξiψi,ξii.i.d.∼N(0,1), (3.2)

where are eigenpairs from the spectral decomposition of , with ’s in increasing order. We see that a larger leads to faster decay of the coefficients and hence more regular sample paths. From Weyl’s law that , setting makes a well-defined measure on . Such priors are closely related to Gaussian fields with Matérn covariance function. An important characterization by Whittle [26, 27] is that a Gaussian Matérn field on

is the statistically stationary solution to the following stochastic partial differential equation:

 (ℓ−2I−Δ)α+m2u(x)=W(x),x∈Rm,

where

is a Gaussian white noise on

. The parameters and specify length scale and regularity respectively. Therefore as defined in (3.1) can be interpreted as the law of a Gaussian Matérn field on with whose sample paths are -regular almost surely. Using the series representation (3.2), the reproducing kernel Hilbert space associated with has the following characterization

where the second equality is due to Weyl’s law.

Now we are ready to state our first result.

Theorem 3.1.

Consider the prior on where is defined in (3.2) with . If , where

then, for and every sufficiently large ,

Proof.

As noted above, it suffices to find so that and we proceed by bounding both terms in (2.1). By (3.2), we have

 (3.3)

where the last inequality follows from [6][Corollary 4.3]. Now in order to approximate , consider the truncated series . We have

 ∥h−w0∥2L2(μ)=∞∑i=N+1w2i=∞∑i=N+1w2ii2βmi−2βm≤N−2βmR2.

This suggests the choice of . Since is a truncated series, and we have

where we have used the assumption that in the second to last step. This together with (3.3) gives

and

which suggests the choice . The result then follows from [24][Theorem 2.1 & 3.2(i)]. ∎

The Sobolev ball with different bases has been studied in [29, 7]. By Weyl’s law we see that is the set of functions that satisfy , representing a -regular function in the Sobolev sense. It is well known that the minimax optimal rate for estimating a -regular function is . However, we have not found in the literature a result for binary regression problems over the Sobolev ball

with eigenfunctions of Laplace-Beltrami as the basis. To make our presentation complete and self-contained, we will show in Appendix

B the following minimax lower bound.

Theorem 3.2.

Assume and that the -normalized eigenfunctions of are uniformly bounded. Then, for ,

 infˆfsupf∈{f:Φ−1(f)∈Fβ,R}Ef∥ˆf−f∥L2(μ)≳n−β2β+m,

where the infimum is taken over all estimators .

Theorem 3.2 requires uniform boundedness of the eigenfunctions of the Laplace-Beltrami operator, which holds for example for flat manifolds [21], and that the target function is not too rough. In such cases, Theorem 3.1 implies that optimal rates of posterior contraction are attained only if . Meanwhile, draws from are -regular in the above sense. This can be seen by observing that a typical sample path as in (3.2) has coefficients and satisfies, for any

 Eξ∞∑i=1(1+λi)−sξ2ii2αm≲∞∑i=1i−2s−2αm<∞.

Hence, optimal rates are attained only if the almost sure regularity of prior draws matches that of Similar observations have been made in [24, 23] for e.g. Gaussian Matérn fields on Euclidean domains. Since is the natural analog of Gaussian Matérn fields on manifolds [18, 20], our findings are intuitively expected.

We have presented results for and the case can be treated similarly. Indeed, in this case and inspection of the proof shows that the contraction rate is , which is always suboptimal since . That is, the prior is always rougher than the truth.

4 Semi-Supervised Classification on {Xi}Nni=1

Now we go back to the semi-supervised setting where the manifold is only known through the features , in which case they are used to approximate Gaussian processes on . In particular, we will construct a sequence of data-driven priors that approximate as defined in (3.1) and study contraction of the corresponding posteriors. This will be achieved by approximating the covariance operator of using graph Laplacians and defining a suitable interpolation procedure, as will be made precise in what follows.

Recall that for each we are given labeled data and unlabeled data . Define a similarity matrix by

 Hij:=2(m+2)Nnνmζm+2Nn1{|Xi−Xj|<ζNn}, (4.1)

where is the Euclidean distance in , is the volume of the dimensional unit ball, and is the connectivity of the graph to be determined later. Let , where is the diagonal matrix with entries . The matrix is the unnormalized graph Laplacian, which approximates the Laplace-Beltrami operator (see e.g. [9, 20] and Section 5). Graph Laplacians have been widely used in semi-supervised learning to regularize the inference problem [31, 32]

. Now consider the Gaussian distribution

on , whose samples admit a Karhunen-Loève expansion

 (4.2)

where are eigenpairs of . Since

is symmetric and positive semidefinite, its eigenvalues are non-negative. If we enumerate the eigenpairs of

and so that the eigenvalues are in increasing order, we will see later (Theorems A.2 and A.3) that the spectral approximations are only accurate for the first several of them. In other words, and give poor approximations to and for large. This motivates considering the following truncated version of (4.2):

 (4.3)

where is a threshold for accurate approximations to be determined later. We define our graph-based prior as , to be viewed as a measure over , where is the empirical measure of , so that is also considered as a function over the point cloud. By inspecting (3.2) and (4.3), it is expected that approximates given good control on spectral convergence. These data-driven Gaussian field priors have been used within various intrinsic approaches to Bayesian semi-supervised classification, see e.g. [8, 11]. Note that the above construction does not require any knowledge of the underlying manifold other than its dimension. In the case of unknown dimension, various dimensionality estimation methods have been studied and [30] proposed a plug-in procedure that leads to optimal contraction rates, which we believe can be applied to our setting and leave for future directions.

We remark that there are two sources of randomness in our definition (4.3), coming from both ’s and ’s. It is therefore natural to think of our graph-based prior as defined conditioning on the ’s. In other words, should be interpreted as

 Πdiscn(⋅|{Xi}Nni=1)=L(Φ(wn)|{Xi}Nni=1).

The corresponding graph-based posterior should also take into account the randomness of and has the form

 Πdiscn(B|{Xi}Nni=1,{Yi}ni=1)=∫B∏ni=1LYi|Xi(fn)dΠdiscn(fn|{Xi}Nni=1)∫L2(μNn)∏ni=1LYi|Xi(fn)dΠdiscn(fn|{Xi}Nni=1), (4.4)

where

 (4.5)

is the conditional likelihood of . The sequence of posteriors (4.4) allows one to infer labels for the unlabeled data and we are interested in analyzing their contraction around the truth. But notice that (4.4) is again a measure over , whereas belongs to . One possible solution is to study the contraction around the restriction of onto , which however makes interpretation difficult as the sequence of truths will then change with . Therefore a more natural route is to push forward the graph-based posteriors to the continuum as measures over so that we can study their contraction around . This can be achieved by defining an interpolation map that extends a function over to a function over and considering the pushforward measure . For the purpose of this paper, we shall consider the one-nearest neighbor interpolation [9, 11], defined for a function on as

 Iun(x):=Nn∑i=1un(Xi)1Vi(x),x∈M, (4.6)

where

 Vi:={x∈M:|x−Xi|=minj=1,…,Nn|x−Xj|}.

Up to a set of ambiguity of -measure 0, , where is the closest point in Euclidean distance to among , and can be thought of as a piecewise constant function on . We remark that other choices of are possible, but the one above can be easily computed and does not require full knowledge of if one is only interested in certain given points outside the point cloud, which is favorable for practical considerations.

Analyzing directly the contraction rates of seems not straightforward due to the interpolation map . However, the following observation suggests an alternate route: we can first push forward to the continuum and then compute the posterior.

Lemma 4.1.

Let be the pushforward of the graph-based prior. Then

 Πcontn(⋅|{Xi}Nni=1,{Yi}ni=1)=I♯[Πdiscn(⋅|{Xi}Nni=1,{Yi}ni=1)].
Proof.

By definition of pushforward measure, it suffices to show that

 Πcontn(B|{Xi}Nni=1,{Yi}ni=1)=Πdiscn(I−1(B)|{Xi}Nni=1,{Yi}ni=1),

for any measurable . The left hand side equals

 Πcontn(B|{Xi}Nni=1,{Yi}ni=1) =∫B∏ni=1LYi|Xi(f)dI♯Πdiscn(f|{Xi}Nni=1)∫L2(μ)∏ni=1LYi|Xi(f)dI♯Πdiscn(f|{Xi}Nni=1), (4.7)

where Note that pointwise values of are well-defined since is supported on . By the change-of-variable formula for pushforward measures,

 (???)=∫I−1(B)∏ni=1LYi|Xi∘I(fn)dΠdiscn(fn|{Xi}Nni=1)∫L2(μNn)∏ni=1LYi|Xi∘I(fn)dΠdiscn(fn|{Xi}Nni=1),

which equals (4.4) with replaced by , by noticing that is exactly the conditional likelihood as in (4.5). The result follows. ∎

In other words, we obtain the same distribution regardless of whether the graph-based posterior is first computed and then pushed forward to the continuum or the other way around. Formally we have the following commutative diagram.

 ΠdiscnD−−−−→Πdiscn(⋅|D)⏐⏐↓I⏐⏐↓II♯ΠdiscnD−−−−→I♯Πdiscn(⋅|D)=I♯[Πdiscn(⋅|D)]

Therefore it suffices to study contraction rates of , where we can apply general results from [13, 24] by analyzing the concentration properties of the priors . This turns out to be manageable since is supported on the same space as and approximates . To see this, first notice that , which follows from the fact that

 I♯Πdiscn(B)=Πdiscn(I−1(B))=P(Φ(wn)∈I−1(B))=P(I(Φ(wn))∈B).

Then observe that since only depends on the geometry. Indeed, for and its nearest neighbor, we have

 I(Φ(wn))(x)=Φ(wn)(Xi)=Φ(wn(Xi))=Φ(I(wn)(x))=Φ(I(wn))(x).

Lastly, since is linear, we see that

 (4.8)

and therefore , where is now a Gaussian field on that approximates . This differs from the graph-based prior in that lives in the same space as , whence we can bound the -norm between them and apply [24][Theorem 2.2], which formally states that if , then the sequence of posterior with respect to will contract at the same rate as if the prior is fixed as .

The following result gives a high probability bound (with respect to randomness of ’s) on the approximation error of by (with respect to the randomness of ’s) under suitable scaling of the graph connectivity and truncation parameter .

Lemma 4.2.

Suppose and is arbitrary. Then, for

 ζNn≍(logNn)pm2N−12mn,kNn≍⎧⎪ ⎪⎨⎪ ⎪⎩N1(8+δ)m+2nm52m+12,

where for and otherwise, we have

 Eξ∥Wn−W∥L2(μ)≲⎧⎪ ⎪⎨⎪ ⎪⎩(logNn)mpm4Nm−2s2m(8m+δm+2)nm52m+12,

with probability at least for some .

The proof, which we defer to Appendix A, is based on spectral convergence results for graph Laplacians. We illustrate the main idea here. The high probability event is that the point cloud approximates well the underlying manifold in terms of the -OT distance between and . Precisely, it is shown by [9][Theorem 2] that, with probability at least for some

 ρNn≲(logNn)pmN1/mn.

Conditioning on this event, it will be shown that the approximation error is dominated by the following quantity

 Eξ∥Wn−W∥L2(μ)≲k12−smNn+kNn∑i=1(1+λi)−s2∥Iψ(Nn)i−ψi∥L2(μ), (4.9)

and the eigenfunction approximation error is bounded, up to logarithmic factors, by

 (4.10)

For a fixed and increasing , setting we see that the eigenfunction approximation has, up to logarithmic factors, the rate

 ∥Iψ(Nn)i−ψi∥L2(μ)≲N−14mn,

and hence it is expected that (4.9) has the same rate. However, this is only true when is large. The reason lies in the fact that we need to deal with the bound (4.10) for and hence cannot treat as fixed. Let , (4.9) and (4.10) together with Weyl’s law imply

 Eξ∥Wn−W∥L2(μ) ≲k12−smNn+√ρNnζNn+ζNnkNn∑i=1i−sm+32+12m ≲k12−smNn+k−sm+52+δ+12mNn√ρNnζNn+ζNnkNn∑i=1i−1−δ.

In other words, the factor can be counteracted if is large and if then the above reduces to and we get the rate by setting and correspondingly.

Now solving for so that the approximation error in Lemma 4.2 scales as , we get our main result.

Theorem 4.3.

Consider the sequence of priors , where is defined in (4.3) with . Suppose the scaling of and are the same as in Lemma 4.2 and

 Nn≍⎧⎪⎨⎪⎩(logn)m2pm(8m+δm+2)4s−2mnm(8m+δm+2)2s−mm52m+12,

where is arbitrary. If , then for and every sufficiently large ,

where the expression is understood to be the measure of under .

Proof.

By Lemma 4.1, it suffices to show

Denoting

we have

 Ef0Fn=Ef0[Fn|An]Pf0(An)+Ef0[Fn|Acn]Pf0(Acn)≤Ef0[Fn|An]+Pf0(Acn),

where we have used the fact that and is the high probability event in Lemma 4.2. With the above scaling of we see that conditioning on , and hence by [24][Theorem 2.2] we get posterior contraction with the same rate as in Theorem 3.1, i.e., . The result follows since . ∎

Theorem 4.3 shows that when sufficiently many unlabeled data are available, the interpolated graph-based posteriors contract at the same rate obtained in Theorem 3.1, where the prior is constructed with perfect knowledge of . The idea is that if the geometry of is recovered by the unlabeled data at a sufficiently fast rate, we are essentially back in the case where is known. The requirement that grows polynomially with suffices to guarantee such a fast recovery rate. As in Section 3, optimal posterior contraction rates can be attained, but now under the additional condition that to ensure the convergence of towards ; a similar restriction was required in [8]. This implies that optimal rates can only be attained for functions with regularity .

The number of unlabeled data required grows polynomially with respect to the number of labeled data, where the power depends on the intrinsic dimension . Since can be chosen arbitrarily small, the sample size for close to is about and decreases to as approaches . This implies that when is unknown, we need many more unlabeled data than labeled ones to achieve the same rate of posterior contraction as when is known. We remark that our analysis only gives an upper bound on the sample complexities required and the spectral approximation bounds from [3], which gives better rates for , can be applied to Lemma 4.2 and Theorem 4.3 for improvements. However, we believe that a polynomial dependence with the intrinsic dimension in the leading power is necessary to ensure the spectral convergence that our posterior contraction results rely on.

5 Generalization to Nonuniform Marginal Density

We have presented our results under the setting where is the uniform distribution on and now we show formally how to generalize to nonuniform . This time we shall start with the graph-based prior and identify its continuum limit.

Suppose , with a differentiable density that is bounded above and below by positive constants. To simplify our presentation, consider a similarity matrix defined as in (4.1) with

 Hij=N−11{|Xi−Xj|<ζN},

where we have only kept the necessary ingredients. Let be the unnormalized graph Laplacian as above and we have, for

 ΔNu(Xi)=N∑j=1Hij[u(Xi)−u(Xj)].

Notice that we can in fact extend to act on functions on by defining, for

 ΔNf(x)=N−1N∑j=11{|x−Xj|<ζN}[f(x)−f(Xj)].

Now taking expectation of the above quantity with respect to the ’s, we have

 EΔNf(x)=∫|x−y|<ζN[f(x)−f(y))]q(y)dV(y). (5.1)

Since is locally homeomorphic to , to simplify our presentation even further we consider the above integral as if it was defined over . By Taylor expanding both and around , we have