# Semi-supervised Inference: General Theory and Estimation of Means

We propose a general semi-supervised inference framework focused on the estimation of the population mean. We consider both the ideal semi-supervised setting where infinitely many unlabeled samples are available, as well as the ordinary semi-supervised setting in which only a finite number of unlabeled samples is available. As usual in semi-supervised settings, there exists an unlabeled sample of covariate vectors and a labeled sample consisting of covariate vectors along with real-valued responses ("labels"). Otherwise the formulation is "assumption-lean" in that no major conditions are imposed on the statistical or functional form of the data. Estimators are proposed along with corresponding confidence intervals for the population mean. Theoretical analysis on both the asymptotic behavior and ℓ_2-risk for the proposed procedures are given. Surprisingly, the proposed estimators, based on a simple form of the least squares method, outperform the ordinary sample mean. The method is further extended to a nonparametric setting, in which the oracle rate can be achieved asymptotically. The proposed estimators are further illustrated by simulation studies and a real data example involving estimation of the homeless population.

## Authors

• 19 publications
• 5 publications
• 39 publications
• ### High-dimensional semi-supervised learning: in search for optimal inference of the mean

We provide a high-dimensional semi-supervised inference framework focuse...
02/02/2019 ∙ by Yuqian Zhang, et al. ∙ 0

• ### Minimax semi-supervised confidence sets for multi-class classification

In this work we study the semi-supervised framework of confidence set cl...
04/29/2019 ∙ by Evgenii Chzhen, et al. ∙ 0

• ### Optimal Semi-supervised Estimation and Inference for High-dimensional Linear Regression

There are many scenarios such as the electronic health records where the...
11/28/2020 ∙ by Siyi Deng, et al. ∙ 0

• ### Semi-Supervised Approaches to Efficient Evaluation of Model Prediction Performance

In many modern machine learning applications, the outcome is expensive o...
11/15/2017 ∙ by Jessica Gronsbell, et al. ∙ 0

• ### The Peaking Phenomenon in Semi-supervised Learning

For the supervised least squares classifier, when the number of training...
10/17/2016 ∙ by Jesse H. Krijthe, et al. ∙ 0

• ### On asymptotic normality in estimation after a group sequential trial

We prove that in many realistic cases, the ordinary sample mean after a ...
05/10/2018 ∙ by Ben Berckmoes, et al. ∙ 0

• ### Efficient and Adaptive Linear Regression in Semi-Supervised Settings

We consider the linear regression problem under semi-supervised settings...
01/17/2017 ∙ by Abhishek Chakrabortty, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Semi-supervised learning arises naturally in statistics and machine learning when the labels are more difficult or more expensive to acquire than the unlabeled data. While numerous algorithms have been proposed for semi-supervised learning, they are mostly focused on classification, where the labels are discrete values representing the classes to which the samples belong (see, e.g., zhu2005semi; ando2007two; zhu2009introduction; wang2009efficient). The analyses typically rely on two types of assumptions, distribution-based and margin-based. The margin-based analysis (see vapnik2013nature; wang2007large; wang2008probability; wang2009efficient) generally assumes that the samples with different labels have some separation, and the additional unlabeled samples can help enhance the separation and achieve a better classification result. The distributional approach (see blum1998combining; ando2005framework; ando2007two) usually relies on some assumptions of a particular type of relation between labels and samples. These assumptions can be difficult to verify in practice. The setting with continuous valued has also been discussed in the literature, see, e.g., johnson2008graph, wasserman2007statistical and chakrobortty2016efficient. For a survey of recent development in semi-supervised learning, readers are referred to zhu2009introduction and the references therein.

The general semi-supervised model can be formulated as follows. Let be a

-dimensional random vector following an unknown joint distribution

. Denote by the marginal distribution of . Suppose one observes “labeled” samples from ,

 [Y,X]={Yk,Xk1,Xk2,⋯,Xkp}nk=1, (1)

and, in addition, “unlabeled” samples from the marginal distribution

In this paper, we focus on estimation and statistical inference for one of the simplest features, namely the population mean . No specific distributional or marginal assumptions relating and are made.

This inference of population mean under general semi-supervised learning framework has a variety of applications. We discuss the estimation of treatment effect (ATE) in Section 4.1 and a prototypical example involving survey data in Section 4.2

. It is noteworthy that for some other problems that do not at first look like mean estimation, one can recast them as mean estimation, possibly after an appropriate transformation. Examples include estimation of the variance of

or covariance between and a given . In work that builds on a portion of the present paper, azeriel2016semi considers construction of linear predictors in semi-supervised learning settings.

To estimate , the most straight-forward estimator is the sample average . Surprisingly, as we show later, a simple least-squares-based estimator, which exploits the unknown association of and , outperforms . We first consider an ideal setting where there are infinitely many unlabeled samples, i.e., . This is equivalent to the case of known marginal distribution . We refer to this case as ideal semi-supervised inference. In this case, our proposed estimator is

 ^θ=¯Y−^β⊤(2)(¯X−μ), (3)

where is the -dimensional least squares estimator for the regression slopes and is the population mean of . This estimator is analyzed in detail in Section 2.2. We then consider the more realistic setting where there are a finite number of unlabeled samples, i.e., . Here one has only partial information about . We call this case ordinary semi-supervised inference. In this setting, we propose to estimate by

 ^θ=¯Y−^β⊤(2)(¯X−^μ), (4)

where denotes the sample average of both the labeled and unlabeled ’s. The detailed analysis of this estimator is given in Section 2.3.

We will investigate the properties of these estimators and in particular establish their asymptotic distributions and the risk bounds. Both the case of a fixed number of covariates and the case of a growing number of covariates are considered. The basic asymptotic theory in Section 2 begins with a setting in which the dimension, , of , is fixed and (see Theorem 2.1). For ordinary semi-supervised learning, the asymptotic results are of non-trivial interest whenever (see Theorem 2.3(i)). We then formulate and prove asymptotic results in the setting where also grows with . In general, these results require the assumption that (see Theorems 2.2 and 2.3(ii)). The limiting distribution results allow us to construct an asymptotically valid confidence interval based on the proposed estimators that is shorter than the traditional sample-mean-based confidence interval.

In Section LABEL:sec.nonparametric we propose a methodology for improving the results of Section 2 by introducing additional covariates as functions of those given in the original problem. We show the proposed estimator achieves an oracle rate asymptotically. This can be viewed as a nonparametric regression estimation procedure.

There are results in the sample-survey literature that are qualitatively related to what we propose. The earliest citation we are aware of is cochran1953sampling. See also deng1987estimation and more recently lohr2009sampling. In these references one collects a finite sample, without replacement, from a (large) finite population. There is a response and a single, real covariate, . The distribution of within the finite population is known. The sample-survey target of estimation is the mean of within the full population. In the case in which the size of this population is infinitely large, sampling without replacement and sampling with replacement are indistinguishable. In that case the results from this sampling theory literature coincide with out results for the ideal semi-supervised scenario with , both in terms of the proposed estimator and its asymptotic variance. Otherwise the sample-survey theory results differ from those within our formulation, although there is a conceptual relationship. In particular the theoretical population mean that is our target is different from the finite population mean that is the target of the sample-survey methods. In addition we allow and as noted above, we also have asymptotic results for growing with . Most notably, our formulation includes the possibility of semi-supervised learning. We believe it should be possible, and sometimes of practical interest, to include semi-supervised sampling within a sampling survey framework, but we do not do so in the present treatment.

The rest of the paper is organized as follows. We introduce the fixed covariate procedures in Section 2. Specifically, ideal semi-supervised learning and ordinary semi-supervised learning are considered respectively in Sections 2.2 and 2.3, where we analyze the asymptotic properties for both estimators. We further give the -risk upper bounds for the two proposed estimators in Section 2.4. We extend the analysis in Section LABEL:sec.nonparametric to nonparametric regression model, where we show the proposed procedure achieves an oracle rate asymptotically. Simulation results are reported in Section 3. Applications to the estimation of Average Treatment Effect is discussed in Section 4.1, and Section 4.2 describes a real data illustration involving estimation of the homeless population in a geographical region. The proofs of the main theorems are given in Section 5 and additional technical results are proved in the Appendix.

## 2 Procedures

We propose in this section a least squares estimator for the population mean in the semi-supervised inference framework. To better characterize the problem, we begin with a brief introduction of the random design regression model. More details of the model can be found in, e.g., bujamodels.

### 2.1 A Random Design Regression Model

Let

represent the population response and predictors. Assume all second moments are finite. Denote

as the predictor with intercept. The following is a linear analysis, even though no corresponding linearity assumption is made about the true distribution P of (X, Y). Some notation and definitions are needed. Let

 β=argminγ∈Rp+1E(Y−→X⊤γ)2. (5)

Here are referred to as the population slopes, and is called the total deviation. We also denote

 τ2:=Eδ2,μ:=EX∈Rp,→μ:=E→X=(1,μ⊤)⊤,→Ξ=E→X→X⊤. (6)

Some basic facts about the regression slope and total deviation are summarized in the following lemma.

###### Lemma 2.1

Let have finite second moment, and let the matrix be non-singular. Then

 β=→Ξ−1(E→XY),Eδ=0,EδX=0,θ=→μ⊤β.

It should be noted that under our general model, there is no independence assumption between and .

For sample of observations , , let and denote the design matrix as follows

 →X:=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣→X⊤1⋯⋯→X⊤n⎤⎥ ⎥ ⎥ ⎥ ⎥⎦:=⎡⎢ ⎢⎣1X11X12⋯X1p⋮⋮⋮⋮1Xn1Xn2⋯Xnp⎤⎥ ⎥⎦.

In our notation, means that the vector/matrix contains the intercept term; boldface indicates that the symbol is related to a multiple sample if observations. Meanwhile, denote the sample response and deviation as and . Now and are connected by a regression model:

 Y=→Xβ+δ,andYk=→X⊤kβ+δk,k=1,⋯,n. (7)

Let be the usual least squares estimator, i.e.

 ^β=(→X⊤→X)−1→X⊤Y. (8)

Then provides a straightforward estimator for . and can be further split into two parts,

 β=\blockarrayc\block[c]β1β(2),^β=\blockarrayc\block[c]^β1^β(2),β1,^β1∈R,β(2),^β(2)∈Rp. (9)

and play different roles in the analysis as we will see later. The risk of the sample average about the population mean has the following decomposition.

###### Proposition 2.1

is an unbiased estimator of

and

 nE(¯Y−θ)2=nVar(¯Y)=τ2+β⊤(2)E((X−μ)(X−μ)⊤)β(2). (10)

From (10), we can see that as long as , i.e., there is a significant linear relationship between and , then the risk of will be significantly greater than .

In the next two subsections, we discuss separately under the ideal semi-supervised setting and the ordinary semi-supervised setting.

### 2.2 Improved Estimator under the Ideal Semi-supervised Setting

We first consider the ideal setting where there are infinitely many unlabeled samples, or equivalently is known. To improve , we propose the least squares estimator,

 ^θLS:=→μ⊤^β=^β1+μ⊤^β(2)=¯Y−^β⊤(2)(¯X−μ), (11)

where is defined in (8).

The following theorem provides the asymptotic distribution of the least squares estimator under the minimal conditions that have finite second moments, be non-singular and .

###### Theorem 2.1 (Asymptotic Distribution, fixed p)

Let be i.i.d. copies from , and assume that has finite second moments, is non-singular and . Then, under the setting that is fixed and grows to infinity,

 ^θLS−θτ/√nd→N(0,1), (12)

and

 MSE/τ2d→1,whereMSE:=∑ni=1(Yi−→X⊤i^β)2n−p−1. (13)

In the more general setting where varies and grows, we need stronger conditions to analyze the asymptotic behavior of . Suppose , we consider the standardization of as

 Z∈Rp,Z=Σ−1/2(X−μ). (14)

Clearly, . For this setting we assume that satisfy the following moment conditions:

 for some κ>0,Eδ2+2κ(Eδ2)1+κ≤M1; (15)
 ∀v∈Rp,E|⟨v,Z⟩|2+κ≤M2; (16)
 E(∥Z∥22δ2)(E∥Z∥22)⋅(Eδ2)≤M3. (17)
###### Theorem 2.2 (Asymptotic result, growing p)

Let be i.i.d. copies from , . Assume that the matrix of the second moments of

exists and is non-singular and the standardized random variable

given in (14) satisfies (15), (16) and (17), then the asymptotic behavior results (12) and (13) still hold.

Based on Theorems 2.1 and 2.2, we can construct the asymptotic -level confidence interval for as

 [^θLS−z1−α/2√MSEn,^θLS+z1−α/2√MSEn]. (18)
###### Remark 2.1

It is not difficult to see that, under the setting in Theorem 2.2,

 MSEd→τ2,^σ2Yd→Var(Y)=τ2+β⊤(2)E((X−μ)(X−μ)⊤)β(2).

Then the traditional -interval for the mean of ,

 ⎡⎢⎣¯Y−z1−α/2√^σ2Yn,¯Y+z1−α/2√^σ2Yn⎤⎥⎦, (19)

is asymptotically more accurate than (18), which implies that the proposed least squares estimator is asymptotically more accurate than the sample mean.

### 2.3 Improved Estimator under the Ordinary Semi-supervised Inference Setting

In the last section, we discussed the estimation of based on full observations with infinitely many unlabeled samples (or equivalently with known marginal distribution ). However, having known is rare in practice. A more realistic practical setting would assume that distribution is unknown and we only have finitely many i.i.d. samples without corresponding . This problem relates to the one in previous section since we are able to obtain partial information of from the additional unlabeled samples.

When or is unknown, we estimate by

 ^μ=1n+mn+m∑k=1Xk,^→μ=(1,^μ⊤)⊤. (20)

Recall that

is the ordinary least squares estimator. Now, we propose the

semi-supervised least squares estimator ,

 ^θSSLS=^→μ⊤^β=¯Y−^β⊤(2)(∑ni=1Xin−∑n+mi=1Xin+m). (21)

has the following properties:

• when , . Then exactly equals in (11);

• when , exactly equals . As there are no additional samples of so that no extra information for is available, it is natural to use to estimate .

• In the last term of (21), it is important to use rather than , in spite of the fact that the latter might seem more natural because it is independent of the term that precedes it.

Under the same conditions as Theorems 2.1, 2.2, we can show the following asymptotic results for , which relates to the ordinary semi-supervised setting described in the introduction. The labeled sample size , the unlabeled sample size is and the distribution is fixed (but unknown) which, in particular, implies that is a fixed dimension, not dependent on . Let

 ν2=√τ2+nn+mβ⊤(2)Σβ(2),Σ=E(X−μ)(X−μ)⊤.
###### Theorem 2.3 (Asymptotic distribution of ^θSSLS, fixed p)

Let be i.i.d. labeled samples from , are additional unlabeled samples from . Suppose is non-singular and . If is fixed and then

 √n(^θSSLS−θ)νd→N(0,1), (22)

and

 ^ν2ν2d→1 (23)

where with and .

Based on Theorems 2.3 and 2.4, the -level asymptotic confidence interval for can be written as

 [^θSSLS−z1−α/2^ν√n,^θSSLS+z1−α/2^ν√n]. (24)

Since asymptotically (with equality only when ), so that when the asymptotic CI in (24) is shorter than the traditional sample-mean-based CI (19).

The following statement refers to a setting in which and may depend on as . Consequently, , and (defined at (14)) may also depend on .

###### Theorem 2.4 (Asymptotic distribution of ^θSSLS, growing p)

Let , , and . Suppose is non-singular, and the standardized random variable satisfies (15), (16) and (17). Then (22) and (23) hold.

### 2.4 ℓ2 Risk for the Proposed Estimators

In this subsection, we analyze the risk for both and . Since the calculation of the proposed estimators involves the unstable process of inverting the Gram matrix , for the merely theoretical purpose of obtaining the risks we again consider the refinement

 ^θ1LS:=TrunY(^θLS),and^θ1SSLS:=TrunY(^θSSLS), (25)

where

 TrunY(x)=⎧⎪ ⎪⎨⎪ ⎪⎩(n+1)ymax−nymin,if x>(n+1)ymax−nymin,x,if |x−ymax+ymin2|≤(n+12)(ymax−ymin),(n+1)ymin−nymax,if x<(n+1)ymin−nymax, (26)

. We emphasize that this refinement is mainly for theoretical reasons and is often not necessary in practice.

The regularization assumptions we need for analyzing the risk are formally stated as below.

1. (Moment conditions on ) There exist such that

 Eδ4=Eδ4n≤M4; (27)
2. (sub-Gaussian condition) Suppose is the standardization of

 Zn∈Rp,Zn=Σ−1/2n(Xn−μn),Σn=E(Xn−μn)(Xn−μn)⊤,

which satisfies

 ∀u∈{u∈Rp+1:∥u∥2=1},∥∥u⊤Zn∥∥ψ2≤M5. (28)

Here is defined as for any random variable .

3. (Bounded condition) The standardization satisfies

 ∥Zn∥∞≤M5,almost surely. (29)

We also note , , . Under the regularization assumptions above, we provide the risks for and respectively in the next two theorems.

###### Theorem 2.5 (ℓ2 Risk of ^θ1LS)

Let be i.i.d. copies from . Assume that Assumptions 1+2 (27)(28) or 1+2’ (27)(29) hold, . Recall depends on . Then we have the following estimate for the risk of ,

 nE(^θ1LS−θ)2=τ2n+sn, (30)

where

 sn=p2nAn,p+p2n5/4Bn,p,max(|An,p|,|Bn,p|)≤C (31)

for a constant that depends on and . The formula for is

 An,p=1p2([tr(Σ−1Σδ1)]2+3∥Σ−1Σδ1∥2F−tr(Σ−1Σδ2)+2E(δ2(X−μ)⊤)⋅E(Σ−1(X−μ)(X−μ)⊤Σ−1(X−μ))+2pτ2). (32)
###### Theorem 2.6 (ℓ2 risk of ^θ1SSLS)

Let be i.i.d. labeled samples from , are additional unlabeled samples from . If Assumptions 1+2 or 1+2’ in (27)-(29) hold, , we have the following estimate of the risk for ,

 nE(^θ1SSLS−θ)2=τ2n+nn+mβ⊤(2),nΣnβ(2),n+sn,m (33)

where

 |sn,m|≤Cp2n. (34)

for constant only depends on and in Assumptions (27)-(29). We consider further improvement in this section. Before we illustrate how the improved estimator works, it is helpful to take a look at the oracle risk for estimating the mean , which can serve as a benchmark for the performance of the improved estimator.

### 2.5 Oracle Estimator and Risk

Define as the response surface and suppose

 ξ(x)=ξ0(x)+c

for some unknown constant . Given samples , our goal is to estimate . Now assume an oracle has knowledge of , but not of , , nor the distribution of . In this case, the model can be written as

 Yk−ξ0(Xk)=c+εk,k=1,⋯,n,whereEεk=0;θ=Eξ0(X)+c. (35)

Under the ideal semi-supervised setting, , and are known. To estimate , the natural idea is to by the following estimator

 ^θ∗=¯Y−¯ξ0+Eξ0(X)=1nn∑k=1(Yk−ξ0(Xk))+Eξ0(X). (36)

Clearly is an unbiased estimator of , while

 nE(^θ∗−θ)2=nVar(1nn∑i=1(Yi−ξ0(Xi)))=Var(Yi−ξ(Xi))=EX(EY(Y−ξ(X)|X)2):=σ2. (37)

This defines the oracle risk for population mean estimation under the ideal semi-supervised setting as .

For the ordinary semi-supervised setting, where is unknown but additional unlabeled samples are available, we propose the semi-supervised oracle estimator as

 ^θ∗ss=¯Y−1nn∑k=1ξ0(Xk)+1n+mn+m∑k=1ξ0(Xk).

Then one can calculate that

 nE(^θ∗ss−θ)2=σ2+nn+mVarPX(ξ(X)). (38)

The detailed calculation of (38) is provided in the Appendix.

The preceding motivation for and

as the oracle risks are partly heuristic, based on the arguments in (

36) and (37). But it corresponds to a formal minimax statement, as follows.

###### Proposition 2.2 (Oracle Lower Bound)

Let ,

 Pξ0(⋅),σ2={P:ξ0(x)=E(Y|X=x)−c,σ2=EX(EY(Y−E(Y|X))2)}.

Then based on observations and known marginal distribution ,

 inf~θsupP∈Pξ0,σ2[EP(n(~θ−θ)2)]=σ2. (39)

Let , be a linear function of ,

 Pssξ0,σ2ξ,σ2={P:ξ0(x)=E(Y|X=x)−c,σ2ξ=Var(ξ(X)),σ2=EX(EY(Y−E(Y|X))2)},

based on observations and ,

 inf~θsupP∈Pssξ0,σ2ξ,σ2[EP(n(~θ−θ)2)]=σ2+nn+mσ2ξ. (40)

### 2.6 Improved Procedure

In order to approach oracle optimality we propose to augment the set of covariates with additional covariates . (Of course these additional covariates need to be chosen without knowledge of . We will discuss their choice later in this section.) In all there are now covariates, say

 X∙=(X∙1,…,X∙p,X∙p+1,…,X∙p+q)=(X1,…,Xp,g1(X),…,gq(X)).

For both ideal and ordinary semi-supervision we propose to let as , and to use the estimator and . For merely theoretical purpose of risks we consider the refinement again

 ^θ∙1LS=TrunY(^θ∙LS) and ^θ∙1SSLS=TrunY(^θ∙SSLS),

where is defined as (26). Apply previous theorems for asymptotic distributions and moments. For convenience of statement and proof we assume that the support of is compact, is bounded and is sub-Gaussian. These assumptions can each be somewhat relaxed at the cost of additional technical assumptions and complications. Here is a formal statement of the result.

###### Theorem 2.7

Assume the support of is compact, is bounded, and is sub-Gaussian. Consider asymptotics as for the case of both ideal and ordinary semi-supervision. Assume also that either (i) is continuous or (ii) that is absolutely continuous with respect to Lebesgue measure on . Let be a bounded basis for the continuous functions on in case (i) and be a bounded basis for the ordinary Hilbert space on in case (ii). There exists a sequence of such that , and

• the estimator for the problem with observations asymptotically achieves the ideal oracle risk, i.e.

 limn→∞nE(^θ∙1LS−θ)2=σ2. (41)
• Now we suppose for some fixed value . Applying the estimator for the problem with observations and . Then

 limn→∞nE(^θ∙1SSLS−θ)2=σ2+ρVarPX(ξ(X)). (42)

Finally, and are asymptotically unbiased and normal with the corresponding variances.

(36) and (42) show that the proposed estimators asymptotically achieve the oracle values in (39) and (40).

## 3 Simulation Results

In this section, we investigate the numerical performance of the proposed estimators in various settings in terms of estimation errors and coverage probability as well as length of confidence intervals. All the simulations are repeated for 1000 times.

We analyze the linear least squares estimators and proposed in Section 2 in the following three settings.

1. (Gaussian and quadratic ) We generate the design and parameters as follows, , , , . Then we draw i.i.d. samples as

 Xk∼N(μ,Σ),Yk=ξ(Xk)+εk,

where

 ξ(Xk)=(∥Xk∥22−p)+→X⊤β,εk∼N(0,2∥Xk∥22/p).

It is easy to calculate that in this setting.

2. (Heavy tailed and ) We randomly generate

 {Xki}1≤k≤n,1≤i≤piid∼P3,Yk=p∑i=1(sin(Xki)+Xki)+.5⋅εk,εkiid∼P3.

where has density , . Here, the distribution has no third or higher moments. In this case, , .

3. (Poisson and ) Then we also consider a setting where

 {Xki}1≤k≤n,1≤i≤piid∼Poisson(10),Yk|Xkiid∼Poisson(10Xk1).

In this case,