1 Introduction
Semisupervised learning arises naturally in statistics and machine learning when the labels are more difficult or more expensive to acquire than the unlabeled data. While numerous algorithms have been proposed for semisupervised learning, they are mostly focused on classification, where the labels are discrete values representing the classes to which the samples belong (see, e.g., zhu2005semi; ando2007two; zhu2009introduction; wang2009efficient). The analyses typically rely on two types of assumptions, distributionbased and marginbased. The marginbased analysis (see vapnik2013nature; wang2007large; wang2008probability; wang2009efficient) generally assumes that the samples with different labels have some separation, and the additional unlabeled samples can help enhance the separation and achieve a better classification result. The distributional approach (see blum1998combining; ando2005framework; ando2007two) usually relies on some assumptions of a particular type of relation between labels and samples. These assumptions can be difficult to verify in practice. The setting with continuous valued has also been discussed in the literature, see, e.g., johnson2008graph, wasserman2007statistical and chakrobortty2016efficient. For a survey of recent development in semisupervised learning, readers are referred to zhu2009introduction and the references therein.
The general semisupervised model can be formulated as follows. Let be a
dimensional random vector following an unknown joint distribution
. Denote by the marginal distribution of . Suppose one observes “labeled” samples from ,(1) 
and, in addition, “unlabeled” samples from the marginal distribution
(2) 
In this paper, we focus on estimation and statistical inference for one of the simplest features, namely the population mean . No specific distributional or marginal assumptions relating and are made.
This inference of population mean under general semisupervised learning framework has a variety of applications. We discuss the estimation of treatment effect (ATE) in Section 4.1 and a prototypical example involving survey data in Section 4.2
. It is noteworthy that for some other problems that do not at first look like mean estimation, one can recast them as mean estimation, possibly after an appropriate transformation. Examples include estimation of the variance of
or covariance between and a given . In work that builds on a portion of the present paper, azeriel2016semi considers construction of linear predictors in semisupervised learning settings.To estimate , the most straightforward estimator is the sample average . Surprisingly, as we show later, a simple leastsquaresbased estimator, which exploits the unknown association of and , outperforms . We first consider an ideal setting where there are infinitely many unlabeled samples, i.e., . This is equivalent to the case of known marginal distribution . We refer to this case as ideal semisupervised inference. In this case, our proposed estimator is
(3) 
where is the dimensional least squares estimator for the regression slopes and is the population mean of . This estimator is analyzed in detail in Section 2.2. We then consider the more realistic setting where there are a finite number of unlabeled samples, i.e., . Here one has only partial information about . We call this case ordinary semisupervised inference. In this setting, we propose to estimate by
(4) 
where denotes the sample average of both the labeled and unlabeled ’s. The detailed analysis of this estimator is given in Section 2.3.
We will investigate the properties of these estimators and in particular establish their asymptotic distributions and the risk bounds. Both the case of a fixed number of covariates and the case of a growing number of covariates are considered. The basic asymptotic theory in Section 2 begins with a setting in which the dimension, , of , is fixed and (see Theorem 2.1). For ordinary semisupervised learning, the asymptotic results are of nontrivial interest whenever (see Theorem 2.3(i)). We then formulate and prove asymptotic results in the setting where also grows with . In general, these results require the assumption that (see Theorems 2.2 and 2.3(ii)). The limiting distribution results allow us to construct an asymptotically valid confidence interval based on the proposed estimators that is shorter than the traditional samplemeanbased confidence interval.
In Section LABEL:sec.nonparametric we propose a methodology for improving the results of Section 2 by introducing additional covariates as functions of those given in the original problem. We show the proposed estimator achieves an oracle rate asymptotically. This can be viewed as a nonparametric regression estimation procedure.
There are results in the samplesurvey literature that are qualitatively related to what we propose. The earliest citation we are aware of is cochran1953sampling. See also deng1987estimation and more recently lohr2009sampling. In these references one collects a finite sample, without replacement, from a (large) finite population. There is a response and a single, real covariate, . The distribution of within the finite population is known. The samplesurvey target of estimation is the mean of within the full population. In the case in which the size of this population is infinitely large, sampling without replacement and sampling with replacement are indistinguishable. In that case the results from this sampling theory literature coincide with out results for the ideal semisupervised scenario with , both in terms of the proposed estimator and its asymptotic variance. Otherwise the samplesurvey theory results differ from those within our formulation, although there is a conceptual relationship. In particular the theoretical population mean that is our target is different from the finite population mean that is the target of the samplesurvey methods. In addition we allow and as noted above, we also have asymptotic results for growing with . Most notably, our formulation includes the possibility of semisupervised learning. We believe it should be possible, and sometimes of practical interest, to include semisupervised sampling within a sampling survey framework, but we do not do so in the present treatment.
The rest of the paper is organized as follows. We introduce the fixed covariate procedures in Section 2. Specifically, ideal semisupervised learning and ordinary semisupervised learning are considered respectively in Sections 2.2 and 2.3, where we analyze the asymptotic properties for both estimators. We further give the risk upper bounds for the two proposed estimators in Section 2.4. We extend the analysis in Section LABEL:sec.nonparametric to nonparametric regression model, where we show the proposed procedure achieves an oracle rate asymptotically. Simulation results are reported in Section 3. Applications to the estimation of Average Treatment Effect is discussed in Section 4.1, and Section 4.2 describes a real data illustration involving estimation of the homeless population in a geographical region. The proofs of the main theorems are given in Section 5 and additional technical results are proved in the Appendix.
2 Procedures
We propose in this section a least squares estimator for the population mean in the semisupervised inference framework. To better characterize the problem, we begin with a brief introduction of the random design regression model. More details of the model can be found in, e.g., bujamodels.
2.1 A Random Design Regression Model
Let
represent the population response and predictors. Assume all second moments are finite. Denote
as the predictor with intercept. The following is a linear analysis, even though no corresponding linearity assumption is made about the true distribution P of (X, Y). Some notation and definitions are needed. Let(5) 
Here are referred to as the population slopes, and is called the total deviation. We also denote
(6) 
Some basic facts about the regression slope and total deviation are summarized in the following lemma.
Lemma 2.1
Let have finite second moment, and let the matrix be nonsingular. Then
It should be noted that under our general model, there is no independence assumption between and .
For sample of observations , , let and denote the design matrix as follows
In our notation, means that the vector/matrix contains the intercept term; boldface indicates that the symbol is related to a multiple sample if observations. Meanwhile, denote the sample response and deviation as and . Now and are connected by a regression model:
(7) 
Let be the usual least squares estimator, i.e.
(8) 
Then provides a straightforward estimator for . and can be further split into two parts,
(9) 
and play different roles in the analysis as we will see later. The risk of the sample average about the population mean has the following decomposition.
Proposition 2.1
From (10), we can see that as long as , i.e., there is a significant linear relationship between and , then the risk of will be significantly greater than .
In the next two subsections, we discuss separately under the ideal semisupervised setting and the ordinary semisupervised setting.
2.2 Improved Estimator under the Ideal Semisupervised Setting
We first consider the ideal setting where there are infinitely many unlabeled samples, or equivalently is known. To improve , we propose the least squares estimator,
(11) 
where is defined in (8).
The following theorem provides the asymptotic distribution of the least squares estimator under the minimal conditions that have finite second moments, be nonsingular and .
Theorem 2.1 (Asymptotic Distribution, fixed )
Let be i.i.d. copies from , and assume that has finite second moments, is nonsingular and . Then, under the setting that is fixed and grows to infinity,
(12) 
and
(13) 
In the more general setting where varies and grows, we need stronger conditions to analyze the asymptotic behavior of . Suppose , we consider the standardization of as
(14) 
Clearly, . For this setting we assume that satisfy the following moment conditions:
(15) 
(16) 
(17) 
Theorem 2.2 (Asymptotic result, growing )
2.3 Improved Estimator under the Ordinary Semisupervised Inference Setting
In the last section, we discussed the estimation of based on full observations with infinitely many unlabeled samples (or equivalently with known marginal distribution ). However, having known is rare in practice. A more realistic practical setting would assume that distribution is unknown and we only have finitely many i.i.d. samples without corresponding . This problem relates to the one in previous section since we are able to obtain partial information of from the additional unlabeled samples.
When or is unknown, we estimate by
(20) 
Recall that
is the ordinary least squares estimator. Now, we propose the
semisupervised least squares estimator ,(21) 
has the following properties:

when , . Then exactly equals in (11);

when , exactly equals . As there are no additional samples of so that no extra information for is available, it is natural to use to estimate .

In the last term of (21), it is important to use rather than , in spite of the fact that the latter might seem more natural because it is independent of the term that precedes it.
Under the same conditions as Theorems 2.1, 2.2, we can show the following asymptotic results for , which relates to the ordinary semisupervised setting described in the introduction. The labeled sample size , the unlabeled sample size is and the distribution is fixed (but unknown) which, in particular, implies that is a fixed dimension, not dependent on . Let
Theorem 2.3 (Asymptotic distribution of , fixed )
Let be i.i.d. labeled samples from , are additional unlabeled samples from . Suppose is nonsingular and . If is fixed and then
(22) 
and
(23) 
where with and .
Based on Theorems 2.3 and 2.4, the level asymptotic confidence interval for can be written as
(24) 
Since asymptotically (with equality only when ), so that when the asymptotic CI in (24) is shorter than the traditional samplemeanbased CI (19).
The following statement refers to a setting in which and may depend on as . Consequently, , and (defined at (14)) may also depend on .
2.4 Risk for the Proposed Estimators
In this subsection, we analyze the risk for both and . Since the calculation of the proposed estimators involves the unstable process of inverting the Gram matrix , for the merely theoretical purpose of obtaining the risks we again consider the refinement
(25) 
where
(26) 
. We emphasize that this refinement is mainly for theoretical reasons and is often not necessary in practice.
The regularization assumptions we need for analyzing the risk are formally stated as below.

(Moment conditions on ) There exist such that
(27) 
(subGaussian condition) Suppose is the standardization of
which satisfies
(28) Here is defined as for any random variable .

(Bounded condition) The standardization satisfies
(29)
We also note , , . Under the regularization assumptions above, we provide the risks for and respectively in the next two theorems.
Theorem 2.5 ( Risk of )
Theorem 2.6 ( risk of )
Let be i.i.d. labeled samples from , are additional unlabeled samples from . If Assumptions 1+2 or 1+2’ in (27)(29) hold, , we have the following estimate of the risk for ,
(33) 
where
(34) 
for constant only depends on and in Assumptions (27)(29). We consider further improvement in this section. Before we illustrate how the improved estimator works, it is helpful to take a look at the oracle risk for estimating the mean , which can serve as a benchmark for the performance of the improved estimator.
2.5 Oracle Estimator and Risk
Define as the response surface and suppose
for some unknown constant . Given samples , our goal is to estimate . Now assume an oracle has knowledge of , but not of , , nor the distribution of . In this case, the model can be written as
(35) 
Under the ideal semisupervised setting, , and are known. To estimate , the natural idea is to by the following estimator
(36) 
Clearly is an unbiased estimator of , while
(37) 
This defines the oracle risk for population mean estimation under the ideal semisupervised setting as .
For the ordinary semisupervised setting, where is unknown but additional unlabeled samples are available, we propose the semisupervised oracle estimator as
Then one can calculate that
(38) 
The detailed calculation of (38) is provided in the Appendix.
The preceding motivation for and
as the oracle risks are partly heuristic, based on the arguments in (
36) and (37). But it corresponds to a formal minimax statement, as follows.Proposition 2.2 (Oracle Lower Bound)
Let ,
Then based on observations and known marginal distribution ,
(39) 
Let , be a linear function of ,
based on observations and ,
(40) 
2.6 Improved Procedure
In order to approach oracle optimality we propose to augment the set of covariates with additional covariates . (Of course these additional covariates need to be chosen without knowledge of . We will discuss their choice later in this section.) In all there are now covariates, say
For both ideal and ordinary semisupervision we propose to let as , and to use the estimator and . For merely theoretical purpose of risks we consider the refinement again
where is defined as (26). Apply previous theorems for asymptotic distributions and moments. For convenience of statement and proof we assume that the support of is compact, is bounded and is subGaussian. These assumptions can each be somewhat relaxed at the cost of additional technical assumptions and complications. Here is a formal statement of the result.
Theorem 2.7
Assume the support of is compact, is bounded, and is subGaussian. Consider asymptotics as for the case of both ideal and ordinary semisupervision. Assume also that either (i) is continuous or (ii) that is absolutely continuous with respect to Lebesgue measure on . Let be a bounded basis for the continuous functions on in case (i) and be a bounded basis for the ordinary Hilbert space on in case (ii). There exists a sequence of such that , and

the estimator for the problem with observations asymptotically achieves the ideal oracle risk, i.e.
(41) 
Now we suppose for some fixed value . Applying the estimator for the problem with observations and . Then
(42)
Finally, and are asymptotically unbiased and normal with the corresponding variances.
3 Simulation Results
In this section, we investigate the numerical performance of the proposed estimators in various settings in terms of estimation errors and coverage probability as well as length of confidence intervals. All the simulations are repeated for 1000 times.
We analyze the linear least squares estimators and proposed in Section 2 in the following three settings.

(Gaussian and quadratic ) We generate the design and parameters as follows, , , , . Then we draw i.i.d. samples as
where
It is easy to calculate that in this setting.

(Heavy tailed and ) We randomly generate
where has density , . Here, the distribution has no third or higher moments. In this case, , .

(Poisson and ) Then we also consider a setting where
In this case,
Comments
There are no comments yet.