In recent years, semi-supervised learning (SSL) has emerged as an exciting new area of research in statistics and machine learning. A detailed discussion on SSL including its practical relevance, the primary question of interest in SSL, and the existing relevant literature can be found inChapelle, Schölkopf and Zien (2006) and Zhu (2008). A typical semi-supervised (SS) setting is characterized by two types of available data: (i) a small or moderate sized ‘labeled’ data, , containing observations for both an outcome and a set of covariates of interest, and (ii) an ‘unlabeled’ data, , of much larger size but having observations only for the covariates . By virtue of its large size, essentially gives us the distribution of , denoted henceforth by . Such a setting arises naturally whenever the covariates are easily available so that unlabeled data is plentiful, but the outcome is costly or difficult to obtain, thereby limiting the size of
. This scenario is directly relevant to a variety of practical problems, especially in the modern ‘big data’ era, with massive unlabeled datasets (often electronically recorded) becoming increasingly available and tractable. A few familiar examples include machine learning problems like text mining, web page classification, speech recognition, natural language processing etc.
Among biomedical applications, a particularly interesting problem where SSL can be of great use is the statistical analysis of electronic medical records (EMR) data. Endowed with a wealth of de-identified clinical and phenotype data for large patient cohorts, EMR linked with bio-repositories are increasingly gaining popularity as rich resources of data for discovery research (Kohane, 2011). Such large scale datasets obtained in a cost-effective and timely manner are of great importance in modern medical research for addressing important questions such as the biological role of genetic variants in disease susceptibility and progression (Kohane, 2011). However, one major bottleneck impeding EMR driven research is the difficulty in obtaining validated phenotype information (Liao et al., 2010) since they are labor intensive or expensive to obtain. Thus, gold standard labels and genomic measurements are typically available only for a small subset nested within a large cohort. In contrast, digitally recorded data on the clinical variables are often available on all subjects, highlighting the necessity and utility of developing robust SSL methods that can leverage such rich source of auxiliary information to improve phenotype definition and estimation precision.
SSL primarily distinguishes from standard supervised methods by making use of , an information that is ignored by the latter. The ultimate question of interest in SSL is to investigate if and when the information on in can be exploited to improve the efficiency over a given supervised approach. In recent years, several graph based non-parametric SSL approaches have been proposed (Zhu, 2005; Belkin, Niyogi and Sindhwani, 2006) for regression or classification. These approaches essentially target non-parametric SS estimation of and therefore, for provable improvement guarantees, must rely implicitly or explicitly on assumptions relating to (the conditional distribution of given ), as duly noted and characterized more formally in Lafferty and Wasserman (2007). For non-parametric classification problems, the theoretical underpinnings of SSL including its scope and the consequences of using have been also studied earlier by Castelli and Cover (1995, 1996). More parametric SS approaches, still aimed mostly at prediction, have also been studied for classification, including the ‘generative model’ approach (Nigam et al., 2000; Nigam, 2001)
which is based on modeling the joint distribution of
as an identifiable mixture of parametric models, thereby implicitly relatingand . However, these approaches depend strongly on the validity of the assumed mixture model, violation of which can actually degrade their performance compared to the supervised approach (Cozman and Cohen, 2001; Cozman, Cohen and Cirelo, 2003).
However SS estimation problems, especially from a semi-parametric point of view, has been somewhat less studied in SSL. Such problems are generally aimed at estimating some (finite-dimensional) parameter , where , and the key to the potential usefulness of in improving estimation of lies in understanding when relates to . For simple parameters like , unless is a constant, clearly depends on and hence, improved SS estimation is possible compared to the supervised estimator , the sample mean of based on . The situation is however more subtle for other choices of , especially those where is the target parameter corresponding to an underlying parametric working model for . This includes the least squares parameter, as studied in this paper, targeted by a working linear model for . Such models are often adopted due to their appealing simplicity and interpretability.
In general, for such cases, if the adopted working model for is correct and is not related to , then one cannot possibly gain through SSL by using the knowledge of (Zhang and Oles, 2000; Seeger, 2002). On the other hand, under model mis-specification, may inherently depend on , and thus imply the potential utility of in improving the estimation. However, inappropriate use of may lead to degradation of the estimation precision. This therefore signifies the need for robust and efficient SS estimators that are adaptive to model mis-specification, so that they are as efficient as the supervised estimator under the correct model and more efficient under model mis-specification. To the best of our knowledge, work done along these lines is relatively scarce in the SSL literature, one notable exception being the recent work of Kawakita and Kanamori (2013), where they use a very different approach based on density ratio estimation, building on the more restrictive approach of Sokolovska, Cappé and Yvon (2008). However, as we observe in our simulation studies, the extent of the efficiency gain actually achieved by these approaches can be quite incremental, at least in finite samples. Further, the seemingly unclear choice of the ideal (nuisance) model to be used for density ratio estimation can also have a significant impact on the performance, both finite sample and asymptotic, of these estimators.
We propose here a class of Efficient and Adaptive Semi-Supervised Estimators (EASE) in the context of linear regression problems. We essentially adopt a semi-parametric perspective wherein the adopted linear ‘working’ model can be potentially mis-specified, and the goal is to obtain efficient and adaptive SS estimators of the regression parameter through robust usage of . The EASE are two-step estimators with a simple and scalable construction based on a first step of ‘semi-non-parametric’ (SNP) imputation which includes a smoothing step and a follow-up ‘refitting’ step. In the second step, we regress the imputed outcomes against the covariates using the unlabeled data to obtain our SNP imputation based SS estimator, and then further combine it optimally with the supervised estimator to obtain the final EASE. Dimension reduction methods are also employed in the smoothing step to accommodate higher dimensional , if necessary. Further, we extensively adopt cross-validation (CV) techniques in the imputation, leading to some useful theoretical properties (apart from practical benefits) typically not observed for smoothing based two-step estimators. We demonstrate that EASE is guaranteed to be efficient and adaptive in the sense discussed above, and also achieves semi-parametric optimality whenever the SNP imputation is ‘sufficient’ or the linear model holds. We also provide data adaptive methods to optimally select the directions for smoothing when dimension reduction is desired, and tools for inference with EASE.
The rest of this paper is organized as follows. In Section 2, we formulate the SS linear regression problem. In Section 3, we construct a family of SS estimators based on SNP imputation and establish all their properties, and further propose the EASE as a refinement of these estimators. For all our proposed estimators, we also address their associated inference procedures based on ‘double’ CV methods. In Section 4, we discuss a kernel smoothing based implementation of the SNP imputation and establish all its properties. In Section 5, we discuss SS dimension reduction techniques, useful for implementing the SNP imputation. Simulation results and an application to an EMR study are shown in Section 6, followed by concluding discussions in Section 7. Proofs of all theoretical results and associated technical materials, and further numerical results and discussions are distributed in the Appendix and the Supplementary Material [Chakrabortty and Cai (2017)].
2 Problem Set-up
denote the outcome random variable and
denote the covariate vector, whereis fixed, and let . Then the entire data available for analysis can be represented as , where consists of independent and identically distributed (i.i.d.) observations from the joint distribution of , consists of i.i.d. observations from , and . Throughout, for notational convenience, we use the subscript ‘’ to denote the unlabeled observations, and re-index without loss of generality (w.l.o.g.) the observations in as: .
Assumption 2.1 (Basic Assumptions).
(a) We assume that has finite moments and is positive definite, denoted as . We also assume, for simplicity, that has a compact support .
(b) We assume i.e. as , and and arise from the same underlying distribution, i.e. for all subjects in .
Let , where , . Let denote the space of all -valued measurable functions of having finite norm with respect to (w.r.t.) , and for any , let denote the matrix . Lastly, let denote the vector norm, and for any integer , let
denote the identity matrix of order, and denote the
-variate Gaussian distribution with meanand covariance matrix .
Assumption 2.1 (b) enlists some fundamental characteristics of SS settings. Indeed, the condition of and being equally distributed has usually been an integral part of the definition of SS settings (Chapelle, Schölkopf and Zien, 2006; Kawakita and Kanamori, 2013). Interpreted in missing data terminology, it entails that in are ‘missing completely at random’ (MCAR), with the missingness/labeling being typically by design. Interestingly, the crucial assumption of MCAR, although commonly required, has often stayed implicit in the SSL literature (Lafferty and Wasserman, 2007). It is important to note that while the SS set-up can be viewed as a missing data problem, it is quite different from standard ones, since with i.e. , the proportion of observed in tends to in SSL. Hence, the ‘positivity assumption’ typical in missing data theory, requiring this proportion to be bounded away from , is violated here. It is also worth noting that owing to such violations, the analysis of SS settings under more general missingness mechanisms such as ‘missing at random’ (MAR) is considerably more complicated and to our knowledge, the literature for SS estimation problems under such settings is virtually non-existent. Furthermore, for such problems, the traditional goal in SSL, that of improving upon a ‘supervised’ estimator, can become unclear without MCAR, unless an appropriately weighted version of the supervised estimator is considered. Given these subtleties and the traditional assumptions (often implicit) in SSL, the MCAR condition is assumed for most of this paper, although a brief discussion on possible extensions of our proposed SS estimators to MAR settings is provided in the Supplementary Material.
2.1 The Target Parameter and Its Supervised Estimator
We consider the linear regression working model given by:
where, is an unknown regression parameter. Accounting for the potential mis-specification of the working model (2.1), we define the target parameter of interest as a model free parameter, as follows:
The target parameter for linear regression may be defined as the solution to the normal equations: in , or equivalently, .
Existence and uniqueness of in 2.1 is clear. Further, is the projection of onto the subspace of all linear functions of and hence, is the best linear predictor of given . The linear model (2.1) is correct (else, mis-specified) if and only if lies in this space (in which case, ). When the model is correct, depends only on , not on . Hence, improved estimation of through SSL is impossible in this case unless further assumptions relating to are made. On the other hand, under model mis-specification, the normal equations defining inherently depend on , thereby implying the potential utility of SSL in improving the estimation of in this case.
The usual supervised estimator of is the OLS estimator , the solution in to the equation: , the normal equations based on . Under Assumption 2.1 (a), it is well known that as ,
where and .
Our primary goal is to obtain efficient SS estimators of using the entire data and compare their efficiencies to that of . It is worth noting that the estimation efficiency of also relates to the predictive performance of the fitted linear model since its out-of-sample prediction error is directly related to the mean squared error (w.r.t. the metric) of the parameter estimate.
3 A Family of Imputation Based Semi-Supervised Estimators
If in were actually observed, then one would simply fit the working model to the entire data in for estimating . Our general approach is precisely motivated by this intuition. We first attempt to impute the missing in based on suitable training of in step (I). Then in step (II), we fit the linear model (2.1) to with the imputed outcomes. Clearly, the imputation is critical. Inaccurate imputation would lead to biased estimate of , while inadequate imputation would result in loss of efficiency. We next consider SS estimators constructed under two imputation strategies for step (I) including a fully non-parametric imputation based on kernel smoothing (KS), and a semi-non-parametric (SNP) imputation that involves a smoothing step and a follow up ‘refitting’ step. Although the construction of the final EASE is based on the SNP imputation strategy, it is helpful to begin with a discussion of the first strategy in order to appropriately motivate and elucidate the discussion on EASE and the SNP imputation strategy.
3.1 A Simple SS Estimator via Fully Non-Parametric Imputation
We present here an estimator based on a fully non-parametric imputation involving KS when is small. For simplicity, we shall assume here that is continuous with a density . Let and . Consider the local constant KS estimator of ,
Here and throughout in our constructions of SS estimators, with either the true or the imputed is not
included in the final fitting step mostly due to technical convenience in the asymptotic analysis of our estimators, and also due to the fact that the contribution of, included in any form, in the final fitting step is asymptotically negligible since .
In order to study the properties of , we require uniform (in norm) convergence of to , a problem that has been extensively studied in the non-parametric statistics literature (Newey, 1994; Andrews, 1995; Masry, 1996; Hansen, 2008) under fairly general settings and assumptions. In particular, we would assume the following regularity conditions to hold:
(i) is a symmetric order kernel for some integer . (ii) is bounded, Lipschitz continuous and has a bounded support . (iii) for some . and are bounded on . (iv) is bounded away from on . (v) and are times continuously differentiable with bounded derivatives on some open set . (vi) For any , let denote the set . Then, for small enough , almost surely (a.s.).
Conditions (i)-(v) are fairly standard in the literature. In (v), the set is needed mostly to make the notion of differentiability well-defined, with both and understood to have been analytically extended over . Condition (vi) implicitly controls the tail behaviour of , requiring that perturbations of in the form of with (bounded) and small enough, belong to a.s. . We now present our result on .
Suppose and as , and let . Then, under Assumption 3.1,
Theorem 3.1 establishes the efficient and adaptive nature of . The asymptotic variance
. The asymptotic varianceof satisfies and the inequality is strict unless a.s. . Hence, is asymptotically optimal among the class of all regular and asymptotically linear (RAL) estimators of with influence function (IF) of the form: with . In particular, is more efficient than whenever (2.1) is mis-specified, and equally efficient when (2.1) is correct i.e. . Further, it can also be shown that is the ‘efficient’ IF for estimating under the semi-parametric model . Thus, also globally achieves the semi-parametric efficiency bound under . Lastly, note that at any parametric sub-model in that corresponds to (2.1) being correct, also achieves optimality, thus showing that under , it is not possible to improve upon if the linear model is correct.
The asymptotic results in Theorem 3.1 require a kernel of order and smaller in order than the ‘optimal’ bandwidth order . This under-smoothing requirement, often encountered in two-step estimators involving a first-step smoothing (Newey, Hsieh and Robins, 1998), generally results in sub-optimal performance of . The optimal under-smoothed bandwidth order for Theorem 3.1 is given by: .
3.2 SS Estimators Based on Semi-Non-Parametric (SNP) Imputation
The simple and intuitive imputation strategy in Section 3.1 based on a fully non-parametric
-dimensional KS is however often undesirable in practice owing to the curse of dimensionality. In order to accommodate larger, we now propose a more flexible SNP imputation method involving a dimension reduction, if needed, followed by a non-parametric calibration. An additional ‘refitting’ step is proposed to reduce the impact of bias from non-parametric estimation and possibly inadequate imputation due to dimension reduction. We also introduce some flexibility in terms of the smoothing methods, apart from KS, that can be used for the non-parametric calibration.
Let be a fixed positive integer and let be any rank transformation matrix. Let . Given , we may now consider approximating the regression function by smoothing over the dimensional instead of the original . In general, can be user-defined and data dependent. A few reasonable choices of are discussed in Section 5. If depends only on the distribution of , it may be assumed to be known given the SS setting considered. If also depends on the distribution of , then it needs to be estimated from and the smoothing needs to be performed using the estimated .
For approximating , we may consider any reasonable smoothing technique . Some examples of include KS, kernel machine regression and smoothing splines. Let denote the ‘target function’ for smoothing over using . For notational simplicity, the dependence of and other quantities on is suppressed throughout. For KS, the appropriate target function is given by: , where . For basis function expansion based methods, will typically correspond to the projection of onto the functional space spanned by the basis functions associated with . The results in this section apply to any choice of that satisfies the required conditions. In Section 4, we provide more specific results for the implementation of our methods using KS.
Note that we do not assume anywhere, and hence the name ‘semi-non-parametric’. Clearly, with and KS, it reduces to a fully non-parametric approach. We next describe the two sub-steps involved in step (I) of the SNP imputation: (Ia) smoothing, and (Ib) refitting.
(Ia) Smoothing Step
With and as defined above, let and respectively denote their estimators based on . In order to address potential overfitting issues in the subsequent steps, we further consider generalized versions of these estimators based on -fold CV for a given fixed integer . For any , let denote a random partition of into disjoint subsets of equal sizes, , with index sets . Let denote the set excluding with size and respective index set . Let and denote the corresponding estimators based on . Further, for notational consistency, we define for , ; ; ; and .
(Ib) Refitting Step
In this step, we fit the linear model to using as predictors and the estimated as an offset. To motivate this, we recall that the fully non-parametric imputation given in Section 3.1 consistently estimates , the projection onto a space that always contains the working model space, i.e. the linear span of . This need not be true for the SNP imputation, since we do not assume necessarily. The refitting step essentially ‘adjusts’ for this so that the final imputation, combining the predictions from these two steps, targets a space that contains the working model space. In particular, for KS with , this step is critical to remove potential bias due to inadequate imputation.
Interestingly, it turns out that the refitting step should always be performed, even when . It plays a crucial role in reducing the bias of the resulting SS estimator due to the inherent bias from non-parametric curve estimation. In particular, for KS with any , it ensures that a bandwidth of the optimal order can be used, thereby eliminating the under-smoothing issue as encountered in Section 3.1. The target parameter for the refitting step is simply the regression coefficient obtained from regressing the residual on and may be defined as: , the solution in to the equation: . For any , we estimate as , the solution in to the equation:
For , the estimate of to be used as an offset is obtained from that is based on data in . For , with , the residuals are thus estimated in a cross-validated manner. For however, is estimated using the entire which can lead to considerable underestimation of the true residuals owing to over-fitting and consequently, substantial finite sample bias in the resulting SS estimator of . This bias can be effectively reduced by using the CV approach with . We next estimate the target function for the SNP imputation given by:
where . For notational simplicity, we suppress throughout the inherent dependence of itself on and . Note that similar to , we also do not assume . Apart from the geometric motivation for the refitting step and its technical role in bias reduction, it also generally ensures the condition: , regardless of the true underlying . This condition is a key requirement for the asymptotic expansions, in Theorem 3.2, of our resulting SS estimators. Using , we now construct our final SS estimator as follows.
SS Estimator from SNP Imputation
In step (II), we fit the linear model to the SNP imputed unlabeled data: and obtain a SS estimator of given by:
For convenience of further discussion, let us define: ,
where denotes expectation w.r.t. . The dependence of on and is suppressed here for notational simplicity. We now present our main result summarizing the properties of .
Suppose that satisfies: (i) and (ii) for some . With as in (3.9), define . Then, for any ,
where and . Further, for any fixed , , so that
which converges in distribution to .
If the imputation is ‘sufficient’ so that , then , for any , enjoys the same set of optimality properties as those noted in Remark 3.1 for (while requiring less stringent assumptions about and , if KS is used). If , then it is however unclear whether is always more efficient than . This will be addressed in Section 3.3 where we develop the final EASE.
Apart from the fairly mild condition (i), Theorem 3.2 only requires uniform consistency of w.r.t. for establishing the -consistency and asymptotic normality (CAN) of for any . The uniform consistency typically holds for a wide range of smoothing methods under fairly general conditions. For KS in particular, we provide explicit results in Section 4 under mild regularity conditions that allow the use of any kernel order and the associated optimal bandwidth order. This is a notable relaxation from the stringent requirements for Theorem 3.1 that necessitate under-smoothing and the use of higher order kernels.
The CAN property of has not yet been established. The term in (3.10) behaves quite differently when , compared to when it has a nice structure due to the inherent ‘cross-fitting’ involved, and can be controlled easily, and quite generally, under mild conditions as noted in Remark 3.4. For however, is simply a centered empirical process devoid of any such structure and in general, controlling it requires stronger conditions and the use of empirical process theory (see for instance Van der Vaart (2000) for relevant results). We derive the properties of for the case of KS in Theorem 4.2 using a different approach however, specialized for KS estimators, in order to control .
3.3 Efficient and Adaptive Semi-Supervised Estimators (EASE)
To ensure adaptivity even when , we now define the final EASE as an optimal linear combination of and . Specifically, for any fixed matrix , is a CAN estimator of whenever and are, and an optimal can be selected easily to minimize the asymptotic variance of the combined estimator. For simplicity, we focus here on being a diagonal matrix with . Then the EASE is defined as with being any consistent estimator (see Section 3.4 for details) of the minimizer , where ,
and for any vector , denotes its component. Note that in (3.12), the and the limit outside are included to formally account for the case: a.s. , when we define for identifiability.
It is straightforward to show that and are asymptotically equivalent, so that is a RAL estimator of satisfying:
as , where and Note that when either the linear model holds or the SNP imputation is sufficient, then , so that is asymptotically optimal in the sense of Remark 3.1. Further, when neither cases hold, is no longer optimal, but is still efficient and adaptive compared to . Lastly, if the imputation is certain to be sufficient (for example, if and KS), we may simply define .
It can be shown that under , defined in Remark 3.1, the class of all possible IFs achievable by RAL estimators of is given by: . The IFs achieved by , and are clearly members of this class. The SNP imputation, for various choices of the imputation function , therefore equips us with a family of RAL estimator pairs for estimating . The IF of is further guaranteed to dominate that of , and when , it also dominates all other IFs .
3.4 Inference for EASE and the SNP Imputation Based SS Estimators
We now provide procedures for making inference about based on and obtained using . We also employ a ‘double’ CV to overcome bias in variance estimation due to over-fitting. A key step involved in the variance estimation is to obtain reasonable estimates of . Although in (3.4) was constructed via CV, the corresponding estimate, in (3.6), of is likely to be over-fitted for . To construct bias corrected estimates of , we first obtain separate doubly cross-validated estimates of , , with , for each , being the solution in to , where
For each and , is constructed such that used for obtaining is independent of that is based on . Then, for each and , we may estimate as:
We exclude in the construction of to reduce over-fitting bias in the residuals which we now use for estimating the IFs.
For each and , we estimate and , the corresponding IFs of and , respectively as:
where denotes any consistent estimator of from and/or (for example, based on , or based on ). Then, in (3.11) may be consistently estimated as:
To estimate the combination matrix