1 Introduction
Semisupervised learning for classification involves exploiting a large amount of unlabeled data and a relatively small amount of labeled data to build better classifiers. This approach can potentially be used to achieve higher accuracy, with a limited budget for obtaining labeled data. Various methods have been proposed, including expectationmaximization (EM) algorithms, transductive support vector machines (SVMs), and regularized methods (e.g., Chapelle et al. 2006; Zhu 2008).
For supervised classification, there are a range of objective functions which are Fisher consistent in the following sense: optimization of the population, nonparametric version of a loss function leads to the true conditional probability function of labels given features as for the logistic loss, or to the Bayes classifier as for the hinge loss (Lin 2002; Bartlett et al. 2006). In contrast, a perplexing issue we notice for semisupervised classification is that existing objective functions are in general not Fisher consistent, unless in the degenerate case where unlabeled data are ignored and only labeled data are used. Examples include the objective functions in transductive SVMs (Vapnik 1998; Joachims 1999) and various regularized methods (Grandvalet & Bengio 2005; Mann & McCallum 2007; Krishnapuram et al. 2005). The lack of Fisher consistency may contribute to unstable performances of existing semisupervised classifiers (e.g., Li & Zhou 2015). Another restriction in existing methods is that the class proportions in labeled and unlabeled data are typically assumed to be the same.
We develop a semisupervised extension of logistic regression based on exponential tilt mixture models (Qin 1999; Zou et al. 2002; Tan 2009), without restricting the class proportions in the unlabeled data to be the same as in the labeled data. The development is motivated by a statistical equivalence between logistic regression for the conditional probability of a label given features and exponential tilt modeling for the density ratio between the feature distributions within different labels (Anderson 1972; Prentice & Pyke 1979). Our work involves two main contributions: (i) we derive novel objective functions which are shown not only to be Fisher consistent but also lead to asymptotically more efficient estimation than based on labeled data only, and (ii) we propose regularized estimation and construct computationally and conceptually desirable EM algorithms. From numerical experiments, our methods achieve a substantial advantage over existing methods when the class proportions in unlabeled data differ from those in labeled data. A possible explanation is that while the class proportions in unlabeled data are estimated as unknown parameters in our methods, they are implicitly assumed to be the same as in labeled data for existing methods including transductive SVMs (Joachims 1999) and entropy regularization (Grandvalet & Bengio 2005).
A simple, informative example is provided in the Supplement (Section II) to highlight comparison between new and existing methods mentioned above.
2 Background: logistic regression and exponential tilt model
For supervised classification, the training data consist of a sample of , where and representing a feature vector and an associated label respectively. Consider a logistic regression model
(1) 
where is a coefficient vector associated with , and is an intercept, with superscript indicating classification or conditional probability of given . The maximum likelihood estimator (MLE) is defined as a maximizer of the log (conditional) likelihood:
(2) 
In general, nonlinear functions of can be used in place of , and a penalty term can be incorporated into the loglikelihood such as the ridge penalty or the squared norm of a reproducing kernel Hilbert space of functions of . We discuss these issues later in Sections 3.3 and 6.
Interestingly, logistic regression on can be made equivalent to an exponential tilt model on (Anderson 1972; Prentice & Pyke 1979; Qin 1998). Denote by or the conditional distribution or respectively, and . By the Bayes rule, model (1) is equivalent to the exponential tilt model
(3) 
where denotes the density ratio between and with respect to a dominating measure, and . Model (3
) is explicitly a semiparametric model, where
is an infinitelydimensional parameter and are finitelydimensional parameters. In fact, logistic model (1) is also semiparametric, where the marginal distribution of is an infinitelydimensional parameter, and are finitelydimensional parameters. Furthermore, the MLE in model (1) can be related to the following estimator in model (3) by the method of nonparametric likelihood (Kiefer & Wolfowitz 1956) or empirical likelihood (Owen 2001). Formally, are defined as a maximizer of the loglikelihood,(4) 
over all possible such that is a probability measure supported on the pooled data with . Analytically, it can be shown that , , where . See Qin (1998) and references therein.
By the foregoing discussion, we see that there are two statistically distinct but equivalent approaches for supervised classification: logistic regression or exponential tilt models. It is such a relationship that we aim to exploit in developing a new method for semisupervised classification.
3 Theory and methods
For semisupervised classification, the training data consist of a labeled sample and an unlabeled sample , for which the associated labels are unobserved. Typically for existing methods including transductive SVMs, the two samples and are assumed to be from a common population of . However, we allow that and may be drawn from different populations, with the same conditional distribution , but possibly different marginal probabilities and .
3.1 Exponential tilt mixture model
Although it seems difficult at first look to extend logistic model (1) for semisupervised learning, we realize that both the labeled sample and the unlabeled sample can be taken account of by a natural extension of the exponential tilt model (3), called an exponential tilt mixture (ETM) model (Qin 1999; Zou et al. 2002; Tan 2009). Denote
An exponential tilt mixture model for the three samples postulates that
(5)  
(6)  
(7) 
where or represents the conditional distribution of given or respectively in both the labeled and unlabeled data such that
(8) 
and is the proportion of underlying the unlabeled data. While Eqs (5)–(6) merely give definitions of and , Eq (7) says that the feature distribution in the unlabeled sample is a mixture of and , which follows from the structural assumption that the conditional distribution is invariant between the labeled and unlabeled samples. Eq (8) imposes a functional restriction on the density ratio between and , similarly as in (3).
The ETM model, defined by (5)–(8), is a semiparametric model, with an infinitelydimensional parameter and finitelydimensional parameter and . We briefly summarize maximum nonparametric likelihood estimation previously studied (Qin 1999; Zou et al. 2002; Tan 2009). For notational convenience, rewrite the sample as , where , , and . Eqs (5)–(7) can be expressed as
where , , and . For any fixed , the average profile loglikelihood of is defined as with
(9) 
over all possible which is a probability measure supported on the pooled data with . Denote
which can be easily shown to be concave in and convex in . Then Proposition 1 in Tan (2009) leads to the following result.
Lemma 1.
The average profile loglikelihood of can be determined as , where is a minimizer of over , satisfying the stationary condition (free of )
(10) 
The maximum likelihood estimator of is then defined by maximizing the profile loglikelihood, that is, . From Lemma 1, we notice that the estimators jointly solve the saddlepoint problem:
(11) 
Large sample theory of has been studied in Qin (1999) under standard regularity conditions as and with some constant for . The theory shows the existence of a local maximizer of , which is consistent and asymptotically normal provided the ETM model (5)–(8) is correctly specified. However, there remain subtle questions. It seems unclear whether the population version of the average profile loglikelihood attains a global maximum at the true values of under a correctly specified ETM model. Moreover, what property can be deduced for under a misspecified ETM model?
3.2 Semisupervised logistic regression
We derive a new classification model with parameters for the three samples such that an MLE of in the new model coincides with an MLE in the ETM model, and vice versa. Let if and if . Consider a conditional probability model for predicting the label from :
(12)  
(13)  
(14) 
where , which ensures that . The model, defined by (12)–(14), will be called a semisupervised logistic regression (SLR) model. The average loglikelihood function of with the data in model (12)–(14) can be written, up to an additive constant free of , as
Proposition 1.
Proposition 1 shows an equivalence between maximum nonparametric likelihood estimation in ETM model (5)–(8) and usual maximum likelihood estimation in SLR model (12)–(14), even though the objective functions and are not equivalent. This differs from the equivalence between logistic regression (1) and exponential tilt model (3) with labeled data only, where the loglikelihood (2) and the profile loglikelihood from (4) are equivalent (Prentice & Pyke 1979). From another angle, this result says that saddlepoint problem (11) can be equivalently solved by directly maximizing . This transformation is nontrivial, because a saddlepoint problem in general cannot be converted into optimization with a closedform objective.
By the identification of as a usual loglikelihood function, we show that the objective functions and , with the linear predictor replaced by an arbitrary function , are Fisher consistent nonparametrically, i.e., maximization of their population versions leads to the true values. This seems to be the first time Fisher consistency of a loss function is established for semisupervised classification. By some abuse of notation, denote
Proposition 2.
Proposition 2 fills existing gaps in understanding maximum likelihood estimation in ETM model (5)–(8), through its equivalence with that in SLR model (12)–(14). If the ETM model is correctly specified, then the population version of has a global maximum at the true values of , and hence a global maximizer is consistent under suitable regularity conditions. If the ETM model is misspecified, then by theory of estimation with misspecified models (Manski 1988; White 1982), the MLE converges in probability to a limit value which minimizes the difference between with and . This difference as shown in the Supplement (Section IV.2
) is the expected Kullback–Leibler divergence
where is the conditional probability (12)–(14) for , is the Kullback–Leibler divergence between two probability vectors and , and denotes the expectation with respect to .
Finally, we point out another interesting property of SLR model (12)–(14). If is fixed as , the proportion of in the labeled sample, then . In this case, the conditional probability (14) reduces to a constant, and the objective function can be easily shown to be equivalent to the profile loglikelihood of derived from (4) in the exponential tilt model based on the labeled data only or equivalently the loglikelihood of as (2) from logistic regression based on the labeled data only, after the intercept shift . We show that the MLE from ETM model (5)–(8) or equivalently SLR model (12)–(14) is asymptotically more efficient than that from logistic regression based on the labeled data only.
Proposition 3.
Denote by the estimator of obtained by maximizing
or equivalently by logistic regression based on the labeled data only. Then the asymptotic variance matrix of the MLE
from ETM model (5)–(8) is no greater (in the usual order on positivedefinite matrices) than that of under standard regularity conditions.3.3 Regularized estimation and EM algorithm
The results in Section 3.2 provide theoretical support for the use of the objective functions and . In real applications, the MLE may not behave satisfactorily as predicted by standard asymptotic theory for various reasons. The labeled sample size may not be sufficiently large. The dimension of the feature vector or the complexity of functions of features may be too high, compared with the labeled and unlabel data sizes. Therefore, we propose regularized estimation by adding suitable penalties to the objective functions.
For the coefficient vector , we employ a ridge penalty , although alternative penalties can also be allowed including a Lasso penalty. For the mixture proportion
, we use a penalty in the form of the log density of a Beta distribution,
, where and for a “center” and a “scale” . This choice is motivated by conceptual and computational simplicity in the EM algorithm to be discussed. Combining these penalties with gives the following penalized objective function(15) 
Similarly, the penalized objective function based on is
(16) 
Maximization of (15) or (16) will be called profile or direct SLR respectively. The two methods in general lead to different estimates of when , although they can be shown to be equivalent similarly as in Proposition 1 when . In fact, as (i.e., is fixed as ), the estimator of from profile SLR is known to asymptotically more efficient than from direct SLR (Tan 2009).
We construct EM algorithms (Dempster et al. 1977) to numerically maximize (15) and (16). Of particular interest is that these algorithms shed light on the effect of the regularization introduced. Various other optimization techniques can also be exploited, because is directly of a closed form, and is defined only after univariate minimization in .
We describe some details about the EM algorithm for profile SLR. See the Supplement (Section III) for the corresponding algorithm for direct SLR. We return to the nonparametric loglikelihood (9) and introduce the following data augmentation. For , let such that and . Recall that and and hence and fixed. Denote the penalty term in (15) or (16) as .
Estep. The expectation of the augmented objective given the current estimates is
(17) 
where .
Mstep. The next estimates are obtained as a maximizer of the expected objective (17) with profiled out, that is, over all possible which is a probability measure supported on the pooled data with . In correspondence to , denote
Instead of maximizing directly, we find a simple scheme for computing .
Proposition 4.
Let
(18) 
If and only if is a local (or global) maximizer of , then is a local (or respectively global) maximizer of .
Proposition 4 is useful both computationally and conceptually. First, is of a closed form, as a weighted average, with the weight depending on the scale , between the prior center and the empirical estimate , which would be obtained with or respectively. Moreover, can be equivalently computed by maximizing the objective function
(19) 
which is concave in and of a similar form to the loglikelihood (2
) with a ridge penalty for logistic regression. Each imputed probability
serves as a pseudo response.In our implementation, the prior center is fixed as , the proportion of in the labeled sample, and the scales
are treated as tuning parameters, to be selected by cross validation. Numerically, this procedure allows an adaptive interpolation between the two extremes: a fixed choice
or an empirical estimate by maximum likelihood. For direct SLR (but not profile SLR), our adaptive procedure reduces to and hence accommodates logistic regression with labeled data only at one extreme with . See the Supplement (Section III) for further discussion.4 Related work
There is a vast literature on semisupervised learning. See, for example, Chapelle et al. (2006) and Zhu (2008). For space limitation, we only discuss directly related work to ours.
Generative models and EM. A generative model can be postulated for jointly such that , where denotes the label proportion and denotes the parameters associated with the feature distributions given labels (e.g., Nigam et al. 2000). In our notation, a generative model corresponds to Eqs (5)–(7), but with both and parametrically specified. For training by EM algorithms, the expected objective in the Estep is similar to in (17), except that is replaced by for or 1. The performance of generative modeling can be sensitive to whether the model assumptions are correct or not (Cozman et al. 2003). In this regard, our approach based on ETM models is attractive in only specifying a parametric form (8) for the density ratio between and while leaving the distribution nonparametric.
Logistic regression and EM. There are various efforts to extend logistic regression in an EMstyle for semisupervised learning. Notably, Amini & Gallinari (2002) proposed a classification EM algorithm using logistic regression (1), which can be described as follows:

Estep: Compute . Fix and .

Cstep: Let if and 0 otherwise. Fix and .

Mstep: Compute by maximizing the objective .
Although convergence of classification EM was studied for clustering (Celeux & Govaert 1992), it seems unclear what objective function is optimized by the preceding algorithm. A worrisome phenomenon we notice is that if soft classification is used instead of hard classification, then the algorithm merely optimizes the loglikelihood of logistic regression with the labeled data only. By comparing (19) and (20), this modified algorithm can be shown to reduce to our EM algorithm with and clamped at , the proportion of in the labeled sample.
Proposition 5.
If the objective in the Mstep is modified with replaced by as
(20) 
then converges as to MLE of logistic regression based on the labeled data only.
We notice that the conclusion also holds if (20) is replaced by the cost function proposed in Wang et al. (2009), Eq (2), when the logistic loss is used as the cost function on labeled data.
Regularized methods. Various methods have been proposed by introducing a regularizer depending on unlabeled data to the loglikelihood of logistic regression with labeled data. Examples include entropy regularization (Grandvalet & Bengio 2005), expectation regularization (Mann & McCallum 2007), and graphbased priors (Krishnapuram et al. 2005). An important difference from our methods is that these penalized objective functions seem to be Fisher consistent only when they reduce to the loglikelihood of logistic regression with labeled data only alone, regardless of unlabeled data. For another difference, the class proportions in unlabeled data are implicitly assumed to be the same as in labeled data in entropy regularization, and need to be explicitly estimated from labeled data or external knowledge in the case of label regularization (Mann & McCallum 2007).
5 Numerical experiments
We report experiments on 15 benchmark datasets including 11 UCI datasets and 4 SSL benchmark datasets. We compare our methods, profile SLR (pSLR) and direct SLR (dSLR), with 2 supervised methods, ridge logistic regression (RLR) and SVM, and 2 semisupervised methods, entropy regularization (ER) (Grandvalet & Bengio 2005) and transductive SVM (TSVM) (Joachims 1999). For each method, only linear predictors are studied. All tuning parameters are selected by 5fold cross validation. See the Supplement (Section V) for details about the datasets and implementations.
For each dataset except SPAM, a training set is obtained as follows: labeled data are sampled for a certain size (25 or 100) and fixed class proportions and then unlabeled data are sampled such that the labeled and unlabeled data combined are 2/3 of the original dataset. The remaining 1/3 of the dataset is used as a test set. For SPAM, the preceding procedure is applied to a subsample of size 750 from the original dataset. To allow different class proportions between labeled and unlabeled data, we consider two schemes: the class proportions in the labeled data are close to those of the original dataset (“Homo Prop”), or larger (or smaller) than the latter by an odds ratio of 4 (“Flip Prop”) if the odds of positive versus negative labels is
(or respectively ) in the original dataset. Hence the class balance constraint as used in TSVM is misspecified in the second scheme.Care is needed to define classifiers on test data. In the Homo Prop scheme, the 4 existing methods are applied as usual, and accordingly the classifiers from our methods are the sign of , where are the class sizes in the labeled training data. In the Flip Prop scheme, the classifiers from RLR, LR, and SVM are the sign of , and those from our methods are the sign of
. Hence the intercepts of linear predictors are adjusted by assuming 1:1 class proportions in the test data. This assumption is often invalid in our experiments, but seems neutral when the actual class proportions in test data are unknown. The “linear predictor” is converted by logit from class probabilities for SVM, but this is currently unavailable for TSVM. Alternatively, class weights can be used in SVM, but this technique has not been developed for TSVM.
Table 1 presents the results with labeled data size 100. See the Supplement for those with labeled data size 25 and AUC results. In the Homo Prop scheme, the logistictype methods, RLR, ER, pSLR, and dSLR, perform similarly to each other, and noticeably better than SVM and TSVM in terms of accuracy achieved within 1% of the highest (in bold). While unstable performances of SVM and TSVM have been previously noticed (e.g., Li & Zhou 2015), such good performances of RLR and ER on these benchmark datasets appear not to have been reported before. In the Flip Prop scheme, our methods, dSLR and pSLR, achieve the best two performances, sometimes with considerable margins of improvement over other methods. In this case, all methods except TSVM are applied with intercept adjustment as described above. Because which proportion scheme holds may be unknown in practice, the results with intercept adjustment in the Homo Prop scheme are reported in the Supplement. Our methods remain to achieve close to the best performance among the methods studied.
6 Conclusion
We develop an extension of logistic regression for semisupervised learning, with strong support from statistical theory, algorithms, and numerical results. There are various questions of interest for future work. Our approach can be readily extended by employing nonlinear predictors such as kernel representations or neural networks. Further experiments with such extensions are desired, as well as applications to more complex text and image classification.
References
Amini, M.R. & Gallinari, P. (2002) Semisupervised logistic regression.
Proceedings of the 15th European Conference on Artificial Intelligence
, 390–394.Bartlett, P., Jordan, M., & McAuliffe, J. (2006) Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101, 138–156.
Celeux, G. & Govaert, G. (1992) A classification EM algorithm and two stochastic versions. Computational Statistics and Data Analysis, 14, 315–332.
Chapelle, O., Zien, A. & Schölkopf, B. (2006) SemiSupervised Learning. MIT Press.
Cozman, F., Cohen, I. & Cirelo, M. (2003). Semisupervised learning of mixture models.
Proceedings of the 20th International Conference on Machine Learning
, 99–106.Dempster, A.P., Laird, N.M. & Rubin, D.B. (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39, 1–22.
Grandvalet, Y., & Bengio, Y. (2005) Semisupervised learning by entropy minimization. Advances in Neural Information Processing Systems 17, 529–536.
Joachims, T. (1999) Transductive inference for text classification using support vector machines. Proceedings of the 16th International Conference on Machine Learning, 200–209.
Kiefer, J. & Wolfowitz, J. (1956) Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. Annals of Statistics, 27, 887–906.
Krishnapuram, B., Williams, D., Xue, Y., Carin, L., Figueiredo, M. & Hartemink, A.J. (2005) On semisupervised classification. Advances in Neural Information Processing Systems 17, 721–728.
Li, Y.F. & Zhou, Z.H. (2015) Towards making unlabeled data never hurt. IEEE Transactions on Pattern analysis and Machine Intelligence, 37, 175–188.
Lin, Y. (2002) Support vector machines and the Bayes rule in classification. Data Mining and Knowledge Discovery, 6, 259–275.
Mann, G.S. & McCallum, A. (2007) Simple, robust, scalable semisupervised learning via expectation regularization. Proceedings of the 24th International Conference on Machine learning, 593–600.
Manski, C.F. (1988) Analog Estimation Methods in Econometrics, Chapman & Hall.
Nigam, K., McCallum, A.K., Thrun, S. & Mitchell, T. (2000) Text classification from labeled and unlabeled documents using EM. Machine learning, 39, 103–134.
Owen, A.B. (2001) Empirical Likelihood. Chapman & Hall/CRC.
Prentice, R.L. & Pyke, R. (1979) Logistic disease incidence models and casecontrol studies. Biometrika, 66, 403–411.
Qin, J. (1998) Inferences for casecontrol and semiparametric twosample density ratio models. Biometrika, 85, 619–630.
Qin, J. (1999) Empirical likelihood ratio based confidence intervals for mixture proportions.
Annals of Statistics, 27, 1368–1384.Tan, Z. (2009) A note on profile likelihood for exponential tilt mixture models. Biometrika, 96, 229–236.
Wang, J., Shen, X. & Pan, W. (2009) On efficient large margin semisupervised learning: Method and theory. Journal of Machine Learning Research, 10, 719–742.
Vapnik, V. (1998) Statistical Learning Theory. WileyInterscience.
White, H. (1982) Maximum likelihood estimation of misspecified models. Econometrica, 50, 1–25.
Zhu, X.J. (2008) Semisupervised learning literature survey. Technical Report, University of WisconsinMadison, Department of Computer Sciences.
Zou, F., Fine, J.P. & Yandell, B.S. (2002) On empirical likelihood for a semiparametric mixture model. Biometrika, 89, 61–75.
Supplementary Material for
[.05in] “Semisupervised Logistic Learning Based on Exponential Tilt Mixture Models”
Xinwei Zhang & Zhiqiang Tan 
Department of Statistics, Rutgers University, USA 
xinwei.zhang@rutgers.edu, ztan@stat.rutgers.edu 
I Introduction
We provide additional material to support the content of the paper. All equation and proposition numbers referred to are from the paper, except S1, S2, etc.
Ii Illustration
We provide a simple example to highlight comparison between new and existing methods. A labeled sample of size 100 is drawn, where 20 are from bivariate Gaussian, , with mean and diagonal variance matrix , and 80 are from bivariate Gaussian, , with mean and diagonal variance matrix . An unlabeled sample of size 1000 is drawn, where are from and from and then the labels are removed. This is similar to the Flip Prop scheme in numerical experiments in Section 5, where the class proportions in unlabeled data differ from those in labeled data. The training set including both labeled and unlabeled data is then rescaled such that the root mean square of each feature is 1, as shown in Figure S1.
Figure S2
(rows 1 to 5) shows the decision lines from, respectively, ridge linear regression (RLR), entropy regularization (ER), SVM, TSVM, and direct SLR (dSLR). In the left column are the decision lines without intercept adjustment (corresponding to an assumption of 1:4 class proportions in test data as in labeled training data), and in the right column are those with intercept adjustment (corresponding to an assumption of 1:1 class proportions in test data as in unlabeled training data), as described in Section
5. In practice, the class proportions in test data may be unknown and hence some assumption is needed.^{1}^{1}1The assumption of 1:1 class proportions in test data is used to define classifiers in the Flip Prop scheme in Section 5, even though this assumption is violated for a majority of datasets studied (see Table S1). For ease of comparison, the intercept adjustment is directly applied to for SVM and TSVM, instead of the “linear predictor” converted by logit from class probabilities (if available), which would yield a nonlinear decision boundary. Alternatively, class weights can be used in SVM to account for differences in class proportions between training and test data. But this technique has not been developed for TSVM.For each method, eight decision lines (black or blue) are plotted, by using 8 values of a tuning parameter. Some of the lines may fall outside the plot region. The blue lines correspond to the least amount of penalization used, that is, smallest , and and largest . See Section V later for a description of the tuning parameters involved. For RLR, is varied uniformly from . For ER, is varied uniformly from from 1, while is fixed at 0, to isolate the effect of entropy regularization. For SVM and TSVM, is varied uniformly from . For TSVM, the parameter is automatically tuned when using SVM (Joachims 1999). For dSLR, is varied uniformly from , while is fixed at 0.
Two oracle lines are drawn in each plot. The red line is computed by logistic regression and the purple line is computed by SVM with , from an independent labeled sample of size 4000 with 1:4 class proportions (left column) or 1:1 class proportions (right column), which is transformed by the same scale as the original training set. The red and purple oracle lines differ only slightly in the left column, but are virtually identical in the right column. It should be noted that these oracle lines are not the optimal, Bayes decision boundary, because the log density ratio between the classes is linear in but nonlinear in due to the different variances of .
From these plots, we see the following comparison. First, the least penalized line (blue) from our method dSLR is much closer to the oracle lines (red and purple) than those from the other methods, whether or not intercept adjustment is applied. This shows numerical support for Fisher consistency of our method, given the labeled size 100 and unlabeled size 1000 reasonably large compared with the feature dimension 2. On the other hand, in spite of the relatively large labeled size, the lines from nonpenalized logistic regression and SVM based on labeled data alone still differ noticeably from the oracle lines. Hence this also shows that our method can exploit unlabeled data together with labeled data to actually achieve a better approximation to the oracle lines.
Second, with suitable choices of tuning parameters, some of the decision lines from existing methods can be reasonably close to the oracle lines. In fact, such cases of good approximation can be found from the supervised methods RLR and SVM, but not from the semisupervised methods ER and TSVM. This indicates potentially unstable performances of ER and TSVM, particularly in the current setting where the class proportions in unlabeled data differ from those in labeled data. Moreover, SVM seems to perform noticeably worse in the right column, possibly due to intercept adjustment, than in the left column, where the class proportions in test data underlying the oracle lines are identical those in labeled training data (hence a more favorable setting).
Comments
There are no comments yet.