Semi-supervised Logistic Learning Based on Exponential Tilt Mixture Models

06/19/2019
by   Xinwei Zhang, et al.
Rutgers University
0

Consider semi-supervised learning for classification, where both labeled and unlabeled data are available for training. The goal is to exploit both datasets to achieve higher prediction accuracy than just using labeled data alone. We develop a semi-supervised logistic learning method based on exponential tilt mixture models, by extending a statistical equivalence between logistic regression and exponential tilt modeling. We study maximum nonparametric likelihood estimation and derive novel objective functions which are shown to be Fisher consistent. We also propose regularized estimation and construct simple and highly interpretable EM algorithms. Finally, we present numerical results which demonstrate the advantage of the proposed methods compared with existing methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

08/26/2011

Semi-supervised logistic discrimination via labeled data and unlabeled data from different sampling distributions

This article addresses the problem of classification method based on bot...
06/25/2021

Self-training Converts Weak Learners to Strong Learners in Mixture Models

We consider a binary classification problem when the data comes from a m...
08/28/2020

Semi-supervised Learning with the EM Algorithm: A Comparative Study between Unstructured and Structured Prediction

Semi-supervised learning aims to learn prediction models from both label...
07/23/2017

Prediction-Constrained Training for Semi-Supervised Mixture and Topic Models

Supervisory signals have the potential to make low-dimensional data repr...
02/22/2011

Semi-supervised logistic discrimination for functional data

Multi-class classification methods based on both labeled and unlabeled f...
02/13/2018

Clustering and Semi-Supervised Classification for Clickstream Data via Mixture Models

Finite mixture models have been used for unsupervised learning for over ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Semi-supervised learning for classification involves exploiting a large amount of unlabeled data and a relatively small amount of labeled data to build better classifiers. This approach can potentially be used to achieve higher accuracy, with a limited budget for obtaining labeled data. Various methods have been proposed, including expectation-maximization (EM) algorithms, transductive support vector machines (SVMs), and regularized methods (e.g., Chapelle et al. 2006; Zhu 2008).

For supervised classification, there are a range of objective functions which are Fisher consistent in the following sense: optimization of the population, nonparametric version of a loss function leads to the true conditional probability function of labels given features as for the logistic loss, or to the Bayes classifier as for the hinge loss (Lin 2002; Bartlett et al. 2006). In contrast, a perplexing issue we notice for semi-supervised classification is that existing objective functions are in general not Fisher consistent, unless in the degenerate case where unlabeled data are ignored and only labeled data are used. Examples include the objective functions in transductive SVMs (Vapnik 1998; Joachims 1999) and various regularized methods (Grandvalet & Bengio 2005; Mann & McCallum 2007; Krishnapuram et al. 2005). The lack of Fisher consistency may contribute to unstable performances of existing semi-supervised classifiers (e.g., Li & Zhou 2015). Another restriction in existing methods is that the class proportions in labeled and unlabeled data are typically assumed to be the same.

We develop a semi-supervised extension of logistic regression based on exponential tilt mixture models (Qin 1999; Zou et al. 2002; Tan 2009), without restricting the class proportions in the unlabeled data to be the same as in the labeled data. The development is motivated by a statistical equivalence between logistic regression for the conditional probability of a label given features and exponential tilt modeling for the density ratio between the feature distributions within different labels (Anderson 1972; Prentice & Pyke 1979). Our work involves two main contributions: (i) we derive novel objective functions which are shown not only to be Fisher consistent but also lead to asymptotically more efficient estimation than based on labeled data only, and (ii) we propose regularized estimation and construct computationally and conceptually desirable EM algorithms. From numerical experiments, our methods achieve a substantial advantage over existing methods when the class proportions in unlabeled data differ from those in labeled data. A possible explanation is that while the class proportions in unlabeled data are estimated as unknown parameters in our methods, they are implicitly assumed to be the same as in labeled data for existing methods including transductive SVMs (Joachims 1999) and entropy regularization (Grandvalet & Bengio 2005).

A simple, informative example is provided in the Supplement (Section II) to highlight comparison between new and existing methods mentioned above.

2 Background: logistic regression and exponential tilt model

For supervised classification, the training data consist of a sample of , where and representing a feature vector and an associated label respectively. Consider a logistic regression model

(1)

where is a coefficient vector associated with , and is an intercept, with superscript indicating classification or conditional probability of given . The maximum likelihood estimator (MLE) is defined as a maximizer of the log (conditional) likelihood:

(2)

In general, nonlinear functions of can be used in place of , and a penalty term can be incorporated into the log-likelihood such as the ridge penalty or the squared norm of a reproducing kernel Hilbert space of functions of . We discuss these issues later in Sections  3.3 and 6.

Interestingly, logistic regression on can be made equivalent to an exponential tilt model on (Anderson 1972; Prentice & Pyke 1979; Qin 1998). Denote by or the conditional distribution or respectively, and . By the Bayes rule, model (1) is equivalent to the exponential tilt model

(3)

where denotes the density ratio between and with respect to a dominating measure, and . Model (3

) is explicitly a semi-parametric model, where

is an infinitely-dimensional parameter and are finitely-dimensional parameters. In fact, logistic model (1) is also semi-parametric, where the marginal distribution of is an infinitely-dimensional parameter, and are finitely-dimensional parameters. Furthermore, the MLE in model (1) can be related to the following estimator in model (3) by the method of nonparametric likelihood (Kiefer & Wolfowitz 1956) or empirical likelihood (Owen 2001). Formally, are defined as a maximizer of the log-likelihood,

(4)

over all possible such that is a probability measure supported on the pooled data with . Analytically, it can be shown that , , where . See Qin (1998) and references therein.

By the foregoing discussion, we see that there are two statistically distinct but equivalent approaches for supervised classification: logistic regression or exponential tilt models. It is such a relationship that we aim to exploit in developing a new method for semi-supervised classification.

3 Theory and methods

For semi-supervised classification, the training data consist of a labeled sample and an unlabeled sample , for which the associated labels are unobserved. Typically for existing methods including transductive SVMs, the two samples and are assumed to be from a common population of . However, we allow that and may be drawn from different populations, with the same conditional distribution , but possibly different marginal probabilities and .

3.1 Exponential tilt mixture model

Although it seems difficult at first look to extend logistic model (1) for semi-supervised learning, we realize that both the labeled sample and the unlabeled sample can be taken account of by a natural extension of the exponential tilt model (3), called an exponential tilt mixture (ETM) model (Qin 1999; Zou et al. 2002; Tan 2009). Denote

An exponential tilt mixture model for the three samples postulates that

(5)
(6)
(7)

where or represents the conditional distribution of given or respectively in both the labeled and unlabeled data such that

(8)

and is the proportion of underlying the unlabeled data. While Eqs (5)–(6) merely give definitions of and , Eq (7) says that the feature distribution in the unlabeled sample is a mixture of and , which follows from the structural assumption that the conditional distribution is invariant between the labeled and unlabeled samples. Eq (8) imposes a functional restriction on the density ratio between and , similarly as in (3).

The ETM model, defined by (5)–(8), is a semi-parametric model, with an infinitely-dimensional parameter and finitely-dimensional parameter and . We briefly summarize maximum nonparametric likelihood estimation previously studied (Qin 1999; Zou et al. 2002; Tan 2009). For notational convenience, rewrite the sample as , where , , and . Eqs (5)–(7) can be expressed as

where , , and . For any fixed , the average profile log-likelihood of is defined as with

(9)

over all possible which is a probability measure supported on the pooled data with . Denote

which can be easily shown to be concave in and convex in . Then Proposition 1 in Tan (2009) leads to the following result.

Lemma 1.

The average profile log-likelihood of can be determined as , where is a minimizer of over , satisfying the stationary condition (free of )

(10)

The maximum likelihood estimator of is then defined by maximizing the profile log-likelihood, that is, . From Lemma 1, we notice that the estimators jointly solve the saddle-point problem:

(11)

Large sample theory of has been studied in Qin (1999) under standard regularity conditions as and with some constant for . The theory shows the existence of a local maximizer of , which is consistent and asymptotically normal provided the ETM model (5)–(8) is correctly specified. However, there remain subtle questions. It seems unclear whether the population version of the average profile log-likelihood attains a global maximum at the true values of under a correctly specified ETM model. Moreover, what property can be deduced for under a misspecified ETM model?

3.2 Semi-supervised logistic regression

We derive a new classification model with parameters for the three samples such that an MLE of in the new model coincides with an MLE in the ETM model, and vice versa. Let if and if . Consider a conditional probability model for predicting the label from :

(12)
(13)
(14)

where , which ensures that . The model, defined by (12)–(14), will be called a semi-supervised logistic regression (SLR) model. The average log-likelihood function of with the data in model (12)–(14) can be written, up to an additive constant free of , as

Proposition 1.

If and only if is a local (or respectively global) maximizer of the average log-likelihood in SLR model (12)–(14), then it is a local (or global) maximizer of the average profile log-likelihood in ETM model (5)–(8).

Proposition 1 shows an equivalence between maximum nonparametric likelihood estimation in ETM model (5)–(8) and usual maximum likelihood estimation in SLR model (12)–(14), even though the objective functions and are not equivalent. This differs from the equivalence between logistic regression (1) and exponential tilt model (3) with labeled data only, where the log-likelihood (2) and the profile log-likelihood from (4) are equivalent (Prentice & Pyke 1979). From another angle, this result says that saddle-point problem (11) can be equivalently solved by directly maximizing . This transformation is nontrivial, because a saddle-point problem in general cannot be converted into optimization with a closed-form objective.

By the identification of as a usual log-likelihood function, we show that the objective functions and , with the linear predictor replaced by an arbitrary function , are Fisher consistent nonparametrically, i.e., maximization of their population versions leads to the true values. This seems to be the first time Fisher consistency of a loss function is established for semi-supervised classification. By some abuse of notation, denote

Proposition 2.

Suppose that is drawn from in (5)–(7) for , with and for some fixed value and function . Denote . For any and function , we have

where both equalities hold if and . Hence the population objective functions and are maximized at the true value and function .

Proposition 2 fills existing gaps in understanding maximum likelihood estimation in ETM model (5)–(8), through its equivalence with that in SLR model (12)–(14). If the ETM model is correctly specified, then the population version of has a global maximum at the true values of , and hence a global maximizer is consistent under suitable regularity conditions. If the ETM model is misspecified, then by theory of estimation with misspecified models (Manski 1988; White 1982), the MLE converges in probability to a limit value which minimizes the difference between with and . This difference as shown in the Supplement (Section IV.2

) is the expected Kullback–Leibler divergence

where is the conditional probability (12)–(14) for , is the Kullback–Leibler divergence between two probability vectors and , and denotes the expectation with respect to .

Finally, we point out another interesting property of SLR model (12)–(14). If is fixed as , the proportion of in the labeled sample, then . In this case, the conditional probability (14) reduces to a constant, and the objective function can be easily shown to be equivalent to the profile log-likelihood of derived from (4) in the exponential tilt model based on the labeled data only or equivalently the log-likelihood of as (2) from logistic regression based on the labeled data only, after the intercept shift . We show that the MLE from ETM model (5)–(8) or equivalently SLR model (12)–(14) is asymptotically more efficient than that from logistic regression based on the labeled data only.

Proposition 3.

Denote by the estimator of obtained by maximizing

or equivalently by logistic regression based on the labeled data only. Then the asymptotic variance matrix of the MLE

from ETM model (5)–(8) is no greater (in the usual order on positive-definite matrices) than that of under standard regularity conditions.

3.3 Regularized estimation and EM algorithm

The results in Section 3.2 provide theoretical support for the use of the objective functions and . In real applications, the MLE may not behave satisfactorily as predicted by standard asymptotic theory for various reasons. The labeled sample size may not be sufficiently large. The dimension of the feature vector or the complexity of functions of features may be too high, compared with the labeled and unlabel data sizes. Therefore, we propose regularized estimation by adding suitable penalties to the objective functions.

For the coefficient vector , we employ a ridge penalty , although alternative penalties can also be allowed including a Lasso penalty. For the mixture proportion

, we use a penalty in the form of the log density of a Beta distribution,

, where and for a “center” and a “scale” . This choice is motivated by conceptual and computational simplicity in the EM algorithm to be discussed. Combining these penalties with gives the following penalized objective function

(15)

Similarly, the penalized objective function based on is

(16)

Maximization of (15) or (16) will be called profile or direct SLR respectively. The two methods in general lead to different estimates of when , although they can be shown to be equivalent similarly as in Proposition 1 when . In fact, as (i.e., is fixed as ), the estimator of from profile SLR is known to asymptotically more efficient than from direct SLR (Tan 2009).

We construct EM algorithms (Dempster et al. 1977) to numerically maximize (15) and (16). Of particular interest is that these algorithms shed light on the effect of the regularization introduced. Various other optimization techniques can also be exploited, because is directly of a closed form, and is defined only after univariate minimization in .

We describe some details about the EM algorithm for profile SLR. See the Supplement (Section III) for the corresponding algorithm for direct SLR. We return to the nonparametric log-likelihood (9) and introduce the following data augmentation. For , let such that and . Recall that and and hence and fixed. Denote the penalty term in (15) or (16) as .

E-step. The expectation of the augmented objective given the current estimates is

(17)

where .

M-step. The next estimates are obtained as a maximizer of the expected objective (17) with profiled out, that is, over all possible which is a probability measure supported on the pooled data with . In correspondence to , denote

Instead of maximizing directly, we find a simple scheme for computing .

Proposition 4.

Let

(18)

If and only if is a local (or global) maximizer of , then is a local (or respectively global) maximizer of .

Proposition 4 is useful both computationally and conceptually. First, is of a closed form, as a weighted average, with the weight depending on the scale , between the prior center and the empirical estimate , which would be obtained with or respectively. Moreover, can be equivalently computed by maximizing the objective function

(19)

which is concave in and of a similar form to the log-likelihood (2

) with a ridge penalty for logistic regression. Each imputed probability

serves as a pseudo response.

In our implementation, the prior center is fixed as , the proportion of in the labeled sample, and the scales

are treated as tuning parameters, to be selected by cross validation. Numerically, this procedure allows an adaptive interpolation between the two extremes: a fixed choice

or an empirical estimate by maximum likelihood. For direct SLR (but not profile SLR), our adaptive procedure reduces to and hence accommodates logistic regression with labeled data only at one extreme with . See the Supplement (Section III) for further discussion.

4 Related work

There is a vast literature on semi-supervised learning. See, for example, Chapelle et al. (2006) and Zhu (2008). For space limitation, we only discuss directly related work to ours.

Generative models and EM. A generative model can be postulated for jointly such that , where denotes the label proportion and denotes the parameters associated with the feature distributions given labels (e.g., Nigam et al. 2000). In our notation, a generative model corresponds to Eqs (5)–(7), but with both and parametrically specified. For training by EM algorithms, the expected objective in the E-step is similar to in (17), except that is replaced by for or 1. The performance of generative modeling can be sensitive to whether the model assumptions are correct or not (Cozman et al. 2003). In this regard, our approach based on ETM models is attractive in only specifying a parametric form (8) for the density ratio between and while leaving the distribution nonparametric.

Logistic regression and EM. There are various efforts to extend logistic regression in an EM-style for semi-supervised learning. Notably, Amini & Gallinari (2002) proposed a classification EM algorithm using logistic regression (1), which can be described as follows:

  • E-step: Compute . Fix and .

  • C-step: Let if and 0 otherwise. Fix and .

  • M-step: Compute by maximizing the objective .

Although convergence of classification EM was studied for clustering (Celeux & Govaert 1992), it seems unclear what objective function is optimized by the preceding algorithm. A worrisome phenomenon we notice is that if soft classification is used instead of hard classification, then the algorithm merely optimizes the log-likelihood of logistic regression with the labeled data only. By comparing (19) and (20), this modified algorithm can be shown to reduce to our EM algorithm with and clamped at , the proportion of in the labeled sample.

Proposition 5.

If the objective in the M-step is modified with replaced by as

(20)

then converges as to MLE of logistic regression based on the labeled data only.

We notice that the conclusion also holds if (20) is replaced by the cost function proposed in Wang et al. (2009), Eq (2), when the logistic loss is used as the cost function on labeled data.

Regularized methods. Various methods have been proposed by introducing a regularizer depending on unlabeled data to the log-likelihood of logistic regression with labeled data. Examples include entropy regularization (Grandvalet & Bengio 2005), expectation regularization (Mann & McCallum 2007), and graph-based priors (Krishnapuram et al. 2005). An important difference from our methods is that these penalized objective functions seem to be Fisher consistent only when they reduce to the log-likelihood of logistic regression with labeled data only alone, regardless of unlabeled data. For another difference, the class proportions in unlabeled data are implicitly assumed to be the same as in labeled data in entropy regularization, and need to be explicitly estimated from labeled data or external knowledge in the case of label regularization (Mann & McCallum 2007).

5 Numerical experiments

We report experiments on 15 benchmark datasets including 11 UCI datasets and 4 SSL benchmark datasets. We compare our methods, profile SLR (pSLR) and direct SLR (dSLR), with 2 supervised methods, ridge logistic regression (RLR) and SVM, and 2 semi-supervised methods, entropy regularization (ER) (Grandvalet & Bengio 2005) and transductive SVM (TSVM) (Joachims 1999). For each method, only linear predictors are studied. All tuning parameters are selected by 5-fold cross validation. See the Supplement (Section V) for details about the datasets and implementations.

For each dataset except SPAM, a training set is obtained as follows: labeled data are sampled for a certain size (25 or 100) and fixed class proportions and then unlabeled data are sampled such that the labeled and unlabeled data combined are 2/3 of the original dataset. The remaining 1/3 of the dataset is used as a test set. For SPAM, the preceding procedure is applied to a subsample of size 750 from the original dataset. To allow different class proportions between labeled and unlabeled data, we consider two schemes: the class proportions in the labeled data are close to those of the original dataset (“Homo Prop”), or larger (or smaller) than the latter by an odds ratio of 4 (“Flip Prop”) if the odds of positive versus negative labels is

(or respectively ) in the original dataset. Hence the class balance constraint as used in TSVM is misspecified in the second scheme.

! Homo Prop RLR ER pSLR dSLR SVM TSVM AUSTRA 85.37 2.00 85.50 1.94 85.43 2.07 85.33 2.03 85.37 1.96 85.15 1.79 BCW 95.76 1.04 95.64 1.08 95.71 1.07 95.80 1.07 96.13 1.04 96.44 0.92 GERMAN 72.16 2.69 72.22 2.60 72.39 2.71 72.12 2.68 70.65 2.85 69.01 3.94 HEART 80.94 3.56 81.04 4.23 81.46 3.41 80.94 3.78 80.00 4.35 80.36 5.00 INON 84.38 2.29 84.17 2.16 85.00 1.21 84.38 2.29 83.92 2.51 83.33 3.18 LIVER 65.57 4.37 65.83 4.20 66.35 4.65 66.22 4.22 67.87 3.22 64.83 5.00 PIMA 74.71 2.99 75.06 3.07 75.02 2.90 74.71 3.04 74.65 2.69 72.58 3.97 SPAM 87.98 2.70 88.04 2.95 87.90 2.49 87.94 2.70 85.74 3.90 87.04 5.37 VEHICLE 93.10 2.73 92.59 2.80 92.45 2.83 93.24 2.82 92.38 3.31 93.03 3.39 VOTES 93.66 2.55 93.59 2.59 93.66 2.32 93.45 2.59 94.17 2.57 94.03 2.92 WDBC 95.92 1.65 95.61 1.70 95.89 1.48 95.92 1.65 95.67 1.55 96.06 1.31 BCI 66.50 4.06 65.83 3.80 65.86 4.89 65.86 4.40 68.46 5.01 67.48 5.20 COIL 78.95 3.15 78.96 3.24 79.07 3.89 78.70 3.42 80.102.53 81.39 2.38 DIGIT1 89.90 1.11 89.29 2.70 90.00 1.18 89.87 1.16 89.301.33 89.73 1.45 USPS 85.39 2.38 85.62 2.26 85.97 1.91 85.54 2.45 85.502.13 84.71 2.27 Average accuracy 83.35 83.27 83.48 83.33 83.34 83.01 # within 1% of highest 12/15 12/15 12/15 12/15 10/15 7/15

! Flip Prop RLR ER pSLR dSLR SVM TSVM AUSTRA 84.98 2.18 85.00 2.43 85.07 2.86 85.46 2.35 85.112.32 71.78 7.05 BCW 96.27 1.39 96.07 1.50 96.64 1.46 96.47 1.36 96.001.76 95.80 2.80 GERMAN 68.77 2.36 68.35 2.27 68.05 2.42 69.61 2.25 68.302.27 57.57 4.00 HEART 80.36 4.75 80.10 4.62 81.82 3.60 81.98 3.61 79.064.73 62.71 4.34 INON 83.96 3.52 82.75 3.90 82.81 3.94 83.75 2.64 80.965.02 59.22 8.40 LIVER 60.70 6.60 60.30 6.71 62.00 6.91 62.83 5.80 59.708.82 54.30 2.19 PIMA 71.39 3.53 71.50 3.38 72.03 2.95 71.84 3.28 71.663.28 61.80 2.86 SPAM 88.52 2.38 89.06 2.97 88.60 2.75 88.44 2.34 86.684.28 87.22 5.30 VEHICLE 89.93 6.31 88.79 5.70 91.55 5.55 93.34 1.97 88.664.02 70.62 4.63 VOTES 92.72 2.04 92.66 2.25 93.31 1.55 93.31 1.58 92.722.92 81.03 6.02 WDBC 96.20 1.59 96.33 1.40 96.71 1.58 97.22 1.68 95.971.43 80.23 4.81 BCI 62.78 3.81 62.71 4.10 65.86 5.13 66.35 4.91 65.044.21 60.83 3.83 COIL 71.86 6.59 72.36 6.93 72.20 7.72 73.45 6.56 69.019.81 66.28 2.90 DIGIT1 87.93 2.11 87.08 4.38 87.68 2.60 88.89 2.73 87.162.39 73.10 2.03 USPS 82.05 3.42 82.29 3.29 83.74 3.21 83.65 3.17 81.163.06 64.95 1.90 Average accuracy 81.23 81.02 81.87 82.44 80.48 69.83 # within 1% of highest 8/15 6/15 10/15 15/15 4/15 1/15

Table 1: Classification accuracy in % (mean sd) on test data over 20 repeated runs, with labeled training data size 100. Subscript indicates that intercept adjustment is applied (see the text).

Care is needed to define classifiers on test data. In the Homo Prop scheme, the 4 existing methods are applied as usual, and accordingly the classifiers from our methods are the sign of , where are the class sizes in the labeled training data. In the Flip Prop scheme, the classifiers from RLR, LR, and SVM are the sign of , and those from our methods are the sign of

. Hence the intercepts of linear predictors are adjusted by assuming 1:1 class proportions in the test data. This assumption is often invalid in our experiments, but seems neutral when the actual class proportions in test data are unknown. The “linear predictor” is converted by logit from class probabilities for SVM, but this is currently unavailable for TSVM. Alternatively, class weights can be used in SVM, but this technique has not been developed for TSVM.

Table 1 presents the results with labeled data size 100. See the Supplement for those with labeled data size 25 and AUC results. In the Homo Prop scheme, the logistic-type methods, RLR, ER, pSLR, and dSLR, perform similarly to each other, and noticeably better than SVM and TSVM in terms of accuracy achieved within 1% of the highest (in bold). While unstable performances of SVM and TSVM have been previously noticed (e.g., Li & Zhou 2015), such good performances of RLR and ER on these benchmark datasets appear not to have been reported before. In the Flip Prop scheme, our methods, dSLR and pSLR, achieve the best two performances, sometimes with considerable margins of improvement over other methods. In this case, all methods except TSVM are applied with intercept adjustment as described above. Because which proportion scheme holds may be unknown in practice, the results with intercept adjustment in the Homo Prop scheme are reported in the Supplement. Our methods remain to achieve close to the best performance among the methods studied.

6 Conclusion

We develop an extension of logistic regression for semi-supervised learning, with strong support from statistical theory, algorithms, and numerical results. There are various questions of interest for future work. Our approach can be readily extended by employing nonlinear predictors such as kernel representations or neural networks. Further experiments with such extensions are desired, as well as applications to more complex text and image classification.

References

Amini, M.R. & Gallinari, P.  (2002) Semi-supervised logistic regression.

Proceedings of the 15th European Conference on Artificial Intelligence

, 390–394.

Bartlett, P., Jordan, M., & McAuliffe, J.  (2006) Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101, 138–156.

Celeux, G. & Govaert, G.  (1992) A classification EM algorithm and two stochastic versions. Computational Statistics and Data Analysis, 14, 315–332.

Chapelle, O., Zien, A. & Schölkopf, B.  (2006) Semi-Supervised Learning. MIT Press.

Cozman, F., Cohen, I. & Cirelo, M.  (2003). Semi-supervised learning of mixture models.

Proceedings of the 20th International Conference on Machine Learning

, 99–106.

Dempster, A.P., Laird, N.M. & Rubin, D.B. (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39, 1–22.

Grandvalet, Y., & Bengio, Y.  (2005) Semi-supervised learning by entropy minimization. Advances in Neural Information Processing Systems 17, 529–536.

Joachims, T. (1999) Transductive inference for text classification using support vector machines. Proceedings of the 16th International Conference on Machine Learning, 200–209.

Kiefer, J. & Wolfowitz, J.  (1956) Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. Annals of Statistics, 27, 887–906.

Krishnapuram, B., Williams, D., Xue, Y., Carin, L., Figueiredo, M. & Hartemink, A.J.  (2005) On semi-supervised classification. Advances in Neural Information Processing Systems 17, 721–728.

Li, Y.-F. & Zhou, Z.-H.  (2015) Towards making unlabeled data never hurt. IEEE Transactions on Pattern analysis and Machine Intelligence, 37, 175–188.

Lin, Y. (2002) Support vector machines and the Bayes rule in classification. Data Mining and Knowledge Discovery, 6, 259–275.

Mann, G.S. & McCallum, A.  (2007) Simple, robust, scalable semi-supervised learning via expectation regularization. Proceedings of the 24th International Conference on Machine learning, 593–600.

Manski, C.F.  (1988) Analog Estimation Methods in Econometrics, Chapman & Hall.

Nigam, K., McCallum, A.K., Thrun, S. & Mitchell, T. (2000) Text classification from labeled and unlabeled documents using EM. Machine learning, 39, 103–134.

Owen, A.B.  (2001) Empirical Likelihood. Chapman & Hall/CRC.

Prentice, R.L. & Pyke, R.  (1979) Logistic disease incidence models and case-control studies. Biometrika, 66, 403–411.

Qin, J.  (1998) Inferences for case-control and semiparametric two-sample density ratio models. Biometrika, 85, 619–630.

Qin, J.  (1999) Empirical likelihood ratio based confidence intervals for mixture proportions.

Annals of Statistics, 27, 1368–1384.

Tan, Z.  (2009) A note on profile likelihood for exponential tilt mixture models. Biometrika, 96, 229–236.

Wang, J., Shen, X. & Pan, W.  (2009) On efficient large margin semisupervised learning: Method and theory. Journal of Machine Learning Research, 10, 719–742.

Vapnik, V.  (1998) Statistical Learning Theory. Wiley-Interscience.

White, H.  (1982) Maximum likelihood estimation of misspecified models. Econometrica, 50, 1–25.

Zhu, X.J.  (2008) Semi-supervised learning literature survey. Technical Report, University of Wisconsin-Madison, Department of Computer Sciences.

Zou, F., Fine, J.P. & Yandell, B.S.  (2002) On empirical likelihood for a semiparametric mixture model. Biometrika, 89, 61–75.

 

Supplementary Material for

[.05in] “Semi-supervised Logistic Learning Based on Exponential Tilt Mixture Models”


 

Xinwei Zhang   &   Zhiqiang Tan
Department of Statistics, Rutgers University, USA
xinwei.zhang@rutgers.edu, ztan@stat.rutgers.edu

I Introduction

We provide additional material to support the content of the paper. All equation and proposition numbers referred to are from the paper, except S1, S2, etc.

Ii Illustration

We provide a simple example to highlight comparison between new and existing methods. A labeled sample of size 100 is drawn, where 20 are from bivariate Gaussian, , with mean and diagonal variance matrix , and 80 are from bivariate Gaussian, , with mean and diagonal variance matrix . An unlabeled sample of size 1000 is drawn, where are from and from and then the labels are removed. This is similar to the Flip Prop scheme in numerical experiments in Section 5, where the class proportions in unlabeled data differ from those in labeled data. The training set including both labeled and unlabeled data is then rescaled such that the root mean square of each feature is 1, as shown in Figure S1.

  [width=height=2in]Illustration_data.pdf 

Figure S1: Training data from bivariate Gaussian

  [width=height=8in]Illustration3.pdf 

Figure S2: Decision lines with bivariate Gaussian data

Figure S2

(rows 1 to 5) shows the decision lines from, respectively, ridge linear regression (RLR), entropy regularization (ER), SVM, TSVM, and direct SLR (dSLR). In the left column are the decision lines without intercept adjustment (corresponding to an assumption of 1:4 class proportions in test data as in labeled training data), and in the right column are those with intercept adjustment (corresponding to an assumption of 1:1 class proportions in test data as in unlabeled training data), as described in Section 

5. In practice, the class proportions in test data may be unknown and hence some assumption is needed.111The assumption of 1:1 class proportions in test data is used to define classifiers in the Flip Prop scheme in Section 5, even though this assumption is violated for a majority of datasets studied (see Table S1). For ease of comparison, the intercept adjustment is directly applied to for SVM and TSVM, instead of the “linear predictor” converted by logit from class probabilities (if available), which would yield a nonlinear decision boundary. Alternatively, class weights can be used in SVM to account for differences in class proportions between training and test data. But this technique has not been developed for TSVM.

For each method, eight decision lines (black or blue) are plotted, by using 8 values of a tuning parameter. Some of the lines may fall outside the plot region. The blue lines correspond to the least amount of penalization used, that is, smallest , and and largest . See Section V later for a description of the tuning parameters involved. For RLR, is varied uniformly from . For ER, is varied uniformly from from 1, while is fixed at 0, to isolate the effect of entropy regularization. For SVM and TSVM, is varied uniformly from . For TSVM, the parameter is automatically tuned when using SVM (Joachims 1999). For dSLR, is varied uniformly from , while is fixed at 0.

Two oracle lines are drawn in each plot. The red line is computed by logistic regression and the purple line is computed by SVM with , from an independent labeled sample of size 4000 with 1:4 class proportions (left column) or 1:1 class proportions (right column), which is transformed by the same scale as the original training set. The red and purple oracle lines differ only slightly in the left column, but are virtually identical in the right column. It should be noted that these oracle lines are not the optimal, Bayes decision boundary, because the log density ratio between the classes is linear in but nonlinear in due to the different variances of .

From these plots, we see the following comparison. First, the least penalized line (blue) from our method dSLR is much closer to the oracle lines (red and purple) than those from the other methods, whether or not intercept adjustment is applied. This shows numerical support for Fisher consistency of our method, given the labeled size 100 and unlabeled size 1000 reasonably large compared with the feature dimension 2. On the other hand, in spite of the relatively large labeled size, the lines from non-penalized logistic regression and SVM based on labeled data alone still differ noticeably from the oracle lines. Hence this also shows that our method can exploit unlabeled data together with labeled data to actually achieve a better approximation to the oracle lines.

Second, with suitable choices of tuning parameters, some of the decision lines from existing methods can be reasonably close to the oracle lines. In fact, such cases of good approximation can be found from the supervised methods RLR and SVM, but not from the semi-supervised methods ER and TSVM. This indicates potentially unstable performances of ER and TSVM, particularly in the current setting where the class proportions in unlabeled data differ from those in labeled data. Moreover, SVM seems to perform noticeably worse in the right column, possibly due to intercept adjustment, than in the left column, where the class proportions in test data underlying the oracle lines are identical those in labeled training data (hence a more favorable setting).

Iii EM algorithm for direct SLR

We present an EM algorithm to numerically maximizer (16) for direct SLR, based on the SLR model defined by (12)–(14). We introduce the following data augmentation. Given the pooled data , let

(S1)

Equivalently, can be denoted as , such that