In this paper, we analyze statistical properties of semi-supervised learning. In the standard supervised learning, only the labeled data is observed, and the goal is to estimate the relation between and . In semi-supervised learning [chapelle06:_semi_super_learn], the unlabeled data is also obtained in addition to labeled data. In real-world data such as the text data, we can often obtain both labeled and unlabeled data. A typical example is that and stand for the text of an article, and the tag of the article, respectively. Tagging the article demands a lot of effort. Hence, the labeled data is scarce, while the unlabeled data is abundant. In semi-supervised learning, studying methods of exploiting unlabeled data is an important issue.
In the standard semi-supervised learning, statistical models of the joint probability , i.e., generative models, are often used to incorporate the information involved in the unlabeled data into the estimation. For example, under the statistical model having the parameter , the information involved in the unlabeled data is used to estimate the parameter via the marginal probability . The amount of information in unlabeled samples is studied by [castelli96, dillon10:_asymp_analy_gener_semi_super_learn, sinha07:_value_label_unlab_examp_model_imper]. This approach is developed to deal with a various data structures. For example, semi-supervised learning with manifold assumption or cluster assumption has been studied along this line [belkin04:_semi_super_learn_rieman_manif, DBLP:conf/nips/LaffertyW07]. Under some assumptions on generative models, it is revealed that unlabeled data is useful to improve the prediction accuracy.
Statistical models of the conditional probability , i.e., discriminative models, are also used in semi-supervised learning. It seems that the unlabeled data is not useful that much for the estimation of the conditional probability, since the marginal probability does not have any information on [lasserre06:_princ_hybrid_gener_discr_model, seeger01:_learn, zhang00]
. Indeed, the maximum likelihood estimator using a parametric model ofis not affected by the unlabeled data. Sokolovska, et al. [sokolovska08], however, proved that even under discriminative models, unlabeled data is still useful to improve the prediction accuracy of the learning method with only labeled data.
Semi-supervised learning methods basically work well under some assumptions on the population distribution and the statistical models. However, it was also reported that the semi-supervised learning has a possibility to degrade the estimation accuracy, especially when a misspecified model is applied [cozman03:_semi, grandvalet05:_semi, nigam99:_text_class_label_unlab_docum_em]. Hence, a safe semi-supervised learning is desired. The learning algorithms proposed by Sokolovska, et al. [sokolovska08] and Li and Zhou [li11:_towar_makin_unlab_data_never_hurt] have a theoretical guarantee such that the unlabeled data does not degrade the estimation accuracy.
In this paper, we develop the study of [sokolovska08]. To incorporate the information involved in unlabeled data into the estimator, Sokolovska, et al. [sokolovska08] used the weighted estimator. In the estimation of the weight function, a well-specified model for the marginal probability was assumed. This is a strong assumption for semi-supervised learning. To overcome the drawback, we apply the density-ratio estimator for the estimation of the weight function [sugiyama12:_machin_learn_non_station_envir, sugiyama12:_densit_ratio_estim_machin_learn]. We prove that the semi-supervised learning with the density-ratio estimation improves the standard supervised learning. Our method is available not only classification problems but also regression problems, while many semi-supervised learning methods focus on binary classification problems.
This paper is organized as follows. In Section 2, we show the problem setup. In Section 3, we introduce the weighted estimator investigated by Sokolovska, et al.,[sokolovska08]. In Section 4, we briefly explain the density-ratio estimation. In Section 5
, the asymptotic variance of the estimators under consideration is studied. Section6 is devoted to prove that the weighted estimator using labeled and unlabeled data outperforms the supervised learning using only labeled data. In Section 7, numerical experiments are presented. We conclude in Section 8.
2 Problem Setup
We introduce the problem setup. We suppose that the probability distribution of training samples is given as
where is the conditional probability of given , and and are the marginal probabilities on . Here, is regarded as the probability in the testing phase, i.e., the test data is distributed from the joint probability , and the estimation accuracy is evaluated under the test probability. The paired sample is called “labeled data”, and the unpaired sample is called “unlabeled data”. Our goal is to estimate the conditional probability or the conditional expectation based on the labeled and unlabeled data in (1). When is a finite set, the problem is called the classification problem. For , the estimation of is referred to as the regression problem.
We describe the assumption on the marginal distributions, and in (1). In the context of the covariate shift adaptation [JSPI:Shimodaira:2000], the assumption that is employed in general. The weighted estimator with the weight function is used to correct the estimation bias induced by the covariate shift; see [sugiyama12:_machin_learn_non_station_envir, sugiyama12:_densit_ratio_estim_machin_learn] for details. Hence, the estimation of the weight function is important to achieve a good estimation accuracy. On the other hand, in the semi-supervised learning [chapelle06:_semi_super_learn], the equality is assumed, and often is much larger than . This setup is also quite practical. For example, in the text data mining, the labeled data is scarce, while the unlabeled data is abundant. In this paper, we assume that the equality
We define the following semiparametric model,
for the estimation of the conditional probability , where is the set of all probability densities of the covariate . The parameter of interest is , and is the nuisance parameter. The model does not necessarily include the true probability , i.e., there may not exist the parameter such that holds. This is the significant condition, when we consider the improvement of the inference with the labeled and unlabeled data. Our target is to estimate the parameter satisfying
in which denotes the expectation with respect to the population distribution. If the model includes the true probability, we have
due to the non-negativity of Kullback-Leibler divergence[cover06:_elemen_of_infor_theor_wiley]. In the misspecified setup, however, the equality is not guaranteed.
3 Weighted Estimator in Semi-supervised Learning
We introduce the weighted estimator. For the estimation of under the model (3), we consider the maximum likelihood estimator (MLE). For the statistical model , let be the score function
where denotes the gradient with respect to the model parameter. Then, for any , we have
In addition, the extremal condition of (4) leads to
Hence, we can estimate the conditional density by , where is a solution of the estimation equation
Under the regularity condition, the MLE has the statistical consistency to the parameter in (4); see [vaart98:_asymp_statis] for details. In addition, the score function is an optimal choice among Z-estimators [vaart98:_asymp_statis], when the true probability density is included in the model . This implies that the efficient score of the semiparametric model is the same as the score function of the model . This is because, in the semiparametric model , the tangent space of the parameter of interest is orthogonal to that of the nuisance parameter. Here, the asymptotic variance matrix of the estimated parameter is employed to compare the estimation accuracy.
Next, we consider the setup of the semi-supervised learning. When the model is specified, we find that the estimator (5) using only the labeled data is efficient. This is obtained from the results of numerous studies about the semiparametric inference with missing data; see [nan09:_asymp, robins94:_estim] and references therein.
Suppose that the model is misspecified. Then, it is possible to improve the MLE in (5) by using the weighted MLE [sokolovska08]. The weighted MLE is defined as a solution of the equation,
where is a weight function. Suppose that
. Then the law of large numbers leads to the probabilistic convergence,
Hence the estimator based on (6) will provide a good estimator of under the marginal probability . This indicates that is expected to approximate over the region on which is large. The weight function has a role to adjust the bias of the estimator under the covariate shift [JSPI:Shimodaira:2000]. On the setup of the semi-supervised learning, however, holds, and it is known beforehand. Hence, one may think that there is no need to estimate the weight function. Sokolovska, et al.,[sokolovska08] showed that estimation of the weight function is useful, even though it is already known in the semi-supervised learning.
We briefly introduce the result in [sokolovska08]. Let the set be finite. Then, is a finite dimensional parametric model. Suppose that the sample size of the unlabeled data is enormous, and that the probability function on is known with a high degree of accuracy. The probability is estimated by the maximum likelihood estimator based on the samples in the labeled data. Then, Sokolovska, et al. [sokolovska08] showed that the weighted MLE (6) with the estimated weight function improves the naive MLE, when the model is misspecified, i.e., .
Shimodaira [JSPI:Shimodaira:2000] pointed out that the weighted MLE using the exact density ratio has the statistical consistency to the target parameter , when the covariate shift occurs. Under the regularity condition, it is rather straightforward to see that the weighted MLE using the estimated weight function also converges to in probability, since converges to in probability. Sokolovska’s result implies that when holds, the weighted MLE using the estimated weight function improves the weighted MLE using the true density ratio in the sense of the asymptotic variance of the estimator.
The phenomenon above is similar to the statistical paradox analyzed by [henmi04, henmi07:_impor_sampl_via_estim_sampl]. In the semi-parametric estimation, Henmi and Eguchi [henmi04] pointed out that the estimation accuracy of the parameter of interest can be improved by estimating the nuisance parameter, even when the nuisance parameter is known beforehand. Hirano, et al., [hirano03:_effic_estim_averag_treat_effec] also pointed out that the estimator with the estimated propensity score is more efficient than the estimator using the true propensity score in the estimation of the average treatment effects. Here, the propensity score corresponds to the weight function in our context. The degree of improvement is described by using the projection of the score function onto the subspace defined by the efficient score for the semi-parametric model. In our analysis, also the projection of the score function plays an important role as shown in Section 6.
For the estimation of the weight function in (6), we apply the density-ratio estimator [sugiyama12:_machin_learn_non_station_envir, sugiyama12:_densit_ratio_estim_machin_learn] instead of estimating the probability densities separately. We show that the density-ratio estimator provides a practical method for the semi-supervised learning. In the next section, we introduce the density-ratio estimation.
4 Density-ratio estimation
Density-ratio estimators are available to estimate the weight function . Recently, methods of the direct estimation for density-ratios have been developed in the machine learning community [sugiyama12:_machin_learn_non_station_envir, sugiyama12:_densit_ratio_estim_machin_learn]. We apply the density-ratio estimator to estimate the weight function instead of using the estimator of each probability density.
We briefly introduce the density-ratio estimator according to [Biometrika:Qin:1998]. Suppose that the following training samples are observed,
Our goal is to estimate the density-ratio . The -dimensional parametric model for the density-ratio is defined by
where is assumed. For any function which may depend on the parameter , one has the equality
Hence, the empirical approximation of the above equation is expected to provide an estimation equation of the density-ratio. The empirical approximation of the above equality under the parametric model of is given as
) provides a direct estimator of the density-ratio based on the moment matching with the function.
Qin [Biometrika:Qin:1998] proved that the optimal choice of is given as
where . By using above, the asymptotic variance matrix of is minimized among the set of moment matching estimators, when is realized by the model . Hence, (10) is regarded as the counterpart of the score function for parametric probability models.
5 Semi-Supervised Learning with Density-Ratio Estimation
We study the asymptotics of the weighted MLE (6) using the estimated density-ratio. The estimation equation is given as
Here, the statistical models (3) and (8) are employed. The first equation is used for the estimation of the parameter of the model , and the second equation is used for the estimation of the density-ratio . The estimator defined by (11) is refereed to as density-ratio estimation based on semi supervised learning, or DRESS for short.
In Sokolovska, et al.[sokolovska08], the marginal probability density is estimated by using a well-specified parametric model. Clearly, preparing the well-specified parametric model is not practical, when is not finite set. On the other hand, it is easy to prepare a specified model of the density-ratio , whenever holds in (1). The model (8) is an example. Indeed, holds. Hence, the assumption that the true weight function is realized by the model is not of an obstacle in semi-supervised learning.
and be the parameter such that , i.e., . We prepare some notations: , . The Jacobian of the score function with respect to the parameter is denoted as , i.e., the by matrix whose element is given as . The variance matrix and the covariance matrix under the probability are denoted as and , respectively. Without loss of generality, we assume that at is represented as
where is an arbitrary function orthogonal to , i.e., holds. If
does not have any component which is represented as a linear transformation of, the estimator would be degenerated. Under the regularity condition, the estimated parameters, and , converge to and , respectively. The asymptotic expansion of (11) around leads to
Hence, we have
Therefore, we obtain the asymptotic variance,
On the other hand, the variance of the naive MLE, , defined as a solution of (5) is given as
6 Maximum Improvement by Semi-Supervised Learning
Given the model for the density-ratio , we compare the asymptotic variance matrices of the estimators, and . First, let us define
i.e., is the projection of the score function onto the subspace consisting of all functions depending only on , where the inner product is defined by the expectation under the joint probability . Note that the equality holds. Let the matrix be
Then, a simple calculation yields that the difference of the variance matrix between and is equal to
In the second equality, we supposed that converges to a positive constant. When is positive definite, the estimator using the labeled and unlabeled data improves the estimator using only the labeled data. It is straightforward to see that the improvement is not attained if holds. In general, the score function satisfies , if the model is specified. When the model of the conditional probability is misspecified, however, there is a possibility that the proposed estimator (11) outperforms the MLE .
We derive the optimal moment function for the estimation of the parameter . The optimal can be different from (10). We prepare some notations. Let be the -valued function on , each element of which is the projection of each element of onto the subspace spanned by . Here, the inner product is defined by the expectation under the marginal probability . In addition, let be the projection of onto the orthogonal complement of the subspace, i.e., .
We assume that the model of the density-ratio is defined as
with the basis functions satisfying . Suppose that is invertible, and that the rank of is equal to the dimension of the parameter , i.e., row full rank. We assume that the moment function at is represented as
where is a function orthogonal to , i.e., holds. Then, an optimal is given as
For the optimal choice of , the maximum improvement is given as
Due to , one has and . Hence, one has . Our goad is to find which minimizes in (12) in the sense of positive definiteness. The orthogonal decomposition leads to
because of the orthogonality between and , and the equality . Hence, satisfying
is an optimal choice. Since the matrix is row full rank, a solution of the above equation is given by
We obtain the maximum improvement of by using the equalities and . ∎
Suppose that the optimal moment function presented in Theorem 1 is used with the score function . Then, the improvement (15) is maximized when is minimized. Hence, the model with the lower dimensional parameter is preferable as long as the assumption in Theorem 1 is satisfied. This is intuitively understandable, because the statistical perturbation of the density-ratio estimator is minimized, when the smallest model is employed.
Suppose that the basis functions, , are closely orthogonal to , i.e., is close to the null matrix. Then, the improvement is close to . As a result, we have in which the supremum is taken over the basis of the density-ratio model satisfying the assumption in Theorem 1. However, the basis functions satisfying the exact equality is useless. Because, the equality leads to and thus, the equality (12) is reduced to
This result implies that there is the singularity at the basis function such that .
It is not practical to apply the optimal function defined by (14). The optimal moment function depends on , and one needs information on the probability to obtain the explicit form of . The estimation of needs non-parametric estimation, since the model misspecification of is significant in our setup. Thus, we consider more practical estimator for the density ratio. Suppose that holds for the moment function . For example, the optimal moment function (10) satisfies at , i.e., . For the density-ratio model with and the moment function satisfying , a brief calculation yields that
Hence, the improvement is attained, when holds. As an interesting fact, we see that the larger model attains the better improvement in (16). Indeed, gets close to , when the density-ratio model becomes large. Hence, the non-parametric estimation of the density-ratio may be a good choice to achieve a large improvement for the estimation of the conditional probability. This is totally different from the case that the optimal presented in Theorem 1 is used in the density-ratio estimation. The relation between using the optimal and with is illustrated in Figure 1. In the limit of the dimension of , both variance matrices converge to monotonically.
Let be the score function
of the model ,
where is the vector consisting of basis functions
is the vector consisting of basis functions andis a known parameter. Then, one has . Suppose that the true conditional probability leads to the regression function , where for all . Then, one has and . Hence, the upper bound of the improvement is governed by the degree of the model misspecification . According to Theorem 1, an optimal moment function is given as
at , where .
7 Numerical Experiments
We show numerical experiments to compare the standard supervised learning and the semi-supervised learning using DRESS. Both regression problems and classification problems are presented.
7.1 Regression problems
We consider the regression problem with the -dimensional covariate variable shown below.
- labeled data:
- unlabeled data:
- regression model:
- score function:
The parameter in (17) implies the degree of the model misspecification. Let be the target function, , and define
which implies the squared distance from the true function
to the linear regression model. On the other hand, the mean square error of the naive least mean square (LMS) estimator, i.e., , is asymptotically equal to , when the model is specified. We use the ratio
as the normalized measure of the model misspecification. When holds, the misspecification of the model can be statistically detected.
First, we use a parametric model for density ratio estimation. For any positive integer , let be the -dimensional vector . The density-ratio model is defined as
having dimensional parameter . We apply the estimator (10) presented by Qin [Biometrika:Qin:1998]. Note that the estimator (10) satisfies at . Hence, the improvement is asymptotically given by (16). Under the setup of and , we compute the mean square errors for LMS estimator and DRESS . The difference of test errors,
is evaluated for each and each dimension of the density ratio, , where the expectation is evaluated over the test samples. The mean square error is calculated by the average over 500 iterations.
Figure 2 shows the results. When the model is specified, i.e., , LMS estimator presents better performance than DRESS. Under the practical setup such as , however, we see that DRESS outperforms LMS estimator. The dependency on the dimension of the density-ratio model is not clearly detected in this experiment. Overall, larger density-ratio model presents rather unstable result. Indeed, in DRESS with large density ratio model, say the right bottom panel in Figure 2, the mean square error of DRESS can be large, i.e., the improvement is negative, even when the model misspecification is large.
Next, we compare LMS estimator and DRESS with a nonparametric estimator of the density-ratio. Here, we use KuLSIF [kanamori12:_statis]
as the density-ratio estimator. KuLSIF is a non-parametric estimator of the density-ratio based on the kernel method. The regularization is efficiently conducted to suppress the degree of freedom of the nonparametric model. In KuLSIF, the kernel function of the reproducing kernel Hilbert space corresponds to the basis function.
Under the setup of and , we compute the mean square errors by the average over 100 iterations. In Figure 3, the square root of the mean square errors for LMS estimator and DRESS are plotted as the function of , i.e., (model error)/(statistical error). When is around , it is statistically hard to detect the model misspecification by the training data of the size . When the model is specified (), LMS estimator presents better performance than DRESS. Under the practical setup such as
, however, we see that DRESS with KuLSIF outperforms LMS estimator. As shown in the asymptotic analysis, we notice that the sample size of the unlabeled data affects the estimation accuracy of DRESS. The numerical results show that DRESS with largeattains the smaller error comparing to DRESS with small , especially when holds. In the numerical experiment, even DRESS with and slightly outperforms LMS estimator. This is not supported by the asymptotic analysis. Hence, we need more involved theoretical study about the statistical feature of semi-supervised learning.
is the standard deviation of the noise involved in the dependent variable.
7.2 Classification problems
As a classification task, we use spam dataset in “kernlab” of R package [karatzoglou04:_kernl]. The dataset includes 4601 samples. The dimension of the covariate is 57, i.e., whose elements represent statistical features of each document. The output is assigned to “spam” or “nonspam”.
For the binary classification problem, we use the logistic model,
where is the dimension of the covariate used in the logistic model. In numerical experiments, varies from 10 to 57, hence, the dimension of the model parameter varies from 11 to 58. We tested DRESS with KuLSIF [kanamori12:_statis] and MLE with randomly chosen labeled training samples and unlabeled training samples. The remaining samples are served as the test data. The score function is used for the estimation.
Table 1 shows the prediction errors with the standard deviation. We also show the p-value of the one-tailed paired -test for prediction errors of DRESS and MLE. Small p-values denote the superiority of DRESS. We notice that p-value is small when the dimension is not large. In other word, the numerical results meet the asymptotic theory in Section 6. For relatively high dimensional models, the prediction error of MLE is smaller than that of DRESS; see the row of in Table 1. The size of unlabeled data, , also affects the results. Indeed, the p-value becomes small for large . This result is supported by the asymptotic analysis presented in Section 6.