1 Introduction
Semisupervised learning is a class of machine learning methods that stand in the middle ground between supervised learning in which all training data are labeled, and unsupervised learning in which no training data are labeled. Specifically, in addition to the labeled training data
, there exist unlabeled inputs . Under certain assumptions on the geometric structure of the input data, such as the cluster assumption or the lowdimensional manifold assumption [7], the use of both labeled and unlabeled data can achieve better prediction accuracy than supervised learning that only uses labeled inputs .Semisupervised learning has become popular since the acquisition of unlabeled data is relatively inexpensive. A large number of methods were developed under the framework of semisupervised learning. For example, [18] proposed that the combination of labeled and unlabeled data will improve the prediction accuracy under the assumption of mixture models. The selftraining method [19] and the cotraining method [13] were soon applied to semisupervised learning when mixture models are not assumed. [23] described an approach to semisupervised clustering based on hidden Markov random fields (HMRFs) that can combine multiple approaches in a unified probabilistic framework. [1]
proposed a probabilistic framework for semisupervised learning incorporating a Kmeans type hard partition clustering algorithm (HMRFKmeans).
[20]proposed the transductive support vector machines (TSVMs) that used the idea of transductive learning by including unlabeled data in the computation of the margin. Transductive learning is a variant of semisupervised learning which focuses on the inference of the correct labels for the given unlabeled data other than the inference of the general rule.
[4] used a convex relaxation of the optimization problem called semidefinite programming as a different approaches to the TSVMs.In this article, we focus on a particular semisupervised method – graphbased semisupervised learning. In this method, the geometric structure of the input data are represented by a graph , where nodes represent the inputs and edges represent the similarities between them. The similarities are given in an by symmetric similarity matrix (or called kernel matrix), , where . The larger implies that and are more similar. Further, let be the responses of the labeled data.
[25] proposed the following graphbased learning method,
(1) 
subject to .
Its solution is called the estimated scores. The objective function (1) (named “hard criterion” thereafter), requires all the estimated score to be exactly the same as the responses for the labeled data. [8] relaxed this requirement by proposing a soft version (named “soft criterion” thereafter). We follow an equivalent form given in [26],
(2) 
The soft criterion belongs to the “loss+penalty” paradigm: It searches for the minimizer which achieves a small training error, and in the meanwhile imposes the smoothness on by a penalty based on similarity matrix. It can be easily seen that when the soft criterion is equivalent to the hard criterion.

The tuning parameter being 0 in the soft criterion (2) is understood in the following sense: The squared loss has infinite weight and thereby for all labeled data. But still plays a crucial role when it has no conflict with the hard constraints on the labeled data, that is, it provides links between ’s on the labeled and unlabeled data. Therefore, the soft criterion (2) at becomes the hard criterion (1).
Researchers have proposed different variants of graphbased learning methods, such as [24] and [2]. We only focus on (1) and (2) in this article.
The theoretical properties of graphbased learning have been studied in computer science and statistics literatures. [5] derived the limit of the Laplacian regularizer when the sample size of unlabeled data goes to infinity. [11] considered the convergence of Laplacian regularizer on Riemannian manifolds. [3] reinterpretted the graph Laplacian as a measure of intrinsic distances between inputs on a manifold and reformulated the problem as a functional optimization in a reproducing kernel Hilbert space. [16] pointed out that the hard criterion can yield completely noninformative solution when the size of unlabeled data goes to infinity and labeled data are finite, that is, the solution can give a perfect fit on the labeled data but remains as 0 on the unlabeled data. [14] obtained the asymptotic mean squared error of a different version of graphbased learning criterion. [2] gave a bound of the generalization error for a slightly different version of (2).
But to the best of our knowledge, no result is available in literature on a very fundamental question – the consistency of graphbased learning, which is the main focus of this article. Specifically, we want to answer the question that under what conditions will converge to on unlabeled data, where
is the true probability of a positive label given
if responses are binary, and is the regression function on if responses are continuous. We will always call as regression function for simplicity.Most of the literatures discussed above considered a “functional version” of (1) and (2). They used a functional optimization problem with the optimizer being a function, as an approximation of the original problem with the optimizer being a vector. And they studied the behavior of the limit of graph Laplacian and the solution . We do not adopt this framework but use a more direct approach. We focus on the original problem and study the relations of and under the general nonparametric setting. Our approach essentially belongs to the framework of transductive learning, which focuses on the prediction on the given unlabeled data , not the general mapping from inputs to responses. By establishing a link between the optimizer of (1) and the NadarayaWatson estimator [15, 21] for kernel regression, we will prove the consistency of the hard criterion. The theorem allows both and to grow. On the other hand, we show that the soft criterion is inconsistent for sufficiently large . To the best of our knowledge, this is the first result that explicitly distinguishes the hard criterion and the soft criterion of graphbased learning from a theoretical perspective and shows that they have very different asymptotic behaviors.
The rest of the article is organized as follows. In Section 2, we state the consistency result for the hard criterion and give the counterexample for the soft criterion. We prove the consistency result in Section 3. Numerical studies in Section 4 support our theoretical findings. Section 5 concludes with a summary and discussion of future research directions.
2 Main Results
We begin with basic notation and setup. Let be independently and identically distributed pairs. Here each is a dimensional vector and are binary responses labeled as 1 and 0 (the classification case) or continuous responses (the regression case). The last responses are unobserved.
[25] used a fixed point algorithm to solve the hard criterion (1), which is
(3) 
Note that (3) is not a closedform solution but an updating formula for the iterative algorithm, since its righthand side depends on unknown quantities.
In order to obtain a closedform solution for (1), we begin by solving the soft version (2) and then let . Recall that is the similarity matrix. Let where , and being the unnormalized graph Laplacian (see [17] for more details). Soft criterion (2) can be written in matrix form
(4) 
where . Further, let be an by matrix defined as
Then by taking the derivative of (4) with respect to and setting equal to zero, we obtain the solution as follows,
What we are interested in are the estimated scores on the unlabeled data, i.e. . In order to obtain an explicit form for , we use a formula for inverse of a block matrix (see standard textbooks on matrix algebra such as [12] for more details): For any nonsingular square matrix
Write and as block matrices,
By the formula above,
(5) 
Letting , we obtain the solution for the hard criterion (1),
(6) 
[2] obtained a similar formula for a slightly different objective function.
Clearly, the form of (6) is closely related to the NadarayaWatson estimator [15],[21] for kernel regression, which is
(7) 
The NadarayaWatson estimator is well studied under the nonparametric framework. We can construct by a kernel function, that is, let , where is a nonnegative function on , and is a positive constant controlling the bandwidth of the kernel. Let be the true regression function. The consistency of NadarayaWatson estimator was first proved by [21] and [15]. And many other researchers such as [9] and [6] studied its asymptotic properties under different assumptions. Here we follow the result in [10]. If , as , and satisfies:

is bounded by ;

The support of is compact;

for some and some closed ball centered at the origin and having positive radius ,
then converges to in probability for .
By establishing a connection between the solution of the hard criterion and NadarayaWatson estimator, we prove the following main theorem:
Theorem 1.
Suppose that are independently and identically distributed with being bounded; and satisfy the above conditions. Further, let be the difference of two independent ’s, i.e.
, with probability density function
. Assume that there exists , such that for any ,(8) 
Then, for , given in (5) converges to in probability, for .
The proof will be given in Section 3.

Theorem 1 established the consistency of the hard criterion under the standard nonparametric framework with two additional assumptions. Firstly, both labeled data and unlabeled data are allowed to grow but the size of unlabeled data grows slower than the size of labeled data . We conjecture that when grows faster than , the graphbased semisupervised learning may not be consistent based on the simulation studies in Section 4. [16] also suggested that the method may not work when grows too fast. Secondly, we assume that density function of the difference of two independent inputs is strictly positive near the origin, which is a mild technical condition valid for commonly used density functions.
Theorem 1 provides some surprising insights about the hard criterion of graphbased learning. At a first glance, the hard criterion makes an impractical assumption that requires the responses to be noiseless, while the soft criterion seems to be a more natural choice. But according to our theoretical analysis, the hard criterion is consistent under the standard nonparametric framework where the responses on training data are of course allowed to be random and noisy.
We now consider the soft criterion with .
Proposition 2.
Suppose that are independently and identically distributed with . Further, suppose that represents a connected graph. Then for sufficiently large , the soft criterion (2) is inconsistent.
Proof.
Consider another extreme case of the soft criterion (2), . When represents a connected graph, the objective function becomes
(9) 
subject to .
It is easy to check that the solution of (9), denoted by , is given by
By the law of large numbers,
Clearly,
since the righthand side is a random variable. This implies that for sufficiently large
, the soft criterion is inconsistent. ∎3 Proof of the Main Theorem
We give the proof of Theorem 1 in this section.
Recall that
We first focus on . Clearly,
For any positive integer , define
Our goal is to prove that the limit of exists with probability approaching 1, and thus we can have
with probability approaching 1 [22].
By definition,
where
for . Thus we have
Since , there exist , such that holds for every . Thus by the assumption in (8), for and ,
where denotes the volume of a dimensional ball with radius , and is a constant independent with . Since , the above inequality implies . On the other side, since .
Further,
By Chebyshev’s Inequality, for any , since ,
(10) 
This further implies
We now continue to study the property of . Consider each element of this matrix. For ,
by condition (i) and (iii). For simplicity of notation, let
where is a nonnegative function depending on . Thus is an upper bound of every element in the matrix . By (10), we have
which implies
and
Since , we have
(11) 
where . Note that is a constant independent with and .
For the sake of simplicity, we say a matrix has tiny elements, if
with probability approaching 1, where . And denotes the th row of . Then has tiny elements by (11). Moreover,
holds with probability approaching 1. By induction,
with probability approaching 1. Therefore,
Thus exists with probability approaching 1 since , and also has tiny elements. Therefore,
with probability approaching 1.
We now go back to the solution of the hard criterion of graphbased semisupervised learning,
(12) 
with probability approaching 1. For , equals to the th row of , i.e.,
(13)  
with probability approaching 1.
By assumption, ’s are bounded. Without loss of generality, assume . For , define
We have
with probability approaching 1 as . This implies
since for any we can find such that and
Finally, for each ,
Since has tiny elements,
which implies in probability. The theorem then holds by the consistency of NadarayaWatson estimator.
4 Numerical Studies
In this section, we compare the performance of the hard criterion and the soft criterion with different tuning parameters under a linear and nonlinear model.
The inputs
are generated independently from a truncated multivariate normal distribution. Specifically, let
follow a dimensional multivariate normal with the mean, and the variancecovariance matrix
We set . For and , let if and otherwise, where and are the th component of and , respectively.
Let
be the Gaussian radial basis function (RBF) kernel, that is,
where . Note that has compact support since ’s are truncated, and the choice of satisfies the condition in Theorem 1.
We consider two models in simulation studies. In Model 1, the responses
’s follow a logistic regression with
for . Model 2 uses a nonlinear logit function,
for .
We compare the performance of graphbased learning methods with four different tuning parameters, and 5. The performance is measured by the root mean squared error (RMSE) on the unlabeled data, that is,
Each simulation is repeated 1000 times and the average RMSEs are reported.
Figure 1 shows the RMSEs under Model 1 when the sample size of unlabeled data is fixed as 30 and the sample size of labeled data , 30, 50, 100, 200, 300, 500, 800, 1000 and 1500. As increases, the RMSEs of all methods decrease as expected. More importantly, the RMSE increases as increases. In particular, the hard criterion always outperforms the soft criterion, which is consistent with our theoretical results.
Figure 2 shows the RMSEs under Model 1 when is fixed as 100 and , 60, 100, 300, 500 and 1000. As before, the RMSE always increases as increases. Moreover, the RMSEs of all methods increase as increases, which suggests that the hard criterion may not be consistent when grows faster than . For the nonlinear logit function, Figure 3 and 4 show the same patterns as in Figure 1 and 2, respectively, which also support our theoretical results.
5 Summary
In this article, we proved the consistency of graphbased semisupervised learning when the tuning parameter of the graph Laplacian is zero (the hard criterion) and showed that the method can be inconsistent when the tuning parameter is nonzero (the soft criterion). Moreover, the numerical studies also suggest that the hard criterion outperforms the soft criterion in terms of the RMSE. These results provide a better understanding about the statistical properties of graphbased semisupervised learning. Of course, the accuracy of prediction can be measured by other indicators such as the area under the receiver operating characteristic curve (AUC). The hard criterion may not always be the best choice in term of these indicators. Further theoretical properties such as rank consistency will be explored in future research. Moreover, we would also like to investigate the behavior of these methods when the unlabeled data grow faster than the label data.
References
 [1] S. Basu, A. Banerjee, and R. J. Mooney. Semisupervised clustering by seeding. In Proceedings of the International Conference on Machine Learning, pages 19–26, 2002.

[2]
M. Belkin, I. Matveeva, and P. Niyogi.
Regularization and semisupervised learning on large graphs.
In Proceedings of the Seventeenth Annual Conference on Computational Learning Theory
, pages 624–638, Banff, Canada, 2004.  [3] M. Belkin, P. Niyogi, and S. Sindhwani. Manifold Regularization: A geometric framework for learning from labeled and unlabeled examples. JMLR, 7:2399–2434, 2006.
 [4] T. De Bie and N. Cristianini. Convex methods for transduction. In Advances in Neural Information Processing Systems, 16:73–80, 2004.
 [5] O. Bosquet, O. Chapelle, and M. Hein. Measure based regularization. NIPS, 16, 2004.
 [6] Zongwu Cai. Weighted Nadaraya Watson regression estimation. Statistics & Probability Letters, 51:307–318, 2001.
 [7] O. Chapelle, B. Schlkopf, and A. Zien. Semisupervised Learning. The MIT Press, 2006.

[8]
Olivier Delalleau, Yoshua Bengio, and Nicolas Le Roux.
Efficient nonparametric function induction in semisupervised
learning.
In Artificial Intelligence and Statistics
, 2005.  [9] Luc. P. Devroye. The uniform convergence of the NadarayaWatson regression function estimate. The Canadian Journal of Statistics, 6:179–191, 1978.
 [10] Luc. P. Devroye and T. J. Wagner. Distributionfree consistency results in nonparametric discrimination and regression function estimation. Annals of Statistics, 8(2):231–239, 1980.
 [11] M. Hein. Uniform convergence of adaptive graphbased regularization. COLT: Learning Theory, pages 50–64, 2006.
 [12] M.D. Intriligator and Z. Griliches. Handbook of Econometrics, volume 1. NorthHolland Publishing Company, 1988.
 [13] Rosie Jones. Learning to extract entitles from labeled and unlabeled text. PhD Thesis, 2005.
 [14] John Lafferty and Larry Wasserman. Statistical analysis of semisupervised regression. NIPS, 20, 2008.
 [15] E. A. Nadaraya. On estimating regression. Theor. Probability Appl, 9:141–142, 1964.
 [16] Boaz Nadler, Nathan Srebro, and Xueyuan Zhou. Semisupervised learning with the Graph Laplacian: The limit of infinite unlabelled data. NIPS, 2009.
 [17] M. E. J. Newman. Networks: An introduction. Oxford University Press, 2010.
 [18] Joel Ratsaby and Santosh S. Venkatesh. Learning from a mixture of labeled and unlabeled examples with parametric side information. In Proceedings of COLT ’95 Proceedings of the eighth annual conference on Computational learning theory, pages 412–417, 1995.

[19]
Chuck Rosenberg, Martial Hebert, and Henry Schneiderman.
Semisupervised selftraining of object detection models.
In Proceedings of Seventh IEEE Workshop on Applications of Computer Vision
, 2005.  [20] V. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.

[21]
G. S. Watson.
Smooth regression analysis.
Sankhy, Series A, 26:359–372, 1964.  [22] D. Werner. Functional Analysis (in German). Springer Verlag, 2005.
 [23] Y. Zhang, M. Brady, and S. Smith. Hidden markov random field model and segmentation of brain mr images. IEEE Transactions on Medical Imaging, 20(1):45–57, 2001.
 [24] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf. Learning with local and global consistency. in S. Thrun, L. Saul, and B. Schölkopf, editors. MIT Press, Cambridge, MA, 2004.
 [25] X. Zhu, Zoubin Ghahramani., and John Lafferty. Semisupervised learning using Gaussian Fields and Harmonic Functions. ICML, (118), 2003.
 [26] X. Zhu and Andrew B. Goldberg. Introduction to Semisupervised Learning. Morgan & Claypool Publishers, 2009.
Comments
There are no comments yet.