On Consistency of Graph-based Semi-supervised Learning

03/17/2017 ∙ by Chengan Du, et al. ∙ 0

Graph-based semi-supervised learning is one of the most popular methods in machine learning. Some of its theoretical properties such as bounds for the generalization error and the convergence of the graph Laplacian regularizer have been studied in computer science and statistics literatures. However, a fundamental statistical property, the consistency of the estimator from this method has not been proved. In this article, we study the consistency problem under a non-parametric framework. We prove the consistency of graph-based learning in the case that the estimated scores are enforced to be equal to the observed responses for the labeled data. The sample sizes of both labeled and unlabeled data are allowed to grow in this result. When the estimated scores are not required to be equal to the observed responses, a tuning parameter is used to balance the loss function and the graph Laplacian regularizer. We give a counterexample demonstrating that the estimator for this case can be inconsistent. The theoretical findings are supported by numerical studies.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Semi-supervised learning is a class of machine learning methods that stand in the middle ground between supervised learning in which all training data are labeled, and unsupervised learning in which no training data are labeled. Specifically, in addition to the labeled training data

, there exist unlabeled inputs . Under certain assumptions on the geometric structure of the input data, such as the cluster assumption or the low-dimensional manifold assumption [7], the use of both labeled and unlabeled data can achieve better prediction accuracy than supervised learning that only uses labeled inputs .

Semi-supervised learning has become popular since the acquisition of unlabeled data is relatively inexpensive. A large number of methods were developed under the framework of semi-supervised learning. For example, [18] proposed that the combination of labeled and unlabeled data will improve the prediction accuracy under the assumption of mixture models. The self-training method [19] and the co-training method [13] were soon applied to semi-supervised learning when mixture models are not assumed. [23] described an approach to semi-supervised clustering based on hidden Markov random fields (HMRFs) that can combine multiple approaches in a unified probabilistic framework. [1]

proposed a probabilistic framework for semi-supervised learning incorporating a K-means type hard partition clustering algorithm (HMRF-Kmeans).

[20]

proposed the transductive support vector machines (TSVMs) that used the idea of transductive learning by including unlabeled data in the computation of the margin. Transductive learning is a variant of semi-supervised learning which focuses on the inference of the correct labels for the given unlabeled data other than the inference of the general rule.

[4] used a convex relaxation of the optimization problem called semi-definite programming as a different approaches to the TSVMs.

In this article, we focus on a particular semi-supervised method – graph-based semi-supervised learning. In this method, the geometric structure of the input data are represented by a graph , where nodes represent the inputs and edges represent the similarities between them. The similarities are given in an by symmetric similarity matrix (or called kernel matrix), , where . The larger implies that and are more similar. Further, let be the responses of the labeled data.

[25] proposed the following graph-based learning method,

(1)

subject to .

Its solution is called the estimated scores. The objective function (1) (named “hard criterion” thereafter), requires all the estimated score to be exactly the same as the responses for the labeled data. [8] relaxed this requirement by proposing a soft version (named “soft criterion” thereafter). We follow an equivalent form given in [26],

(2)

The soft criterion belongs to the “loss+penalty” paradigm: It searches for the minimizer which achieves a small training error, and in the meanwhile imposes the smoothness on by a penalty based on similarity matrix. It can be easily seen that when the soft criterion is equivalent to the hard criterion.

  • The tuning parameter being 0 in the soft criterion (2) is understood in the following sense: The squared loss has infinite weight and thereby for all labeled data. But still plays a crucial role when it has no conflict with the hard constraints on the labeled data, that is, it provides links between ’s on the labeled and unlabeled data. Therefore, the soft criterion (2) at becomes the hard criterion (1).

Researchers have proposed different variants of graph-based learning methods, such as [24] and [2]. We only focus on (1) and (2) in this article.

The theoretical properties of graph-based learning have been studied in computer science and statistics literatures. [5] derived the limit of the Laplacian regularizer when the sample size of unlabeled data goes to infinity. [11] considered the convergence of Laplacian regularizer on Riemannian manifolds. [3] reinterpretted the graph Laplacian as a measure of intrinsic distances between inputs on a manifold and reformulated the problem as a functional optimization in a reproducing kernel Hilbert space. [16] pointed out that the hard criterion can yield completely noninformative solution when the size of unlabeled data goes to infinity and labeled data are finite, that is, the solution can give a perfect fit on the labeled data but remains as 0 on the unlabeled data. [14] obtained the asymptotic mean squared error of a different version of graph-based learning criterion. [2] gave a bound of the generalization error for a slightly different version of (2).

But to the best of our knowledge, no result is available in literature on a very fundamental question – the consistency of graph-based learning, which is the main focus of this article. Specifically, we want to answer the question that under what conditions will converge to on unlabeled data, where

is the true probability of a positive label given

if responses are binary, and is the regression function on if responses are continuous. We will always call as regression function for simplicity.

Most of the literatures discussed above considered a “functional version” of (1) and (2). They used a functional optimization problem with the optimizer being a function, as an approximation of the original problem with the optimizer being a vector. And they studied the behavior of the limit of graph Laplacian and the solution . We do not adopt this framework but use a more direct approach. We focus on the original problem and study the relations of and under the general non-parametric setting. Our approach essentially belongs to the framework of transductive learning, which focuses on the prediction on the given unlabeled data , not the general mapping from inputs to responses. By establishing a link between the optimizer of (1) and the Nadaraya-Watson estimator [15, 21] for kernel regression, we will prove the consistency of the hard criterion. The theorem allows both and to grow. On the other hand, we show that the soft criterion is inconsistent for sufficiently large . To the best of our knowledge, this is the first result that explicitly distinguishes the hard criterion and the soft criterion of graph-based learning from a theoretical perspective and shows that they have very different asymptotic behaviors.

The rest of the article is organized as follows. In Section 2, we state the consistency result for the hard criterion and give the counterexample for the soft criterion. We prove the consistency result in Section 3. Numerical studies in Section 4 support our theoretical findings. Section 5 concludes with a summary and discussion of future research directions.

2 Main Results

We begin with basic notation and setup. Let be independently and identically distributed pairs. Here each is a -dimensional vector and are binary responses labeled as 1 and 0 (the classification case) or continuous responses (the regression case). The last responses are unobserved.

[25] used a fixed point algorithm to solve the hard criterion (1), which is

(3)

Note that (3) is not a closed-form solution but an updating formula for the iterative algorithm, since its right-hand side depends on unknown quantities.

In order to obtain a closed-form solution for (1), we begin by solving the soft version (2) and then let . Recall that is the similarity matrix. Let where , and being the unnormalized graph Laplacian (see [17] for more details). Soft criterion (2) can be written in matrix form

(4)

where . Further, let be an by matrix defined as

Then by taking the derivative of (4) with respect to and setting equal to zero, we obtain the solution as follows,

What we are interested in are the estimated scores on the unlabeled data, i.e. . In order to obtain an explicit form for , we use a formula for inverse of a block matrix (see standard textbooks on matrix algebra such as [12] for more details): For any non-singular square matrix

Write and as block matrices,

By the formula above,

(5)

Letting , we obtain the solution for the hard criterion (1),

(6)

[2] obtained a similar formula for a slightly different objective function.

Clearly, the form of (6) is closely related to the Nadaraya-Watson estimator [15],[21] for kernel regression, which is

(7)

The Nadaraya-Watson estimator is well studied under the non-parametric framework. We can construct by a kernel function, that is, let , where is a nonnegative function on , and is a positive constant controlling the bandwidth of the kernel. Let be the true regression function. The consistency of Nadaraya-Watson estimator was first proved by [21] and [15]. And many other researchers such as [9] and [6] studied its asymptotic properties under different assumptions. Here we follow the result in [10]. If , as , and satisfies:

  • is bounded by ;

  • The support of is compact;

  • for some and some closed ball centered at the origin and having positive radius ,

then converges to in probability for .

By establishing a connection between the solution of the hard criterion and Nadaraya-Watson estimator, we prove the following main theorem:

Theorem 1.

Suppose that are independently and identically distributed with being bounded; and satisfy the above conditions. Further, let be the difference of two independent ’s, i.e.

, with probability density function

. Assume that there exists , such that for any ,

(8)

Then, for , given in (5) converges to in probability, for .

The proof will be given in Section 3.

  • Theorem 1 established the consistency of the hard criterion under the standard non-parametric framework with two additional assumptions. Firstly, both labeled data and unlabeled data are allowed to grow but the size of unlabeled data grows slower than the size of labeled data . We conjecture that when grows faster than , the graph-based semi-supervised learning may not be consistent based on the simulation studies in Section 4. [16] also suggested that the method may not work when grows too fast. Secondly, we assume that density function of the difference of two independent inputs is strictly positive near the origin, which is a mild technical condition valid for commonly used density functions.

Theorem 1 provides some surprising insights about the hard criterion of graph-based learning. At a first glance, the hard criterion makes an impractical assumption that requires the responses to be noiseless, while the soft criterion seems to be a more natural choice. But according to our theoretical analysis, the hard criterion is consistent under the standard non-parametric framework where the responses on training data are of course allowed to be random and noisy.

We now consider the soft criterion with .

Proposition 2.

Suppose that are independently and identically distributed with . Further, suppose that represents a connected graph. Then for sufficiently large , the soft criterion (2) is inconsistent.

Proof.

Consider another extreme case of the soft criterion (2), . When represents a connected graph, the objective function becomes

(9)

subject to .

It is easy to check that the solution of (9), denoted by , is given by

By the law of large numbers,

Clearly,

since the right-hand side is a random variable. This implies that for sufficiently large

, the soft criterion is inconsistent. ∎

3 Proof of the Main Theorem

We give the proof of Theorem 1 in this section.

Recall that

We first focus on . Clearly,

For any positive integer , define

Our goal is to prove that the limit of exists with probability approaching 1, and thus we can have

with probability approaching 1 [22].

By definition,


where

for . Thus we have

Since , there exist , such that holds for every . Thus by the assumption in (8), for and ,

where denotes the volume of a -dimensional ball with radius , and is a constant independent with . Since , the above inequality implies . On the other side, since .

Further,

By Chebyshev’s Inequality, for any , since ,

(10)

This further implies

We now continue to study the property of . Consider each element of this matrix. For ,

by condition (i) and (iii). For simplicity of notation, let

where is a nonnegative function depending on . Thus is an upper bound of every element in the matrix . By (10), we have

which implies

and

Since , we have

(11)

where . Note that is a constant independent with and .

For the sake of simplicity, we say a matrix has tiny elements, if

with probability approaching 1, where . And denotes the -th row of . Then has tiny elements by (11). Moreover,

holds with probability approaching 1. By induction,

with probability approaching 1. Therefore,

Thus exists with probability approaching 1 since , and also has tiny elements. Therefore,

with probability approaching 1.

We now go back to the solution of the hard criterion of graph-based semi-supervised learning,

(12)

with probability approaching 1. For , equals to the th row of , i.e.,

(13)

with probability approaching 1.

By assumption, ’s are bounded. Without loss of generality, assume . For , define

We have

with probability approaching 1 as . This implies

since for any we can find such that and

Finally, for each ,

Since has tiny elements,

which implies in probability. The theorem then holds by the consistency of Nadaraya-Watson estimator.

4 Numerical Studies

In this section, we compare the performance of the hard criterion and the soft criterion with different tuning parameters under a linear and non-linear model.

The inputs

are generated independently from a truncated multivariate normal distribution. Specifically, let

follow a -dimensional multivariate normal with the mean

, and the variance-covariance matrix

We set . For and , let if and otherwise, where and are the -th component of and , respectively.

Let

be the Gaussian radial basis function (RBF) kernel, that is,

where . Note that has compact support since ’s are truncated, and the choice of satisfies the condition in Theorem 1.

We consider two models in simulation studies. In Model 1, the responses

’s follow a logistic regression with

for . Model 2 uses a non-linear logit function,

for .

We compare the performance of graph-based learning methods with four different tuning parameters, and 5. The performance is measured by the root mean squared error (RMSE) on the unlabeled data, that is,

Each simulation is repeated 1000 times and the average RMSEs are reported.

Figure 1 shows the RMSEs under Model 1 when the sample size of unlabeled data is fixed as 30 and the sample size of labeled data , 30, 50, 100, 200, 300, 500, 800, 1000 and 1500. As increases, the RMSEs of all methods decrease as expected. More importantly, the RMSE increases as increases. In particular, the hard criterion always outperforms the soft criterion, which is consistent with our theoretical results.

Figure 2 shows the RMSEs under Model 1 when is fixed as 100 and , 60, 100, 300, 500 and 1000. As before, the RMSE always increases as increases. Moreover, the RMSEs of all methods increase as increases, which suggests that the hard criterion may not be consistent when grows faster than . For the non-linear logit function, Figure 3 and 4 show the same patterns as in Figure 1 and 2, respectively, which also support our theoretical results.

5 Summary

In this article, we proved the consistency of graph-based semi-supervised learning when the tuning parameter of the graph Laplacian is zero (the hard criterion) and showed that the method can be inconsistent when the tuning parameter is nonzero (the soft criterion). Moreover, the numerical studies also suggest that the hard criterion outperforms the soft criterion in terms of the RMSE. These results provide a better understanding about the statistical properties of graph-based semi-supervised learning. Of course, the accuracy of prediction can be measured by other indicators such as the area under the receiver operating characteristic curve (AUC). The hard criterion may not always be the best choice in term of these indicators. Further theoretical properties such as rank consistency will be explored in future research. Moreover, we would also like to investigate the behavior of these methods when the unlabeled data grow faster than the label data.

Figure 1: Average RMSEs when under Model 1
Figure 2: Average RMSEs when under Model 1
Figure 3: Average RMSEs when under Model 2
Figure 4: Average RMSEs when under Model 2

References

  • [1] S. Basu, A. Banerjee, and R. J. Mooney. Semi-supervised clustering by seeding. In Proceedings of the International Conference on Machine Learning, pages 19–26, 2002.
  • [2] M. Belkin, I. Matveeva, and P. Niyogi. Regularization and semi-supervised learning on large graphs.

    In Proceedings of the Seventeenth Annual Conference on Computational Learning Theory

    , pages 624–638, Banff, Canada, 2004.
  • [3] M. Belkin, P. Niyogi, and S. Sindhwani. Manifold Regularization: A geometric framework for learning from labeled and unlabeled examples. JMLR, 7:2399–2434, 2006.
  • [4] T. De Bie and N. Cristianini. Convex methods for transduction. In Advances in Neural Information Processing Systems, 16:73–80, 2004.
  • [5] O. Bosquet, O. Chapelle, and M. Hein. Measure based regularization. NIPS, 16, 2004.
  • [6] Zongwu Cai. Weighted Nadaraya Watson regression estimation. Statistics & Probability Letters, 51:307–318, 2001.
  • [7] O. Chapelle, B. Schlkopf, and A. Zien. Semi-supervised Learning. The MIT Press, 2006.
  • [8] Olivier Delalleau, Yoshua Bengio, and Nicolas Le Roux. Efficient non-parametric function induction in semi-supervised learning.

    In Artificial Intelligence and Statistics

    , 2005.
  • [9] Luc. P. Devroye. The uniform convergence of the Nadaraya-Watson regression function estimate. The Canadian Journal of Statistics, 6:179–191, 1978.
  • [10] Luc. P. Devroye and T. J. Wagner. Distribution-free consistency results in nonparametric discrimination and regression function estimation. Annals of Statistics, 8(2):231–239, 1980.
  • [11] M. Hein. Uniform convergence of adaptive graph-based regularization. COLT: Learning Theory, pages 50–64, 2006.
  • [12] M.D. Intriligator and Z. Griliches. Handbook of Econometrics, volume 1. North-Holland Publishing Company, 1988.
  • [13] Rosie Jones. Learning to extract entitles from labeled and unlabeled text. PhD Thesis, 2005.
  • [14] John Lafferty and Larry Wasserman. Statistical analysis of semi-supervised regression. NIPS, 20, 2008.
  • [15] E. A. Nadaraya. On estimating regression. Theor. Probability Appl, 9:141–142, 1964.
  • [16] Boaz Nadler, Nathan Srebro, and Xueyuan Zhou. Semi-supervised learning with the Graph Laplacian: The limit of infinite unlabelled data. NIPS, 2009.
  • [17] M. E. J. Newman. Networks: An introduction. Oxford University Press, 2010.
  • [18] Joel Ratsaby and Santosh S. Venkatesh. Learning from a mixture of labeled and unlabeled examples with parametric side information. In Proceedings of COLT ’95 Proceedings of the eighth annual conference on Computational learning theory, pages 412–417, 1995.
  • [19] Chuck Rosenberg, Martial Hebert, and Henry Schneiderman. Semi-supervised self-training of object detection models.

    In Proceedings of Seventh IEEE Workshop on Applications of Computer Vision

    , 2005.
  • [20] V. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.
  • [21] G. S. Watson.

    Smooth regression analysis.

    Sankhy, Series A, 26:359–372, 1964.
  • [22] D. Werner. Functional Analysis (in German). Springer Verlag, 2005.
  • [23] Y. Zhang, M. Brady, and S. Smith. Hidden markov random field model and segmentation of brain mr images. IEEE Transactions on Medical Imaging, 20(1):45–57, 2001.
  • [24] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf. Learning with local and global consistency. in S. Thrun, L. Saul, and B. Schölkopf, editors. MIT Press, Cambridge, MA, 2004.
  • [25] X. Zhu, Zoubin Ghahramani., and John Lafferty. Semi-supervised learning using Gaussian Fields and Harmonic Functions. ICML, (118), 2003.
  • [26] X. Zhu and Andrew B. Goldberg. Introduction to Semi-supervised Learning. Morgan & Claypool Publishers, 2009.