A Distribution Dependent and Independent Complexity Analysis of Manifold Regularization

06/14/2019 ∙ by Alexander Mey, et al. ∙ Delft University of Technology 0

Manifold regularization is a commonly used technique in semi-supervised learning. It guides the learning process by enforcing that the classification rule we find is smooth with respect to the data-manifold. In this paper we present sample and Rademacher complexity bounds for this method. We first derive distribution independent sample complexity bounds by analyzing the general framework of adding a data dependent regularization term to a supervised learning process. We conclude that for these types of methods one can expect that the sample complexity improves at most by a constant, which depends on the hypothesis class. We then derive Rademacher complexities bounds which allow for a distribution dependent complexity analysis. We illustrate how our bounds can be used for choosing an appropriate manifold regularization parameter. With our proposed procedure there is no need to use an additional labeled validation set.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many applications, as for example image or text classification, gathering unlabeled data is easier than gathering labeled data. Semi-supervised methods try to extract information from the unlabeled data to get improved classification results over purely supervised methods. A well known technique to incorporate unlabeled data into a learning process is manifold regularization (MR) [6, 16]

. This procedure adds a data-dependent penalty term to the loss function and that penalizes classification rules that behave non-smooth with respect to the data distribution. This paper presents a sample complexity and a Rademacher complexity analysis for this procedure and illustrates how our Rademacher complexity bounds can be used for choosing a suitable manifold regularization parameter. We organize our paper as follows. In Sections

2 and 3 we discuss related work and introduce the semi-supervised setting.

In Section 4 we fomalize the idea of adding a distribution-dependent penalty term to a loss function. Algorithms such as manifold, entropy or co-regularization [6, 13, 18] follow this idea. Our formalization of this idea is inspired by Balcan and Blum [3] and allows for a similar sample complexity analysis.

Section 5 reviews the work from Balcan and Blum [3] and generalizes a sample complexity bound from their paper. We then show how this bound can be used to derive sample complexity bounds for the proposed framework, and thus in particular for MR. In the same section we discuss what our findings mean in terms of the performance gap between supervised and semi-supervised methods. We conclude that if our hypothesis set has finite pseudo-dimension, any semi-supervised learner (SSL) that falls in our framework can have at most a constant improvement in terms of sample complexity when compared to all supervised learners. This relates to work of Darnstädt et al. [11] and Ben-David et al. [7] and we argue that we generalize their results.

While our results from the previous sections are uniformly over all distributions and thus distribution independent, we show in Section 6 how one can obtain distribution dependent complexity bounds for MR. We review a kernel formulation of MR [19]

and show how this can be used to estimate Rademacher complexities for

specific datasets.

In Section 7 we illustrative on an artificial dataset how the distribution dependent bounds can be used for choosing the regularization parameter of MR. This is in particular useful as this analysis does not need an additional labeled validation set, and in semi-supervised settings labeled samples are often assumed to be sparse.

In Section 8 we discuss our results.

We summarize the three core contributions of this paper.

  1. We formalize the concept of distribution-dependent regularization for semi-supervised learning and derive distribution independent sample complexity bounds for it. This includes in particular sample complexity bounds for MR.

  2. We compare the derived upper bounds with the lower bounds of purely supervised learners. We show that for all SSL that fall in our framework, the performance gap, when compared to their supervised counterpart, can be at most a constant. The constant depends on the hypothesis class used.

  3. We derive a computational feasible method to compute distribution dependent Rademacher complexity bounds for MR. We illustrate how these complexity bounds can be used to find a suitable regularization parameter for MR and removes the need for a labeled validation set.

2 Related Work

There are currently two related analyses of MR that show, to some extent, that a SSL can learn efficiently if it knows the true underlying manifold, while a fully supervised learner may not. Globerson et al. [12] investigate a setting where they essentially restrict themselves to distributions over the input space that correspond to unions of irreducible algebraic sets of a fixed size , and each algebraic set is either labeled or . A SSL that knows the true distribution on can identify the algebraic sets and reduce the hypothesis space to all possible label combinations on those sets. As we are left with finitely many hypotheses we can learn them efficiently, while they show that every supervised learner is left with a hypothesis space of infinite VC dimension.

A similar type of analysis was done by Niyogi [16]. Niyogi considers manifolds that arise as embeddings from a circle, where the labeling over the circle is (up to the decision boundary) smooth. Niyogi then shows that a learner that has knowledge of the manifold can learn efficiently while for every fully supervised learner one can find an embedding and a distribution for which this is not possible.

The relation to our paper is as follows. They provide examples where sample complexity between a semi-supervised and a supervised learner be arbitrarily large, while we explore the general difference in sample complexity between a supervised method and MR.

3 The Semi Supervised Setting

We work in the statistical learning framework. That means that we assume we are given a feature domain and a label space

together with an unknown probability distribution

over . Additionally we are given a loss function , which is convex in the first argument and in practice often a surrogate loss function for the loss, as for example the hinge loss. A hypothesis is a function . We set

to be a random variable distributed according to

, while small and are elements of and respectively. Our goal is to find a hypothesis , within a restricted class , such that the expected loss is small. Throughout this paper we call models that minimize discriminative models as the loss only depends on the output of the model and not on the input . In the standard supervised setting we choose a hypothesis based on a i.i.d. sample drawn from . With that we define the empirical risk of a model with respect to and measured on the sample as For ease of notation we sometimes omit and just write . Given a learning problem defined by and a labeled sample , one way to choose a hypothesis is by the empirical risk minimization principle

(1)

We refer to as the supervised solution as it does not depend on the unlabeled sample. In SSL we additionally have samples with unknown labels. So we assume to have samples independently drawn according to , where has not been observed for the last samples. We call the first samples the labeled samples. The last samples are referred to as unlabeled samples. We furthermore set , so is the set that contains all our available information about the feature distribution.

4 Proposed Framework for Semi-Supervised Learning

We propose to include the unlabeled sample in the following manner. We first introduce a second convex loss function that only depends on the input feature and a hypothesis . We refer to as the unsupervised loss as it does not depend on any labels. We propose to add the unlabeled data through the loss function and add it as a penalty term to the supervised loss to obtain the semi-supervised solution

(2)

where controls the trade-off between the supervised and the unsupervised loss. For ease of notation we set and We don’t claim any novelty for the idea of adding a unsupervised loss for regularization, a different framework can be found in [10, Chapter 10]. We are, however, not aware of this particular formulation of the framework and a sample complexity analysis as presented here. As we are in particular interested in the class of MR schemes we first remark that this method actually fits in our framework.

Example: Manifold Regularization

Overloading the notation we write now for the distribution restricted to . In MR one assumes that the input distribution has support on a compact manifold . Our assumption is that the predictor varies smoothly in the geometry of [6]. There are several regularization terms that can enforce this smoothness, one of which is . Belkin and Niyogi [5] show that may be approximated with a finite sample of drawn from . Given such a sample one defines first a weight matrix , where . We set then as the Laplacian matrix , where is a diagonal matrix with . Let furthermore

be the evaluation vector of

on . The expression converges to [5]. Thus we can set the unsupervised loss as , and this is indeed a convex function in .

5 Analysis of the Framework

In this section we analyze the properties of the semi-supervised solution found in Equation (2). We give sample complexity bounds for this procedure and compare them to sample complexities for the supervised case. We derive our results from Balcan and Blum [3], who introduced the concept of unsupervised loss functions. The use of the unsupervised loss function in their work differs however from our work. While they use it to restrict the hypothesis space directly, we use it as a regularization tool in the empirical risk minimization as usually done in practice. To switch between the views of a constrained optimization formulation and our formulation (2) we use the following classical result from convex optimization [14, Theorem 1].

Lemma 1.

Let and be functions convex in for all . Then the following two optimization problems are equivalent:

(3)
(4)

Where equivalence means that for each we can find a such that both problems have the same solution and vice versa.

The next subsection introduces the sample complexity bound and shows how it can be used to give theoretical guarantees for the presented framework.

5.1 Sample Complexity Bounds

Sample complexity bounds for supervised learning use typically a notion of complexity of the hypothesis space to bound the worst case difference between the estimated and the true risk. One well-known capacity notion is the VC-dimension [20], which allows for a distribution free, worst-case analysis on the complexity of the hypothesis space. Balcan and Blum [3] use this notion to derive sample complexity bounds for SSL. While this bound was given for classification error and boolean unsupervised loss functions, we adapt it to be valid for all bounded supervised loss functions and all unsupervised loss functions bounded by . To do that we use the notion of pseudo-dimension, an extension of the VC-dimension to real valued loss functions and hypotheses [20, 15]. We briefly review the concept of pseudo-dimension.

Let and for and let . Ignoring the label we can similarly define for the unsupervised loss . The function

is the 0-1 error we observe if we try to classify the point

with the hypothesis and by thresholding the loss function at . For a given sample the following set captures then all possible 0-1 errors (and thus all possible labellings), when using the functions for classification:

(5)

For brevity we substitute in the following and . The pseudo-dimension of a hypothesis class and a loss function is then defined as

Those definitions let us state our first main Theorem, which is a generalization of Balcan and Blum [3, Theorem 10] to bounded loss functions.

Theorem 1.

Let . Assume is a measurable loss function such that there exists a with for all and and let be a distribution. Furthermore let . Then an unlabeled sample of size

and a labeled sample of size

is sufficient to ensure that with probability at least the classifier that minimizes subject to satisfies

(6)
Proof.

The proof can be found in the supplementary material.

In the next subsection we show how to use this theorem to derive sample complexity bounds for MR. But before that we want to make one remark about the assumption that the loss function is globally bounded. If we assume that is a reproducing kernel Hilbert space there exists an such that for all and it holds that . If we restrict the norm of by introducing a regularization term with respect to the norm and assume that is continuous we can conclude that is globally bounded in . In a way this can also be seen as a justification to also use an intrinsic regularization for the norm of in addition to the regularization by the unsupervised loss, as only then the guarantees of Theorem 1 apply. Using this bound together with Lemma 1 we can state the following corollary to give a PAC-style guarantee for our proposed framework.

Corollary 1.

Let and be convex supervised and an unsupervised loss function that fulfill the assumptions of Theorem 1. The semi-supervised solution as found by Equation (2) then satisfies the guarantees given in Theorem 1, so fulfills Inequality (6) when we plug it in for .

As our paper focuses mostly on MR there are two notes to make. First, recall that in this setting , so we gather unlabeled samples from instead of and thus we need instead of unlabeled samples for the same bound. Second, the unlabeled loss does not necessarily map to . But it is clear that we can adapt the bounds for bounded unlabeled loss functions the same way we did for bounded supervised loss functions.

5.2 Comparison to the Supervised Solution

In the SSL community it is well known that using SSL does not come without a risk [10, Chapter 4]. Thus it is of particular interest how they compare to purely supervised schemes. There are, however, many potential supervised methods we can think of. In the work of Ben-David et al. [7], Darnstädt et al. [11] and Globerson et al. [12] this problem is avoided by comparing to all possible supervised schemes. The framework introduced in this paper allows for a more fine-grained analysis as the semi-supervision happens on top of an already existing supervised methods. Thus for our framework it is natural to compare the sample complexities of with the sample complexity of . To compare the supervised and semi-supervised solution we use again the language of sample complexities and state a simple lower bound on the sample complexity for the semi-supervised solution.

Theorem 2 ([17, p. 73]).

Let . Assume is a measurable loss function such that there exists a with for all and . Furthermore let

Then there exists a constant such that a labeled sample of size

is necessary to ensure that for all distributions with probability at least the classifier that minimizes subject to satisfies

We want to point out two things on this lower bound. First note that in Theorem 1 we provided an upper bound on the sample complexity, so a sample size which is sufficient for the guarantees. The lower bound for sample complexity from Theorem 2 is the necessary amount of labeled samples for the given guarantees. Second we note that we do not have any requirement on the unlabeled sample size. This is because we stated Theorem 2 for the assumption that we have infinitely many unlabeled samples. As the theorem holds for that case, it for sure holds when we have only a finite unlabeled sample size. The following corollary captures the comparison to the supervised solution.

Corollary 2.

Assume the setting of Theorem 2 and assume that and are convex. There exists a universal constant such that the semi-supervised solution found in (2) needs at least a labeled sample size of

to guarantee the same performance as the supervised solution found in (1) which has access to a labeled sample of size

Proof.

The lower bound on the semi-supervised sample complexity is exactly Theorem 2, while the upper bound on the supervised solution is a classical result as found in Shalev-Shwartz and Ben-David [17, p. 73]. ∎

We can interpret the corollary also as follows. The worst case difference of labeled sample consumption between the supervised and semi-supervised solution only depends on the two universal constants and and on the pseudo-dimensions and . In particular the sample complexity of the supervised method is at most as much as the semi-supervised method. This is finite if is finite.

5.3 The Limits of Manifold Regularization

We now relate our result to the conjectures published by Shalev-Shwartz and Ben-David [17]: A SSL cannot learn faster by more than a constant (which my depend on the hypothesis class and the loss ) than the supervised learner. Darnstädt et al. [11, Theorem 2 and 3] showed that this conjecture is essentially true for classes with finite VC-dimension, and SSL that do not make any distributional assumptions. Corollary 2 and the text thereafter shows that this statement also holds for every SSL that falls in our proposed framework and thus extents this result. Our result holds explicitly for SSLs that do make assumptions about the distribution: MR makes the assumption that the labeling function behaves smoothly with respect to the underlying manifold.

6 Rademacher Complexity for Manifold Regularization

The analysis of the previous sections was distribution independent. We saw that the sample complexity difference between manifold regularized methods and purely supervised methods was at most a constant, which depends on the hypothesis class and the regularization parameter. We believe that in order to find out in which scenarios semi-supervised learning can help it is useful to also look at distribution dependent complexity measures. For this we derive in this section computational feasible upper and lower bounds on the Rademacher complexity of MR. For this we first review the work of Sindhwani et al. [19]; they create a kernel which corresponds to MR. Having this kernel we can use standard upper and lower bounds of the Rademacher complexity for RKHS, as found for example in [9]. The analysis is thus similar to [18], they consider a co-regularization setting. In particular [19, p1] show the following, here informally stated, theorem.

Theorem 3 ([19, Propositions 2.1, 2.2]).

Let be a RKHS with inner product . As before let , and . Furthermore be any inner product in . Let be the same space of functions as , but with a newly defined inner product by Then is a RKHS.

Assume now that is a positive definite -dimensional matrix and we set the inner product By setting as the Laplacian matrix as explained in Section 4 we note that the norm of now automatically regularizes with respect to the data manifold given by . We furthermore know the exact form of the kernel of .

Theorem 4 ([19, Proposition 2.2]).

Let be the kernel of , be the gram matrix given by and . Finally let be the

dimensional identity matrix. The kernel of

is then given by

This interpretation of MR is useful to derive computational feasible upper and lower bounds of the empirical Rademacher complexity, so distribution dependent complexity bounds. First, recall that the empirical Rademacher complexity of the hypothesis class and measured on the sample labeled input features is defined as

where are i.i.d Rademacher random variables, i.e. .

Theorem 5 ([9, p. 333]).

Let be a RKHS with kernel and . Given an sample we can bound the empirical Rademacher complexity of by

(7)

With the previous two theorems we can give upper bounds on the Rademacher complexity of MR, in particular we can also derive a bound of the maximal complexity reduction over supervised learning.

Corollary 3.

Let be a RKHS and for define the inner product , where is a positive definite matrix and is a regularization parameter. Let be defined as before, then

(8)

Similarly we can also obtain a lower bound according to Inequality (7).

The corollary allows us to compute upper bounds of the Rademacher complexity for MR and shows in particular that the difference of the Rademacher complexity of the supervised and the semi-supervised method is given by the term . This can be used for example to compute generalization bounds [15, Chapter 3]. We can also use the kernel to compute local Rademacher complexities which may yield tighter generalization bounds [4]. We, however, will use the Rademacher complexity bounds instead for choosing the regularization parameter without the need for an additional labeled validation set.

7 Experiment: Concentric circles

Figure 1: The behavior of the Rademacher complexity when using manifold regularization on circle dataset with different regularization values .

We now illustrate how one can use the upper bound of the Rademacher complexity (8) for model selection, in particular we can get an initial idea of how to choose the regularization parameter . The idea is to plot the Rademacher complexity versus the parameter as in Figure 1 . For increasing we will see that the Rademacher complexity reduces, so how to choose the optimal

? We propose to use an heuristic which is often used in clusterings, the so called ’elbow criteria’

[8]. We essentially want to find a such that increasing the will not result in much reduction of the complexity anymore. We test this idea on a dataset which consists out of two concentric circles with 500 datapoints in , 250 per circle, see also Figure 2. We use a gaussian base kernel with bandwith set to and for the MR matrix we use the Laplacian matrix as before, where weights are computed also with a gaussian kernel with bandwith set to . Note that those parameters have to be carefully set in order to capture the structure of the dataset, but this is not the concern of this paper, we already assume that we found a reasonable choice for those parameters. We furthermore add a small L2-regularization that ensures that the radius from Inequality (8) is finite. The precise value of plays here a secondary role as the behavior of the curve from Figure 1 remains the same. Details on the data generation process can be found in our code. We are now interested in how to choose the regularization parameter . Looking at Figure 1 we observe that for values smaller than the curve still drops steeply, while after it starts to flatten out. We thus plot the resulting kernels for and in Figure 2. We plot the isolines of the kernel around the point of class one, the red dot in the figure. We indeed observe that for we don’t capture that much structure yet, while for the two conectric circles are almost completely separated by the kernel.

(a)
(b)
Figure 2: The resulting kernel when we use manifold regularization with parameter set to and .

8 Discussion

We presented a distribution dependent and distribution independent complexity analysis for manifold regularization. We note that the improvements in terms of sample or Rademacher complexity do not immediately imply that we will also observe an improvement in terms of performance. The reason for that is that the difference in performance between supervised and semi-supervised learning depends depends on two things. First, on how the approximation error of the class compares the approximation error of is. Second, on how much we reduce the complexity by switching from the hypothesis class to the hypothesis class . In our analysis we discussed the second part and we argue that our proposed framework is suited to address this problem, as the resulting SSL schemes have a natural supervised counterpart. The first part depends on a notion the literature often refers to as a semi-supervised assumption. This assumption basically states that we can learn with as good as with . Regarding our example of the two concentric circles, this would mean that each circle actually corresponds to a class. The problem is that it is typically not known whether one can test efficiently if the assumption is true or not, without prior knowledge. The issue becomes clearer when one realizes that the assumption is always a property on the distribution , so one would need labeled samples to verify it. We speculate that this verification process would negate the sample savings one can achieve with SSL. We are not aware of much research in this direction, besides the work of Balcan et al. [2] which discusses the sample consumption to test the so called cluster assumption and the work of Azizyan et al. [1] which analyze the overhead of cross-validating the extra hyper parameter from their proposed semi-supervised learner.

Furthermore we note that one condition for the main finding in Corollary 1 is that the unsupervised loss function is convex and this is not the case for example for entropy regularization, which in fact as a concave unsupervised loss function. It is an open question whether we can give similar bounds for those functions. We are not aware of any simple way to adapt the presented theory for non-convex functions.

References

  • Azizyan et al. [2012] M. Azizyan, A. Singh, and L. A. Wasserman. Density-sensitive semisupervised inference. CoRR, abs/1204.1685, 2012.
  • Balcan et al. [2011] M. Balcan, E. Blais, A. Blum, and L. Yang. Active testing. CoRR, abs/1111.0897, 2011.
  • Balcan and Blum [2010] M.-F. Balcan and A. Blum. A discriminative model for semi-supervised learning. J. ACM, 57(3):19:1–19:46, Mar. 2010. ISSN 0004-5411. doi: 10.1145/1706591.1706599.
  • Bartlett et al. [2005] P. L. Bartlett, O. Bousquet, and S. Mendelson. Local rademacher complexities. Ann. Statist., 33(4):1497–1537, 08 2005. doi: 10.1214/009053605000000282. URL https://doi.org/10.1214/009053605000000282.
  • Belkin and Niyogi [2008] M. Belkin and P. Niyogi. Towards a theoretical foundation for laplacian-based manifold methods. Journal of Computer and System Sciences, 74(8):1289 – 1308, 2008. ISSN 0022-0000. doi: https://doi.org/10.1016/j.jcss.2007.08.006. Learning Theory 2005.
  • Belkin et al. [2006] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. J. Mach. Learn. Res., 7:2399–2434, Dec. 2006. ISSN 1532-4435.
  • Ben-David et al. [2008] S. Ben-David, T. Lu, and D. Pál. Does unlabeled data provably help? worst-case analysis of the sample complexity of semi-supervised learning. In Proceedings of the The 21st Annual Conference on Learning Theory (COLT-08), Helsinki, Finland, 2008.
  • Bholowalia and Kumar [2014] P. Bholowalia and A. Kumar.

    Article: Ebk-means: A clustering technique based on elbow method and k-means in wsn.

    International Journal of Computer Applications, 105(9):17–24, November 2014. Full text available.
  • Boucheron et al. [2005] S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classification: A survey of some recent advances, 2005.
  • Chapelle et al. [2010] O. Chapelle, B. Schölkopf, and A. Zien. Semi-Supervised Learning. The MIT Press, 1st edition, 2010. ISBN 0262514125, 9780262514125.
  • Darnstädt et al. [2013] M. Darnstädt, H. U. Simon, and B. Szörényi. Unlabeled data does provably help. In STACS, pages 185–196, 2013.
  • Globerson et al. [2017] A. Globerson, R. Livni, and S. Shalev-Shwartz. Effective semisupervised learning on manifolds. In S. Kale and O. Shamir, editors, COLT, volume 65 of JMLR Workshop and Conference Proceedings, pages 978–1003. JMLR.org, 2017.
  • Grandvalet and Bengio [2005] Y. Grandvalet and Y. Bengio. Semi-supervised learning by entropy minimization. In F. Denis, editor, CAP, pages 281–296. PUG, 2005.
  • Kloft et al. [2009] M. Kloft, U. Brefeld, P. Laskov, K.-R. Müller, A. Zien, and S. Sonnenburg. Efficient and accurate lp-norm multiple kernel learning. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 997–1005. Curran Associates, Inc., 2009. URL http://papers.nips.cc/paper/3675-efficient-and-accurate-lp-norm-multiple-kernel-learning.pdf.
  • Mohri et al. [2012] M. Mohri, A. Rostamizadeh, and A. Talwalkar.

    Foundations of Machine Learning

    .
    The MIT Press, 2012. ISBN 026201825X, 9780262018258.
  • Niyogi [2013] P. Niyogi. Manifold regularization and semi-supervised learning: Some theoretical analyses. J. Mach. Learn. Res., 14(1):1229–1250, May 2013. ISSN 1532-4435.
  • Shalev-Shwartz and Ben-David [2014] S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, New York, NY, USA, 2014. ISBN 1107057132, 9781107057135.
  • Sindhwani and Rosenberg [2008] V. Sindhwani and D. S. Rosenberg. An rkhs for multi-view learning and manifold co-regularization. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, pages 976–983, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-205-4. doi: 10.1145/1390156.1390279. URL http://doi.acm.org/10.1145/1390156.1390279.
  • Sindhwani et al. [2005] V. Sindhwani, P. Niyogi, and M. Belkin. Beyond the point cloud: From transductive to semi-supervised learning. In Proceedings of the 22Nd International Conference on Machine Learning, ICML ’05, pages 824–831, New York, NY, USA, 2005. ACM. ISBN 1-59593-180-5. doi: 10.1145/1102351.1102455. URL http://doi.acm.org/10.1145/1102351.1102455.
  • Vapnik [1998] V. N. Vapnik. Statistical Learning Theory. Wiley-Interscience, 1998.