Discriminative Similarity for Clustering and Semi-Supervised Learning

09/05/2017 ∙ by Yingzhen Yang, et al. ∙ National University of Singapore Microsoft University of Illinois at Urbana-Champaign 0

Similarity-based clustering and semi-supervised learning methods separate the data into clusters or classes according to the pairwise similarity between the data, and the pairwise similarity is crucial for their performance. In this paper, we propose a novel discriminative similarity learning framework which learns discriminative similarity for either data clustering or semi-supervised learning. The proposed framework learns classifier from each hypothetical labeling, and searches for the optimal labeling by minimizing the generalization error of the learned classifiers associated with the hypothetical labeling. Kernel classifier is employed in our framework. By generalization analysis via Rademacher complexity, the generalization error bound for the kernel classifier learned from hypothetical labeling is expressed as the sum of pairwise similarity between the data from different classes, parameterized by the weights of the kernel classifier. Such pairwise similarity serves as the discriminative similarity for the purpose of clustering and semi-supervised learning, and discriminative similarity with similar form can also be induced by the integrated squared error bound for kernel density classification. Based on the discriminative similarity induced by the kernel classifier, we propose new clustering and semi-supervised learning methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Similarity-based clustering and semi-supervised learning methods segment the data based on the similarity measure between the data points. Regarding to similarity-based data clustering, spectral clustering

(Ng et al., 2001) identifies clusters of complex shapes lying on some low dimensional manifolds by normalized graph Laplacian from a data similarity matrix. Pairwise clustering method (Shental et al., 2003)

uses message-passing algorithm to infer the cluster labels in a pairwise undirected graphical model with the pairwise potential function constructed from a similarity matrix. K-means

(Hartigan and Wong, 1979) searches for data clusters by a local minimum of sum of within-cluster dissimilarities.

The representative similarity-based semi-supervised learning algorithm is label propagation (Zhu et al., 2003; Zhou et al., 2003), which is effective and widely used. With a predefined similarity graph, it determines the labels of unlabeled data by the minimization of the objective function over the similarity graph defined as sum of the product of pairwise similarity and the squared label difference. Therefore, label propagation encourages local smoothness of the labels according to the edge weight of the similarity graph. The typical label propagation algorithm (Zhu et al., 2003) renders a harmonic solution which can also be interpreted by random walks from unlabeled data to the labeled data.

The success of similarity-based learning method highly depends on the underlying pairwise similarity over the data, which in most cases are constructed empirically, e.g. by Gaussian kernel or the K-Nearest-Neighbor (KNN) graph. In this paper, we present a discriminative similarity learning framework for clustering and semi-supervised learning wherein the discriminative similarity is derived as the generalization error bound for the kernel classifier learned from hypothetical labeling. When the popular Support Vector Machines (SVMs) is used in this framework, unsupervised SVM

(Xu et al., 2004) is deduced for the unsupervised case, and Semi-Supervised or Transductive SVMs (Vapnik, 1998; Joachims, 1999; Chapelle et al., 2008) is deduced for the semi-supervised learning case. A kernel classifier motivated by similarity learning (Balcan et al., 2008; Cortes et al., 2013) is used in our framework. By generalization analysis via Rademacher complexity, the generalization error bound for the kernel classifier learned from hypothetical labeling is expressed as the sum of pairwise similarity between the data from different classes. Such pairwise similarity, parameterized by the weights of the learned kernel classifier, serves as the discriminative similarity induced by this generalization bound for clustering and semi-supervised learning. Although similarity is often used to quantify the local affinity between the data, the term “discriminative similarity” here means the similarity to be learned that improve the discriminative capability of some classification method such as the mentioned kernel classifier. Moreover, we prove that discriminative similarity with the same form can also be induced by the error bound for the integrated squared error of kernel density classification.

Our discriminative similarity learning framework is related to a class of discriminative clustering methods which classify unlabeled data by various measures on the discriminative unsupervised classifiers, and the measures include generalization error (Xu et al., 2004) or the entropy of the posterior distribution of the label (Gomes et al., 2010). Discriminative clustering methods (Xu et al., 2004) predict the labels of unlabeled data by minimizing the generalization error bound for the unsupervised classifier. (Xu et al., 2004) proposes Unsupervised SVMs which learns a binary classifier to partition unlabeled data with the maximum margin between different clusters. The theoretical properties of unsupervised SVMs are analyzed in (Karnin et al., 2012). (Gomes et al., 2010)

learns the kernel logistic regression classifier regularized by the entropy of the posterior distribution of the class label.

The paper is organized as follows. We introduce the formulation of the discriminative similarity learning framework in Section 2, and then derive the generalization error bound for the kernel classifier learned from hypothetical labeling in Section 3 where the discriminative similarity is induced by the error bound, and Section 4 shows that a discriminative similarity can also be induced by kernel density classification. The application of the discriminative similarity learning framework to data clustering and semi-supervised learning is shown in Section 6, and we conclude the paper in Section 7. Throughout this paper the term kernel standards for PSD kernel if no special notes are made.

2 Discriminative Similarity Framework

2.1 Discriminative Similarity Framework for Clustering

The discriminative clustering literature (Xu et al., 2004; Gomes et al., 2010) has demonstrated the potential of multi-class classification for the clustering problem. Inspired by the natural connection between clustering and classification, we proposes the framework of learning discriminative similarity for clustering by unsupervised classification which models the clustering problem as a multi-class classification problem: a classifier is learned from the training data built by a hypothetical labeling, which can be any possible cluster labeling. The optimal hypothetical labeling is supposed to be the one such that its associated classifier has the minimum generalization error bound. To study the generalization bound for the classifier learned from hypothetical labeling, the concept of classification model is needed. Given unlabeled data , a classification model is constructed for any hypothetical labeling as below:

Definition 1

The classification model corresponding to the hypothetical labeling for clustering is defined as . are the labeled data by the hypothetical labeling, and

are assumed to be i.i.d. samples drawn from the joint distribution

over the data and its class label . is the family comprised of all such distributions, i.e. . is a classifier learned using the training data . The generalization error of the classification model is defined as the generalization error of the classifier in .

The optimal hypothetical labeling minimizes the generalization error bound for the classification model. With being different classifiers, different discriminative clustering models can be derived. When SVMs is used in the above discriminative model, unsupervised SVM (Xu et al., 2004) is obtained. When two nonparametric classifiers, i.e. the nearest neighbor classifier and the plug-in classifier, are used in this framework, the unsupervised classification method in (Yang et al., 2014) is recovered.

2.2 Discriminative Similarity Framework for Semi-Supervised Learning

Similar to the case of clustering, the discriminative similarity framework for semi-supervised learning also searches for the optimal hypothetical labeling such that the associated classifier has minimum generalization error bound.

Suppose the data are comprised of labeled and unlabeled set, the first points have labels for and is the number of classes. In the following text, is the label of for , and for . Given any hypothetical labeling for the unlabeled data, the classification model for semi-supervised learning is defined as below:

Definition 2

The classification model corresponding to the hypothetical labeling for semi-supervised learning is defined as . are the labeled data by the hypothetical labeling, and are assumed to be i.i.d. samples drawn from the joint distribution over the data and its class label . is the family comprised of all such distributions, i.e. . is a classifier learned using the training data . The generalization error of the classification model is defined as the generalization error of the classifier in .

Our framework of learning discriminative similarity for semi-supervised learning model searches for the optimal hypothetical labeling that minimizes the generalization error bound for the classification model defined above. Different specific discriminative semi-supervised learning models can be derived with being different classifiers. For example, when SVMs is used, a model with the same optimization problem as the well known Semi-Supervised or Transductive SVMs (Vapnik, 1998; Joachims, 1999; Chapelle et al., 2008) is obtained 111Note that Transductive SVMs, despite its name, also learns a inductive rule.

2.3 Discriminative Similarity induced by Kernel Classifier

We employ kernel classifier in the proposed discriminative similarity learning framework in Section 2.1 and Section 2.2, so as to induce the discriminative similarity parameterized with learnable weights of the kernel classifier. The kernel classifier is designed based on similarity learning methods. Balcan et al. (Balcan et al., 2008) proposes a classification method using general similarity functions, and the classification rule measures the similarity of the test data to each class then assigns the test data to the class such that the weighed average of the similarity between the test data and the training data belonging to that class is maximized over all the classes. In (Cortes et al., 2013), kernel function is used as the similarity function, and the generalization error of a kernel classifier is derived using a properly defined kernel margin, wherein the classifier uses average instead of weighed average when computing point-to-class similarity. Inspired by these similarity learning methods, we propose hypothesis to measure the similarity of datum to class and it uses kernel as the similarity function:

(1)

where is the isotropic Gaussian kernel (with the constant that makes unit integral omitted) with bandwidth , are the nonnegative weights that sum up to . Instead of the expectation-based similarity in (Balcan et al., 2008), hypothesis is a the finite sample-based similarity between datum and class , leading to a tractable optimization problem, as shown in the next section. The similarity-based kernel classifier predicts the label of the datum as the one for which the point-to-class similarity is maximized, i.e. .

The generalization error bound for the kernel classifier is derived in the following section. Note that we only need to derive the error bound for the kernel classifiers in the discriminative similarity framework for clustering. The reason is that the error bound for clustering also applies to the case of semi-supervised learning, as a certain amount of given labels do not affect the derivation of the generalization bound.

3 Discriminative Similarity from Generalization Bound for Kernel Classifier

In this section, the generalization error bound for the classification model in Definition 1 and Definition 2 with the kernel classifier is derived as a sum of discriminative similarity between the data from different classes.

To analyze the generalization bound for the kernel classifier , the following notations are introduced. Let be the nonzero weights that sum up to , be a column vector representing the weights belonging to class such that is if , and otherwise. The margin of the labeled sample is defined as , the sample is classified correctly if . We then derive the generalization error bound for using the Rademacher complexity of the function class comprised of all the possible margin functions . The Rademacher complexity (Bartlett and Mendelson, 2003; Koltchinskii, 2001) of a function class is defined below:

Definition 3

Let be

i.i.d. random variables such that

. The Rademacher complexity of a function class is defined as

(2)

Let be the gram matrix of the data by the kernel with . Based on the generalization analysis by Koltchinskii and Panchenko (2002), Theorem 1 presents the generalization error bound for the unsupervised kernel classifier using the empirical error and the Rademacher complexity of the function class . Inspired by the regularization on multiclass kernel-based vector machines (Crammer and Singer, 2001), we propose the regularization term which is required to be bounded by for some . Denote by the space of all the hypothesis associated with label , i.e.

(3)

and define the hypothesis space . Lemma 1 shows that the Rademacher complexity of the properly defined functional class is bounded by the regularization term

with a large probability:

Lemma 1

Define . When where is a positive constant, with probability at least over the data , the Rademacher complexity of the function space satisfies

(4)

With the bounded Rademacher complexity of the function class , Theorem 1 presents the generalization error bound for the kernel classifier in the discriminative similarity learning framework.

Theorem 1

(Error of the Kernel Classifier) Given the classification model in Definition 1 or Definition 2, if , then with probability over the labeled data with respect to any distribution in , the generalization error of the kernel classifier satisfies

(5)

where is the empirical error of on the labeled data, is a constant and is defined as

(6)

Moreover, if , the empirical error is

(7)

With the error bound in Theorem 1, searching for the optimal hypothetical labeling amounts to solving the following optimization problem

(8)

Where is the feasible set for the weights of the kernel classifier, and is a weighting parameter. Similar to the optimization problem of SVMs (Vapnik, 1998; Joachims, 1999), there is a balancing parameter that balances between the empirical error and the regularization term. Substituting into (8),

(9)

where

(10)

is tuned such that , e.g. . The first term of the objective function (9) is , which is the sum of similarity between the data from different classes with being the discriminative similarity between and induced by the generalization error bound (5). In the next section, it is shown that the discriminative similarity with the same form as that induced by our generalization analysis, namely (10

), can also be induced from the perspective of kernel density classification by kernel density estimators with nonuniform weights. It supports the theoretical justification for the induced discriminative similarity (

10) in this section.

4 A Kernel Density Classification Perspective

The discriminative similarity can also be induced from kernel density classification with varying weights on the data, and binary classification is considered in this section. For any classification model with hypothetical labeling and the labeled data , suppose the joint distribution over has probabilistic density function . Let be the induced marginal distribution over the data with probabilistic density function . Robust kernel density estimation methods (Girolami and He, 2003; Kim and Scott, 2008; Mahapatruni and Gray, 2011; Kim and Scott, 2012) suggest the following kernel density estimator where the kernel contributions of different data points are reflected by different nonnegative weights that sum up to :

(11)

where . Based on (11), it is straightforward to obtain the following kernel density estimator of the density function :

(12)

Kernel density classifier is learnt from the labeled data and constructed by kernel density estimators (12). Kernel density classifier resembles the Bayes classifier, and it classifies the test data based on the conditional label distribution , or equivalently, is assigned to class if , otherwise it is assigned to class . Intuitively, it is preferred that the decision function is close to the true Bayes decision function . (Girolami and He, 2003; Kim and Scott, 2008) propose to use Integrated Squared Error (ISE) as the metric to measure the distance between the kernel density estimators and their true counterparts, and the oracle inequality is obtained that relates the performance of the classifier in (Kim and Scott, 2008) to the best possible performance of kernel density classifier in the same category. ISE is adopted in our analysis of kernel density classification, and the ISE between the decision function and the true Bayes decision function is defined as

(13)

The upper bound for the ISE also induces discriminative similarity between the data from different classes, which is presented in the following theorem.

Theorem 2

Let and . With probability at least over the labeled data , the ISE between the decision function and the true Bayes decision function satisfies

(14)

where

(15)
(16)

and is the gram matrix evaluated on the data with the kernel .

Let be a weighting parameter, then the cost function , designed according to the empirical term and the regularization term in the ISE error bound (14), can be expressed as

wherein the first term is comprised of sum of similarity between data from different classes with similarity , and is the discriminative similarity induced by the ISE bound for kernel density classification. Note that has the same form as the discriminative similarity (10) induced by the kernel classifier in our discriminative similarity learning framework, up to a scaling constant and the choice of the balancing parameter.

5 Extension to General Similarity-Based Classifier

We now consider using a general symmetric and continuous function as the similarity function in the classification model in Definition 1 or Definition 2 which is not necessarily a PSD kernel. The hypothesis becomes

. In order to analyze the generalization property of the classification rule using the general similarity function, we first investigate the properties of general similarity function and its relationship to PSD kernels in terms of eigenvalues and eigenfunctions of the associated integral operator. The integral operator

is well defined. It can be verified that is a compact operator since is continuous. According to the spectral theorem in operator theory, there exists an orthogonal basis of which are the eigenfunctions of , where is the space of measurable functions which are defined over and square Lebesgue integrable. is the eigenfunction of with eigenvalue if . The following lemma shows that under certain assumption on the eigenvalues and eigenfunctions of , a general symmetric and continuous similarity can be decomposed into two PSD kernels.

Lemma 2

Suppose is a symmetric continuous function, and and are the eigenvalues and eigenfunctions of respectively. Suppose for some constant . Then for any , and it can be decomposed as the difference between two positive semi-definite kernels, namely with

(17)
Remark 1

If is a kernel, i.e. all of the eigenvalues of are nonnegative, then is bounded by a constant (Cucker and Zhou, 2007). Also, when is square Lebesgue integrable, . Lemma 2 allows for general similarity function with negative eigenvalues if is bounded.

Resembling the case that kernel serves as the similarity function in Section 3, we use the regularization term to bound the Rademacher complexity for the classification rule using general similarity function. Let and with and . Similar to the analysis of kernel classifier, the space of all the hypothesis associated with label is defined as

(18)

with positive number and which bounds and respectively. Let the margin function be , the hypothesis space be , and the general similarity-based classifier predicts the label of the datum by . We then present the main results in this section, which show the bound for the Rademacher complexity of the hypothesis space, i.e. , and the generalization error of unsupervised general similarity-based classifier .

Lemma 3

Suppose the assumptions in Lemma 2 hold. Define and . When , for positive constants and , , for some , then with probability at least over the data , the Rademacher complexity of the class satisfies

(19)
Theorem 3

(Error of the General Similarity-Based Classifier) Suppose the assumptions in Lemma 2 hold. Given the classification model in Definition 1 or Definition 2, if , , then with probability over the labeled data with respect to any distribution in , under the assumptions of Lemma 2 and Lemma 3 on , and , the generalization error of the general classifier satisfies

(20)

where is the empirical error of on the labeled data, is a constant and is defined in (6). Moreover, if , the empirical error is

(21)
Remark 2

When is a kernel, e.g. , it can be verified that , , so we can let , , and it follows that the dominant term in the error bound (20) for general similarity function reduces to , the dominant term in the bound (5) for kernel classifier (with ).

Remark 3

When the decomposition exists and , are PSD kernels, is the kernel of some Reproducing Kernel Kreĭn Space (RKKS) (Mary, 2003). (Ong et al., 2004; Loosli et al., 2016) analyze the problem of learning SVM-style classifiers with indefinite kernels from the Kreĭn space. However, their work does not show when and how an indefinite and general similarity function can have PSD decomposition, as well as the generalization analysis for the similarity-based classifier using such general indefinite function as similarity measure. Our analysis deals with these problems in Lemma 2 and Theorem 3.

Minimizing the above bound for general continuous similarity leads to the formulation that minimizes , i.e.

(22)

where is the weighting parameter for the regularization term , and

(23)

is the discriminative similarity between data from different classes, which is induced by the generalization error bound for the general similarity-based classifier . When is a kernel, , , then reduces to in (10), the similarity induced by the kernel classifier.

Remark 4 (Similarity Machines: SVM-Type Classifier with General Similarity Function)

Traditional Kernel Support Vector Machines (Kernel SVMs), one of the most representative of Kernel Machine methods, maps the data into a infinite dimensional Reproducing Kernel Hilbert Space (RKHS) associated with the chosen kernel, and learn max-margin linear classifier in RKHS. The mapping function is a Mercer Kernel in most cases, which is symmetric, continuous and positive semi-definite. The past several decades have witnessed the great success of SVMs in solid theoretical foundation of statistical learning, and broad applications in a vast regime of machine learning, pattern recognition. However, the requirement of positive semi-definiteness for Mercer Kernel substantially restricts the feasibility of SVMs for learning max-margin classifier with general similarity function which is not necessarily a Mercer kernel. Based on Lemma 

2, we can propose Similarity Machines as a generalization of Kernel SVMs, which is a framework of learning maximum margin classifier with general similarity function which is symmetric but not necessarily positive semi-definite. Similarity Machines has generalization error bound which reduces to the canonical error bound for Kernel SVMs when the similarity function is in fact a PSD kernel. The parameters of Similarity Machines can be obtained by minimizing the objective function based on its error bound. More details about the generalization analysis of Similarity Machines are included in the appendix of this paper.

6 Applications

In this section, we present new clustering and semi-supervised learning method using the discriminative similarity induced by the kernel classifier.

6.1 Application to Data Clustering

We propose a novel data clustering method named Clustering by Discriminative Similarity via Kernel classification (CDSK) which is based on our discriminative similarity learning framework with kernel classifier. CDSK aims to minimize (9). However, problem (9) involves minimization with respect to discrete cluster labels which is NP-hard. In addition, it potentially results in a trivial solution which puts all the data in a single cluster due to the lack of constraints on the cluster balance. Therefore, (9) is relaxed in the proposed optimization problem for CDSK below:

(24)

where , , is the graph Laplacian computed with , is a diagonal matrix with each diagonal element being the sum of the corresponding row of : , is a identity matrix. Note that when each column of is a binary membership indicator vector for the corresponding cluster, . Similar to spectral clustering (Ng et al., 2001), the constraint prevents imbalanced data clusters.

Problem (24) is optimized by coordinate descent. In each iteration of coordinate descent, optimization with respect to is performed with fixed , which is exactly the same problem as that of spectral clustering with a solution formed by the smallest eigenvectors of the normalized graph Laplacian ; then the optimization with respect to is performed with fixed , which is a standard constrained quadratic programming problem. The iteration of coordinate descent proceeds until convergence or the maximum iteration number is achieved.

6.2 Application to Semi-Supervised Learning

We also propose a new semi-supervised learning method based on Label Propagation using the Discriminative Similarity induced by the Kernel classification (LPDSK). The formulation of LPDSK is

(25)

where is a matrix of with its elements set by the given labels, i.e. if has label for , otherwise .

Similar to the case of clustering, Problem (25) is also optimized by coordinate descent. In each iteration of coordinate descent, optimization with respect to is performed with fixed . With the block representation , , where and are of size , is of size and is of size , it can be verified that this subproblem admits a closed form solution when is invertible; the optimization with respect to is the same as the case of clustering for problem (24). The iteration of coordinate descent proceeds until convergence or the maximum iteration number is achieved.

7 Conclusions

We propose a novel discriminative similarity learning framework wherein the discriminative similarity is induced by the generalization error bound for the classifier learned from hypothetical labeling, and the optimal hypothetical labeling is pursued by minimizing the generalization bound for the associated classifier. A kernel classifier is employed in the proposed framework. The learnable weights in discriminative similarity induced by the kernel classifier allows for adaptive similarity accommodating the local variation of the data. We also analyze the generalization property of similarity-based classifier with general continuous similarity function rather than a PSD kernel. We present new clustering and semi-supervised learning method based on the discriminative similarity learning framework with the kernel classifier, i.e. clustering by discriminative similarity via kernel classification (CDSK) and label propagation by discriminative similarity via kernel classification (LPDSK).

References

  • Balcan et al. [2008] Maria-Florina Balcan, Avrim Blum, and Nathan Srebro. A theory of learning with similarity functions. Machine Learning, 72(1-2):89–112, 2008. ISSN 0885-6125. doi: 10.1007/s10994-008-5059-5.
  • Bartlett and Mendelson [2003] Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res., 3:463–482, March 2003. ISSN 1532-4435.
  • Chapelle et al. [2008] Olivier Chapelle, Vikas Sindhwani, and Sathiya S. Keerthi. Optimization techniques for semi-supervised support vector machines. J. Mach. Learn. Res., 9:203–233, June 2008. ISSN 1532-4435.
  • Cortes et al. [2013] Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Multi-class classification with maximum margin multiple kernel. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, pages 46–54, 2013.
  • Crammer and Singer [2001] Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2:265–292, 2001.
  • Cucker and Zhou [2007] Felipe Cucker and Ding-Xuan Zhou. Learning theory : an approximation theory viewpoint. Cambridge monographs on applied and computational mathematics. Cambridge University Press, Cambridge, New York, 2007. ISBN 978-0-521-86559-3.
  • Girolami and He [2003] Mark Girolami and Chao He. Probability density estimation from optimally condensed data samples. IEEE Trans. Pattern Anal. Mach. Intell., 25(10):1253–1264, 2003. doi: 10.1109/TPAMI.2003.1233899.
  • Gomes et al. [2010] Ryan Gomes, Andreas Krause, and Pietro Perona. Discriminative clustering by regularized information maximization. In NIPS, pages 775–783, 2010.
  • Hartigan and Wong [1979] J. A. Hartigan and M. A. Wong. A K-means clustering algorithm. Applied Statistics, 28:100–108, 1979.
  • Joachims [1999] Thorsten Joachims. Transductive inference for text classification using support vector machines. In Proceedings of the Sixteenth International Conference on Machine Learning, ICML ’99, pages 200–209, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. ISBN 1-55860-612-2.
  • Karnin et al. [2012] Zohar Karnin, Edo Liberty, Shachar Lovett, Roy Schwartz, and Omri Weinstein.

    Unsupervised svms: On the complexity of the furthest hyperplane problem.

    Journal of Machine Learning Research - Proceedings Track, 23:2.1–2.17, 2012.
  • Kim and Scott [2008] JooSeuk Kim and Clayton D. Scott. Performance analysis for l_2 kernel classification. In Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 8-11, 2008, pages 833–840, 2008.
  • Kim and Scott [2012] JooSeuk Kim and Clayton D. Scott. Robust kernel density estimation. Journal of Machine Learning Research, 13:2529–2565, 2012.
  • Koltchinskii and Panchenko [2002] V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Statist., 30(1):1–50, 02 2002. doi: 10.1214/aos/1015362183.
  • Koltchinskii [2001] Vladimir Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory, 47(5):1902–1914, 2001. doi: 10.1109/18.930926.
  • Koltchinskii and Panchenko [2000] Vladimir Koltchinskii and Dmitriy Panchenko. Rademacher processes and bounding the risk of function learning. In Evarist Giné, DavidM. Mason, and JonA. Wellner, editors, High Dimensional Probability II, volume 47 of Progress in Probability, pages 443–457. Birkh?user Boston, 2000. ISBN 978-1-4612-7111-6. doi: 10.1007/978-1-4612-1358-1˙29.
  • Loosli et al. [2016] Gaëlle Loosli, Stéphane Canu, and Cheng Soon Ong. Learning SVM in kreĭn spaces. IEEE Trans. Pattern Anal. Mach. Intell., 38(6):1204–1216, 2016. doi: 10.1109/TPAMI.2015.2477830.
  • Mahapatruni and Gray [2011] Ravi Sastry Ganti Mahapatruni and Alexander G. Gray. CAKE: convex adaptive kernel density estimation. In

    Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011

    , pages 498–506, 2011.
  • Mary [2003] Xavier Mary. Hilbertian subspaces, subdualities and applications. Ph.D. Dissertation, Institut National des Sciences Appliquees Rouen, 2003.
  • Ng et al. [2001] Andrew Y. Ng, Michael I. Jordan, and Yair Weiss.

    On spectral clustering: Analysis and an algorithm.

    In NIPS, pages 849–856, 2001.
  • Ong et al. [2004] Cheng Soon Ong, Xavier Mary, Stéphane Canu, and Alexander J. Smola. Learning with non-positive kernels. In Machine Learning, Proceedings of the Twenty-first International Conference (ICML 2004), Banff, Alberta, Canada, July 4-8, 2004, 2004. doi: 10.1145/1015330.1015443.
  • Shental et al. [2003] Noam Shental, Assaf Zomet, Tomer Hertz, and Yair Weiss. Pairwise clustering and graphical models. In NIPS, 2003.
  • Vapnik [1998] Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, 1998.
  • Xu et al. [2004] Linli Xu, James Neufeld, Bryce Larson, and Dale Schuurmans. Maximum margin clustering. In NIPS, 2004.
  • Yang et al. [2014] Yingzhen Yang, Feng Liang, Shuicheng Yan, Zhangyang Wang, and Thomas S. Huang. On a theory of nonparametric pairwise similarity for clustering: Connecting clustering to classification. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 145–153, 2014.
  • Zhou et al. [2003] Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Schölkopf. Learning with local and global consistency. In Advances in Neural Information Processing Systems 16 [Neural Information Processing Systems, NIPS 2003, December 8-13, 2003, Vancouver and Whistler, British Columbia, Canada], 2003.
  • Zhu et al. [2003] Xiaojin Zhu, Zoubin Ghahramani, and John D. Lafferty. Semi-supervised learning using gaussian fields and harmonic functions. In Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA, pages 912–919, 2003.

8 Appendix

8.1 Proof of Lemma 1

Inspired by Koltchinskii and Panchenko (2002), we first prove that the Rademacher complexity of the function class formed by the maximum of several hypotheses is bounded by two times the sum of the Rademacher complexity of the function classes that these hypothesis belong to, i.e.

(26)

where for . If no confusion arises, the notations are omitted in the subscript of the expectation operator in the following text, i.e. is abbreviated to . According to Theorem of Koltchinskii and Panchenko (2002), it can be verified that

Therefore,