Multiple Kernel k-Means Clustering by Selecting Representative Kernels

11/01/2018 ∙ by Yaqiang Yao, et al. ∙ USTC 0

To cluster data that are not linearly separable in the original feature space, k-means clustering was extended to the kernel version. However, the performance of kernel k-means clustering largely depends on the choice of kernel function. To mitigate this problem, multiple kernel learning has been introduced into the k-means clustering to obtain an optimal kernel combination for clustering. Despite the success of multiple kernel k-means clustering in various scenarios, few of the existing work update the combination coefficients based on the diversity of kernels, which leads to the result that the selected kernels contain high redundancy and would degrade the clustering performance and efficiency. In this paper, we propose a simple but efficient strategy that selects a diverse subset from the pre-specified kernels as the representative kernels, and then incorporate the subset selection process into the framework of multiple k-means clustering. The representative kernels can be indicated as the significant combination weights. Due to the non-convexity of the obtained objective function, we develop an alternating minimization method to optimize the combination coefficients of the selected kernels and the cluster membership alternatively. We evaluate the proposed approach on several benchmark and real-world datasets. The experimental results demonstrate the competitiveness of our approach in comparison with the state-of-the-art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

As one of the major topics in the machine learning and the data mining communities, clustering algorithms aim to group a set of samples into several clusters such that samples from intra-clusters are more similar to each other than samples from inter-clusters

[Hartigan1975]. The most commonly-used clustering methods in practice are

-means and its soft version, i.e. Gaussian mixture models. In particular, after initialization of cluster centers,

-means clustering alternates between two steps: membership assignment of samples and update of cluster centers, until satisfactory convergence reaches. Due to its properties of simplicity, efficiency, and interpretability, -means clustering has been greatly developed in recent years in both computational and theoretical aspects [Ding et al.2015, Newling and Fleuret2016, Georgogiannis2016].

As with most of the machine learning algorithms, -means clustering has been extended to a kernel version by mapping data into a high-dimensional feature space with the kernel trick [Girolami2002]. In this way, kernel -means can handle data that is not linearly separable in the original feature space. The cluster structure obtained with

-means and its kernel version is closely related to the initialization, and inappropriate initial cluster centers would render the sum-of-square minimization to a local minimum. Fortunately, the original optimization problem can be formulated as a constrained trace minimization problem and optimized with the eigenvalue decomposition of the associated matrix

[Schölkopf, Smola, and Müller1998, Ding and He2004]. On the other hand, similar to other kernel methods, the performance of kernel -means clustering is largely dependent on the choice of the kernel function. However, the most suitable kernel for a particular task is unknown in advance.

In most real-world applications, samples are characterized by features from multiple groups. For example, flowers can be classified based on three different features: shape, color, and texture

[Nilsback and Zisserman2006]. Web pages can be represented with their content and the texts of inbound links [Bickel and Scheffer2004]. These features are different in attributes, scales, etc. and provide complementary views for the representation of datasets. Therefore, rather than concatenating different views into one or simply using one of the views, it is preferred to integrate distinctive views optimally based on learning algorithms, which is known as multi-view learning or multiple kernel learning. In the literature of clustering, the existing work on the combination strategy of data integration is divided into two categories: multi-view clustering and multiple kernel clustering.

Related Work

Multi-view clustering attempts to obtain consistent cluster structures from different views [Bickel and Scheffer2004, Chaudhuri et al.2009, Kumar and Daumé2011, Wang, Nie, and Huang2013, Chao, Sun, and Bi2017]. In [Bickel and Scheffer2004], multi-view versions of clustering approaches, including

-means, expectation maximization and hierarchical agglomerative methods, are studied for document clustering to demonstrate their advantages over single-view counterparts. The work in

[Kumar and Daumé2011]

proposes to constrain the similarity graph from one view with the spectral embedding from the other view in the framework of spectral clustering using the idea of

co-training. Based on canonical correlation analysis, [Chaudhuri et al.2009] presents a simple subspace learning method for multi-view clustering under a natural assumption that different views are uncorrelated given the label of the cluster. In consideration of the limitation that most existing work on data fusion assumes the same weight for features from one source, [Wang, Nie, and Huang2013] provides a novel framework for multi-view clustering which learns a weight for individual feature via a structured sparsity regularization.

Following the central idea of multiple kernel learning that multiple kernels of different similarity measurement are combined with coefficients to obtain an optimal linearly or non-linearly kernel combination [Gönen and Alpaydın2011], multiple kernel clustering utilizes the combined kernel in clustering tasks associated with multi-view data since different kernel corresponds to different view naturally [Zhao, Kwok, and Zhang2009, Huang, Chuang, and Chen2012, Lu et al.2014, Liu et al.2016, Wang et al.2017, Zhu et al.2018]. For example, [Zhao, Kwok, and Zhang2009]

proposes a multiple kernel version of maximum margin clustering, which searches for cluster labeling, maximum margin hyperplane, and the optimal combine kernel simultaneously. The obtained non-convex optimization problem is resolved with a variant of the cutting plane algorithm. Based on a kernel evaluation measure: centered kernel alignment,

[Lu et al.2014] integrates the clustering task into the framework of multiple kernel learning. Considering the correlation between different kernels, the work in [Liu et al.2016] adds a matrix-induced regularization term in the objective of multiple kernel clustering to reduce the redundancy of kernels. In [Wang et al.2017]

, the deep neural network is utilized to approximate the generation of multiple kernels and optimization process, which makes multiple kernel clustering applicable to large-scale problems.

We focus on multiple kernel clustering in this paper. Although a lot of efforts have been made during the past years to improve the efficiency and robustness of multiple kernel clustering, there are still two major problems with the exiting work. First, few of them consider the dissimilarity between kernels. In other words, the combination coefficients of kernels are updated independently, which results in the fact that the selected kernels might contain high redundancy. Second, none of them models the sparsity of combination coefficients based on the diversity of kernels. Due to the -norm constraint imposed on the combination weights, the coefficients of kernels with low dissimilarity would be reduced undesirably, which could highlight the importance of inappropriate kernels. Selecting a diverse subset from the pre-specified kernels would mitigate these two problems and enhance the quality of the combined kernel.

Our Contributions

Motivated by the representatives used in dissimilarity-based sparse subset selection [Zhou and Zhao2016, Elhamifar, Sapiro, and Sastry2016], we propose a new approach for multiple kernel clustering with the representative kernels. A subset of the base kernels termed representative kernels are selected and integrated to construct the optimal kernel combination. The key insight of the proposed approach is that all pre-specified kernels can be characterized by the representative kernels. In particular, if one kernel is selected by another kernel as the representative kernel, it indicates that the similarity measurements in these two kernels are relevant. By imposing a constraint that only some of the kernels are selected with a diversity regularization, we obtain a subset of kernels whose magnitude is smaller than that of the pre-specified kernels. In addition, the number of representative kernels is determined by the training data automatically. In contrast to the previous work in [Liu et al.2016] that imposes a matrix-induced regularization to reduce the risk of assigning large weights to pairwise kernels with high correlation simultaneously, our approach introduces a new strategy that each base kernel can be encoded (represented) with other kernels and manages to minimize the total encoding cost. As a result, the obtained representative kernels are a sparse and diverse subset of the pre-specified kernels due to the implicit sparsity constraint (-norm) on the combination coefficients. In summary, the contributions of our work are:

  • A representative kernels selection method is introduced to construct a diverse subset of the pre-specified kernels for multiple kernels clustering.

  • The strategy of representative kernels selection is incorporated into the objective function of multiple kernel -mean clustering seamlessly.

  • An alternating minimization method is developed to optimize the cluster membership and combination coefficients alternatively.

  • Experimental results on several benchmark and real-world datasets of multiple kernel learning demonstrate the effectiveness of the proposed approach.

The rest of our work is organized as follows. We first introduce the proposed approach, including the preliminaries on multiple kernel -means clustering, representative kernels selection, multiple kernel clustering with representative kernels and alternating optimization, and then evaluate our approach on several datasets in comparison with the state-of-the-art methods. Finally, we conclude this paper and give some directions for future work.

The Proposed Approach

This section presents multiple kernel clustering by selecting representative kernels. We first present the preliminaries on multiple kernel -means clustering, and then introduce the strategy for representative kernels selection. Next, we incorporate this strategy into the objective function of multiple kernel -means clustering. Finally, an alternating minimization method is developed to optimize the combination coefficients and cluster membership alternatively.

Multiple Kernel -Means Clustering

Given a set of samples , kernel

-means clustering aims to minimize the sum-of-squares loss function over the cluster indicator matrix

, which is formulated as an optimization problem as follows,

(1)
s.t.

where is a function that maps the original features onto a reproducing kernel Hilbert space , and and are the centroid and number of the -th cluster, respectively.

The optimization problem in Eq. (1

) can be rewritten as the following matrix-vector form,

(2)
s.t.

where is a kernel matrix with the -th element , is a column vector with all elements equal to 1, and . It is difficult to solve the above optimization problem due to the discrete variable in Eq. (2). Fortunately, the optimization problem can be approximated by relaxing with (where is obtained by taking the square root of the diagonal elements in ). In this way, we can obtain a relaxed version of the optimization problem,

(3)
s.t.

where

is an identity matrix of size

.

In the framework of multiple kernel learning, each sample has several feature representations associated with a group of feature mappings . In particular, each sample is represented as , where denotes the weights of base kernels and needs to be learned during optimization. Therefore, the -th element of the combined kernel over the above mapping function can be formulated as,

(4)

By replacing the single kernel in Eq.(3) with this combined kernel , we can obtain the optimization objective of multiple kernel -means clustering as follows,

(5)
s.t.

As will be detailed hereinafter, this optimization problem can be solved by alternatively updating and .

Representative Kernels Selection

Given a collection of base kernels , our goal is to find a diverse subset of , dubbed representative kernels, that could represent the collection [Elhamifar, Sapiro, and Vidal2012, Elhamifar, Sapiro, and Sastry2016].

Dissimilarity between Kernels

Assume that the pairwise dissimilarity between base kernels and is given by , which indicates how well represents . Specifically, the smaller the value of dissimilarity is, the better the -th base kernel represents the -th base kernel . To reduce the redundancy and select a subset of base kernels as the representatives, we first define a measurement that is able to characterize the dissimilarity between pairwise kernels. Such dissimilarity can be directly computed by using the Euclidean distance or the inner products between base kernel matrix. Here we utilize the measurement adopted in [Liu et al.2016] as follows,

(6)

A larger means the high dissimilarity between and , while a smaller value implies that their dissimilarity is low. Advanced dissimilarity measurement such as Bregman matrix divergence [Kulis, Sustik, and Dhillon2009] would be discussed in the future work. The dissimilarities can be arranged into a matrix of the following form,

where denotes the -th row of .

Constrained Linear Optimization

We consider an optimization program on unknown variables associated with the dissimilarity . The matrix of all variables can be arranged into a matrix of the following form,

where is the -th row of . The -th element is interpreted as the indicator of representing . In particular, if the -th base kernel is the representative of the -th base kernel and otherwise. To ensure that each base kernel is represented by one representative kernel, we constrain .

Define the cost of encoding with is , then the cost of encoding with and the cost of encoding are and , respectively. The goal of selecting a representative subset from is that the selected representative kernels could well encode according to the dissimilarities, i.e., the encoding cost should be as small as possible. Therefore, we have the following equality constrained minimization program,

(7)
s.t.

where the objective function corresponds to the total cost of encoding via representatives. Due to the -norm in the constraints, there would be zero rows in , which means that some base kernels are not the representative of any kernels in . Therefore, the nonzero rows of correspond to the representative kernels.

Convex Relaxation

The constraints in Eq. (7

) contains binary variables

, which makes the optimization non-convex and NP-hard in general. To make the optimization convex, the relaxation is needed for the program. In particular, we relax the binary constraints to

, which can be viewed as the probability that

is the representative of . Thus, we have the following convex minimization program,

(8)
s.t.

In this way, we obtain a soft assignment of representatives, i.e. .

Multiple Kernel -Means Clustering by Selecting Representative Kernels

To reduce the redundancy of kernels by selecting representative kernels in the process of multiple kernel clustering, we integrate the strategy of representative kernels selection into the objective function of multiple kernel -means, and associate , the probability of representing , with the weight of each base kernels. In particular, we define the weight of the base kernel as the average probability of base kernel representing all the base kernels as follows,

(9)

Since , we have

(10)

which indicates that the weights are valid coefficients of base kernels.

Therefore, the optimization objective of multiple kernel -means clustering in Eq.(5) can be written as the following form,

(11)
s.t.

where denotes a column vector whose elements are all equal to one,

is the zero matrix of size

, and is defined as follows,

(12)

Rewriting representative kernels selection Eq. (8) in the matrix form and integrating it into Eq. (11), we obtain the final optimization problem of the proposed algorithm,

(13)
s.t.

where the parameter controls the diversity of representative kernels.

Alternating Optimization

1:  Input:    Base kernels ;    The number of clusters ;   Trade-off parameters ;   Stop threshold .
2:  Output:    Coefficients of base kernels ;    -dimensional representations of the samples .
3:  Compute dissimilarity matrix with Eq. (6);
4:  Initialize indicator matrix of kernel representation ;
5:  Compute ; {mean of each row}
6:  Initialize objective function values ;
7:  ;
8:  repeat
9:     ;
10:     Update by solving Eq. (14);
11:     Update by solving Eq. (17);
12:     ; {mean of each row}
13:     ;
14:  until .
Algorithm 1 Multiple Kernel -Means Clustering by Selecting Representative Kernels

Finally, we optimize the optimization problem Eq. (13). There are two parameters and in Eq. (13), which can be solved by the alternating gradient descent method.

Given , the optimization problem with respect to is a standard kernel -mean clustering problem, i.e. Eq. (3), and the optimal can be obtained by taking the eigenvectors that correspond to the largest eigenvalues of . Specifically, Eq. (3) can be written as

(14)
s.t.

By interpreting the columns of as a collection of mutually orthonormal basis vectors , the objective can then be written as

(15)

Choosing proportional to the largest eigenvectors of , we would obtain the maximal value of the objective [Welling2013].

Given , the optimization problem with respect to can be written in the following form,

(16)
s.t.

where and is the -th row of . This optimization problem can be rewritten as

(17)
s.t.

where . It is obvious that Eq. (17) is a convex quadratic programming (QP) problem with decision variables, equality constraints, and inequality constraints. Therefore, we can solve it with standard QP solver [Grant and Boyd2014], and then the weights of base kernels can be computed with Eq. (9).

The main algorithm of the proposed approach is summarized in Algorithm 1. We analyze the computational complexity of the proposed approach, which is composed of four main parts as follows:

  1. In the beginning, the kernel matrices are needed to compute, whose cost is .

  2. Then the computational complexity of dissimilarity matrix with Eq. (6) is .

  3. Next, after obtaining the combined kernel , the complexity of eigen-decomposition to update with Eq. (14) is in each iteration.

  4. Finally, the standard QP solver to update with Eq. (17) typically needs complexity in each iteration.

Assuming that is the number of iteration, the total complexity of the proposed approach is . Since in general, for example, , , in our experiments, we have . Therefore, the final computational complexity is approximated by , which is equal to the complexity of the vanilla MKKM.

Experimental Studies

Datasets and Experimental Setup

The clustering algorithms are performed on seven benchmark datasets and two Flowers datasets that are frequently used in different clustering methods for performance evaluation. Three of the benchmark datasets are collected from text corpora, and the remaining four are image datasets. The Flowers datasets are collected from http://www.robots.ox.ac.uk/~vgg/data/flowers/. The detailed descriptions of these datasets are presented in Table 1.

Name # Samples # Features # Classes
TR11 414 6429 9
TR41 878 7454 10
TR45 690 8261 10
JAFFE 213 676 10
ORL 400 1024 40
AR 840 768 20
COIL20 1440 768 20
Flowers17 1360 7 (# Kernel) 17
Flowers102 8189 4 (# Kernel) 102
Table 1: Details of Datasets
Dataset Metric SB-KKM A-MKKM MKKM LMKKM RMKKM MKKM-MR Proposed
TR11 Acc
NMI
Purity
TR41 Acc
NMI
Purity
TR45 Acc
NMI
Purity
JAFFE Acc
NMI
Purity
ORL Acc
NMI
Purity
AR Acc
NMI
Purity
COIL20 Acc
NMI
Purity
Flowers17 Acc
NMI
Purity
Flowers102 Acc
NMI
Purity
Computational Complexity
Table 2: Performance comparison of different clustering methods with respect to Acc/NMI/Purity on seven benchmark datasets and two Flowers datasets. The best results are highlighted in boldface. Note that the last row is the computational complexity.

Following the strategy that most multiple kernel learning methods utilize, twelve different kernel functions are employed to construct the base kernels for seven benchmark datasets. Specifically, these kernel functions include one cosine function kernel , four polynomial function kernels with and

, and seven radial basis function kernels

with , where is the maximum distance between pairwise samples and . The kernel matrices for Flowers datasets are pre-computed and downloaded directly from the above website. All of the constructed kernels are normalized and scaled to through .

For all clustering methods and datasets, the number of clusters is set to be the true number of classes, i.e., we assume the true number of clusters is known in advance. In addition, the parameters of the clustering methods are selected by grid search. In particular, the parameter search scope of the comparative methods adopt the suggestions in their original papers. For the proposed approach, the search ranges of the diversity parameter are . Besides, three metrics are employed to evaluate the performance of clustering results, including clustering accuracy (Acc), normalized mutual information (NMI) and purity. Moreover, to reduce the influence induced by the random initialization in -means, all experiments on different clustering algorithms are repeated for times and the best results are reported.

Comparative Approaches

To demonstrate the competitiveness of our approach, we compare our approach with following recently proposed strategies for multiple kernel -means clustering:

  • Single Best Kernel -means (SB-KKM): This approach performs kernel -means on every single kernel and reports the best result of them.

  • Average Multiple Kernel -means (A-MKKM): In this case, the final kernel is constructed by a linear combination of the equal-weighted single kernel.

  • Multiple Kernel -means (MKKM): As introduced in multiple kernel clustering, MKKM conducts kernel -means clustering and updates kernel coefficients alternatively [Yu et al.2012].

  • Localized Multiple Kernel -means (LMKKM): LMKKM assigns each single kernel function a sample-specific weight such that the final kernel is in a form of localized combination [Gönen and Margolin2014].

  • Robust Multiple Kernel -means (RMKKM): To improve the robustness of MKKM, RMKKM replaces the squared Euclidean distance between the data point and the cluster center with the -norm [Du et al.2015].

  • Multiple Kernel -means with Matrix-induced Regularization (MKKM-MR): MKKM-MR constrains the objective function of MKKM with a matrix-induced regularization term to reduce the redundancy between base kernels [Liu et al.2016]

Results and Discussion

The experimental results with respect to Acc, NMI, and Purity are reported in Table 2, in which the best results are boldfaced and the last row is the computational complexity. From these results, we obtain the following conclusions:

  • In comparison with six competitive approaches, the proposed approach obtains the best results on eight out of nine datasets with respect to Acc and NMI, and is only slightly inferior to MKKM-MR and RMKKM on dataset TR11 and TR45, respectively. As for Purity, our approach beats the competitive approaches on all nine datasets. Therefore, the proposed approach is superior to the comparative approaches.

  • Single best kernel -means method performs better than the multiple kernel -means with equal weights on several datasets, which indicates that the inappropriate kernel functions would degrade the performance of kernel -means algorithm, and highlights the importance of the kernel selection in multiple kernel -means method.

  • The performance of vanilla MKKM is slightly inferior to the single best kernel -means in most cases. However, some appropriate strategies for kernel weights learning, such as LMKKM and RMKKM, would improve the multiple kernel -means and usually obtain better performance than the single best kernel -mean.

  • The superior results obtained by MKKM-MR and our approach reveal that enhancing the diversity between pairwise base kernels has a beneficial effect on the performance of multiple kernel -mean. In addition, by characterizing the pre-specified kernels with representative kernels, the proposed approach improves MKKM-MR in terms of effectiveness.

In a nutshell, these observations demonstrate the advantages and effectiveness of the proposed approach.

(a)
(b)
Figure 1: The effect of diversity regularization parameter on dataset (a) ORL and (b) TR11.
Figure 2: Illustration of the indicator matrix with respect to different diversity parameter on dataset ORL. The labels on and axis are the indexes of base kernels, and the color in the -th element of matrix denotes the probability that the -th kernel is the representative of the -th kernel.

Parameter Sensitivity and Convergence

The parameter in the objective function of the proposed approach controls the diversity of base kernels. To analyze the effect of on the clustering performance, we illustrate the results on one image dataset ORL and one document dataset TR11 in Figure (a)a and Figure (b)b, respectively. As we can see, the performance on image dataset ORL is stable with respect to . For dataset TR11, with the increase of , the clustering performance drops to the minimum at , and keeps stable afterward.

In addition, we illustrate the obtained matrix on dataset ORL with different diversity parameter in Figure 2. It can be observed that when is small, many base kernels, such as , select more than one kernels as their representatives with moderate probabilities (indicated by gray and black colors). However, as the value of becomes large, more base kernels select just one kernel as their representatives. In particular, when , only base kernels are selected as representatives (nonzero rows), and a lot of the probabilities are close to .

Moreover, the effect of the regularization parameter in the proposed approach and MKKM-MR on the number of the selected kernels are shown in Figure (a)a and Figure (b)b, respectively. In contrast to the trend in MKKM-MR that the number of selected kernels increases first and then decrease with the increase of , the number of selected kernels obtained by our approach fluctuates but tends to decrease in the long run on datasets ORL and TR11. These results indicate the proposed approach is more explainable due to the expectation that algorithms with larger regularization parameter should select fewer base kernels.

(a)
(b)
Figure 3: The number of selected kernels obtained by (a) the proposed approach and (b) MKKM-MR, with respect to the regularization parameter .
(a)
(b)
Figure 4: The objective value of the proposed approach at each iteration on dataset (a) ORL and (b) TR11.

Finally, the objective value of the proposed approach at each iteration is plotted in Figure 4, from which we can observe that our approach converges to the optimal value in less than iterations in most cases.

Conclusion

This paper presents a new approach for multiple kernel clustering by selecting representative kernels to improve the quality of the combined kernel. More concretely, we first devise a strategy to select a diverse subset of the pre-specified kernels, and then incorporate this representative kernels selection strategy into the objective function of multiple kernel -means method. Finally, an alternating optimization method is developed to optimize the clustering membership and the kernel weights alternatively. Experimental results on several benchmark and real-world datasets validate the advantages and effectiveness of the proposed approach. In the future work, we plan to develop a customized optimization method for the proposed approach with resort to the alternating direction method of multipliers framework to reduce the computational complexity.

References

  • [Bickel and Scheffer2004] Bickel, S., and Scheffer, T. 2004. Multi-view clustering. In 2004 IEEE 4th International Conference on Data Mining, volume 4, 19–26. IEEE.
  • [Chao, Sun, and Bi2017] Chao, G.; Sun, S.; and Bi, J. 2017. A survey on multi-view clustering. arXiv preprint arXiv:1712.06246.
  • [Chaudhuri et al.2009] Chaudhuri, K.; Kakade, S. M.; Livescu, K.; and Sridharan, K. 2009. Multi-view clustering via canonical correlation analysis. In Proceedings of the 26th International Conference on Machine Learning, 129–136. ACM.
  • [Ding and He2004] Ding, C., and He, X. 2004.

    K-means clustering via principal component analysis.

    In Proceedings of the 21th International Conference on Machine Learning,  29. ACM.
  • [Ding et al.2015] Ding, Y.; Zhao, Y.; Shen, X.; Musuvathi, M.; and Mytkowicz, T. 2015. Yinyang k-means: A drop-in replacement of the classic k-means with consistent speedup. In Proceedings of the 30th International Conference on Machine Learning, 579–587.
  • [Du et al.2015] Du, L.; Zhou, P.; Shi, L.; Wang, H.; Fan, M.; Wang, W.; and Shen, Y.-D. 2015. Robust multiple kernel k-means using l21-norm. In

    Proceedings of the 24th International Joint Conference on Artificial Intelligence

    , 3476–3482.
    AAAI Press.
  • [Elhamifar, Sapiro, and Sastry2016] Elhamifar, E.; Sapiro, G.; and Sastry, S. S. 2016. Dissimilarity-based sparse subset selection. IEEE Transactions on Pattern Analysis and Machine Intelligence 38(11):2182–2197.
  • [Elhamifar, Sapiro, and Vidal2012] Elhamifar, E.; Sapiro, G.; and Vidal, R. 2012. Finding exemplars from pairwise dissimilarities via simultaneous sparse recovery. In Advances in Neural Information Processing Systems, 19–27.
  • [Georgogiannis2016] Georgogiannis, A. 2016. Robust k-means: a theoretical revisit. In Advances in Neural Information Processing Systems, 2891–2899.
  • [Girolami2002] Girolami, M. 2002. Mercer kernel-based clustering in feature space. IEEE Transactions on Neural Networks 13(3):780–784.
  • [Gönen and Alpaydın2011] Gönen, M., and Alpaydın, E. 2011. Multiple kernel learning algorithms. Journal of Machine Learning Research 12(Jul):2211–2268.
  • [Gönen and Margolin2014] Gönen, M., and Margolin, A. A. 2014. Localized data fusion for kernel k-means clustering with application to cancer biology. In Advances in Neural Information Processing Systems, 1305–1313.
  • [Grant and Boyd2014] Grant, M., and Boyd, S. 2014. CVX: Matlab software for disciplined convex programming, version 2.1. http://cvxr.com/cvx.
  • [Hartigan1975] Hartigan, J. A. 1975. Clustering algorithms.
  • [Huang, Chuang, and Chen2012] Huang, H.-C.; Chuang, Y.-Y.; and Chen, C.-S. 2012. Multiple kernel fuzzy clustering. IEEE Transactions on Fuzzy Systems 20(1):120–134.
  • [Kulis, Sustik, and Dhillon2009] Kulis, B.; Sustik, M. A.; and Dhillon, I. S. 2009. Low-rank kernel learning with bregman matrix divergences. Journal of Machine Learning Research 10(Feb):341–376.
  • [Kumar and Daumé2011] Kumar, A., and Daumé, H. 2011. A co-training approach for multi-view spectral clustering. In Proceedings of the 28th International Conference on Machine Learning, 393–400.
  • [Liu et al.2016] Liu, X.; Dou, Y.; Yin, J.; Wang, L.; and Zhu, E. 2016. Multiple kernel k-means clustering with matrix-induced regularization. In AAAI, 1888–1894.
  • [Lu et al.2014] Lu, Y.; Wang, L.; Lu, J.; Yang, J.; and Shen, C. 2014. Multiple kernel clustering based on centered kernel alignment. Pattern Recognition 47(11):3656–3664.
  • [Newling and Fleuret2016] Newling, J., and Fleuret, F. 2016. Nested mini-batch k-means. In Advances in Neural Information Processing Systems, 1352–1360.
  • [Nilsback and Zisserman2006] Nilsback, M.-E., and Zisserman, A. 2006. A visual vocabulary for flower classification. In

    2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition

    , volume 2, 1447–1454.
    IEEE.
  • [Schölkopf, Smola, and Müller1998] Schölkopf, B.; Smola, A.; and Müller, K.-R. 1998. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation 10(5):1299–1319.
  • [Wang et al.2017] Wang, Y.; Liu, X.; Dou, Y.; and Li, R. 2017. Approximate large-scale multiple kernel k-means using deep neural network. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, 3006–3012. AAAI Press.
  • [Wang, Nie, and Huang2013] Wang, H.; Nie, F.; and Huang, H. 2013. Multi-view clustering and feature learning via structured sparsity. In Proceedings of the 30th International Conference on Machine Learning, 352–360.
  • [Welling2013] Welling, M. 2013. Kernel k-means and spectral clustering.
  • [Yu et al.2012] Yu, S.; Tranchevent, L.; Liu, X.; Glanzel, W.; Suykens, J. A.; De Moor, B.; and Moreau, Y. 2012. Optimized data fusion for kernel k-means clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 34(5):1031–1039.
  • [Zhao, Kwok, and Zhang2009] Zhao, B.; Kwok, J. T.; and Zhang, C. 2009. Multiple kernel clustering. In Proceedings of the 2009 SIAM International Conference on Data Mining, 638–649. SIAM.
  • [Zhou and Zhao2016] Zhou, Q., and Zhao, Q. 2016. Flexible clustered multi-task learning by learning representative tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence 38(2):266–278.
  • [Zhu et al.2018] Zhu, X.; Liu, X.; Li, M.; Zhu, E.; Liu, L.; Cai, Z.; Yin, J.; and Gao, W. 2018. Localized incomplete multiple kernel k-means. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, 3271–3277. AAAI Press.