Clustering with Similarity Preserving

05/21/2019 ∙ by Zhao Kang, et al. ∙ 12

Graph-based clustering has shown promising performance in many tasks. A key step of graph-based approach is the similarity graph construction. In general, learning graph in kernel space can enhance clustering accuracy due to the incorporation of nonlinearity. However, most existing kernel-based graph learning mechanisms is not similarity-preserving, hence leads to sub-optimal performance. To overcome this drawback, we propose a more discriminative graph learning method which can preserve the pairwise similarities between samples in an adaptive manner for the first time. Specifically, we require the learned graph be close to a kernel matrix, which serves as a measure of similarity in raw data. Moreover, the structure is adaptively tuned so that the number of connected components of the graph is exactly equal to the number of clusters. Finally, our method unifies clustering and graph learning which can directly obtain cluster indicators from the graph itself without performing further clustering step. The effectiveness of this approach is examined on both single and multiple kernel learning scenarios in several datasets.



There are no comments yet.


page 15

page 21

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Discovering clusters in unlabeled data is one of the most fundamental scientific tasks, with an endless list of practical applications in data mining, pattern recognition, and machine learning

jain1999data ; yang2017discrete ; huang2018self ; chen2012fgkm ; ren2019semi . It is well-known that labels are expensive to obtain, so clustering techniques are useful tools to process data and to reveal its underlying structure.

Over the past decades, a number of clustering techniques have been developed xu2005survey ; huang2019auto ; yang2018fast ; peng2018integrate

. One main class of clustering methods is K-means and its various extensions. To some extent, these techniques are distance-based methods. K-means has been extensively investigated since its introduction in 1957 by Lloyd

lloyd1982least , due to its simplicity and effectiveness. However, it is only suitable for data points that are evenly spread around some centroids yang2017towards ; chen2013twkm . To make it work under general circumstances, much effort has been spent on mapping data to a certain space. One representative approach is using kernel method. The first Kernel K-means algorithm was proposed in 1998 scholkopf1998nonlinear . Although some data points cannot be separated in the original data representation, they are linearly separable in kernel space.

Recently, robust Kernel K-means (RKKM) method has been developed du2015robust . Different from other K-means algorithms, RKKM uses

-norm to evaluate the fidelity term. Consequently, RKKM can alleviate the adverse effects of noise and outliers considerably. It shows that RKKM can achieve superior performance on a number of real-world data sets. However, its performance still depends on the choice of the kernel function.

Graph-based algorithms, as another main category of clustering methods, have been drawing growing attention. Among them, spectral clustering is a leading and highly popular method due to its ability in incorporating manifold information with good performance

ng2002spectral ; kang2018unified

. In particular, it embeds the data into the eigenspace of the Laplacian matrix, derived from the pairwise similarities between data points

peluffo2016relationship . A commonly used way of similarity measure is the Gaussian kernel von2007tutorial . Nevertheless, it is challenging to select an appropriate scaling factor zelnik2005self . Kernel spectral clustering (KSC) alzate2010multiway and its variants langone2016kernel have also been proposed.

Recently, a novel approach which models graph construction as an optimization problem has been proposed nie2014clustering ; Cheng2010 ; elhamifar2013sparse ; liu2013robust ; tang2018learning . It works by either performing adaptive local structure learning or representing each data point as a weighted combination of other data points. The second approach can capture the global structure information and can be easily extended to kernel space kang2017kernel ; kang2019low . That is to say, one seeks to learn a high-quality graph from artificially constructed kernel matrix. These methods are free of similarity metrics or kernel parameters, thus they are more appealing to real-world applications.

Although the above approach has shown much better performance than traditional methods, it also causes some information loss. In particular, it learns similarity graph from the data itself without considering other prior information. Consequently, some similarity information might get lost, which should be helpful for our graph learning haeffele2017structured ; Kang2019aa

. On the other hand, preserving similarity information has been shown to be important for feature selection

zhao2013similarity . In zhao2013similarity

, new feature vector

is obtained by maximizing , where is the refined similarity matrix derived from original kernel matrix . In this paper, we propose a way to preserve the similarity information between samples when we learn the graph and cluster labels. To the best of our knowledge, this is the first work that develops similarity preserving strategy for graph learning.

It is necessary to point out that the key point of this paper is the similarity preserving concept. Though there are many similarity learning methods in the literature, they often ignore to explicitly retain structure information of original data. Concretely, we expect our learned similarity matrix approximates pre-defined kernel matrix to some extent. The quality of similarity matrix is crucial to many tasks, such as graph embedding cai2018comprehensive , where the low-dimensional representation is expected to respect the neighborhood relation characterized by .

In addition, most existing graph-based clustering methods perform clustering in two separate steps ng2002spectral ; liu2013robust ; elhamifar2013sparse ; kang2019robust . Specifically, they first construct a graph. Then, the obtained graph is inputted to the spectral clustering algorithm. In this approach, the quality of the graph is not guaranteed, which might not be suitable for subsequent clustering nie2014clustering ; kang2017twin . In this paper, the structure information of the graph is explicitly considered in our model, so that the component number in the learned graph is equal to the number of clusters. Then, we can directly obtain cluster indicators from the graph itself without performing further graph cut or K-means clustering steps. Extensive experimental results validate the effectiveness of our proposed method.

The contributions of this paper are two-fold:

  1. Our proposed model has the capability of similarity preserving. This is the first attempt to preserve the sample’s similarity information when we construct the similarity graph. Consequently, the quality of the learned graph would be enhanced.

  2. Cluster structure is seamlessly incorporated into our objective function. As a result, the component number in the learned graph is equal to the number of clusters, such that the vertices in each connected component of the graph are partitioned into one cluster. Therefore, we directly obtain cluster indicators from the graph itself without performing further graph cut or K-means clustering steps.

Notations. Given a data matrix with features and samples, we denote its -th element and -th column as and , respectively. The -norm of vector is represented by , where is the transpose of . The squared Frobenius norm is defined as .

represents the identity matrix with the proper size.

means all the elements of are nonnegative. denotes the inner product of two matrices.

2 Preliminaries

In this section, we give a brief overview of two popular similarity learning techniques which have been developed recently.

2.1 Adaptive Local Structure Learning

For each data point , it can be connected to data point

with probability

. Closer points should have a larger probability, thus characterizes the similarity between and niyogi2004locality ; nie2014clustering . Since has the negative correlation with the distance between and , the determination of can be achieved by optimizing the following problem:


where is the trade-off parameter. Here, is adaptively learned from the data. This idea has recently been applied in a number of problems. Nonnegative matrix factorization dacheng2017 ; huang2018adaptive , feature selection du2015unsupervised , multi-view learning nie2017multi , just to name a few. One limitation of this method is that it can only capture the local structure information and thus the performance might be deteriorated.

2.2 Adaptive Global Structure Learning

To explore the global structure information, methods based on self-expression, have become increasingly popular in recent years Cheng2010 ; yang2014data . The basic idea is to encode each datum as a weighted combination of other samples, i.e., its direct neighbors and reachable indirect neighbors. If is quite similar to , coefficient , which denotes the contribution from to , should be large. From this point of view, can be viewed as the similarity between the data points. The corresponding optimization problem can be formulated as:


This has drawn significant attention and achieved impressive performance in a number of applications, including face recognition

zhang2011sparse , motion segmentation liu2013robust ; elhamifar2013sparse

, semi-supervised learning

zhuang2017label .

As a matter of fact, Eq. (2) is related to some dimension reduction methods. For example, in Locally Linear Embedding (LLE), nearest neighbors are first identified for each data point roweis2000nonlinear . Then each data point is reconstructed by a linear combination of its k nearest neighbors. By contrast, Eq. (2) uses all data points and determines the neighbors automatically according to the optimization result. Thus, it is supposed to capture the global structure information. Eq. (2) is different from Locality Preserving Projections (LPP), which tries to preserve the neighborhood structure during the dimension reduction process he2004locality . LPP uses a predefined similarity matrix to characterize the neighbor relations, while Eq. (2) is trying to learn this similarity matrix automatically from data. For Laplacian Eigenmaps (LE), a similarity graph matrix is also predefined belkin2002laplacian

. On the other hand, Principal Component Analysis (PCA) aims to find a projection so that the variance is maximized in low-dimensional space, which is less relevant to the similarity learning methods.

To capture the nonlinear structure information of data, Eq. (2) can be easily extended to kernel space, which gives


where is the trace operator and is the kernel matrix of . This model recovers the linear relations among the data in the new representation, and thus the nonlinear relations in the original space. Eq. (3) is more general than Eq. (2) and reduces to Eq. (2) if a linear kernel function is applied.

3 Similarity Preserving Clustering

The aforementioned two learning mechanisms lead to much better performance than traditional similarity measure based techniques in many real-world applications. However, they ignore some important information. Specifically, as they operate on the data itself, some data relation information might get lost haeffele2017structured . Since we seek to learn a high-quality similarity graph, data relation information would be crucial to our task. In this paper, we aim to retain this information.

Because the kernel matrix itself contains similarity information of data points, we expect to be close to . To this end, we optimize the following objective function


Although we claim similarity preserving, Eq. (4) also keeps dissimilarity information. For example, if points and are from different clusters, , then would hold. Note that we already have term in Eq. (3). Hence we can combine Eq. (4) and Eq. (3) by introducing a coefficient , we have


Although we just make a small modification to Eq. (3), it makes a lot of sense in practice. By tuning parameter , we can control how much relation information we want to keep from the original kernel matrix. In particular, can avoid the conflicts between the pre-computed similarity and the learned similarity . If is not suitable to reveal the underlying relationships among samples, we just set , which means that there is no similarity preserving effect. The influence of selecting parameter is elaborated in Sec. 5.2.4.

Eq. (5) provides a framework to learn graph matrix with similarity preservation. Further clustering is achieved by using spectral clustering and K-means clustering on the learned graph. These separate steps often lead to suboptimal solutions nie2014clustering and K-means is sensitive to the initialization of cluster centers. To this end, we propose to unify clustering with graph learning, so that two tasks can be simultaneously achieved. Speficially, if there are clusters in the data, we hope to learn a graph with exactly number of connected components. Obviously, Eq. (5) can hardly satisfy such a constraint. To this end, we leverage the following theorem:

Theorem 1.

mohar1991laplacian The number of connected components

is equal to the multiplicity of zero as an eigenvalue of its Laplacian matrix


Since is positive semi-definite, it has non-negative eigenvalues . Theorem 1 indicates that if is satisfied, the graph would be ideal and the data points are already clustered into clusters. According to Fan’s theorem fan1949theorem , we obtain


where Laplacian matrix , is a diagonal matrix and its elements are the column sums of . Combining Eq.(6) with Eq.(5), our proposed Similarity Preserving Clustering (SPC) is formulated as


By solving Eq. (7), we can obtain a structured graph , which has exactly connected components. By running Matlab built-in function graphconncomp, we can obtain which component each sample belongs to.

3.1 Optimization

The problem (7) can be easily solved with an alternating optimization approach. When is fixed, Eq. (7) becomes


It is quite standard to achieve which is formed by the eigenvectors of corresponding to the smallest eigenvalues.

When is fixed, Eq. (7) can be written column-wisely


where and we have used equality . It is easy to achieve the closed-form solution


Once parameter is given, becomes a constant. Therefore, we only perform the matrix inversion once. We summarize the steps in Algorithm 1. Our algorithm stops if the maximum iteration number 200 is reached or the relative change of is less than .

Input: Kernel matrix , parameters , , .
Initialize:Random matrix .

1:  Update by solving (8).
2:  For each , update the -th column of according to Eq. (10).
3:  Project by .

UNTIL stopping criterion is met.

Algorithm 1 Similarity Preserving Clustering (SPC)

4 Multiple Kernel Learning

Different kernels correspond to different notions of similarity and lead to different results. This makes it not be reliable for practical applications. Multiple Kernel Learning (MKL) offers a principal way to encode complementary information and automatically learning an optimal combination of distinct kernels sonnenburg2006large ; kang2018self

. Instead of heuristic kernel selection, a principled method is developed to automatically learn a good combination of multiple kernels.

Specifically, suppose there are in total kernels, we introduce kernel weight for each kernel, e.g., for kernel . We denote the combined kernel as , and the weight distribution satisfies gonen2011multiple . Finally, our multiple kernel learning based similarity preserving clustering (mSPC) method can be formulated as


The problem (11) can be solved in a similar way as (7). In specific, we repeat the following steps.

1) Updating and when is fixed: We can directly obtain , and the optimization problem (11) is identical to Eq. (7). We implement Algorithm 1 with as the input kernel matrix.

2) Updating when and are fixed: Solving Eq. (11) with respect to can be reformulated as




The Lagrange function of Eq. (12) is


By utilizing the Karush-Kuhn-Tucker (KKT) condition with and the constraint , we have as follows:


Input: A set of kernel matrices , parameters , , .
Initialize: Random matrix , .

1:  Calculate .
2:  Update

by performing singular value decomposition on

and finding the smallest eigenvectors.
3:  Update column-wisely according to (10).
4:  Update by (13).
5:  Calculate according to (15).

UNTIL stopping criterion is met.

Algorithm 2 The algorithm of mSPC

5 Experiment

In this section, we perform extensive experiments to demonstrate the effectiveness of our proposed models.

(a) k-means
(b) SPC
Figure 1: The clustering results on synthetic data.

5.1 Experiment on Synthetic Data

We generate a synthetic data set with 300 points. The data points distribute in the pattern of two moons. Each moon is considered as a cluster. In Figure 1, we present the clustering results of our proposed SPC and standard k-means. Gaussian kernel with is used in our SPC model. We can observe that our method performs much better than k-means. We quantitatively assess the clustering performance in terms of accuracy (Acc), normalized mutual information (NMI), and Purity. For SPC, Acc, NMI, and Purity are 93%, 63.49%, 93%, respectively. Correspondingly, k-means produces 73.67%, 16.87%, and 73.67%.

5.2 Experiment on Real Data

5.2.1 Data sets

# instances # features # classes
YALE 165 1024 15
JAFFE 213 676 10
ORL 400 1024 40
YEAST 1484 1470 10
USPS 1854 256 20
TR11 414 6429 9
TR41 878 7454 10
TR45 690 8261 10
Table 1: Description of the data sets
(b) YALE
(c) USPS
Figure 2: Sample images of USPS, JAFFE, and YALE.

We conduct our experiments with eight benchmark data sets, which are widely used in clustering experiments. We show the statistics of these data sets in Table 1.

These data sets are from different fields. Specifically, YALE111, JAFFE222, ORL333 are three face databases. Each image represents different facial expressions or configurations due to times, illumination conditions, and glasses/no glasses. Figures (a)a and (b)b show some example images from JAFFE and YALE database, respectively. YEAST is a microarray data set. USPS data set444 tibs/ElemStatLearn/data.html is obtained from the scanning of handwritten digits from envelopes by the U.S. Postal Service. Some sample digits are shown in Figure (c)c. The last three data sets in Table 1 are text data555 han/data/tmdata.tar.gz.

We manually construct 12 kernels. They consist of seven Gaussian kernels with and denotes the maximal distance between data points; four polynomial kernels of the form with and ; a linear kernel . Furthermore, all kernel matrices are normalized to range to avoid numerical inconsistence.

YALE 49.42(40.52) 48.09(39.71) 54.55 58.79 55.29(45.07) 60.60(46.95) 36.36(29.76) 45.70 40.64 52.18 56.97 63.03
JAFFE 74.88(54.03) 75.61(67.98) 87.32 98.12 97.18(86.55) 98.03(86.20) 72.77(66.48) 74.55 30.35 87.07 97.18 98.12
ORL 57.96(46.65) 54.96(46.88) 69.00 61.50 62.47(50.64) 75.75(52.48) 37.00(32.70) 47.51 27.20 55.60 65.25 75.43
YEAST 35.55(30.89) 34.04(31.52) 29.99 34.62 35.72(30.55) 37.85(31.65) 31.19(28.36) 13.04 35.38 31.63 36.08 39.15
USPS 35.18(26.90) 65.16(55.72) 64.83 62.84 65.94(56.38) 67.25(56.94) 61.49(47.87) 63.72 37.36 65.47 65.32 68.74
TR11 50.98(43.32) 53.03(45.04) 41.06 38.89 70.88(54.25) 78.26(56.37) 50.73(41.86) 50.13 47.15 57.71 73.43 79.63
TR41 63.52(44.80) 56.76(46.80) 63.78 62.87 67.28(52.75) 72.89(49.40) 53.42(45.11) 56.10 45.90 62.65 67.31 80.41
TR45 57.39(45.96) 58.13(45.69) 71.45 56.96 73.59(53.06) 75.07(57.26) 53.48(44.99) 58.46 52.64 64.00 74.35 75.64
(a) Accuracy(%)
YALE 52.92(44.79) 52.29(42.87) 57.26 57.67 56.37(45.00) 61.32(45.62) 44.84(35.47) 50.06 46.83 55.58 56.52 61.36
JAFFE 82.08(59.35) 83.47(74.01) 92.93 97.31 96.35(84.67) 98.62(83.30) 73.67(68.39) 79.79 27.22 89.37 95.61 97.36
ORL 75.16(66.74) 74.23(63.91) 84.23 76.59 79.36(63.98) 86.06(66.56) 56.78(54.21) 68.86 43.77 74.83 80.04 85.93
YEAST 21.38(6.18) 17.27(9.31) 15.85 20.06 15.37(9.62) 16.44(10.25) 13.87(13.83) 10.29 21.19 20.71 15.89 16.23
USPS 29.71(21.33) 63.94(52.90) 72.68 76.13 74.85(55.28) 78.54(58.93) 48.45(40.27) 62.25 29.81 63.60 75.29 79.88
TR11 43.11(31.39) 49.69(33.48) 27.60 19.17 58.14(37.42) 63.10(35.94) 46.22(36.51) 44.56 39.39 56.08 60.15 63.90
TR41 61.33(36.60) 60.77(40.86) 59.56 51.13 65.90(43.28) 71.22(38.54) 44.21(39.19) 57.75 43.05 63.47 65.11 70.50
TR45 48.03(33.22) 57.86(38.96) 67.82 49.31 74.21(44.29) 75.94(46.28) 43.41(42.95) 56.17 41.94 62.73 74.97 74.57
(b) NMI(%)
YALE 51.61(43.06) 49.79(41.74) 58.18 59.39 56.79(55.25) 60.53(56.28) 44.85(34.67) 47.52 42.33 53.64 60.00 66.67
JAFFE 76.83(56.56) 79.58(71.82) 96.24 98.12 97.85(96.03) 98.25(97.02) 77.00(72.30) 76.83 33.08 88.90 97.18 98.12
ORL 61.45(51.20) 59.60(51.46) 76.50 68.5 73.28(70.02) 82.08(76.56) 41.00(36.73) 52.85 31.56 60.23 77.00 82.69
YEAST 53.05(35.37) 45.39(38.18) 44.29 58.97 57.38(40.83) 65.72(44.75) 32.21(31.14) 32.58 52.71 33.21 60.27 66.51
USPS 37.48(33.27) 72.49(62.49) 75.84 71.89 76.90(63.78) 77.54(64.82) 62.84(51.48) 70.77 42.40 73.45 77.03 79.25
TR11 58.79(50.23) 67.93(56.40) 85.02 44.20 81.79(80.12) 90.10(83.86) 52.90(46.76) 65.48 54.67 72.93 87.44 93.04
TR41 73.68(56.45) 74.99(60.21) 75.40 67.54 73.05(71.13) 80.67(74.79) 53.42(47.92) 72.83 62.05 77.57 73.69 82.45
TR45 61.25(50.02) 68.18(53.75) 83.62 60.87 78.74(77.82) 86.32(80.03) 55.51(49.29) 69.14 57.49 75.20 78.26 87.59
(c) Purity(%)
Table 2: Clustering results measured on benchmark data sets. The average performance on those 12 kernels are put in parenthesis. For KSC, we run 10 times and report the best performance and their mean value. The best results for single and multiple kernel methods are highlighted in boldface.

5.2.2 Comparison Methods

To fully investigate the performance of our method on clustering, we choose a good set of methods to compare. In general, they can be classified into two categories: graph-based and kernel-based clustering methods.

  • Spectral Clustering (SC) ng2002spectral . We use kernel matrix as its graph input. For our SPC, we learn graph from kernels.

  • Robust Kernel K-means (RKKM)666 . As an extension to the classical K-means clustering method, RKKM has the capability of dealing with nonlinear structures, noise, and outliers in the data. We also compare with its multiple kernel learning version: RMKKM.

  • Simplex Sparse Representation (SSR) huang2015new . Based on self-expression, SSR achieves satisfying performance in numerous data sets.

  • Clustering with Adaptive Neighbor (CAN) nie2014clustering . Based on adaptive local structure learning, CAN constructs the similarity graph by Eq. (1).

  • Kernel Spectral Clustering (KSC) alzate2010multiway . Based on a weighted kernel principal component analysis strategy, KSC performs multiway spectral clustering. Moreover, Balanced Line Fit (BLF) is proposed to obtain model parameters.

  • Our proposed SPC and mSPC777 Our proposed single kernel and multiple kernel learning based similarity preserving clustering methods.

  • SPC1 and mSPC1. To observe the effect of similarity preserving, we let in SPC and name this method as SPC1. Similarly, we have mSPC1. They are equivalent to the methods in kang2017twin .

  • Multiple Kernel K-means (MKKM)888 . The MKKM extends K-means in a multiple kernel setting. It imposes a different constraint on the kernel weight distribution.

  • Affinity Aggregation for Spectral Clustering (AASC)999 . The AASC is an extension of spectral clustering to the situation when multiple affinities exist.

For a fair comparison, we either use the recommended parameter settings in their respective papers or tune each method to obtain the best performance. In fact, the optimal performance for SC, RKKM, MKKM, AASC, and RMKKM methods can be easily obtained through implementing the package given in du2015robust . SC, SSR, and CAN are parameter-free models. KSC selects parameters based on Balanced Line Fit principle.

Figure 3: The visualization of similarity preserving effect.

5.2.3 Results

All results are summarized in Table 2. We can see that our methods SPC and mSPC outperform others in most cases. In particular, we have the following observations: 1) The improvement of SPC over SPC1 is considerable. Noted that the only difference between SPC and SPC1 is that SPC explicitly considers the similarity preserving effect. In other words, SPC adds the proposed term Eq. (4), which aims to keep the learned graph matrix close to the kernel matrix , so that the similarity information carried by the kernel matrix will transfer to the learned graph matrix. Hence this demonstrates the significance of similarity preserving in graph learning; 2) For multiple kernel methods, mSPC also performs better than mSPC1 in most experiments. This once again confirms the importance of similarity preserving; 3) Compared to self-expression based method SSR, our advantage is also obvious. For example, in TR11, SPC enhances the accuracy from 41.06% to 78.26%. Note that our basic objective function Eq. (3) is also derived from self-expression idea. However, our method is kernel method; 4) With respect to traditional spectral clustering, kernel spectral clustering, the recently proposed robust kernel K-means method, adaptive local structure graph learning method, the improvement is very promising; 5) In terms of multiple kernel learning approach, mSPC also achieves much better performance than other state-of-the-art techniques.

To better illustrate the effect of similarity preserving, we visualize the results of YALE data in Figure 3. In specific, Figure (a)a plots the histogram of , i.e., the difference between the learned kernel in Eq. (11) and similarity matrix. We can see that they are quite close for most elements and the difference is the refinement brought by our learning algorithm. The manually constructed kernel matrix often fails to reflect the underlying relationships among samples due to the inherent noise or the inappropriate use of a metric function. This is validated by the experimental results. Note that for SC method, we directly treat kernel matrix as similarity matrix, while for our proposed SPC method, we use the learned similarity matrix to perform clustering. It can be seen that the results of SPC are much better than that of SC.

Figure (b)b displays the difference between the original data and the reconstructed data . Good reconstruction means that represents the similarity pretty well. The reconstruction error accounts for noise or outliers in the original data. As shown by Figure (b)b, our learned reconstructs the original data with a small error. Therefore, our proposed approach can achieve a high-quality similarity matrix.

5.2.4 Parameter Analysis

Figure 4: The influence of parameters on YALE data set.

As shown in Eq. (11), there are three parameters in our model. As we mentioned previously, is bigger than one. Take YALE data set as an example, we demonstrate the sensitivity of our model mSPC in Figure 4. We can see that it works well over a wide range of values. Note that case has been discussed by SPC1 and mSPC1 methods in Table 2. When , Eq. (7) and (11) do not possess similarity preserving capability.

6 Conclusion

In this paper, we propose a clustering algorithm which can exploit similarity information of raw data. Furthermore, the structure information of a graph is also considered in our objective function. Comprehensive experimental results on real data sets well demonstrate the superiority of the proposed method on the clustering task. It has been shown that the performance of the proposed method is largely determined by the choice of the kernel function. To this end, we develop a multiple kernel learning method, which is capable of automatically learning an appropriate kernel from a pool of candidate kernels. In the future, we will examine the effectiveness of our framework on the semi-supervised learning task.


This paper was in part supported by Grants from the Natural Science Foundation of China (Nos. 61806045, 61572111, and 61872062), three Fundamental Research Fund for the Central Universities of China (Nos. ZYGX2017KYQD177, A03017023701012, and ZYGX2016J086), and a 985 Project of UESTC (No. A1098531023601041).



  • (1) A. K. Jain, M. N. Murty, P. J. Flynn, Data clustering: a review, ACM computing surveys (CSUR) 31 (3) (1999) 264–323.
  • (2) Y. Yang, F. M. Shen, Z. Huang, , H. T. Shen, X. L. Li, Discrete nonnegative spectral clustering, IEEE Transactions on Knowledge and Data Engineering 29 (9) (2017) 1834–1845.
  • (3) S. Huang, Z. Kang, Z. Xu, Self-weighted multi-view clustering with soft capped norm, Knowledge-Based Systems 158 (2018) 1–8.
  • (4)

    X. Chen, Y. Ye, X. Xu, J. Z. Huang, A feature group weighting method for subspace clustering of high-dimensional data, Pattern Recognition 45 (1) (2012) 434–446.

  • (5) Y. Ren, K. Hu, X. Dai, L. Pan, S. C. Hoi, Z. Xu, Semi-supervised deep embedded clustering, Neurocomputing 325 (2019) 121–130.
  • (6)

    R. Xu, D. Wunsch, Survey of clustering algorithms, IEEE Transactions on neural networks 16 (3) (2005) 645–678.

  • (7) S. Huang, Z. Kang, I. W. Tsang, Z. Xu, Auto-weighted multi-view clustering via kernelized graph learning, Pattern Recognition 88 (2019) 174–184.
  • (8) X. Yang, W. Yu, R. Wang, G. Zhang, F. Nie, Fast spectral clustering learning with hierarchical bipartite graph for large-scale data, Pattern Recognition Letters.
  • (9) C. Peng, Z. Kang, S. Cai, Q. Cheng, Integrate and conquer: Double-sided two-dimensional k-means via integrating of projection and manifold construction, ACM Transactions on Intelligent Systems and Technology (TIST) 9 (5) (2018) 57.
  • (10) S. Lloyd, Least squares quantization in pcm, IEEE transactions on information theory 28 (2) (1982) 129–137.
  • (11)

    B. Yang, X. Fu, N. D. Sidiropoulos, M. Hong, Towards k-means-friendly spaces: Simultaneous deep learning and clustering, in: International Conference on Machine Learning, 2017, pp. 3861–3870.

  • (12) X. Chen, X. Xu, Y. Ye, J. Z. Huang, TW-k-means: Automated Two-level Variable Weighting Clustering Algorithm for Multi-view Data, IEEE Transactions on Knowledge and Data Engineering 25 (4) (2013) 932–944.
  • (13) B. Schölkopf, A. Smola, K.-R. Müller, Nonlinear component analysis as a kernel eigenvalue problem, Neural computation 10 (5) (1998) 1299–1319.
  • (14) L. Du, P. Zhou, L. Shi, H. Wang, M. Fan, W. Wang, Y.-D. Shen, Robust multiple kernel k-means using

    -norm, in: Proceedings of the 24th International Conference on Artificial Intelligence, AAAI Press, 2015, pp. 3476–3482.

  • (15)

    A. Y. Ng, M. I. Jordan, Y. Weiss, et al., On spectral clustering: Analysis and an algorithm, Advances in neural information processing systems 2 (2002) 849–856.

  • (16) Z. Kang, C. Peng, Q. Cheng, Z. Xu, Unified spectral clustering with optimal graph, in: Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • (17) D. H. Peluffo-Ordóñez, M. A. Becerra, A. E. Castro-Ospina, X. Blanco-Valencia, J. C. Alvarado-Pérez, R. Therón, A. Anaya-Isaza, On the relationship between dimensionality reduction and spectral clustering from a kernel viewpoint, in: Distributed Computing and Artificial Intelligence, 13th International Conference, Springer, 2016, pp. 255–264.
  • (18) U. Von Luxburg, A tutorial on spectral clustering, Statistics and computing 17 (4) (2007) 395–416.
  • (19) L. Zelnik-Manor, P. Perona, Self-tuning spectral clustering, in: Advances in neural information processing systems, 2005, pp. 1601–1608.
  • (20) C. Alzate, J. A. Suykens, Multiway spectral clustering with out-of-sample extensions through weighted kernel pca, IEEE transactions on pattern analysis and machine intelligence 32 (2) (2010) 335–347.
  • (21)

    R. Langone, R. Mall, C. Alzate, J. A. Suykens, Kernel spectral clustering and applications, in: Unsupervised Learning Algorithms, Springer, 2016, pp. 135–161.

  • (22) F. Nie, X. Wang, H. Huang, Clustering and projected clustering with adaptive neighbors, in: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2014, pp. 977–986.
  • (23) B. Cheng, J. Yang, S. Yan, Y. Fu, T. S. Huang, Learning with l1-graph for image analysis, Trans. Img. Proc. 19 (4) (2010) 858–866. doi:10.1109/TIP.2009.2038764.
  • (24) E. Elhamifar, R. Vidal, Sparse subspace clustering: Algorithm, theory, and applications, IEEE transactions on pattern analysis and machine intelligence 35 (11) (2013) 2765–2781.
  • (25) G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, Y. Ma, Robust recovery of subspace structures by low-rank representation, IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (1) (2013) 171–184.
  • (26) C. Tang, X. Zhu, X. Liu, M. Li, P. Wang, C. Zhang, L. Wang, Learning joint affinity graph for multi-view subspace clustering, IEEE Transactions on Multimedia.
  • (27) Z. Kang, C. Peng, Q. Cheng, Kernel-driven similarity learning, Neurocomputing 267 (2017) 210–219.
  • (28) Z. Kang, L. Wen, W. Chen, Z. Xu, Low-rank kernel learning for graph-based clustering, Knowledge-Based Systems 163 (2019) 510–517.
  • (29) B. D. Haeffele, R. Vidal, Structured low-rank matrix factorization: Global optimality, algorithms, and applications, arXiv preprint arXiv:1708.07850.
  • (30) Z. Kang, Y. Lu, Y. Su, C. Li, Z. Xu, Similarity learning via kernel preserving embedding, in: Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19). AAAI Press, 2019.
  • (31) Z. Zhao, L. Wang, H. Liu, J. Ye, On similarity preserving feature selection, IEEE Transactions on Knowledge and Data Engineering 25 (3) (2013) 619–632.
  • (32) H. Cai, V. W. Zheng, K. C.-C. Chang, A comprehensive survey of graph embedding: Problems, techniques, and applications, IEEE Transactions on Knowledge and Data Engineering 30 (9) (2018) 1616–1637.
  • (33) Z. Kang, H. Pan, S. C. Hoi, Z. Xu, Robust graph learning from noisy data, IEEE transactions on cybernetics.
  • (34) Z. Kang, C. Peng, Q. Cheng, Twin learning for similarity and clustering: A unified kernel approach, in: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17). AAAI Press, 2017.
  • (35) X. Niyogi, Locality preserving projections, in: Neural information processing systems, Vol. 16, MIT, 2004, p. 153.
  • (36) L. Zhang, Q. Zhang, B. Du, J. You, D. Tao, Adaptive manifold regularized matrix factorization for data clustering, in: Twenty-sixth international joint conference on artificial intelligence, 2017, pp. 33999–3405.
  • (37) S. Huang, Z. Xu, J. Lv, Adaptive local structure learning for document co-clustering, Knowledge-Based Systems 148 (2018) 74–84.
  • (38) L. Du, Y.-D. Shen, Unsupervised feature selection with adaptive structure learning, in: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2015, pp. 209–218.
  • (39) F. Nie, G. Cai, X. Li, Multi-view clustering and semi-supervised classification with adaptive neighbours., in: AAAI, 2017, pp. 2408–2414.
  • (40) Y. Yang, Z. Wang, J. Yang, J. Wang, S. Chang, T. S. Huang, Data clustering by laplacian regularized l1-graph., in: AAAI, 2014, pp. 3148–3149.
  • (41)

    L. Zhang, M. Yang, X. Feng, Sparse representation or collaborative representation: Which helps face recognition?, in: Computer vision (ICCV), 2011 IEEE international conference on, IEEE, 2011, pp. 471–478.

  • (42) L. Zhuang, Z. Zhou, S. Gao, J. Yin, Z. Lin, Y. Ma, Label information guided graph construction for semi-supervised learning, IEEE Transactions on Image Processing.
  • (43) S. T. Roweis, L. K. Saul, Nonlinear dimensionality reduction by locally linear embedding, science 290 (5500) (2000) 2323–2326.
  • (44) X. He, P. Niyogi, Locality preserving projections, in: Advances in neural information processing systems, 2004, pp. 153–160.
  • (45) M. Belkin, P. Niyogi, Laplacian eigenmaps and spectral techniques for embedding and clustering, in: Advances in neural information processing systems, 2002, pp. 585–591.
  • (46) B. Mohar, Y. Alavi, G. Chartrand, O. Oellermann, The laplacian spectrum of graphs, Graph theory, combinatorics, and applications 2 (871-898) (1991) 12.
  • (47)

    K. Fan, On a theorem of weyl concerning eigenvalues of linear transformations i, Proceedings of the National Academy of Sciences 35 (11) (1949) 652–655.

  • (48) S. Sonnenburg, G. Rätsch, C. Schäfer, B. Schölkopf, Large scale multiple kernel learning, Journal of Machine Learning Research 7 (Jul) (2006) 1531–1565.
  • (49) Z. Kang, X. Lu, J. Yi, Z. Xu, Self-weighted multiple kernel learning for graph-based clustering and semi-supervised classification, in: Proceedings of the 27th International Joint Conference on Artificial Intelligence, AAAI Press, 2018, pp. 2312–2318.
  • (50) M. Gönen, E. Alpaydın, Multiple kernel learning algorithms, Journal of machine learning research 12 (Jul) (2011) 2211–2268.
  • (51) J. Huang, F. Nie, H. Huang, A new simplex sparse learning model to measure data similarity for clustering., in: IJCAI, 2015, pp. 3569–3575.
  • (52) H.-C. Huang, Y.-Y. Chuang, C.-S. Chen, Multiple kernel fuzzy clustering, IEEE Transactions on Fuzzy Systems 20 (1) (2012) 120–134.
  • (53) H.-C. Huang, Y.-Y. Chuang, C.-S. Chen, Affinity aggregation for spectral clustering, in: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE, 2012, pp. 773–780.