With the benefits of simplicity and effectiveness, k-means clustering algorithm is often adopted in various real-world problems. To deal with the nonlinear structure of many practical data sets, kernel k-means (KKM) algorithm has been developed [Schölkopf et al.1998]
, where data points are mapped through a nonlinear transformation into a higher dimensional feature space in which the data points are linearly separable. KKM usually achieves better performance than the standard k-means. To cope with noise and outliers, robust kernel k-means (RKKM)[Du et al.2015] algorithm has been proposed. In this approach, the squared norm of error construction term is replaced by norm. RKKM demonstrates superior performance on a number of benchmark data sets. The performance of such model-based methods heavily depends on whether the data fit the model. Unfortunately, in most cases, we do not know the distribution of data in advance. To some extent, this problem is alleviated by multiple kernel learning. Moreover, there is no theoretical result on how to choose the similarity graph [Von Luxburg2007].
Spectral clustering is another widely used clustering method [Kumar et al.2011]. It enjoys the advantage of exploring the intrinsic data structures by exploiting the different similarity graphs of data points [Yang et al.2015]
. There are three kinds of similarity graph constructing strategies: k-nearest-neighborhood (knn);-nearest-neighborhood; The fully connected graph. Here, some open issues arise [Huang et al.2015]: 1) how to choose a proper neighbor number or radius ; 2) how to select an appropriate similarity metric to measure the similarity among data points; 3) how to counteract the adverse effect of noise and outliers; 4) how to tackle data with structures at different scales of size and density. Unfortunately, all of these issues heavily influence the clustering results [Zelnik-Manor and Perona2004]. Nowadays, many data are often high dimensional, heterogeneous, and without prior knowledge, and it is therefore a fundamental challenge to define a pairwise similarity graph for effective spectral clustering.
Recently, [Zhu et al.2014]
construct robust affinity graphs for spectral clustering by identifying discriminative features. It adopts a random forest approach based on the motivation that tree leaf nodes contain discriminative data partitions, which can be exploited to capture subtle and weak data affinity. This approach shows better performance than other state-of-the-art methods including the Euclidean-distance-based knn[Wang et al.2008], dominant neighbourhoods [Pavan and Pelillo2007], consensus of knn [Premachandran and Kakarala2013], and non-metric based unsupervised manifold forests [Pei et al.2013].
The second step of spectral clustering is to use the spectrum of the similarity graph to reveal the cluster structure of the data. Due to the discrete constraint on the cluster labels, this problem is NP-hard. To obtain a feasible approximation solution, spectral clustering solves a relaxed version of this problem, i.e., the discrete constraint is relaxed to allow continuous values. It first performs eigenvalue decomposition on the Laplacian matrix to generate an approximate indicator matrix with continuous values. Then, k-means is often implemented to produce final clustering labels[Huang et al.2013]. Although this approach has been widely used in practice, it may exhibit poor performance since the k-means method is well-known as sensitive to the initialization of cluster centers [Ng et al.2002].
To address the aforementioned problems, in this paper, we propose a unified spectral clustering framework. It jointly learns the similarity graph from the data and the discrete clustering labels by solving an optimization problem, in which the continuous clustering labels just serve as intermediate products. To the best of our knowledge, this is the first work that combine the three steps into a single optimization problem. As we show later, it is not trivial to unify them. The contributions of our work are as follows:
Rather than using predefined similarity metrics, the similarity graph is adaptively learned from the data in a kernel space. By combining similarity learning with subsequentl clustering into a unified framework, we can ensure the optimality of the learned similarity graph.
Unlike existing spectral clustering methods that work in three separate steps, we simultaneously learn similarity graph, continuous labels, and discrete cluster labels. By leveraging the inherent interactions between these three subtasks, they can be boosted by each other.
Based on our single kernel model, we further extend it to have the ability to learn the optimal combination of multiple kernels.
Notations. Given a data set , we denote with features and samples. Then the -th sample and -th element of matrix are denoted by and , respectively. The
-norm of a vectoris defined as , where means transpose. The squared Frobenius norm is denoted by . The -norm of matrix is defined as the absolute summation of its entries, i.e., .
denotes the identity matrix. Tr() is the trace operator. means all the elements of are nonnegative.
Recently, sparse representation, which assumes that each data point can be reconstructed as a linear combination of the other data points, has shown its power in many tasks [Cheng et al.2010, Peng et al.2016]. It often solves the following problem:
where is a balancing parameter. Eq. (1) simultaneously determines both the neighboring samples of a data point and the corresponding weights by the sparse reconstruction from the remaining samples. In principle, more similar points should receive bigger weights and the weights should be smaller for less similar points. Thus is also called similarity graph matrix [Kang et al.2015]. In addition, sparse representation enjoys some nice properties, e.g., the robustness to noise and datum-adaptive ability [Huang et al.2015]. On the other hand, model (1) has a drawback, i.e., it does not consider nonlinear data sets where data points reside in a union of manifolds [Kang et al.2017a].
Spectral clustering requires Laplacian matrix as an input, which is computed as , where is a diagonal matrix with the -th diagonal element . In traditional spectral clustering methods, similarity graph is often constructed in one of the three ways aforementioned. Supposing there are clusters in the data , spectral clustering solves the following problem:
where is the cluster indicator matrix and represents the clustering label vector of each point contains one and only one element “1” to indicate the group membership of . Due to the discrete constraint on , problem (2) is NP-hard. In practice, is relaxed to allow continuous values and solve
where is the relaxed continuous clustering label matrix, and the orthogonal constraint is adopted to avoid trivial solutions. The optimal solution is obtained from the eigenvectors of corresponding to the smallest eigenvalues. After obtaining , traditional clustering method, e.g., k-means, is implemented to obtain discrete cluster labels [Huang et al.2013].
Although this three-steps approach provides a feasible solution, it comes with two potential risks. First, since the similarity graph computation is independent of the subsequent steps, it may be far from optimal. As we discussed before, the clustering performance is largely determined by the similarity graph. Thus, final results may be degraded. Second, the final solution may unpredictably deviate from the ground-truth discrete labels [Yang et al.2016]. To address these problems, we propose a unified spectral clustering model.
Spectral Clustering with Single Kernel
One drawback of Eq. (1) is that it assumes that all the points lie in a union of independent or disjoint subspaces and are noiseless. In the presence of dependent subspaces, nonlinear manifolds and/or data errors, it may select points from different structures to represent a data point and makes the representation less informative [Elhamifar and Vidal2009]. It is recognized that nonlinear data may represent linearity when mapped to an implicit, higher-dimensional space via a kernel function. To fully exploit data information, we formulate Eq. (1) in a general manner with a kernelization framework.
Let be a kernel mapping the data samples from the input space to a reproducing kernel Hilbert space . Then is transformed to . The kernel similarity between data samples and is defined through a predefined kernel as . By applying this kernel trick, we do not need to know the transformation . In the new space, Eq. (1) becomes [Zhang et al.2010]
This model recovers the linear relations among the data in the new space, and thus the nonlinear relations in the original representation. Eq. (4) is more general than Eq. (1) and is supposed to learn arbitrarily shaped data structure. Moreover, Eq. (4) goes back to Eq. (1) when a linear kernel is applied.
To fulfill the clustering task, we propose our spectral clustering with single kernel (SCSK) model as following:
where , , and are penalty parameters, and is a rotation matrix. Due to the spectral solution invariance property [Yu and Shi2003], for any solution , is another solution. The purpose of the last term is to find a proper orthonormal such that the resulting is close to the real discrete clustering labels. In Eq. (5), the similarity graph and the final discrete clustering labels are automatically learned from the data. Ideally, whenever data points and belong to different clusters, we must have and it is also true vice versa. That is to say, we have if and only if data points and are in the same cluster, or, equivalently . Therefore, our unified framework Eq. (5) can exploit the correlation between the similarity matrix and the labels. Because of the feedback of inferred labels to induce the similarity matrix and vice versa, we say that our clustering framework has a self-taught property.
In fact, Eq. (5) is not a simple unification of the pipeline of steps. It learns a similarity graph with optimal structure for clustering. Ideally, should have exactly connected components if there are clusters in the data set [Kang et al.2017b]. This is to say that the Laplacian matrix has zero eigenvalues [Mohar et al.1991], i.e., the summation of the smallest eigenvalues is zero. To ensure the optimality of the similarity graph, we can minimize . According to Ky Fan’s theorem [Fan1949], . Therefore, the spectral clustering term, i.e., the second term in Eq. (5), will ensure learned is optimal for clustering.
To efficiently and effectively solve Eq. (5), we design an alternated iterative method.
Computation of Z: With , , fixed, the problem is reduced to
We introduce an auxiliary variable to make above objective function separable and solve the following equivalent problem:
This can be solved by using the augmented Lagrange multiplier (ALM) type of method. We turn to minimizing the following augmented Lagrangian function:
where is the penalty parameter and is the Lagrange multiplier. This problem can be minimized with respect to , , and alternatively, by fixing the other variables.
For , by letting , it can be updated element-wisely as below：
For , by letting , it can be updated column-wisely as:
where is a vector with the -th element being . It is easy to obtain by setting the derivative of Eq. (10) w.r.t. to be zero.
Computation of P: With , , and fixed, it is equivalent to solving
The above problem with orthogonal constraint can be efficiently solved by the algorithm proposed by Wen and Yin [Wen and Yin2013].
Computation of Q: With , , and fixed, we have
It is the orthogonal Procrustes problem [Schönemann1966], which admits a closed-form solution. The solution is
where and are left and right parts of the SVD decomposition of .
Computation of F: With , and fixed, the problem becomes
Note that , the above subproblem can be rewritten as below:
The optimal solution can be easily obtained as follows:
The updates of , , , and are coupled with each other, so we could reach an overall optimal solution. The details of our SCSK optimization are summarized in Algorithm 1.
With our optimization strategy, the updating of requires complexity. The quadratic program can be solved in polynomial time. The solution of involves SVD and its complexity is . To update , we need . The complexity for is . Note that the number of clusters is often a small number. Therefore, the main computation load is from solving , which involves matrix inversion. Fortunately, is solved in parallel.
Spectral Clustering with Multiple Kernels
Although the model in Eq. (5) can automatically learn the similarity graph matrix and discrete cluster labels, its performance will strongly depend on the choice of kernels. It is often impractical to exhaustively search for the most suitable kernel. Moreover, real world data sets are often generated from different sources along with heterogeneous features. Single kernel method may not be able to fully utilize such information. Multiple kernel learning has the ability to integrate complementary information and identify a suitable kernel for a given task. Here we present a way to learn an appropriate consensus kernel from a convex combination of a number of predefined kernel functions.
Suppose there are a total number of different kernel functions . An augmented Hilbert space can be constructed by using the mapping of with different weights . Then the combined kernel can be represented as [Zeng and Cheung2011]
Note that the convex combination of the positive semi-definite kernel matrices is still a positive semi-definite kernel matrix. Thus the combined kernel still satisfies Mercer’s condition. Then our proposed method of spectral clustering with multiple kernels (SCMK) can be formulated as
Now above model will learn the similarity graph, discrete clustering labels, and kernel weights by itself. By iteratively updating , , and , each of them will be iteratively refined according to the results of the others.
In this part, we show an efficient and effective algorithm to iteratively and alternatively solve Eq. (18).
is fixed: Update other variables when is fixed: We can directly calculate , and the optimization problem is exactly Eq. (5). Thus we just need to use Algorithm 1 with as the input kernel matrix.
The Lagrange function of Eq. (19) is
By utilizing the Karush-Kuhn-Tucker (KKT) condition with and the constraint , we obtain the solution of as follows:
We can see that is closely related to . Therefore, we could obtain both optimal similarity matrix and kernel weight . We summarize the optimization process of Eq. (18) in Algorithm 2.
|# instances||# features||# classes|
There are altogether ten real benchmark data sets used in our experiments. Table 1 summarizes the statistics of these data sets. Among them, the first six are image data, and the other four are text corpora111http://www-users.cs.umn.edu/ han/data/tmdata.tar.gz222http://www.cad.zju.edu.cn/home/dengcai/Data/TextData.html.
The six image data sets consist of four famous face databases (ORL333http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html, YALE444http://vision.ucsd.edu/content/yale-face-database, AR555http://www2.ece.ohio-state.edu/ aleix/ARdatabase.html and JAFFE666http://www.kasrl.org/jaffe.html), a toy image database COIL20777http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php, and a binary alpha digits data set BA888http://www.cs.nyu.edu/ roweis/data.html. Specifically, COIL20 contains images of 20 objects. For each object, the images were taken five degrees apart as the object is rotating on a turntable. There are 72 images for each object. Each image is represented by a 1,024-dimensional vector. BA consists of digits of “0” through “9” and letters of capital “A” through “Z”. There are 39 examples for each class. YALE, ORL, AR, and JAFEE contain images of individuals. Each image has different facial expressions or configurations due to times, illumination conditions, and glasses/no glasses.
To assess the effectiveness of multiple kernel learning, we adopted 12 kernels. They include: seven Gaussian kernels of the form , where is the maximal distance between samples and varies over the set ; a linear kernel ; four polynomial kernels with and . Furthermore, all kernels are rescaled to by dividing each element by the largest pair-wise squared distance.
For single kernel methods, we run downloaded kernel k-means (KKM) [Schölkopf et al.1998], spectral clustering (SC) [Ng et al.2002], robust kernel k-means (RKKM) [Du et al.2015], and SCSK on each kernel separately. To demonstrate the advantage of our unified framework, we also implement three separate steps method (TSEP), i.e., learn the similarity matrix by (4), spectral clustering, k-means (repeat 20 times). And we report both the best and the average results over all these kernels.
In addition, we also implement the recent simplex sparse representation (SSR) [Huang et al.2015] method and robust affinity graph construction methods by using random forest approach: ClustRF-u and ClustRF-a [Zhu et al.2014]
. ClustRF-u assumes all tree nodes are uniformly important, while ClustRF-a assigns an adaptive weight to each node. Note that these three methods can only process data in the original feature space. Moreover, ClusteRF has a high demand for memory and cannot process high dimensional data directly. Thus we follow the authors’ strategy and perform PCA on TR11, TR41, and TR45 to reduce the dimension. We use different numbers of dominant components and report the best clustering results. Nevertheless, we still cannot handle TDT2 data set with them.
For multiple kernel methods, we implement our proposed method and directly use the downloaded programs for the methods in comparison on a combination of these 12 kernels:
MKKM999http://imp.iis.sinica.edu.tw/IVCLab/research/Sean/mkfc/code. The MKKM [Huang et al.2012b] extends k-means in a multiple kernel setting. However, it imposes a different constraint on the kernel weight distribution.
AASC101010http://imp.iis.sinica.edu.tw/IVCLab/research/Sean/aasc/code. The AASC [Huang et al.2012a] is an extension of spectral clustering to the situation when multiple affinities exist. It is different from our approach since our method tries to learn an optimal similarity graph.
SCMK. Our proposed method of spectral clustering with multiple kernels. For the purpose of reproducibility, the code is publicly available121212https://github.com/sckangz/AAAI18.
For our method, we only need to run once. For those methods that involve K-means, we follow the strategy suggested in [Yang et al.2010]; i.e., we repeat clustering 20 times and present the results with the best objective values. We set the number of clusters to the true number of classes for all clustering algorithms.
We present the clustering results of different methods on those benchmark data sets in Table 2. In terms of accuracy, NMI and Purity, our proposed methods obtain superior results. The big difference between the best and average results confirms that the choice of kernels has a huge influence on the performance of single kernel methods. This motivates our extended model for multiple kernel learning. Besides, our extended model for multiple kernel clustering usually improves the results over our model for single kernel clustering.
Although the best results of the three separate steps approach are sometimes close to our proposed unified method, their average values are often lower than our method. We notice that random forest based affinity graph method achieves good performance on image data sets. This observation can be explained by the fact that ClustRF is suitable to handle ambiguous and unreliable features caused by variation in illumination, face expression or pose on those data sets. On the other hand, it is not effective for text data sets. In most cases, ClustRF-a behaves better than ClustRF-u. This justifies the importance of considering neighbourhood-scale-adaptive weighting on the nodes.
There are three parameters in our model: , , and . We use YALE data set as an example to demonstrate the sensitivity of our model SCMK to parameters. As shown in Figure 1, our model is quite insensitive to and , and over wide ranges of values. In terms of NMI and Purity, we have similar observations.
In this work, we address two problems existing in most classical spectral clustering algorithms, i.e., constructing similarity graph and relaxing discrete constraints to continuous one. To alleviate performance degradation, we propose a unified spectral clustering framework which automatically learns the similarity graph and discrete labels from the data. To cope with complex data, we develop our method in kernel space. A multiple kernel approach is proposed to solve kernel dependent issue. Extensive experiments on nine real data sets demonstrated the promising performance of our methods as compared to existing clustering approaches.
This paper was in part supported by Grants from the Natural Science Foundation of China (No. 61572111), the National High Technology Research and Development Program of China (863 Program) (No. 2015AA015408), a 985 Project of UESTC (No.A1098531023601041) and a Fundamental Research Fund for the Central Universities of China (No. A03017023701012).
[Cai et al.2013]
Xiao Cai, Feiping Nie, Weidong Cai, and Heng Huang.
Heterogeneous image features integration via multi-modal semi-supervised learning model.In
Proceedings of the IEEE International Conference on Computer Vision, pages 1737–1744, 2013.
- [Cheng et al.2010] Bin Cheng, Jianchao Yang, Shuicheng Yan, Yun Fu, and Thomas S Huang. Learning with-graph for image analysis. IEEE transactions on image processing, 19(4):858–866, 2010.
[Du et al.2015]
Liang Du, Peng Zhou, Lei Shi, Hanmo Wang, Mingyu Fan, Wenjian Wang, and Yi-Dong
Robust multiple kernel k-means using ℓ 2; 1-norm.
Proceedings of the 24th International Conference on Artificial Intelligence, pages 3476–3482. AAAI Press, 2015.
- [Elhamifar and Vidal2009] Ehsan Elhamifar and René Vidal. Sparse subspace clustering. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 2790–2797. IEEE, 2009.
On a theorem of weyl concerning eigenvalues of linear transformations i.Proceedings of the National Academy of Sciences of the United States of America, 35(11):652, 1949.
- [Huang et al.2012a] Hsin-Chien Huang, Yung-Yu Chuang, and Chu-Song Chen. Affinity aggregation for spectral clustering. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 773–780. IEEE, 2012.
- [Huang et al.2012b] Hsin-Chien Huang, Yung-Yu Chuang, and Chu-Song Chen. Multiple kernel fuzzy clustering. IEEE Transactions on Fuzzy Systems, 20(1):120–134, 2012.
- [Huang et al.2013] Jin Huang, Feiping Nie, and Heng Huang. Spectral rotation versus k-means in spectral clustering. In AAAI, 2013.
- [Huang et al.2015] Jin Huang, Feiping Nie, and Heng Huang. A new simplex sparse learning model to measure data similarity for clustering. In Proceedings of the 24th International Conference on Artificial Intelligence, pages 3569–3575. AAAI Press, 2015.
- [Huang et al.2017] Shudong Huang, Hongjun Wang, Tao Li, Tianrui Li, and Zenglin Xu. Robust graph regularized nonnegative matrix factorization for clustering. Data Mining and Knowledge Discovery, pages 1–21, 2017.
- [Kang et al.2015] Zhao Kang, Chong Peng, and Qiang Cheng. Robust subspace clustering via smoothed rank approximation. IEEE Signal Processing Letters, 22(11):2088–2092, 2015.
- [Kang et al.2017a] Zhao Kang, Chong Peng, and Qiang Cheng. Kernel-driven similarity learning. Neurocomputing, 2017.
- [Kang et al.2017b] Zhao Kang, Chong Peng, and Qiang Cheng. Twin learning for similarity and clustering: A unified kernel approach. In AAAI, pages 2080–2086, 2017.
- [Kumar et al.2011] Abhishek Kumar, Piyush Rai, and Hal Daume. Co-regularized multi-view spectral clustering. In Advances in neural information processing systems, pages 1413–1421, 2011.
- [Mohar et al.1991] Bojan Mohar, Y Alavi, G Chartrand, and OR Oellermann. The laplacian spectrum of graphs. Graph theory, combinatorics, and applications, 2(871-898):12, 1991.
[Ng et al.2002]
Andrew Y Ng, Michael I Jordan, Yair Weiss, et al.
On spectral clustering: Analysis and an algorithm.Advances in neural information processing systems, 2:849–856, 2002.
- [Pavan and Pelillo2007] Massimiliano Pavan and Marcello Pelillo. Dominant sets and pairwise clustering. IEEE transactions on pattern analysis and machine intelligence, 29(1):167–172, 2007.
- [Pei et al.2013] Yuru Pei, Tae-Kyun Kim, and Hongbin Zha. Unsupervised random forest manifold alignment for lipreading. In Proceedings of the IEEE International Conference on Computer Vision, pages 129–136, 2013.
- [Peng et al.2016] Chong Peng, Zhao Kang, Ming Yang, and Qiang Cheng. Feature selection embedded subspace clustering. IEEE Signal Processing Letters, 23(7):1018–1022, 2016.
- [Premachandran and Kakarala2013] Vittal Premachandran and Ramakrishna Kakarala. Consensus of k-nns for robust neighborhood selection on graph-based manifolds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1594–1601, 2013.
- [Schölkopf et al.1998] Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller. Nonlinear component analysis as a kernel eigenvalue problem. Neural computation, 10(5):1299–1319, 1998.
- [Schönemann1966] Peter H Schönemann. A generalized solution of the orthogonal procrustes problem. Psychometrika, 31(1):1–10, 1966.
- [Von Luxburg2007] Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395–416, 2007.
- [Wang et al.2008] Jun Wang, Shih-Fu Chang, Xiaobo Zhou, and Stephen TC Wong. Active microscopic cellular image annotation by superposable graph transduction with imbalanced labels. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008.
- [Wen and Yin2013] Zaiwen Wen and Wotao Yin. A feasible method for optimization with orthogonality constraints. Mathematical Programming, 142(1-2):397–434, 2013.
- [Yang et al.2010] Yi Yang, Dong Xu, Feiping Nie, Shuicheng Yan, and Yueting Zhuang. Image clustering using local discriminant models and global integration. IEEE Transactions on Image Processing, 19(10):2761–2773, 2010.
- [Yang et al.2015] Yang Yang, Zhigang Ma, Yi Yang, Feiping Nie, and Heng Tao Shen. Multitask spectral clustering by exploring intertask correlation. IEEE transactions on cybernetics, 45(5):1083–1094, 2015.
- [Yang et al.2016] Yang Yang, Fumin Shen, Zi Huang, and Heng Tao Shen. A unified framework for discrete spectral clustering. In IJCAI, pages 2273–2279, 2016.
- [Yu and Shi2003] Stella X Yu and Jianbo Shi. Multiclass spectral clustering. In Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on, pages 313–319. IEEE, 2003.
- [Zelnik-Manor and Perona2004] Lihi Zelnik-Manor and Pietro Perona. Self-tuning spectral clustering. In NIPS, volume 17, page 16, 2004.
- [Zeng and Cheung2011] Hong Zeng and Yiu-ming Cheung. Feature selection and kernel learning for local learning-based clustering. IEEE transactions on pattern analysis and machine intelligence, 33(8):1532–1547, 2011.
- [Zhang et al.2010] Changshui Zhang, Feiping Nie, and Shiming Xiang. A general kernelization framework for learning algorithms based on kernel pca. Neurocomputing, 73(4):959–967, 2010.
- [Zhu et al.2014] Xiatian Zhu, Chen Change Loy, and Shaogang Gong. Constructing robust affinity graphs for spectral clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1450–1457, 2014.