Introduction
Clustering is a fundamental topic in data mining and machine learning
[Peng et al.2016]. It partitions data points into different groups, such that the objects within a group are similar to one another and different from those in other groups. Various methods have been proposed over the past decades. Some wellknown algorithms include kmeans clustering [MacQueen1967], spectral clustering [Ng et al.2002], and hierarchical clustering
[Johnson1967].Thanks to the simplicity and the effectiveness, the kmeans algorithm is widely used. However, it fails to identify arbitrarily shaped clusters. Kernel kmeans [Schölkopf, Smola, and Müller1998] has been developed to capture nonlinear structure information hidden in data sets. Kernelbased learning methods requires one to specify a kernel, which means one assumes a certain shape of the underlying data space. Thus the performance of kernelbased methods are largely affected by the choice of kernel.
Spectral clustering does a lowdimensional embedding of the similarity matrix of the data before performing kmeans clustering [Ng et al.2002]. The similarity between every pair of points, as an input, leverages the manifold information in this clustering model. Thus similaritybased clustering methods usually show better performance than kmeans algorithm. However, the performance of this kind of methods is largely determined by the similarity matrix [Huang, Nie, and Huang2015]. Any variations during the similarity measurement, such as metric, neighborhood size, and data scale, may lead to suboptimal performance.
Recently, selfexpression has been successfully utilized in subspace recovery [Elhamifar and Vidal2009, Luo et al.2011], low rank representation [Kang, Peng, and Cheng2015b, Kang, Peng, and Cheng2015a], and recommender systems [Kang and Cheng2016]. It represents each data point in terms of the other points. By solving an optimization problem, the similarity information is automatically learned from the data. This approach can not only reveal lowdimensional structure, but also be robust to noise and data scale [Huang, Nie, and Huang2015].
Contributions
In this paper, we perform clustering built upon the idea of using samples from the data to “express itself”. Rather than local structure learning [Nie, Wang, and Huang2014], this approach extracts the global structure of data and can be extended to kernel spaces. Unlike existing clustering algorithms that work in two separate steps, we simultaneously learn similarity matrix and cluster indicators by imposing a rank constraint on the Laplacian matrix of the learned similarity matrix. By leveraging the intrinsic interactions between learning similarity and cluster indicators, our proposed model seamlessly integrates them into a joint framework, where the result of one task is used to improve the other one. To capture the nonlinear structure information inherent in many real world data sets, we directly develop our method in a kernel space, which is well known for its ability to explore the nonlinear relation. We design an efficient algorithm to find an optimal solution to our model, and show the theoretical analysis on the connections to kernel kmeans, kmeans, and spectral clustering methods.
While effective, the kernel in use often has enormous influence on the performance of any kernel method. Unfortunately, the most suitable kernel for a specific task is usually unknown in advance. Exhaustive search on a userdefined pool of kernels is timeconsuming and impractical when the sizes of the pool and data become large [Zeng and Cheung2011]. Thus we further propose a multiple kernel algorithm for our model. Another benefit of applying multiple kernels is that we can fully utilize information from different sources equipped with heterogeneous features [Yu et al.2012]. To alleviate the effort for kernel construction and integrating complementary information, we learn an appropriate consensus kernel from a linear combination of multiple input kernels. As a result, our joint model can simultaneously learn the similarity information, cluster indicator matrix, and the optimal combination of multiple kernels. Extensive empirical results on realworld benchmark data sets show that our method consistently outperforms other stateoftheart methods.
Notations.
In this paper, matrices are written as upper case letters and vectors are represented by boldface lowercase letters. The
th column and the th element of matrix are denoted by and , respectively. The norm of a vector is defined as , where means transpose.denotes the identity matrix and
denotes a column vector with all the elements as one. Tr() is the trace operator. means all elements of are in the range .Clustering with Single Kernel
According to the selfexpressive property [Elhamifar and Vidal2009],
(1) 
where is the weight for th sample. More similar data points should receive bigger weights and the weights should be smaller for less similar points. Thus is also called similarity matrix, which represents the global structure of data. Note that (1) is in a similar spirit of Locally Linear Embedding (LLE) [Roweis and Saul2000], which assumes that the data points lie on a manifold and each data point can be expressed as a linear combination of its nearest neighbors. The difference from LLE lies in the fact that we specify no neighborhood, which is automatically determined by our method.
To obtain , we solve the following problem:
(2) 
where the first term is to measure reconstruction error, the second term is imposed to avoid the trivial solution , and is a tradeoff parameter.
One drawback of (2) is that it assumes linear relations between samples. To recover the nonlinear relations between the data points, we extend (2) to kernel spaces by deploying a general kernelization framework [Zhang, Nie, and Xiang2010]. Define to be a kernel mapping the data samples from the input space to a reproducing kernel Hilbert space . For containing samples, the transformation is . The kernel similarity between data samples and is defined through a predefined kernel as . It is easy to observe that all similarities can be computed exclusively using the kernel function and one does not need to know the transformation . This is known as the kernel trick and it greatly simplifies the computations in the kernel space when the kernels are precomputed. Then (2) becomes
(3) 
By solving above problem, we learn the linear sparse relations of , and thus the nonlinear relations among . Note that (3) goes back to (2) if a linear kernel is adopted.
Ideally, we expect that the number of connected components in are exactly if the given data set consists of clusters (that is, is block diagonal with proper permutations). However, the solution from (3) might not satisfy this desired property. Therefore, we will add another constraint based on the following theorem [Mohar et al.1991].
Theorem 1.
The multiplicity
of the eigenvalue 0 of the Laplacian matrix
of is equal to the number of connected components in the graph with the similarity matrix .Theorem 1 means that if the similarity matrix contains exactly connected components. Thus our new clustering model is to solve:
(4) 
Problem (4) is not easy to tackle, since , where is a diagonal matrix with the th diagonal element , also depends on similarity matrix .
Here is positive semidefinite, thus , where is the th smallest eigenvalue of . is equivalent to . It is not easy to enforce this constraint because the optimization problem with a rank constraint is known to be of combinatorial complexity. To mitigate the difficulty, [Wang et al.2015, Nie et al.2016] incorporates the rank constraint into the objective function as a regularizer. Motivated by this consideration, we relax the constraint and reformulate our model as
(5) 
The minimization will make the regularizer if is large enough. Then the constraint will be satisfied.
Problem (5) is still a challenging problem because of the last term. Fortunately, it can be solved by using Ky Fan’s Theorem [Fan1949], i.e.,
(6) 
where is the indicator matrix. The elements of th row are the measure of the membership of data point belonging to the clusters. Finally, our model of twin learning for similarity and clustering with a single kernel (SCSK) is formulated as
(7) 
By solving (7), we directly obtain the indicator matrix ; therefore, we do not need to perform spectral clustering any more. By alternatively updating and , they can improve each other and optimize (7).
Optimization Algorithm
We use an alternating optimization strategy for (7). When is fixed, (7) becomes
(8) 
The optimal solution is obtained by the eigenvectors of corresponding to the smallest eigenvalues.
When is fixed, (7) can be reformulated columnwisely as:
(9) 
where is a vector with the th element being . To obtain (9), the important equation in spectral analysis
(10) 
is used. (9) can be further simplified as
(11) 
This problem can be solved by many existing quadratic programing packages. The complete algorithm is outlined in Algorithm 1.
Theoretical Analysis of SCSK Model
In this section, we present a theoretical analysis of SCSK and its connections to kernel kmeans, kmeans, and SC.
Connection to Kernel Kmeans and Kmeans
Here we introduce a theorem which states the equivalence of SCSK and kernel kmeans, kmeans under some condition.
Theorem 2.
When , the problem (4) is equivalent to kernel kmeans problem.
Proof.
The constraint in (4) makes the solution block diagonal. Let denote the similarity matrix of the th component of , where is the number of data samples in the component. Problem (4) is equivalent to solving the following problem for each :
(12) 
where consists of the samples corresponding to the th component of . When , the above problem becomes
(13) 
The optimal solution is that all elements of are equal to .
Thus when , the optimal solution to problem (4) is
(14) 
Denote the solution set of this form by . It is easy to observe that . Thus (4) becomes
(15) 
It is easy to deduce that is the mean of cluster in the kernel space. Therefore, (15) is exactly the kernel kmeans. Thus our proposed algorithm is to solve the kernel kmeans problem when . ∎
Corollary 1.
When and a linear kernel is adopted, the problem (4) is equivalent to kmeans problem.
Proof.
It is obvious when one does not use any transformations on in (15). ∎
Connection to Spectral Clustering
With a predefined similarity , spectral clustering is to solve the following problem:
(16) 
The optimal solution is obtained by the eigenvectors of corresponding to the smallest eigenvalues. Generally, can not be directly used for clustering since does not have exactly connected components. To obtain the final clustering results, kmeans or some other discretization procedures must be performed on [Huang, Nie, and Huang2013].
In our proposed algorithm, the similarity matrix is not predefined as the existing spectral clustering methods in the literature. Also, the similarity matrix is learned by taking account of the clustering task at hand, as opposed to the existing subspace clustering methods in the literature which only focus on learning the similarity matrix without considering the effect of clustering on [Peng et al.2015]. In our method, the graph with the learned will be partitioned into connected components by using . The optimal solution is formed by the eigenvectors of , which is defined by , corresponding to the smallest eigenvalues. Therefore, the proposed algorithm learns the similarity matrix and the cluster indicator matrix simultaneously in a coupled way, which leads to a better result in real applications than existing spectral methods as shown in our experiments, since it learns an adaptive graph for clustering.
Clustering with Multiple Kernels
Although model (7) can automatically learn the similarity matrix and cluster indicator matrix, its performance will largely be determined by the choice of kernel. It is often impractical to exhaustively search for the most suitable kernel. Moreover, real world data sets are often generated from different sources along with heterogeneous features. Single kernel method may not be able to fully utilize such information. Multiple kernel learning is capable of integrating complementary information and identifying a suitable kernel for a given task. Here we present a way to learn an appropriate consensus kernel from a convex combination of several predefined kernel matrices.
Suppose there are a total number of different kernel functions . Correspondingly, there would be different kernel spaces denoted as . An augmented Hilbert space, , can be constructed by concatenating all kernel spaces and by using the mapping of with different weights . Then the combined kernel can be represented as [Zeng and Cheung2011]
(17) 
Note that the convex combination of the positive semidefinite kernel matrices is still a positive semidefinite kernel matrix. Thus the combined kernel still satisfies Mercer’s condition. Then we propose our joint similarity learning and clustering with multiple kernel (SCMK) model which can be written as
(18) 
By iteratively updating , each of them will be adaptively refined according to the results of the other two.
Optimization
Problem (18) can be solved by alternatively updating , , and , while holding the other variables as constant.
1) Optimizing with respect to and when is fixed: We can directly calculate , and the optimization problem is exactly (7). Thus we just need to use Algorithm 1 with as the input kernel matrix.
2) Optimizing with respect to when and are fixed: Solving (18) with respect to can be rewritten as [Cai et al.2013]
(19) 
where
(20) 
The Lagrange function of (19) is
(21) 
By utilizing the KarushKuhnTucker (KKT) condition with and the constraint , we obtain the solution of as follows:
(22) 
In Algorithm 2 we provide a complete algorithm for solving (18).
Experiments
# instances  # features  # classes  

YALE  165  1024  15 
JAFFE  213  676  10 
ORL  400  1024  40 
AR  840  768  120 
BA  1404  320  36 
TR11  414  6429  9 
TR41  878  7454  10 
TR45  690  8261  10 



In this section, we demonstrate the effectiveness of the proposed method on real world benchmark data sets.
Data Sets
There are altogether eight benchmark data sets used in our experiments. Table 1 summarizes the statistics of these data sets. Among them, five are image ones, and the other three are text corpora^{1}^{1}1http://wwwusers.cs.umn.edu/ han/data/tmdata.tar.gz. The five image data sets consist of four commonly used face databases (ORL^{2}^{2}2http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html, YALE^{3}^{3}3http://vision.ucsd.edu/content/yalefacedatabase, AR^{4}^{4}4http://www2.ece.ohiostate.edu/ aleix/ARdatabase.html [Martinez and Benavente2007] and JAFFE^{5}^{5}5http://www.kasrl.org/jaffe.html), and a binary alpha digits data set BA^{6}^{6}6http://www.cs.nyu.edu/ roweis/data.html.
Experiment Setup
To assess the effectiveness of multiple kernel learning, we design 12 kernels which include: seven Gaussian kernels of the form , where is the maximal distance between samples and varies over the set ; a linear kernel ; four polynomial kernels with and . Furthermore, all kernels are rescaled to by dividing each element by the largest pairwise squared distance.
For single kernel methods, we run kernel kmeans (KKM) [Schölkopf, Smola, and Müller1998], spectral clustering (SC) [Ng et al.2002], robust kernel kmeans (RKKM) [Du et al.2015], and our proposed SCSK^{7}^{7}7https://github.com/sckangz/AAAI17 on each kernel separately. The methods in comparison are downloaded from the their authors’ websites. And we report both the best and the average results over all these kernels.
For multiple kernel methods, we implement the following algorithms on a combination of above kernels.
MKKM^{8}^{8}8http://imp.iis.sinica.edu.tw/IVCLab/research/Sean/mkfc/code. MKKM [Huang, Chuang, and Chen2012b] extends kmeans in a multiplekernel setting. However, it uses a different way to learn the kernel weight.
AASC^{9}^{9}9http://imp.iis.sinica.edu.tw/IVCLab/research/Sean/aasc/code. AASC [Huang, Chuang, and Chen2012a] extends spectral clustering to the situation where multiple affinities exist.
RMKKM^{10}^{10}10https://github.com/csliangdu/RMKKM. RMKKM [Du et al.2015] adopts norm to measure the loss of kmeans.
SCMK. Our proposed method for joint similarity learning and clustering with multiple kernels.
For spectral clustering like SC and AASC, we run kmeans on spectral embedding to obtain the clustering results. To reduce the influence of initialization, we follow the strategy suggested in [Yang et al.2010, Du et al.2015], and we repeat clustering 20 times and present the results with the best objective values. We set the number of clusters to the true number of classes for all clustering algorithms.
To quantitatively evaluate the clustering performance, we adopt the three widely used metrics, accuracy (Acc), normalized mutual information (NMI) [Cai et al.2009], and Purity.
Clustering Result
Table 2 shows the clustering results in terms of accuracy, NMI and Purity on all the data sets. It can be seen that the proposed SCSK and SCMK produce promising results. Especially, our method can substantially improve the performance on JAFFE, AR, BA, TR11, and TR45 data sets. The big difference between best and average results confirms the fact that the choice of kernel has a huge influence on the performance of single kernel methods. This difference motivates the development of multiple kernel learning method. Besides, multiple kernel clustering approaches usually improve the results over single kernel clustering methods.
Parameter Selection
There are two parameters and in our models. We let vary over the range of , and over . Figure 1 shows how the clustering results of SCMK in terms of Acc, NMI, and Purity vary with and on JAFFE and YALE data sets. We can observe that the performance of SCMK is very stable with respect to a large range of values and it is more sensitive to the value of .
Conclusion
In this paper, we first propose a clustering method to simultaneously perform similarity learning and the cluster indicator matrix construction. In our method, the similarity learning and the cluster indicator learning are integrated within one framework; the method can be easily extended to kernel spaces, so as to capture nonlinear structure information. The connections of the proposed method to kernel kmeans, kmeans, and spectral clustering are also established. To avoid extensive search of the best kernel, we further incorporate multiple kernel learning into our model. Similarity learning, cluster indicator construction, and kernel weight learning can be boosted by using the results of the other two. Extensive experiments have been conducted on realworld benchmark data sets to demonstrate the superior performance of our method.
Acknowledgements
This work is supported by US National Science Foundation Grants IIS 1218712. Q. Cheng is the corresponding author.
References
 [Cai et al.2009] Cai, D.; He, X.; Wang, X.; Bao, H.; and Han, J. 2009. Locality preserving nonnegative matrix factorization. In IJCAI, volume 9, 1010–1015.

[Cai et al.2013]
Cai, X.; Nie, F.; Cai, W.; and Huang, H.
2013.
Heterogeneous image features integration via multimodal semisupervised learning model.
InProceedings of the IEEE International Conference on Computer Vision
, 1737–1744. 
[Du et al.2015]
Du, L.; Zhou, P.; Shi, L.; Wang, H.; Fan, M.; Wang, W.; and Shen, Y.D.
2015.
Robust multiple kernel kmeans using ℓ 2; 1norm.
In
Proceedings of the 24th International Conference on Artificial Intelligence
, 3476–3482. AAAI Press. 
[Elhamifar and
Vidal2009]
Elhamifar, E., and Vidal, R.
2009.
Sparse subspace clustering.
In
Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on
, 2790–2797. IEEE. 
[Fan1949]
Fan, K.
1949.
On a theorem of weyl concerning eigenvalues of linear transformations i.
Proceedings of the National Academy of Sciences of the United States of America 35(11):652.  [Huang, Chuang, and Chen2012a] Huang, H.C.; Chuang, Y.Y.; and Chen, C.S. 2012a. Affinity aggregation for spectral clustering. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, 773–780. IEEE.
 [Huang, Chuang, and Chen2012b] Huang, H.C.; Chuang, Y.Y.; and Chen, C.S. 2012b. Multiple kernel fuzzy clustering. IEEE Transactions on Fuzzy Systems 20(1):120–134.
 [Huang, Nie, and Huang2013] Huang, J.; Nie, F.; and Huang, H. 2013. Spectral rotation versus kmeans in spectral clustering. In AAAI.
 [Huang, Nie, and Huang2015] Huang, J.; Nie, F.; and Huang, H. 2015. A new simplex sparse learning model to measure data similarity for clustering. In Proceedings of the 24th International Conference on Artificial Intelligence, 3569–3575. AAAI Press.
 [Johnson1967] Johnson, S. C. 1967. Hierarchical clustering schemes. Psychometrika 32(3):241–254.
 [Kang and Cheng2016] Kang, Z., and Cheng, Q. 2016. Topn recommendation with novel rank approximation. In 2016 SIAM Int. Conf. on Data Mining (SDM 2016), 126–134.
 [Kang, Peng, and Cheng2015a] Kang, Z.; Peng, C.; and Cheng, Q. 2015a. Robust subspace clustering via smoothed rank approximation. IEEE Signal Processing Letters 22(11):2088–2092.
 [Kang, Peng, and Cheng2015b] Kang, Z.; Peng, C.; and Cheng, Q. 2015b. Robust subspace clustering via tighter rank approximation. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, 393–401. ACM.
 [Luo et al.2011] Luo, D.; Nie, F.; Ding, C.; and Huang, H. 2011. Multisubspace representation and discovery. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 405–420. Springer.

[MacQueen1967]
MacQueen, J.
1967.
Some methods for classification and analysis of multivariate
observations.
In
Proceedings of the fifth Berkeley symposium on mathematical statistics and probability
, volume 1, 281–297. Oakland, CA, USA.  [Martinez and Benavente2007] Martinez, A., and Benavente, R. 2007. The ar face database, 1998. Computer Vision Center, Technical Report 3.
 [Mohar et al.1991] Mohar, B.; Alavi, Y.; Chartrand, G.; and Oellermann, O. 1991. The laplacian spectrum of graphs. Graph theory, combinatorics, and applications 2(871898):12.

[Ng et al.2002]
Ng, A. Y.; Jordan, M. I.; Weiss, Y.; et al.
2002.
On spectral clustering: Analysis and an algorithm.
Advances in neural information processing systems 2:849–856.  [Nie et al.2016] Nie, F.; Wang, X.; Jordan, M. I.; and Huang, H. 2016. The constrained laplacian rank algorithm for graphbased clustering. In Thirtieth AAAI Conference on Artificial Intelligence. Citeseer.
 [Nie, Wang, and Huang2014] Nie, F.; Wang, X.; and Huang, H. 2014. Clustering and projected clustering with adaptive neighbors. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 977–986. ACM.
 [Peng et al.2015] Peng, C.; Kang, Z.; Li, H.; and Cheng, Q. 2015. Subspace clustering using logdeterminant rank approximation. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 925–934. ACM.
 [Peng et al.2016] Peng, C.; Kang, Z.; Hu, Y.; Cheng, J.; and Cheng, Q. 2016. Nonnegative matrix factorization with integrated graph and feature learning. ACM Transactions on Intelligent Systems and Technology.
 [Roweis and Saul2000] Roweis, S. T., and Saul, L. K. 2000. Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323–2326.
 [Schölkopf, Smola, and Müller1998] Schölkopf, B.; Smola, A.; and Müller, K.R. 1998. Nonlinear component analysis as a kernel eigenvalue problem. Neural computation 10(5):1299–1319.
 [Wang et al.2015] Wang, X.; Liu, Y.; Nie, F.; and Huang, H. 2015. Discriminative unsupervised dimensionality reduction. In Proceedings of the 24th International Conference on Artificial Intelligence, 3925–3931. AAAI Press.
 [Yang et al.2010] Yang, Y.; Xu, D.; Nie, F.; Yan, S.; and Zhuang, Y. 2010. Image clustering using local discriminant models and global integration. IEEE Transactions on Image Processing 19(10):2761–2773.
 [Yu et al.2012] Yu, S.; Tranchevent, L.; Liu, X.; Glanzel, W.; Suykens, J. A.; De Moor, B.; and Moreau, Y. 2012. Optimized data fusion for kernel kmeans clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 34(5):1031–1039.
 [Zeng and Cheung2011] Zeng, H., and Cheung, Y.m. 2011. Feature selection and kernel learning for local learningbased clustering. IEEE transactions on pattern analysis and machine intelligence 33(8):1532–1547.
 [Zhang, Nie, and Xiang2010] Zhang, C.; Nie, F.; and Xiang, S. 2010. A general kernelization framework for learning algorithms based on kernel pca. Neurocomputing 73(4):959–967.
Comments
There are no comments yet.