Twin Learning for Similarity and Clustering: A Unified Kernel Approach

05/01/2017 ∙ by Zhao Kang, et al. ∙ Southern Illinois University 0

Many similarity-based clustering methods work in two separate steps including similarity matrix computation and subsequent spectral clustering. However, similarity measurement is challenging because it is usually impacted by many factors, e.g., the choice of similarity metric, neighborhood size, scale of data, noise and outliers. Thus the learned similarity matrix is often not suitable, let alone optimal, for the subsequent clustering. In addition, nonlinear similarity often exists in many real world data which, however, has not been effectively considered by most existing methods. To tackle these two challenges, we propose a model to simultaneously learn cluster indicator matrix and similarity information in kernel spaces in a principled way. We show theoretical relationships to kernel k-means, k-means, and spectral clustering methods. Then, to address the practical issue of how to select the most suitable kernel for a particular clustering task, we further extend our model with a multiple kernel learning ability. With this joint model, we can automatically accomplish three subtasks of finding the best cluster indicator matrix, the most accurate similarity relations and the optimal combination of multiple kernels. By leveraging the interactions between these three subtasks in a joint framework, each subtask can be iteratively boosted by using the results of the others towards an overall optimal solution. Extensive experiments are performed to demonstrate the effectiveness of our method.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Clustering is a fundamental topic in data mining and machine learning

[Peng et al.2016]. It partitions data points into different groups, such that the objects within a group are similar to one another and different from those in other groups. Various methods have been proposed over the past decades. Some well-known algorithms include k-means clustering [MacQueen1967], spectral clustering [Ng et al.2002]

, and hierarchical clustering


Thanks to the simplicity and the effectiveness, the k-means algorithm is widely used. However, it fails to identify arbitrarily shaped clusters. Kernel k-means [Schölkopf, Smola, and Müller1998] has been developed to capture nonlinear structure information hidden in data sets. Kernel-based learning methods requires one to specify a kernel, which means one assumes a certain shape of the underlying data space. Thus the performance of kernel-based methods are largely affected by the choice of kernel.

Spectral clustering does a low-dimensional embedding of the similarity matrix of the data before performing k-means clustering [Ng et al.2002]. The similarity between every pair of points, as an input, leverages the manifold information in this clustering model. Thus similarity-based clustering methods usually show better performance than k-means algorithm. However, the performance of this kind of methods is largely determined by the similarity matrix [Huang, Nie, and Huang2015]. Any variations during the similarity measurement, such as metric, neighborhood size, and data scale, may lead to suboptimal performance.

Recently, self-expression has been successfully utilized in subspace recovery [Elhamifar and Vidal2009, Luo et al.2011], low rank representation [Kang, Peng, and Cheng2015b, Kang, Peng, and Cheng2015a], and recommender systems [Kang and Cheng2016]. It represents each data point in terms of the other points. By solving an optimization problem, the similarity information is automatically learned from the data. This approach can not only reveal low-dimensional structure, but also be robust to noise and data scale [Huang, Nie, and Huang2015].


In this paper, we perform clustering built upon the idea of using samples from the data to “express itself”. Rather than local structure learning [Nie, Wang, and Huang2014], this approach extracts the global structure of data and can be extended to kernel spaces. Unlike existing clustering algorithms that work in two separate steps, we simultaneously learn similarity matrix and cluster indicators by imposing a rank constraint on the Laplacian matrix of the learned similarity matrix. By leveraging the intrinsic interactions between learning similarity and cluster indicators, our proposed model seamlessly integrates them into a joint framework, where the result of one task is used to improve the other one. To capture the nonlinear structure information inherent in many real world data sets, we directly develop our method in a kernel space, which is well known for its ability to explore the nonlinear relation. We design an efficient algorithm to find an optimal solution to our model, and show the theoretical analysis on the connections to kernel k-means, k-means, and spectral clustering methods.

While effective, the kernel in use often has enormous influence on the performance of any kernel method. Unfortunately, the most suitable kernel for a specific task is usually unknown in advance. Exhaustive search on a user-defined pool of kernels is time-consuming and impractical when the sizes of the pool and data become large [Zeng and Cheung2011]. Thus we further propose a multiple kernel algorithm for our model. Another benefit of applying multiple kernels is that we can fully utilize information from different sources equipped with heterogeneous features [Yu et al.2012]. To alleviate the effort for kernel construction and integrating complementary information, we learn an appropriate consensus kernel from a linear combination of multiple input kernels. As a result, our joint model can simultaneously learn the similarity information, cluster indicator matrix, and the optimal combination of multiple kernels. Extensive empirical results on real-world benchmark data sets show that our method consistently outperforms other state-of-the-art methods.


In this paper, matrices are written as upper case letters and vectors are represented by boldface lower-case letters. The

-th column and the -th element of matrix are denoted by and , respectively. The -norm of a vector is defined as , where means transpose.

denotes the identity matrix and

denotes a column vector with all the elements as one. Tr() is the trace operator. means all elements of are in the range .

Clustering with Single Kernel

According to the self-expressive property [Elhamifar and Vidal2009],


where is the weight for -th sample. More similar data points should receive bigger weights and the weights should be smaller for less similar points. Thus is also called similarity matrix, which represents the global structure of data. Note that (1) is in a similar spirit of Locally Linear Embedding (LLE) [Roweis and Saul2000], which assumes that the data points lie on a manifold and each data point can be expressed as a linear combination of its nearest neighbors. The difference from LLE lies in the fact that we specify no neighborhood, which is automatically determined by our method.

To obtain , we solve the following problem:


where the first term is to measure reconstruction error, the second term is imposed to avoid the trivial solution , and is a trade-off parameter.

One drawback of (2) is that it assumes linear relations between samples. To recover the nonlinear relations between the data points, we extend (2) to kernel spaces by deploying a general kernelization framework [Zhang, Nie, and Xiang2010]. Define to be a kernel mapping the data samples from the input space to a reproducing kernel Hilbert space . For containing samples, the transformation is . The kernel similarity between data samples and is defined through a predefined kernel as . It is easy to observe that all similarities can be computed exclusively using the kernel function and one does not need to know the transformation . This is known as the kernel trick and it greatly simplifies the computations in the kernel space when the kernels are precomputed. Then (2) becomes


By solving above problem, we learn the linear sparse relations of , and thus the nonlinear relations among . Note that (3) goes back to (2) if a linear kernel is adopted.

Ideally, we expect that the number of connected components in are exactly if the given data set consists of clusters (that is, is block diagonal with proper permutations). However, the solution from (3) might not satisfy this desired property. Therefore, we will add another constraint based on the following theorem [Mohar et al.1991].

Theorem 1.

The multiplicity

of the eigenvalue 0 of the Laplacian matrix

of is equal to the number of connected components in the graph with the similarity matrix .

Theorem 1 means that if the similarity matrix contains exactly connected components. Thus our new clustering model is to solve:


Problem (4) is not easy to tackle, since , where is a diagonal matrix with the -th diagonal element , also depends on similarity matrix .

Here is positive semi-definite, thus , where is the -th smallest eigenvalue of . is equivalent to . It is not easy to enforce this constraint because the optimization problem with a rank constraint is known to be of combinatorial complexity. To mitigate the difficulty, [Wang et al.2015, Nie et al.2016] incorporates the rank constraint into the objective function as a regularizer. Motivated by this consideration, we relax the constraint and reformulate our model as


The minimization will make the regularizer if is large enough. Then the constraint will be satisfied.

Problem (5) is still a challenging problem because of the last term. Fortunately, it can be solved by using Ky Fan’s Theorem [Fan1949], i.e.,


where is the indicator matrix. The elements of -th row are the measure of the membership of data point belonging to the clusters. Finally, our model of twin learning for similarity and clustering with a single kernel (SCSK) is formulated as


By solving (7), we directly obtain the indicator matrix ; therefore, we do not need to perform spectral clustering any more. By alternatively updating and , they can improve each other and optimize (7).

Optimization Algorithm

We use an alternating optimization strategy for (7). When is fixed, (7) becomes


The optimal solution is obtained by the eigenvectors of corresponding to the smallest eigenvalues.

When is fixed, (7) can be reformulated column-wisely as:


where is a vector with the -th element being . To obtain (9), the important equation in spectral analysis


is used. (9) can be further simplified as


This problem can be solved by many existing quadratic programing packages. The complete algorithm is outlined in Algorithm 1.

Input: Kernel matrix , parameters , .
Initialize:Random matrix .

1:   Update , which is formed by the eigenvectors of corresponding to the smallest eigenvalues.
2:   For each , update the -th column of by solving problem (11).

UNTIL stopping criterion is met.

Algorithm 1 The algorithm of SCSK

Theoretical Analysis of SCSK Model

In this section, we present a theoretical analysis of SCSK and its connections to kernel k-means, k-means, and SC.

Connection to Kernel K-means and K-means

Here we introduce a theorem which states the equivalence of SCSK and kernel k-means, k-means under some condition.

Theorem 2.

When , the problem (4) is equivalent to kernel k-means problem.


The constraint in (4) makes the solution block diagonal. Let denote the similarity matrix of the -th component of , where is the number of data samples in the component. Problem (4) is equivalent to solving the following problem for each :


where consists of the samples corresponding to the -th component of . When , the above problem becomes


The optimal solution is that all elements of are equal to .

Thus when , the optimal solution to problem (4) is


Denote the solution set of this form by . It is easy to observe that . Thus (4) becomes


It is easy to deduce that is the mean of cluster in the kernel space. Therefore, (15) is exactly the kernel k-means. Thus our proposed algorithm is to solve the kernel k-means problem when . ∎

Corollary 1.

When and a linear kernel is adopted, the problem (4) is equivalent to k-means problem.


It is obvious when one does not use any transformations on in (15). ∎

Connection to Spectral Clustering

With a predefined similarity , spectral clustering is to solve the following problem:


The optimal solution is obtained by the eigenvectors of corresponding to the smallest eigenvalues. Generally, can not be directly used for clustering since does not have exactly connected components. To obtain the final clustering results, k-means or some other discretization procedures must be performed on [Huang, Nie, and Huang2013].

In our proposed algorithm, the similarity matrix is not predefined as the existing spectral clustering methods in the literature. Also, the similarity matrix is learned by taking account of the clustering task at hand, as opposed to the existing subspace clustering methods in the literature which only focus on learning the similarity matrix without considering the effect of clustering on [Peng et al.2015]. In our method, the graph with the learned will be partitioned into connected components by using . The optimal solution is formed by the eigenvectors of , which is defined by , corresponding to the smallest eigenvalues. Therefore, the proposed algorithm learns the similarity matrix and the cluster indicator matrix simultaneously in a coupled way, which leads to a better result in real applications than existing spectral methods as shown in our experiments, since it learns an adaptive graph for clustering.

Clustering with Multiple Kernels

Although model (7) can automatically learn the similarity matrix and cluster indicator matrix, its performance will largely be determined by the choice of kernel. It is often impractical to exhaustively search for the most suitable kernel. Moreover, real world data sets are often generated from different sources along with heterogeneous features. Single kernel method may not be able to fully utilize such information. Multiple kernel learning is capable of integrating complementary information and identifying a suitable kernel for a given task. Here we present a way to learn an appropriate consensus kernel from a convex combination of several predefined kernel matrices.

Suppose there are a total number of different kernel functions . Correspondingly, there would be different kernel spaces denoted as . An augmented Hilbert space, , can be constructed by concatenating all kernel spaces and by using the mapping of with different weights . Then the combined kernel can be represented as [Zeng and Cheung2011]


Note that the convex combination of the positive semi-definite kernel matrices is still a positive semi-definite kernel matrix. Thus the combined kernel still satisfies Mercer’s condition. Then we propose our joint similarity learning and clustering with multiple kernel (SCMK) model which can be written as


By iteratively updating , each of them will be adaptively refined according to the results of the other two.


Problem (18) can be solved by alternatively updating , , and , while holding the other variables as constant.

1) Optimizing with respect to and when is fixed: We can directly calculate , and the optimization problem is exactly (7). Thus we just need to use Algorithm 1 with as the input kernel matrix.

2) Optimizing with respect to when and are fixed: Solving (18) with respect to can be rewritten as [Cai et al.2013]




The Lagrange function of (19) is


By utilizing the Karush-Kuhn-Tucker (KKT) condition with and the constraint , we obtain the solution of as follows:


In Algorithm 2 we provide a complete algorithm for solving (18).

Input: A set of kernel matrices , parameters , .
Initialize: Random matrix , .

1:   Calculate by (17).
2:   Update with the smallest eigenvectors of .
3:   For each , update the -th column of by (11).
4:   Calculate by (20).
5:   Update by (22).

UNTIL stopping criterion is met.

Algorithm 2 The algorithm of SCMK


# instances # features # classes
YALE 165 1024 15
JAFFE 213 676 10
ORL 400 1024 40
AR 840 768 120
BA 1404 320 36
TR11 414 6429 9
TR41 878 7454 10
TR45 690 8261 10
Table 1: Description of the data sets
YALE 47.12 38.97 49.42 40.52 48.09 39.71 55.85 45.35 45.70 40.64 52.18 56.97
JAFFE 74.39 67.09 74.88 54.03 75.61 67.98 99.83 86.64 74.55 30.35 87.07 100.00
ORL 53.53 45.93 57.96 46.65 54.96 46.88 62.35 50.50 47.51 27.20 55.60 65.25
AR 33.02 30.89 28.83 22.22 33.43 31.20 56.79 41.35 28.61 33.23 34.37 62.38
BA 41.20 33.66 31.07 26.25 42.17 34.35 47.72 39.50 40.52 27.07 43.42 47.34
TR11 51.91 44.65 50.98 43.32 53.03 45.04 71.26 54.79 50.13 47.15 57.71 73.43
TR41 55.64 46.34 63.52 44.80 56.76 46.80 67.43 53.13 56.10 45.90 62.65 67.31
TR45 58.79 45.58 57.39 45.96 58.13 45.69 74.02 53.38 58.46 52.64 64.00 74.35
(a) Accuracy(%)
YALE 51.34 42.07 52.92 44.79 52.29 42.87 56.50 45.07 50.06 46.83 55.58 56.52
JAFFE 80.13 71.48 82.08 59.35 83.47 74.01 99.35 84.67 79.79 27.22 89.37 100.00
ORL 73.43 63.36 75.16 66.74 74.23 63.91 78.96 63.55 68.86 43.77 74.83 80.04
AR 65.21 60.64 58.37 56.05 65.44 60.81 76.02 59.70 59.17 65.06 65.49 81.51
BA 57.25 46.49 50.76 40.09 57.82 46.91 63.04 52.17 56.88 42.34 58.47 62.94
TR11 48.88 33.22 43.11 31.39 49.69 33.48 58.60 37.58 44.56 39.39 56.08 60.15
TR41 59.88 40.37 61.33 36.60 60.77 40.86 65.50 43.18 57.75 43.05 63.47 65.11
TR45 57.87 38.69 48.03 33.22 57.86 38.96 74.24 44.36 56.17 41.94 62.73 74.97
(b) NMI(%)
YALE 49.15 41.12 51.61 43.06 49.79 41.74 57.27 55.79 47.52 42.33 53.64 60.00
JAFFE 77.32 70.13 76.83 56.56 79.58 71.82 99.85 96.53 76.83 33.08 88.90 100.00
ORL 58.03 50.42 61.45 51.20 59.60 51.46 74.00 70.37 52.85 31.56 60.23 77.00
AR 35.52 33.64 33.24 25.99 35.87 33.88 63.45 62.37 30.46 34.98 36.78 82.62
BA 44.20 36.06 34.50 29.07 45.28 36.86 52.36 49.79 43.47 30.29 46.27 52.12
TR11 67.57 56.32 58.79 50.23 67.93 56.40 82.85 80.76 65.48 54.67 72.93 87.44
TR41 74.46 60.00 73.68 56.45 74.99 60.21 73.23 71.21 72.83 62.05 77.57 73.69
TR45 68.49 53.64 61.25 50.02 68.18 53.75 78.26 77.76 69.14 57.49 75.20 78.26
(c) Purity(%)
Table 2: Clustering results measured on benchmark data sets. ’-a’ denotes the average performance on those 12 kernels. Both the best results for single kernel and multiple kernel methods are highlighted in boldface.
(a) Acc
(b) NMI
(c) Purity
Figure 1: The effect of parameters and on the YALE and JAFFE data sets.

In this section, we demonstrate the effectiveness of the proposed method on real world benchmark data sets.

Data Sets

There are altogether eight benchmark data sets used in our experiments. Table 1 summarizes the statistics of these data sets. Among them, five are image ones, and the other three are text corpora111 han/data/tmdata.tar.gz. The five image data sets consist of four commonly used face databases (ORL222, YALE333, AR444 aleix/ARdatabase.html [Martinez and Benavente2007] and JAFFE555, and a binary alpha digits data set BA666 roweis/data.html.

Experiment Setup

To assess the effectiveness of multiple kernel learning, we design 12 kernels which include: seven Gaussian kernels of the form , where is the maximal distance between samples and varies over the set ; a linear kernel ; four polynomial kernels with and . Furthermore, all kernels are rescaled to by dividing each element by the largest pair-wise squared distance.

For single kernel methods, we run kernel k-means (KKM) [Schölkopf, Smola, and Müller1998], spectral clustering (SC) [Ng et al.2002], robust kernel k-means (RKKM) [Du et al.2015], and our proposed SCSK777 on each kernel separately. The methods in comparison are downloaded from the their authors’ websites. And we report both the best and the average results over all these kernels.

For multiple kernel methods, we implement the following algorithms on a combination of above kernels.

MKKM888 MKKM [Huang, Chuang, and Chen2012b] extends k-means in a multiple-kernel setting. However, it uses a different way to learn the kernel weight.

AASC999 AASC [Huang, Chuang, and Chen2012a] extends spectral clustering to the situation where multiple affinities exist.

RMKKM101010 RMKKM [Du et al.2015] adopts norm to measure the loss of k-means.

SCMK. Our proposed method for joint similarity learning and clustering with multiple kernels.

For spectral clustering like SC and AASC, we run k-means on spectral embedding to obtain the clustering results. To reduce the influence of initialization, we follow the strategy suggested in [Yang et al.2010, Du et al.2015], and we repeat clustering 20 times and present the results with the best objective values. We set the number of clusters to the true number of classes for all clustering algorithms.

To quantitatively evaluate the clustering performance, we adopt the three widely used metrics, accuracy (Acc), normalized mutual information (NMI) [Cai et al.2009], and Purity.

Clustering Result

Table 2 shows the clustering results in terms of accuracy, NMI and Purity on all the data sets. It can be seen that the proposed SCSK and SCMK produce promising results. Especially, our method can substantially improve the performance on JAFFE, AR, BA, TR11, and TR45 data sets. The big difference between best and average results confirms the fact that the choice of kernel has a huge influence on the performance of single kernel methods. This difference motivates the development of multiple kernel learning method. Besides, multiple kernel clustering approaches usually improve the results over single kernel clustering methods.

Parameter Selection

There are two parameters and in our models. We let vary over the range of , and over . Figure 1 shows how the clustering results of SCMK in terms of Acc, NMI, and Purity vary with and on JAFFE and YALE data sets. We can observe that the performance of SCMK is very stable with respect to a large range of values and it is more sensitive to the value of .


In this paper, we first propose a clustering method to simultaneously perform similarity learning and the cluster indicator matrix construction. In our method, the similarity learning and the cluster indicator learning are integrated within one framework; the method can be easily extended to kernel spaces, so as to capture nonlinear structure information. The connections of the proposed method to kernel k-means, k-means, and spectral clustering are also established. To avoid extensive search of the best kernel, we further incorporate multiple kernel learning into our model. Similarity learning, cluster indicator construction, and kernel weight learning can be boosted by using the results of the other two. Extensive experiments have been conducted on real-world benchmark data sets to demonstrate the superior performance of our method.


This work is supported by US National Science Foundation Grants IIS 1218712. Q. Cheng is the corresponding author.


  • [Cai et al.2009] Cai, D.; He, X.; Wang, X.; Bao, H.; and Han, J. 2009. Locality preserving nonnegative matrix factorization. In IJCAI, volume 9, 1010–1015.
  • [Cai et al.2013] Cai, X.; Nie, F.; Cai, W.; and Huang, H. 2013.

    Heterogeneous image features integration via multi-modal semi-supervised learning model.


    Proceedings of the IEEE International Conference on Computer Vision

    , 1737–1744.
  • [Du et al.2015] Du, L.; Zhou, P.; Shi, L.; Wang, H.; Fan, M.; Wang, W.; and Shen, Y.-D. 2015. Robust multiple kernel k-means using ℓ 2; 1-norm. In

    Proceedings of the 24th International Conference on Artificial Intelligence

    , 3476–3482.
    AAAI Press.
  • [Elhamifar and Vidal2009] Elhamifar, E., and Vidal, R. 2009. Sparse subspace clustering. In

    Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on

    , 2790–2797.
  • [Fan1949] Fan, K. 1949.

    On a theorem of weyl concerning eigenvalues of linear transformations i.

    Proceedings of the National Academy of Sciences of the United States of America 35(11):652.
  • [Huang, Chuang, and Chen2012a] Huang, H.-C.; Chuang, Y.-Y.; and Chen, C.-S. 2012a. Affinity aggregation for spectral clustering. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, 773–780. IEEE.
  • [Huang, Chuang, and Chen2012b] Huang, H.-C.; Chuang, Y.-Y.; and Chen, C.-S. 2012b. Multiple kernel fuzzy clustering. IEEE Transactions on Fuzzy Systems 20(1):120–134.
  • [Huang, Nie, and Huang2013] Huang, J.; Nie, F.; and Huang, H. 2013. Spectral rotation versus k-means in spectral clustering. In AAAI.
  • [Huang, Nie, and Huang2015] Huang, J.; Nie, F.; and Huang, H. 2015. A new simplex sparse learning model to measure data similarity for clustering. In Proceedings of the 24th International Conference on Artificial Intelligence, 3569–3575. AAAI Press.
  • [Johnson1967] Johnson, S. C. 1967. Hierarchical clustering schemes. Psychometrika 32(3):241–254.
  • [Kang and Cheng2016] Kang, Z., and Cheng, Q. 2016. Top-n recommendation with novel rank approximation. In 2016 SIAM Int. Conf. on Data Mining (SDM 2016), 126–134.
  • [Kang, Peng, and Cheng2015a] Kang, Z.; Peng, C.; and Cheng, Q. 2015a. Robust subspace clustering via smoothed rank approximation. IEEE Signal Processing Letters 22(11):2088–2092.
  • [Kang, Peng, and Cheng2015b] Kang, Z.; Peng, C.; and Cheng, Q. 2015b. Robust subspace clustering via tighter rank approximation. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, 393–401. ACM.
  • [Luo et al.2011] Luo, D.; Nie, F.; Ding, C.; and Huang, H. 2011. Multi-subspace representation and discovery. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 405–420. Springer.
  • [MacQueen1967] MacQueen, J. 1967. Some methods for classification and analysis of multivariate observations. In

    Proceedings of the fifth Berkeley symposium on mathematical statistics and probability

    , volume 1, 281–297.
    Oakland, CA, USA.
  • [Martinez and Benavente2007] Martinez, A., and Benavente, R. 2007. The ar face database, 1998. Computer Vision Center, Technical Report 3.
  • [Mohar et al.1991] Mohar, B.; Alavi, Y.; Chartrand, G.; and Oellermann, O. 1991. The laplacian spectrum of graphs. Graph theory, combinatorics, and applications 2(871-898):12.
  • [Ng et al.2002] Ng, A. Y.; Jordan, M. I.; Weiss, Y.; et al. 2002.

    On spectral clustering: Analysis and an algorithm.

    Advances in neural information processing systems 2:849–856.
  • [Nie et al.2016] Nie, F.; Wang, X.; Jordan, M. I.; and Huang, H. 2016. The constrained laplacian rank algorithm for graph-based clustering. In Thirtieth AAAI Conference on Artificial Intelligence. Citeseer.
  • [Nie, Wang, and Huang2014] Nie, F.; Wang, X.; and Huang, H. 2014. Clustering and projected clustering with adaptive neighbors. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 977–986. ACM.
  • [Peng et al.2015] Peng, C.; Kang, Z.; Li, H.; and Cheng, Q. 2015. Subspace clustering using log-determinant rank approximation. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 925–934. ACM.
  • [Peng et al.2016] Peng, C.; Kang, Z.; Hu, Y.; Cheng, J.; and Cheng, Q. 2016. Nonnegative matrix factorization with integrated graph and feature learning. ACM Transactions on Intelligent Systems and Technology.
  • [Roweis and Saul2000] Roweis, S. T., and Saul, L. K. 2000. Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323–2326.
  • [Schölkopf, Smola, and Müller1998] Schölkopf, B.; Smola, A.; and Müller, K.-R. 1998. Nonlinear component analysis as a kernel eigenvalue problem. Neural computation 10(5):1299–1319.
  • [Wang et al.2015] Wang, X.; Liu, Y.; Nie, F.; and Huang, H. 2015. Discriminative unsupervised dimensionality reduction. In Proceedings of the 24th International Conference on Artificial Intelligence, 3925–3931. AAAI Press.
  • [Yang et al.2010] Yang, Y.; Xu, D.; Nie, F.; Yan, S.; and Zhuang, Y. 2010. Image clustering using local discriminant models and global integration. IEEE Transactions on Image Processing 19(10):2761–2773.
  • [Yu et al.2012] Yu, S.; Tranchevent, L.; Liu, X.; Glanzel, W.; Suykens, J. A.; De Moor, B.; and Moreau, Y. 2012. Optimized data fusion for kernel k-means clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 34(5):1031–1039.
  • [Zeng and Cheung2011] Zeng, H., and Cheung, Y.-m. 2011. Feature selection and kernel learning for local learning-based clustering. IEEE transactions on pattern analysis and machine intelligence 33(8):1532–1547.
  • [Zhang, Nie, and Xiang2010] Zhang, C.; Nie, F.; and Xiang, S. 2010. A general kernelization framework for learning algorithms based on kernel pca. Neurocomputing 73(4):959–967.