Similarity Learning via Kernel Preserving Embedding

03/11/2019 ∙ by Zhao Kang, et al. ∙ 8

Data similarity is a key concept in many data-driven applications. Many algorithms are sensitive to similarity measures. To tackle this fundamental problem, automatically learning of similarity information from data via self-expression has been developed and successfully applied in various models, such as low-rank representation, sparse subspace learning, semi-supervised learning. However, it just tries to reconstruct the original data and some valuable information, e.g., the manifold structure, is largely ignored. In this paper, we argue that it is beneficial to preserve the overall relations when we extract similarity information. Specifically, we propose a novel similarity learning framework by minimizing the reconstruction error of kernel matrices, rather than the reconstruction error of original data adopted by existing work. Taking the clustering task as an example to evaluate our method, we observe considerable improvements compared to other state-of-the-art methods. More importantly, our proposed framework is very general and provides a novel and fundamental building block for many other similarity-based tasks. Besides, our proposed kernel preserving opens up a large number of possibilities to embed high-dimensional data into low-dimensional space.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Nowadays, high-dimensional data can be collected everywhere, either by low-cost sensors or from the internet [Chen et al.2012]

. Extracting useful information from massive high-dimensional data is critical in different areas like text, images, videos and more. Data similarity is especially important since it is the input for a number of data analysis tasks, such as spectral clustering

[Ng et al.2002, Chen et al.2018], nearest neighbor classification [Weinberger, Blitzer, and Saul2005], image segmentation [Li et al.2016], person re-identification [Hirzer et al.2012]

, image retrieval

[Hoi, Liu, and Chang2008], dimension reduction [Passalis and Tefas2017], and graph-based semi-supervised learning [Kang et al.2018a]

. Therefore, similarity measure is crucial to the performance of many techniques and is a fundamental problem in machine learning, pattern recognition, and data mining communities

[Gao et al.2017, Towne, Rosé, and Herbsleb2016]. A variety of similarity metrics, e.g., Cosine, Jaccard coefficient, Euclidean distance, Gaussian function, are often used in practice for convenience. However, they are often data-dependent and sensitive to noise [Huang, Nie, and Huang2015]. Consequently, different metrics lead to a big difference in the final results. In addition, several other similarity measure strategies are popular in dimension reduction techniques. For example, in the widely used locally linear embedding (LLE) [Roweis and Saul2000], isomeric feature mapping (ISOMAP) [Tenenbaum, De Silva, and Langford2000], and locality preserving projection (LPP) [Niyogi2004]

methods, one has to construct an adjacency graph of neighbors. Then, k-nearest-neighborhood (knn) and

-nearest-neighborhood graph construction methods are often utilized. These approaches also have some inherent drawbacks, including 1) how to determine neighbor number or radius

; 2) how to choose an appropriate similarity metric to define neighborhood; 3) how to counteract the adverse effect of noise and outliers; 4) how to tackle data with structures at different scales of size and density. Unfortunately, all these factors heavily influence the subsequent tasks

[Kang et al.2018b].

Recently, automatically learning of similarity information from data has drawn significant attention. In general, it can be classified into two categories. The first one is adaptive neighbors approach. It learns similarity information by assigning a probability for each data point as the neighborhood of another data point

[Nie, Wang, and Huang2014]. It has been shown to be an effective way to capture the local manifold structure.

The other one is self-expression approach. The basic idea is to represent every data point by a linear combination of other data points. In contrast, LLE reconstructs the original data by expressing each data point as a linear combination of its nearest neighbors only. Through minimizing this reconstruction error, we can obtain a coefficient matrix, which is also named similarity matrix. It has been widely applied in various representation learning tasks, including sparse subspace clustering [Elhamifar and Vidal2013, Peng et al.2016], low-rank representation [Liu et al.2013], multi-view learning [Tao et al.2017], semi-supervised learning [Zhuang et al.2017], nonnegative matrix factorization(NMF) [Zhang et al.2017].

However, this approach just tries to reconstruct the original data and has no explicit mechanism to preserve manifold structure information about the data. In many applications, the data can display structures beyond simply being low-rank or sparse. It is well-accepted that it is essential to take into account structure information when we perform high-dimensional data analysis. For instance, LLE preserves the local structure information.

In view of this issue with the current approaches, we propose to learn the similarity information through reconstructing the original data kernel matrix, which is supposed to preserve overall relations. By doing so, we expect to obtain more accurate and complete data similarity. Considering clustering as a specific application of our proposed similarity learning method, we demonstrate that our framework provides impressive performance on several benchmark data sets. In summary, the main contributions of this paper are threefold:

  • Compared to other approaches, the use of the kernel-based distances allows to work on preserving the sets of overall relations rather than individual pairwise similarities.

  • Similarity preserving provides a fundamental building block to embed high-dimensional data into low-dimensional latent space. It is general enough to be applied to a variety of learning problems.

  • We evaluate the proposed approach in the clustering task. It shows that our algorithm enjoys superior performance compared to many state-of-the-art methods.

Notations. Given a data set , we denote with features and instances. Then the -th element of matrix are denoted by . The

-norm of a vector

is represented by , where denotes transpose. The -norm of is defined as . The squared Frobenius norm is represented as . The nuclear norm of is , where is the

-th singular value of


is the identity matrix with a proper size.

represents a column vector whose every element is one. means all the elements of are nonnegative. Inner product .

2 Related Work

In this section, we provide a brief review of existing automatic similarity learning techniques.

2.1 Adaptive Neighbors Approach

In a similar spirit of LPP, for each data point , all the data points can be regarded as the neighborhood of with probability . To some extent, represents the similarity between and [Nie, Wang, and Huang2014]. The smaller distance is, the greater the probability is. Rather than prespecifying with the deterministic neighborhood relation as LPP does, one can adaptively learn from the data set by solving an optimization problem:


where is the regularization parameter. Recently, a variety of algorithms have been developed by using Eq. (1) to learn a similarity matrix. Some applications are clustering [Nie, Wang, and Huang2014], NMF [Huang et al.2018]

, and feature selection

[Du and Shen2015]. This approach can effectively capture the local structure information.

2.2 Self-expression Approach

The so-called self-expression is to approximate each data point as a linear combination of other data points, i.e., . The rationale here is that if and are similar, weight should be big. Therefore, also behaves like the similarity matrix. This shares the similar spirit as LLE, except that we do not predetermine the neighborhood. Its corresponding learning problem is:


where is a regularizer of . Two commonly used assumptions about are low-rank and sparse. Hence, in many domains, we also call as the low-dimensional representation of . Through this procedure, the individual pairwise similarity information hidden in the data is explored [Nie, Wang, and Huang2014] and the most informative “neighbors” for each data point are automatically chosen.

Moreover, this learned can not only reveal low-dimensional structure of data, but also be robust to data scale [Huang, Nie, and Huang2015]

. Therefore, this approach has drawn significant attention and achieved impressive performance in a number of applications, including face recognition

[Zhang, Yang, and Feng2011], subspace clustering [Liu et al.2013, Elhamifar and Vidal2013], semi-supervised learning [Zhuang et al.2017]. In many real-world applications, data often present complex structures. Nevertheless, the first term in Eq. (2) simply minimizes the reconstruction error. Some important manifold structure information, such as overall relations, could be lost during this process. Preserving relation information has been shown to be important for feature selection [Zhao et al.2013]. In [Zhao et al.2013], new feature vector is obtained by maximizing , where is the refined similarity matrix derived from original kernel matrix with element . In this paper, we propose a novel model to preserve the overall relations of the original data and simultaneously learn the similarity matrix.

3 Proposed Methodology

Since our goal is to obtain similarity information, it is very necessary to retain the overall relations among the data samples when we build a new representation. However, Eq. (2) just tries to reconstruct the original data and does not take overall relations information into account. Our objective is finding a new representation which preserves overall relations as much as possible.

Given a data matrix , one of the most commonly used relation measures is the inner product. Specifically, we try to minimize the inconsistency between two inner products: one for the raw data and another for reconstructed data . To make our model more general, we build it in a transformed space,i.e., is mapped by [Xu et al.2009]. We have


(3) can be simplified as


With certain assumption about the structure of , our proposed Similarity Learning via Kernel preserving Embedding (SLKE) framework can be formulated as


where is a tradeoff parameter and is a regularizer on . If we use the nuclear norm to replace , we have a low-rank representation. If the -norm is adopted, we obtain a sparse representation. It is worth pointing out that Eq. (5) enjoys several nice properties:
1) The use of kernel-based distance preserves the sets of overall relations, which will benefit the subsequent tasks;
2) This learned low-dimensional representation or similarity matrix is general enough to be utilized to solve a variety of different tasks, where similarity information is needed;
3) The learned representation is particularly suitable to problems that are sensitive to data similarity, such as clustering [Kang et al.2018c], classification [Wright et al.2009], recommender systems [Kang, Peng, and Cheng2017a];
4) Its input is the kernel matrix. This is also desirable, as not all types of data can be represented in numerical feature vectors form [Xu et al.2010]. For instance, we need to group proteins in bioinformatics based on their structures and to divide users in social media based on their friendship relations.
In the following section, we will show a simple strategy to solve problem (5).

4 Optimization

It is easy to see that Eq. (5) is a fourth-order function of . Directly solving it is not so straightforward. To circumvent this problem, we first convert it to the following equivalent problem by introducing two more auxiliary variables


Now we resort to the alternating direction method of multipliers (ADMM) method to solve (6). The corresponding augmented Lagrangian function is:


where is a penalty parameter and , are the lagrangian multipliers. The variables , , and can be updated alternatingly, one at each step, while keeping the other two fixed.

To solve , we observe that the objective function (7) is a strongly convex quadratic function in which can be solved by setting its first derivative to zero, we have:


where is the identity matrix.



For , we have the following subproblem:


Depending on different regularization strategies, we have different closed-form solutions for . Define

, we can write its singular value decomposition (SVD) as

. Then, for low-rank representation, i.e., , we have,


For sparse representation, i.e., , we can update element by element as,


For clarity, the complete procedures to solve the problem (5) are outlined in Algorithm 1.

Input: Kernel matrix , parameters , .
Initialize:Random matrix and , .
1:  Calculate by (8).
2:  Update according to (9).
3:  Calculate using (11) or (12).
4:  Update Lagrange multipliers and as
UNTIL stopping criterion is met.
Algorithm 1 The algorithm of SLKE

4.1 Complexity Analysis

With our optimization strategy, the complexity for is . Updating has the same complexity as . Both and involve matrix inverse. Fortunately, we can avoid it by resorting to some approximation techniques when we face large-scale data sets. Depending on the choice of regularizer, we have different complexity for . For low-rank representation, it requires an SVD for every iteration and its complexity is . Since we seek a low-rank matrix and so only need a few principle singular values. Package like PROPACK can compute a rank SVD with complexity [Larsen2004]. To obtain a sparse solution of , we need complexity. The updating of and cost .

5 Experiments

To assess the effectiveness of our proposed method, we apply the learned similarity matrix to do clustering.

5.1 Data Sets

# instances # features # classes
YALE 165 1024 15
JAFFE 213 676 10
ORL 400 1024 40
COIL20 1440 1024 20
BA 1404 320 36
TR11 414 6429 9
TR41 878 7454 10
TR45 690 8261 10
TDT2 9394 36771 30
Table 1: Description of the data sets

We conduct our experiments with nine benchmark data sets, which are widely used in clustering experiments. We show the statistics of these data sets in Table 1. In summary, the number of data samples varies from 165 to 9,394 and feature number ranges from 320 to 36,771. The first five data sets are images, while the last four are text data.

Specifically, the five image data sets contain three face databases (ORL, YALE, and JAFFE), a toy image database COIL20, and a binary alpha digits data set BA. For example, COIL20 consists of 20 objects and each object was taken from different angles. BA data set contains images of digits of “0” through “9” and letters of capital “A” through “Z”. YALE, ORL, and JAFEE consist of images of the person. Each image represents different facial expressions or configurations due to times, illumination conditions, and glasses/no glasses.

5.2 Data Preparation

Since the input for our proposed method is kernel matrix, we design 12 kernels in total to fully examine its performance. They are: seven Gaussian kernels of the form with , where denotes the maximal distance between data points; a linear kernel ; four polynomial kernels of the form with and . Besides, all kernels are rescaled to by dividing each element by the largest element in its corresponding kernel matrix. These kernels are commonly used types in the literature, so we can well investigate the performance of our method.

5.3 Comparison Methods

To fully examine the effectiveness of the proposed framework on clustering, we choose a good set of methods to compare. In general, they can be classified into two categories: similarity-based and kernel-based clustering methods.

  • Spectral Clustering (SC) [Ng et al.2002]: SC is a widely used clustering method. It enjoys the advantage of exploring the intrinsic data structures. Its input is the graph Laplacian, which is constructed from the similarity matrix. Here, we directly treat kernel matrix as the similarity matrix for spectral clustering. For our proposed SLKE method, we employ learned to do spectral clustering. Thus, SC serves as a baseline method.

  • Robust Kernel K-means (RKKM)111 [Du et al.2015]

    : Based on classical k-means clustering algorithm, RKKM has been developed to deal with nonlinear structures, noise, and outliers in the data. RKKM demonstrates superior performance on a number of real-world data sets.

  • Simplex Sparse Representation (SSR) [Huang, Nie, and Huang2015]: SSR method has been proposed recently. It is based on adaptive neighbors idea. Another appealing property of this method is that its model parameter can be calculated by assuming a maximum number of neighbors. Therefore, we don’t need to tune the parameter any more. In addition, it outperforms many other state-of-the-art techniques.

  • Low-Rank Representation (LRR) [Liu et al.2013]: Based on self-expression, subspace clustering with low-rank regularizer achieves great success on a number of applications, such as face clustering, motion segmentation.

  • Sparse Subspace Clustering (SSC) [Elhamifar and Vidal2013]: Similar to LRR, SSC assumes a sparse solution of . Both LRR and SSC learn similarity matrix by reconstructing the original data. In this aspect, SC, LRR, and SSC are baseline methods w.r.t. our proposed algorithm.

  • Clustering with Adaptive Neighbor (CAN) [Nie, Wang, and Huang2014]. Based on the idea of adaptive neighbors, i.e., Eq.(1), CAN learns a local graph from raw data for clustering task.

  • Twin Learning for Similarity and Clustering (TLSC) [Kang, Peng, and Cheng2017b]: Recently, TLSC has been proposed and has shown promising results on real-world data sets. TLSC does not only learn similarity matrix via self-expression in kernel space, but also have optimal similarity graph guarantee. Besides, it has good theoretical properties, i.e., it is equivalent to kernel k-means and k-means under certain conditions.

  • SLKE: Our proposed similarity learning method with overall relations preserving capability. After obtaining similarity matrix , we use spectral clustering to conduct clustering experiments. We test both low-rank and sparse regularizer. We denote them as SLKE-R and SLKE-S, respectively222

5.4 Evaluation Metrics

To quantitatively and effectively assess the clustering performance, we utilize the two widely used metrics [Peng et al.2018], accuracy (Acc) and normalized mutual information (NMI).

Acc discovers the one-to-one relationship between clusters and classes. Let and be the clustering result and the ground truth cluster label of , respectively. Then the Acc is defined as

where is the sample size, Kronecker delta function equals one if and only if and zero otherwise, and map() is the best permutation mapping function that maps each cluster index to a true class label based on Kuhn-Munkres algorithm.

Given two sets of clusters and , NMI is defined as

where and

represent the marginal probability distribution functions of

and , respectively. is the joint probability function of and . is the entropy function. The greater NMI means the better clustering performance.

5.5 Results

YALE 49.42(40.52) 48.09(39.71) 38.18 61.21 54.55 58.79 55.85(45.35) 61.82(38.89) 66.24(51.28)
JAFFE 74.88(54.03) 75.61(67.89) 99.53 99.53 87.32 98.12 99.83(86.64) 96.71(70.77) 99.85(90.89)
ORL 58.96(46.65) 54.96(46.88) 36.25 76.50 69.00 61.50 62.35(50.50) 77.00(45.33) 74.75(59.00)
COIL20 67.60(43.65) 61.64(51.89) 73.54 68.40 76.32 84.58 72.71(38.03) 75.42(56.83) 84.03(65.65)
BA 31.07(26.25) 42.17(34.35) 24.22 45.37 23.97 36.82 47.72(39.50) 50.74(36.35) 44.37(35.79)
TR11 50.98(43.32) 53.03(45.04) 32.61 73.67 41.06 38.89 71.26(54.79) 69.32(46.87) 74.64(55.07)
TR41 63.52(44.80) 56.76(46.80) 28.02 70.62 63.78 62.87 65.60(43.18) 71.19(47.91) 74.37(53.51)
TR45 57.39(45.96) 58.13(45.69) 24.35 78.84 71.45 48.41 74.02(53.38) 78.55(50.59) 79.89(58.37)
TDT2 52.63(45.26) 48.35(36.67) 23.45 52.03 20.86 19.74 55.74(44.82) 59.61(25.40) 74.92(33.67)
(a) Accuracy(%)
YALE 52.92(44.79) 52.29(42.87) 45.56 62.98 57.26 57.67 56.50(45.07) 59.47(40.38) 64.29(52.87)
JAFFE 82.08(59.35) 83.47(74.01) 99.17 99.16 92.93 97.31 99.35(84.67) 94.80(60.83) 99.49(81.56)
ORL 75.16(66.74) 74.23(63.91) 60.24 85.69 84.23 76.59 78.96(63.55) 86.35(58.84) 85.15(75.34)
COIL20 80.98(54.34) 74.63(63.70) 80.69 77.87 86.89 91.55 82.20(73.26) 80.61(65.40) 91.25(73.53)
BA 50.76(40.09) 57.82(46.91) 37.41 57.97 30.29 49.32 63.04(52.17) 63.58(55.06) 56.78(50.11)
TR11 43.11(31.39) 49.69(33.48) 02.14 65.61 27.60 19.17 58.60(37.58) 67.63(30.56) 70.93(45.39)
TR41 61.33(36.60) 60.77(40.86) 01.16 67.50 59.56 51.13 65.50(43.18) 70.89(34.82) 68.50(47.45)
TR45 48.03(33.22) 57.86(38.96) 01.61 77.01 67.82 49.31 74.24(44.36) 72.50(38.04) 78.12(50.37)
TDT2 52.23(27.16) 54.46(42.19) 13.09 64.36 02.44 03.97 58.35(46.37) 58.55(15.43) 68.21(28.94)
(b) NMI(%)
Table 2: Clustering results obtained on benchmark data sets. The average performance of those 12 kernels is put in parenthesis. The best results among those kernels are highlighted in boldface.

We report the extensive experimental results in Table 2. Except SSC, LRR, CAN, and SSR, we run other methods on each kernel matrix individually. As a result, we show both the best performance among those 12 kernels and the average results over those 12 kernels for them. Based on this table, we can see that our proposed SLKE achieves the best performance in most cases. To be specific, we have the following observations:
1) Compared to classical k-means based RKKM and spectral clustering techniques, our proposed method SLKE has a big advantage in terms of accuracy and NMI. With respect to the recently proposed SSR and TLSC methods, SKLE always obtains better results.
2) SLKE-R and SLKE-S often outperform LRR and SSC, respectively. The accuracy increased by 8.92%, 8.76% on average, respectively. That is to say, kernel-based distance approach indeed performs better than original data reconstruction technique. This verifies the importance of retaining relation information when we learn a low-dimensional representation, especially for sparse representation.

3) With respect to adaptive neighbors approach CAN, we also obtain better performance on those datasets except COIL20. For COIL20, our results are quite close to CAN’s. Therefore, compared to various similarity learning techniques, our method is very competitive.
4) Regarding low-rank and sparse representation, it is hard to conclude which one is better. It totally depends on the specific data.

Furthermore, we run t-SNE [Maaten and Hinton2008] algorithm on the JAFFE data and the reconstructed data from the best result of our SLKE-R. As shown by Figure 1, we can see that our method can well preserve the cluster structure of the data.

(a) Original data
(b) Reconstructed data
Figure 1: JAFFE data set visualized in two dimensions.
(a) Acc
(b) NMI
Figure 2: The effect of parameter on the YALE data set.
SLKE-S Acc .0039 .0039 .0117 .2500 .0078 .0391 .0391
NMI .0078 .0039 .0195 .6523 .0391 .0547 .3008
SLKE-R Acc .0039 .0039 .0039 .0977 .0039 .0039 .0117
NMI .0039 .0078 .0039 .0742 .0039 .0078 .0391
Table 3: Wilcoxon Signed Rank Test on all Data sets.

To see the significance of improvements, we further apply the Wilcoxon signed rank test [Peng, Cheng, and Cheng2017] to Table 2. We show the -values in Table 3. We note that the testing results are under 0.05 in most cases when comparing SLKE-S and SLKE-R to other methods. Therefore, SLKE-S and SLKE-R outperform SC, RKKM, SSC, and SSR with statistical significance.

5.6 Parameter Analysis

In this subsection, we investigate the influence of our model parameter on the clustering results. Take Gaussian kernel with of YALE and JAFFE data sets as examples, we plot our algorithm’s performance with in the range in Figure 2 and 3, respectively. As we can see that our proposed methods work well for a wide range of , e.g., from to .

(a) Acc
(b) NMI
Figure 3: The effect of parameter on the JAFFE data set.

6 Conclusion

In this paper, we present a novel similarity learning framework relying on an embedding of kernel-based distance. Our model is flexible to obtain either low-rank or sparse representation of data. Comprehensive experimental results on real data sets well demonstrate the superiority of the proposed method on the clustering task. It has great potential to be applied in a number of applications beyond clustering. It has been shown that the performance of the proposed method is largely determined by the choice of kernel function. In the future, we plan to address this issue by developing a multiple kernel learning method, which is capable of automatically learning an appropriate kernel from a pool of input kernels.

7 Acknowledgment

This paper was in part supported by Grants from the Natural Science Foundation of China (Nos. 61806045 and 61572111), a 985 Project of UESTC (No. A1098531023601041) and two Fundamental Research Fund for the Central Universities of China (Nos. A03017023701012 and ZYGX2017KYQD177).


  • [Chen et al.2012] Chen, X.; Ye, Y.; Xu, X.; and Huang, J. Z. 2012. A feature group weighting method for subspace clustering of high-dimensional data. Pattern Recognition 45(1):434–446.
  • [Chen et al.2018] Chen, X.; Hong, W.; Nie, F.; He, D.; Yang, M.; and Huang, J. Z. 2018. Directly minimizing normalized cut for large scale data. In SIGKDD, 1206–1215.
  • [Du and Shen2015] Du, L., and Shen, Y.-D. 2015. Unsupervised feature selection with adaptive structure learning. In SIGKDD, 209–218. ACM.
  • [Du et al.2015] Du, L.; Zhou, P.; Shi, L.; Wang, H.; Fan, M.; Wang, W.; and Shen, Y.-D. 2015. Robust multiple kernel k-means using ℓ 2; 1-norm. In IJCAI, 3476–3482. AAAI Press.
  • [Elhamifar and Vidal2013] Elhamifar, E., and Vidal, R. 2013. Sparse subspace clustering: Algorithm, theory, and applications. IEEE transactions on pattern analysis and machine intelligence 35(11):2765–2781.
  • [Gao et al.2017] Gao, X.; Hoi, S. C.; Zhang, Y.; Zhou, J.; Wan, J.; Chen, Z.; Li, J.; and Zhu, J. 2017. Sparse online learning of image similarity. ACM Transactions on Intelligent Systems and Technology (TIST) 8(5):64.
  • [Hirzer et al.2012] Hirzer, M.; Roth, P. M.; Köstinger, M.; and Bischof, H. 2012. Relaxed pairwise learned metric for person re-identification. In ECCV, 780–793. Springer.
  • [Hoi, Liu, and Chang2008] Hoi, S. C.; Liu, W.; and Chang, S.-F. 2008. Semi-supervised distance metric learning for collaborative image retrieval. In CVPR, 1–7. IEEE.
  • [Huang et al.2018] Huang, S.; Wang, H.; Li, T.; Li, T.; and Xu, Z. 2018. Robust graph regularized nonnegative matrix factorization for clustering. Data Mining and Knowledge Discovery 32(2):483–503.
  • [Huang, Nie, and Huang2015] Huang, J.; Nie, F.; and Huang, H. 2015. A new simplex sparse learning model to measure data similarity for clustering. In IJCAI, 3569–3575.
  • [Kang et al.2018a] Kang, Z.; Lu, X.; Yi, J.; and Xu, Z. 2018a. Self-weighted multiple kernel learning for graph-based clustering and semi-supervised classification. In IJCAI, 2312–2318.
  • [Kang et al.2018b] Kang, Z.; Peng, C.; Cheng, Q.; and Xu, Z. 2018b. Unified spectral clustering with optimal graph. In

    Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18). AAAI Press

  • [Kang et al.2018c] Kang, Z.; Wen, L.; Chen, W.; and Xu, Z. 2018c. Low-rank kernel learning for graph-based clustering. Knowledge-Based Systems.
  • [Kang, Peng, and Cheng2017a] Kang, Z.; Peng, C.; and Cheng, Q. 2017a. Kernel-driven similarity learning. Neurocomputing 267:210–219.
  • [Kang, Peng, and Cheng2017b] Kang, Z.; Peng, C.; and Cheng, Q. 2017b. Twin learning for similarity and clustering: A unified kernel approach. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17). AAAI Press.
  • [Larsen2004] Larsen, R. M. 2004. Propack-software for large and sparse svd calculations. Available online. URL http://sun. stanford. edu/rmunk/PROPACK 2008–2009.
  • [Li et al.2016] Li, T.; Cheng, B.; Ni, B.; Liu, G.; and Yan, S. 2016. Multitask low-rank affinity graph for image segmentation and image annotation. ACM Transactions on Intelligent Systems and Technology (TIST) 7(4):65.
  • [Liu et al.2013] Liu, G.; Lin, Z.; Yan, S.; Sun, J.; Yu, Y.; and Ma, Y. 2013. Robust recovery of subspace structures by low-rank representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(1):171–184.
  • [Maaten and Hinton2008] Maaten, L. v. d., and Hinton, G. 2008. Visualizing data using t-sne. Journal of machine learning research 9(Nov):2579–2605.
  • [Ng et al.2002] Ng, A. Y.; Jordan, M. I.; Weiss, Y.; et al. 2002.

    On spectral clustering: Analysis and an algorithm.

    NIPS 2:849–856.
  • [Nie, Wang, and Huang2014] Nie, F.; Wang, X.; and Huang, H. 2014. Clustering and projected clustering with adaptive neighbors. In SIGKDD, 977–986. ACM.
  • [Niyogi2004] Niyogi, X. 2004. Locality preserving projections. In NIPS, volume 16, 153. MIT.
  • [Passalis and Tefas2017] Passalis, N., and Tefas, A. 2017. Dimensionality reduction using similarity-induced embeddings.

    IEEE transactions on neural networks and learning systems

  • [Peng et al.2016] Peng, X.; Xiao, S.; Feng, J.; Yau, W.-Y.; and Yi, Z. 2016. Deep subspace clustering with sparsity prior. In IJCAI, 1925–1931.
  • [Peng et al.2018] Peng, C.; Kang, Z.; Cai, S.; and Cheng, Q. 2018. Integrate and conquer: Double-sided two-dimensional k-means via integrating of projection and manifold construction. ACM Transactions on Intelligent Systems and Technology (TIST) 9(5):57.
  • [Peng, Cheng, and Cheng2017] Peng, C.; Cheng, J.; and Cheng, Q. 2017. A supervised learning model for high-dimensional and large-scale data. ACM Transactions on Intelligent Systems and Technology (TIST) 8(2):30.
  • [Roweis and Saul2000] Roweis, S. T., and Saul, L. K. 2000. Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323–2326.
  • [Tao et al.2017] Tao, Z.; Liu, H.; Li, S.; Ding, Z.; and Fu, Y. 2017. From ensemble clustering to multi-view clustering. In IJCAI, 2843–2849.
  • [Tenenbaum, De Silva, and Langford2000] Tenenbaum, J. B.; De Silva, V.; and Langford, J. C. 2000. A global geometric framework for nonlinear dimensionality reduction. Science 290(5500):2319–2323.
  • [Towne, Rosé, and Herbsleb2016] Towne, W. B.; Rosé, C. P.; and Herbsleb, J. D. 2016. Measuring similarity similarly: Lda and human perception. ACM Transactions on Intelligent Systems and Technology 8(1).
  • [Weinberger, Blitzer, and Saul2005] Weinberger, K. Q.; Blitzer, J.; and Saul, L. K. 2005. Distance metric learning for large margin nearest neighbor classification. In NIPS, 1473–1480.
  • [Wright et al.2009] Wright, J.; Yang, A. Y.; Ganesh, A.; Sastry, S. S.; and Ma, Y. 2009. Robust face recognition via sparse representation. IEEE transactions on pattern analysis and machine intelligence 31(2):210–227.
  • [Xu et al.2009] Xu, Z.; Jin, R.; King, I.; and Lyu, M. 2009. An extended level method for efficient multiple kernel learning. In NIPS, 1825–1832.
  • [Xu et al.2010] Xu, Z.; Jin, R.; Yang, H.; King, I.; and Lyu, M. R. 2010. Simple and efficient multiple kernel learning by group lasso. In ICML-10, 1175–1182. Citeseer.
  • [Zhang et al.2017] Zhang, L.; Zhang, Q.; Du, B.; You, J.; and Tao, D. 2017. Adaptive manifold regularized matrix factorization for data clustering. In IJCAI, 33999–3405.
  • [Zhang, Yang, and Feng2011] Zhang, L.; Yang, M.; and Feng, X. 2011. Sparse representation or collaborative representation: Which helps face recognition? In ICCV, 471–478. IEEE.
  • [Zhao et al.2013] Zhao, Z.; Wang, L.; Liu, H.; and Ye, J. 2013. On similarity preserving feature selection. IEEE Transactions on Knowledge and Data Engineering 25(3):619–632.
  • [Zhuang et al.2017] Zhuang, L.; Zhou, Z.; Gao, S.; Yin, J.; Lin, Z.; and Ma, Y. 2017. Label information guided graph construction for semi-supervised learning. IEEE Transactions on Image Processing 26(9):4182–4192.