SLKE
None
view repo
Data similarity is a key concept in many datadriven applications. Many algorithms are sensitive to similarity measures. To tackle this fundamental problem, automatically learning of similarity information from data via selfexpression has been developed and successfully applied in various models, such as lowrank representation, sparse subspace learning, semisupervised learning. However, it just tries to reconstruct the original data and some valuable information, e.g., the manifold structure, is largely ignored. In this paper, we argue that it is beneficial to preserve the overall relations when we extract similarity information. Specifically, we propose a novel similarity learning framework by minimizing the reconstruction error of kernel matrices, rather than the reconstruction error of original data adopted by existing work. Taking the clustering task as an example to evaluate our method, we observe considerable improvements compared to other stateoftheart methods. More importantly, our proposed framework is very general and provides a novel and fundamental building block for many other similaritybased tasks. Besides, our proposed kernel preserving opens up a large number of possibilities to embed highdimensional data into lowdimensional space.
READ FULL TEXT VIEW PDF
Leveraging on the underlying lowdimensional structure of data, lowrank...
read it
Similaritybased clustering and semisupervised learning methods separat...
read it
We propose a method to reconstruct and cluster incomplete highdimension...
read it
The emergence of highdimensional data in various areas has brought new
...
read it
We propose a novel method of introducing structure into existing machine...
read it
The measure between heterogeneous data is still an open problem. Many
re...
read it
Kernel methods have been successfully applied to the areas of pattern
re...
read it
None
Nowadays, highdimensional data can be collected everywhere, either by lowcost sensors or from the internet [Chen et al.2012]
. Extracting useful information from massive highdimensional data is critical in different areas like text, images, videos and more. Data similarity is especially important since it is the input for a number of data analysis tasks, such as spectral clustering
[Ng et al.2002, Chen et al.2018], nearest neighbor classification [Weinberger, Blitzer, and Saul2005], image segmentation [Li et al.2016], person reidentification [Hirzer et al.2012][Hoi, Liu, and Chang2008], dimension reduction [Passalis and Tefas2017], and graphbased semisupervised learning [Kang et al.2018a]. Therefore, similarity measure is crucial to the performance of many techniques and is a fundamental problem in machine learning, pattern recognition, and data mining communities
[Gao et al.2017, Towne, Rosé, and Herbsleb2016]. A variety of similarity metrics, e.g., Cosine, Jaccard coefficient, Euclidean distance, Gaussian function, are often used in practice for convenience. However, they are often datadependent and sensitive to noise [Huang, Nie, and Huang2015]. Consequently, different metrics lead to a big difference in the final results. In addition, several other similarity measure strategies are popular in dimension reduction techniques. For example, in the widely used locally linear embedding (LLE) [Roweis and Saul2000], isomeric feature mapping (ISOMAP) [Tenenbaum, De Silva, and Langford2000], and locality preserving projection (LPP) [Niyogi2004]methods, one has to construct an adjacency graph of neighbors. Then, knearestneighborhood (knn) and
nearestneighborhood graph construction methods are often utilized. These approaches also have some inherent drawbacks, including 1) how to determine neighbor number or radius; 2) how to choose an appropriate similarity metric to define neighborhood; 3) how to counteract the adverse effect of noise and outliers; 4) how to tackle data with structures at different scales of size and density. Unfortunately, all these factors heavily influence the subsequent tasks
[Kang et al.2018b].Recently, automatically learning of similarity information from data has drawn significant attention. In general, it can be classified into two categories. The first one is adaptive neighbors approach. It learns similarity information by assigning a probability for each data point as the neighborhood of another data point
[Nie, Wang, and Huang2014]. It has been shown to be an effective way to capture the local manifold structure.The other one is selfexpression approach. The basic idea is to represent every data point by a linear combination of other data points. In contrast, LLE reconstructs the original data by expressing each data point as a linear combination of its nearest neighbors only. Through minimizing this reconstruction error, we can obtain a coefficient matrix, which is also named similarity matrix. It has been widely applied in various representation learning tasks, including sparse subspace clustering [Elhamifar and Vidal2013, Peng et al.2016], lowrank representation [Liu et al.2013], multiview learning [Tao et al.2017], semisupervised learning [Zhuang et al.2017], nonnegative matrix factorization(NMF) [Zhang et al.2017].
However, this approach just tries to reconstruct the original data and has no explicit mechanism to preserve manifold structure information about the data. In many applications, the data can display structures beyond simply being lowrank or sparse. It is wellaccepted that it is essential to take into account structure information when we perform highdimensional data analysis. For instance, LLE preserves the local structure information.
In view of this issue with the current approaches, we propose to learn the similarity information through reconstructing the original data kernel matrix, which is supposed to preserve overall relations. By doing so, we expect to obtain more accurate and complete data similarity. Considering clustering as a specific application of our proposed similarity learning method, we demonstrate that our framework provides impressive performance on several benchmark data sets. In summary, the main contributions of this paper are threefold:
Compared to other approaches, the use of the kernelbased distances allows to work on preserving the sets of overall relations rather than individual pairwise similarities.
Similarity preserving provides a fundamental building block to embed highdimensional data into lowdimensional latent space. It is general enough to be applied to a variety of learning problems.
We evaluate the proposed approach in the clustering task. It shows that our algorithm enjoys superior performance compared to many stateoftheart methods.
Notations. Given a data set , we denote with features and instances. Then the th element of matrix are denoted by . The
norm of a vector
is represented by , where denotes transpose. The norm of is defined as . The squared Frobenius norm is represented as . The nuclear norm of is , where is theth singular value of
.is the identity matrix with a proper size.
represents a column vector whose every element is one. means all the elements of are nonnegative. Inner product .In this section, we provide a brief review of existing automatic similarity learning techniques.
In a similar spirit of LPP, for each data point , all the data points can be regarded as the neighborhood of with probability . To some extent, represents the similarity between and [Nie, Wang, and Huang2014]. The smaller distance is, the greater the probability is. Rather than prespecifying with the deterministic neighborhood relation as LPP does, one can adaptively learn from the data set by solving an optimization problem:
(1) 
where is the regularization parameter. Recently, a variety of algorithms have been developed by using Eq. (1) to learn a similarity matrix. Some applications are clustering [Nie, Wang, and Huang2014], NMF [Huang et al.2018]
, and feature selection
[Du and Shen2015]. This approach can effectively capture the local structure information.The socalled selfexpression is to approximate each data point as a linear combination of other data points, i.e., . The rationale here is that if and are similar, weight should be big. Therefore, also behaves like the similarity matrix. This shares the similar spirit as LLE, except that we do not predetermine the neighborhood. Its corresponding learning problem is:
(2) 
where is a regularizer of . Two commonly used assumptions about are lowrank and sparse. Hence, in many domains, we also call as the lowdimensional representation of . Through this procedure, the individual pairwise similarity information hidden in the data is explored [Nie, Wang, and Huang2014] and the most informative “neighbors” for each data point are automatically chosen.
Moreover, this learned can not only reveal lowdimensional structure of data, but also be robust to data scale [Huang, Nie, and Huang2015]
. Therefore, this approach has drawn significant attention and achieved impressive performance in a number of applications, including face recognition
[Zhang, Yang, and Feng2011], subspace clustering [Liu et al.2013, Elhamifar and Vidal2013], semisupervised learning [Zhuang et al.2017]. In many realworld applications, data often present complex structures. Nevertheless, the first term in Eq. (2) simply minimizes the reconstruction error. Some important manifold structure information, such as overall relations, could be lost during this process. Preserving relation information has been shown to be important for feature selection [Zhao et al.2013]. In [Zhao et al.2013], new feature vector is obtained by maximizing , where is the refined similarity matrix derived from original kernel matrix with element . In this paper, we propose a novel model to preserve the overall relations of the original data and simultaneously learn the similarity matrix.Since our goal is to obtain similarity information, it is very necessary to retain the overall relations among the data samples when we build a new representation. However, Eq. (2) just tries to reconstruct the original data and does not take overall relations information into account. Our objective is finding a new representation which preserves overall relations as much as possible.
Given a data matrix , one of the most commonly used relation measures is the inner product. Specifically, we try to minimize the inconsistency between two inner products: one for the raw data and another for reconstructed data . To make our model more general, we build it in a transformed space,i.e., is mapped by [Xu et al.2009]. We have
(3) 
(3) can be simplified as
(4) 
With certain assumption about the structure of , our proposed Similarity Learning via Kernel preserving Embedding (SLKE) framework can be formulated as
(5) 
where is a tradeoff parameter and is a regularizer on . If we use the nuclear norm to replace , we have a lowrank representation. If the norm is adopted, we obtain a sparse representation.
It is worth pointing out that Eq. (5) enjoys several nice properties:
1) The use of kernelbased distance preserves the sets of overall relations, which will benefit the subsequent tasks;
2) This learned lowdimensional representation or similarity matrix is general enough to be utilized to solve a variety of different tasks, where similarity information is needed;
3) The learned representation is particularly suitable to problems that are sensitive to data similarity, such as clustering [Kang et al.2018c], classification [Wright et al.2009], recommender systems [Kang, Peng, and Cheng2017a];
4) Its input is the kernel matrix. This is also desirable, as not all types of data can be represented in numerical feature vectors form [Xu et al.2010]. For instance, we need to group proteins in bioinformatics based on their structures and to divide users in social media based on their friendship relations.
In the following section, we will show a simple strategy to solve problem (5).
It is easy to see that Eq. (5) is a fourthorder function of . Directly solving it is not so straightforward. To circumvent this problem, we first convert it to the following equivalent problem by introducing two more auxiliary variables
(6) 
Now we resort to the alternating direction method of multipliers (ADMM) method to solve (6). The corresponding augmented Lagrangian function is：
(7) 
where is a penalty parameter and , are the lagrangian multipliers. The variables , , and can be updated alternatingly, one at each step, while keeping the other two fixed.
To solve , we observe that the objective function (7) is a strongly convex quadratic function in which can be solved by setting its first derivative to zero, we have:
(8) 
where is the identity matrix.
Similarly,
(9) 
For , we have the following subproblem:
(10) 
Depending on different regularization strategies, we have different closedform solutions for . Define
, we can write its singular value decomposition (SVD) as
. Then, for lowrank representation, i.e., , we have,(11) 
For sparse representation, i.e., , we can update element by element as,
(12) 
For clarity, the complete procedures to solve the problem (5) are outlined in Algorithm 1.
With our optimization strategy, the complexity for is . Updating has the same complexity as . Both and involve matrix inverse. Fortunately, we can avoid it by resorting to some approximation techniques when we face largescale data sets. Depending on the choice of regularizer, we have different complexity for . For lowrank representation, it requires an SVD for every iteration and its complexity is . Since we seek a lowrank matrix and so only need a few principle singular values. Package like PROPACK can compute a rank SVD with complexity [Larsen2004]. To obtain a sparse solution of , we need complexity. The updating of and cost .
To assess the effectiveness of our proposed method, we apply the learned similarity matrix to do clustering.
# instances  # features  # classes  

YALE  165  1024  15 
JAFFE  213  676  10 
ORL  400  1024  40 
COIL20  1440  1024  20 
BA  1404  320  36 
TR11  414  6429  9 
TR41  878  7454  10 
TR45  690  8261  10 
TDT2  9394  36771  30 
We conduct our experiments with nine benchmark data sets, which are widely used in clustering experiments. We show the statistics of these data sets in Table 1. In summary, the number of data samples varies from 165 to 9,394 and feature number ranges from 320 to 36,771. The first five data sets are images, while the last four are text data.
Specifically, the five image data sets contain three face databases (ORL, YALE, and JAFFE), a toy image database COIL20, and a binary alpha digits data set BA. For example, COIL20 consists of 20 objects and each object was taken from different angles. BA data set contains images of digits of “0” through “9” and letters of capital “A” through “Z”. YALE, ORL, and JAFEE consist of images of the person. Each image represents different facial expressions or configurations due to times, illumination conditions, and glasses/no glasses.
Since the input for our proposed method is kernel matrix, we design 12 kernels in total to fully examine its performance. They are: seven Gaussian kernels of the form with , where denotes the maximal distance between data points; a linear kernel ; four polynomial kernels of the form with and . Besides, all kernels are rescaled to by dividing each element by the largest element in its corresponding kernel matrix. These kernels are commonly used types in the literature, so we can well investigate the performance of our method.
To fully examine the effectiveness of the proposed framework on clustering, we choose a good set of methods to compare. In general, they can be classified into two categories: similaritybased and kernelbased clustering methods.
Spectral Clustering (SC) [Ng et al.2002]: SC is a widely used clustering method. It enjoys the advantage of exploring the intrinsic data structures. Its input is the graph Laplacian, which is constructed from the similarity matrix. Here, we directly treat kernel matrix as the similarity matrix for spectral clustering. For our proposed SLKE method, we employ learned to do spectral clustering. Thus, SC serves as a baseline method.
Robust Kernel Kmeans (RKKM)^{1}^{1}1https://github.com/csliangdu/RMKKM [Du et al.2015]
: Based on classical kmeans clustering algorithm, RKKM has been developed to deal with nonlinear structures, noise, and outliers in the data. RKKM demonstrates superior performance on a number of realworld data sets.
Simplex Sparse Representation (SSR) [Huang, Nie, and Huang2015]: SSR method has been proposed recently. It is based on adaptive neighbors idea. Another appealing property of this method is that its model parameter can be calculated by assuming a maximum number of neighbors. Therefore, we don’t need to tune the parameter any more. In addition, it outperforms many other stateoftheart techniques.
LowRank Representation (LRR) [Liu et al.2013]: Based on selfexpression, subspace clustering with lowrank regularizer achieves great success on a number of applications, such as face clustering, motion segmentation.
Sparse Subspace Clustering (SSC) [Elhamifar and Vidal2013]: Similar to LRR, SSC assumes a sparse solution of . Both LRR and SSC learn similarity matrix by reconstructing the original data. In this aspect, SC, LRR, and SSC are baseline methods w.r.t. our proposed algorithm.
Clustering with Adaptive Neighbor (CAN) [Nie, Wang, and Huang2014]. Based on the idea of adaptive neighbors, i.e., Eq.(1), CAN learns a local graph from raw data for clustering task.
Twin Learning for Similarity and Clustering (TLSC) [Kang, Peng, and Cheng2017b]: Recently, TLSC has been proposed and has shown promising results on realworld data sets. TLSC does not only learn similarity matrix via selfexpression in kernel space, but also have optimal similarity graph guarantee. Besides, it has good theoretical properties, i.e., it is equivalent to kernel kmeans and kmeans under certain conditions.
SLKE: Our proposed similarity learning method with overall relations preserving capability. After obtaining similarity matrix , we use spectral clustering to conduct clustering experiments. We test both lowrank and sparse regularizer. We denote them as SLKER and SLKES, respectively^{2}^{2}2https://github.com/sckangz/SLKE.
To quantitatively and effectively assess the clustering performance, we utilize the two widely used metrics [Peng et al.2018], accuracy (Acc) and normalized mutual information (NMI).
Acc discovers the onetoone relationship between clusters and classes. Let and be the clustering result and the ground truth cluster label of , respectively. Then the Acc is defined as
where is the sample size, Kronecker delta function equals one if and only if and zero otherwise, and map() is the best permutation mapping function that maps each cluster index to a true class label based on KuhnMunkres algorithm.
Given two sets of clusters and , NMI is defined as
where and
represent the marginal probability distribution functions of
and , respectively. is the joint probability function of and . is the entropy function. The greater NMI means the better clustering performance.


We report the extensive experimental results in Table 2. Except SSC, LRR, CAN, and SSR, we run other methods on each kernel matrix individually. As a result, we show both the best performance among those 12 kernels and the average results over those 12 kernels for them. Based on this table, we can see that our proposed SLKE achieves the best performance in most cases. To be specific, we have the following observations:
1) Compared to classical kmeans based RKKM and spectral clustering techniques, our proposed method SLKE has a big advantage in terms of accuracy and NMI. With respect to the recently proposed SSR and TLSC methods, SKLE always obtains better results.
2) SLKER and SLKES often outperform LRR and SSC, respectively. The accuracy increased by 8.92%, 8.76% on average, respectively. That is to say, kernelbased distance approach indeed performs better than original data reconstruction technique. This verifies the importance of retaining relation information when we learn a lowdimensional representation, especially for sparse representation.
3) With respect to adaptive neighbors approach CAN, we also obtain better performance on those datasets except COIL20. For COIL20, our results are quite close to CAN’s. Therefore, compared to various similarity learning techniques, our method is very competitive.
4) Regarding lowrank and sparse representation, it is hard to conclude which one is better. It totally depends on the specific data.
Furthermore, we run tSNE [Maaten and Hinton2008] algorithm on the JAFFE data and the reconstructed data from the best result of our SLKER. As shown by Figure 1, we can see that our method can well preserve the cluster structure of the data.
Method  Metric  SC  RKKM  SSC  LRR  SSR  CAN  TLSC 
SLKES  Acc  .0039  .0039  .0117  .2500  .0078  .0391  .0391 
NMI  .0078  .0039  .0195  .6523  .0391  .0547  .3008  
SLKER  Acc  .0039  .0039  .0039  .0977  .0039  .0039  .0117 
NMI  .0039  .0078  .0039  .0742  .0039  .0078  .0391 
To see the significance of improvements, we further apply the Wilcoxon signed rank test [Peng, Cheng, and Cheng2017] to Table 2. We show the values in Table 3. We note that the testing results are under 0.05 in most cases when comparing SLKES and SLKER to other methods. Therefore, SLKES and SLKER outperform SC, RKKM, SSC, and SSR with statistical significance.
In this subsection, we investigate the influence of our model parameter on the clustering results. Take Gaussian kernel with of YALE and JAFFE data sets as examples, we plot our algorithm’s performance with in the range in Figure 2 and 3, respectively. As we can see that our proposed methods work well for a wide range of , e.g., from to .
In this paper, we present a novel similarity learning framework relying on an embedding of kernelbased distance. Our model is flexible to obtain either lowrank or sparse representation of data. Comprehensive experimental results on real data sets well demonstrate the superiority of the proposed method on the clustering task. It has great potential to be applied in a number of applications beyond clustering. It has been shown that the performance of the proposed method is largely determined by the choice of kernel function. In the future, we plan to address this issue by developing a multiple kernel learning method, which is capable of automatically learning an appropriate kernel from a pool of input kernels.
This paper was in part supported by Grants from the Natural Science Foundation of China (Nos. 61806045 and 61572111), a 985 Project of UESTC (No. A1098531023601041) and two Fundamental Research Fund for the Central Universities of China (Nos. A03017023701012 and ZYGX2017KYQD177).
Proceedings of the ThirtySecond AAAI Conference on Artificial Intelligence (AAAI18). AAAI Press
.On spectral clustering: Analysis and an algorithm.
NIPS 2:849–856.IEEE transactions on neural networks and learning systems
.
Comments
There are no comments yet.