Data similarity is a key concept in many data-driven applications. Many algorithms are sensitive to similarity measures. To tackle this fundamental problem, automatically learning of similarity information from data via self-expression has been developed and successfully applied in various models, such as low-rank representation, sparse subspace learning, semi-supervised learning. However, it just tries to reconstruct the original data and some valuable information, e.g., the manifold structure, is largely ignored. In this paper, we argue that it is beneficial to preserve the overall relations when we extract similarity information. Specifically, we propose a novel similarity learning framework by minimizing the reconstruction error of kernel matrices, rather than the reconstruction error of original data adopted by existing work. Taking the clustering task as an example to evaluate our method, we observe considerable improvements compared to other state-of-the-art methods. More importantly, our proposed framework is very general and provides a novel and fundamental building block for many other similarity-based tasks. Besides, our proposed kernel preserving opens up a large number of possibilities to embed high-dimensional data into low-dimensional space.READ FULL TEXT VIEW PDF
Leveraging on the underlying low-dimensional structure of data, low-rank...
Similarity-based clustering and semi-supervised learning methods separat...
We propose a method to reconstruct and cluster incomplete high-dimension...
The emergence of high-dimensional data in various areas has brought new
We propose a novel method of introducing structure into existing machine...
The measure between heterogeneous data is still an open problem. Many
Kernel methods have been successfully applied to the areas of pattern
Nowadays, high-dimensional data can be collected everywhere, either by low-cost sensors or from the internet [Chen et al.2012]
. Extracting useful information from massive high-dimensional data is critical in different areas like text, images, videos and more. Data similarity is especially important since it is the input for a number of data analysis tasks, such as spectral clustering[Ng et al.2002, Chen et al.2018], nearest neighbor classification [Weinberger, Blitzer, and Saul2005], image segmentation [Li et al.2016], person re-identification [Hirzer et al.2012]Hoi, Liu, and Chang2008], dimension reduction [Passalis and Tefas2017], and graph-based semi-supervised learning [Kang et al.2018a]Gao et al.2017, Towne, Rosé, and Herbsleb2016]. A variety of similarity metrics, e.g., Cosine, Jaccard coefficient, Euclidean distance, Gaussian function, are often used in practice for convenience. However, they are often data-dependent and sensitive to noise [Huang, Nie, and Huang2015]. Consequently, different metrics lead to a big difference in the final results. In addition, several other similarity measure strategies are popular in dimension reduction techniques. For example, in the widely used locally linear embedding (LLE) [Roweis and Saul2000], isomeric feature mapping (ISOMAP) [Tenenbaum, De Silva, and Langford2000], and locality preserving projection (LPP) [Niyogi2004]
methods, one has to construct an adjacency graph of neighbors. Then, k-nearest-neighborhood (knn) and-nearest-neighborhood graph construction methods are often utilized. These approaches also have some inherent drawbacks, including 1) how to determine neighbor number or radius
; 2) how to choose an appropriate similarity metric to define neighborhood; 3) how to counteract the adverse effect of noise and outliers; 4) how to tackle data with structures at different scales of size and density. Unfortunately, all these factors heavily influence the subsequent tasks[Kang et al.2018b].
Recently, automatically learning of similarity information from data has drawn significant attention. In general, it can be classified into two categories. The first one is adaptive neighbors approach. It learns similarity information by assigning a probability for each data point as the neighborhood of another data point[Nie, Wang, and Huang2014]. It has been shown to be an effective way to capture the local manifold structure.
The other one is self-expression approach. The basic idea is to represent every data point by a linear combination of other data points. In contrast, LLE reconstructs the original data by expressing each data point as a linear combination of its nearest neighbors only. Through minimizing this reconstruction error, we can obtain a coefficient matrix, which is also named similarity matrix. It has been widely applied in various representation learning tasks, including sparse subspace clustering [Elhamifar and Vidal2013, Peng et al.2016], low-rank representation [Liu et al.2013], multi-view learning [Tao et al.2017], semi-supervised learning [Zhuang et al.2017], nonnegative matrix factorization(NMF) [Zhang et al.2017].
However, this approach just tries to reconstruct the original data and has no explicit mechanism to preserve manifold structure information about the data. In many applications, the data can display structures beyond simply being low-rank or sparse. It is well-accepted that it is essential to take into account structure information when we perform high-dimensional data analysis. For instance, LLE preserves the local structure information.
In view of this issue with the current approaches, we propose to learn the similarity information through reconstructing the original data kernel matrix, which is supposed to preserve overall relations. By doing so, we expect to obtain more accurate and complete data similarity. Considering clustering as a specific application of our proposed similarity learning method, we demonstrate that our framework provides impressive performance on several benchmark data sets. In summary, the main contributions of this paper are threefold:
Compared to other approaches, the use of the kernel-based distances allows to work on preserving the sets of overall relations rather than individual pairwise similarities.
Similarity preserving provides a fundamental building block to embed high-dimensional data into low-dimensional latent space. It is general enough to be applied to a variety of learning problems.
We evaluate the proposed approach in the clustering task. It shows that our algorithm enjoys superior performance compared to many state-of-the-art methods.
Notations. Given a data set , we denote with features and instances. Then the -th element of matrix are denoted by . The
-norm of a vectoris represented by , where denotes transpose. The -norm of is defined as . The squared Frobenius norm is represented as . The nuclear norm of is , where is the
-th singular value of.
is the identity matrix with a proper size.represents a column vector whose every element is one. means all the elements of are nonnegative. Inner product .
In this section, we provide a brief review of existing automatic similarity learning techniques.
In a similar spirit of LPP, for each data point , all the data points can be regarded as the neighborhood of with probability . To some extent, represents the similarity between and [Nie, Wang, and Huang2014]. The smaller distance is, the greater the probability is. Rather than prespecifying with the deterministic neighborhood relation as LPP does, one can adaptively learn from the data set by solving an optimization problem:
where is the regularization parameter. Recently, a variety of algorithms have been developed by using Eq. (1) to learn a similarity matrix. Some applications are clustering [Nie, Wang, and Huang2014], NMF [Huang et al.2018]
, and feature selection[Du and Shen2015]. This approach can effectively capture the local structure information.
The so-called self-expression is to approximate each data point as a linear combination of other data points, i.e., . The rationale here is that if and are similar, weight should be big. Therefore, also behaves like the similarity matrix. This shares the similar spirit as LLE, except that we do not predetermine the neighborhood. Its corresponding learning problem is:
where is a regularizer of . Two commonly used assumptions about are low-rank and sparse. Hence, in many domains, we also call as the low-dimensional representation of . Through this procedure, the individual pairwise similarity information hidden in the data is explored [Nie, Wang, and Huang2014] and the most informative “neighbors” for each data point are automatically chosen.
Moreover, this learned can not only reveal low-dimensional structure of data, but also be robust to data scale [Huang, Nie, and Huang2015]
. Therefore, this approach has drawn significant attention and achieved impressive performance in a number of applications, including face recognition[Zhang, Yang, and Feng2011], subspace clustering [Liu et al.2013, Elhamifar and Vidal2013], semi-supervised learning [Zhuang et al.2017]. In many real-world applications, data often present complex structures. Nevertheless, the first term in Eq. (2) simply minimizes the reconstruction error. Some important manifold structure information, such as overall relations, could be lost during this process. Preserving relation information has been shown to be important for feature selection [Zhao et al.2013]. In [Zhao et al.2013], new feature vector is obtained by maximizing , where is the refined similarity matrix derived from original kernel matrix with element . In this paper, we propose a novel model to preserve the overall relations of the original data and simultaneously learn the similarity matrix.
Since our goal is to obtain similarity information, it is very necessary to retain the overall relations among the data samples when we build a new representation. However, Eq. (2) just tries to reconstruct the original data and does not take overall relations information into account. Our objective is finding a new representation which preserves overall relations as much as possible.
Given a data matrix , one of the most commonly used relation measures is the inner product. Specifically, we try to minimize the inconsistency between two inner products: one for the raw data and another for reconstructed data . To make our model more general, we build it in a transformed space,i.e., is mapped by [Xu et al.2009]. We have
(3) can be simplified as
With certain assumption about the structure of , our proposed Similarity Learning via Kernel preserving Embedding (SLKE) framework can be formulated as
where is a tradeoff parameter and is a regularizer on . If we use the nuclear norm to replace , we have a low-rank representation. If the -norm is adopted, we obtain a sparse representation.
It is worth pointing out that Eq. (5) enjoys several nice properties:
1) The use of kernel-based distance preserves the sets of overall relations, which will benefit the subsequent tasks;
2) This learned low-dimensional representation or similarity matrix is general enough to be utilized to solve a variety of different tasks, where similarity information is needed;
3) The learned representation is particularly suitable to problems that are sensitive to data similarity, such as clustering [Kang et al.2018c], classification [Wright et al.2009], recommender systems [Kang, Peng, and Cheng2017a];
4) Its input is the kernel matrix. This is also desirable, as not all types of data can be represented in numerical feature vectors form [Xu et al.2010]. For instance, we need to group proteins in bioinformatics based on their structures and to divide users in social media based on their friendship relations.
In the following section, we will show a simple strategy to solve problem (5).
It is easy to see that Eq. (5) is a fourth-order function of . Directly solving it is not so straightforward. To circumvent this problem, we first convert it to the following equivalent problem by introducing two more auxiliary variables
Now we resort to the alternating direction method of multipliers (ADMM) method to solve (6). The corresponding augmented Lagrangian function is：
where is a penalty parameter and , are the lagrangian multipliers. The variables , , and can be updated alternatingly, one at each step, while keeping the other two fixed.
To solve , we observe that the objective function (7) is a strongly convex quadratic function in which can be solved by setting its first derivative to zero, we have:
where is the identity matrix.
For , we have the following subproblem:
Depending on different regularization strategies, we have different closed-form solutions for . Define
, we can write its singular value decomposition (SVD) as. Then, for low-rank representation, i.e., , we have,
For sparse representation, i.e., , we can update element by element as,
For clarity, the complete procedures to solve the problem (5) are outlined in Algorithm 1.
With our optimization strategy, the complexity for is . Updating has the same complexity as . Both and involve matrix inverse. Fortunately, we can avoid it by resorting to some approximation techniques when we face large-scale data sets. Depending on the choice of regularizer, we have different complexity for . For low-rank representation, it requires an SVD for every iteration and its complexity is . Since we seek a low-rank matrix and so only need a few principle singular values. Package like PROPACK can compute a rank SVD with complexity [Larsen2004]. To obtain a sparse solution of , we need complexity. The updating of and cost .
To assess the effectiveness of our proposed method, we apply the learned similarity matrix to do clustering.
|# instances||# features||# classes|
We conduct our experiments with nine benchmark data sets, which are widely used in clustering experiments. We show the statistics of these data sets in Table 1. In summary, the number of data samples varies from 165 to 9,394 and feature number ranges from 320 to 36,771. The first five data sets are images, while the last four are text data.
Specifically, the five image data sets contain three face databases (ORL, YALE, and JAFFE), a toy image database COIL20, and a binary alpha digits data set BA. For example, COIL20 consists of 20 objects and each object was taken from different angles. BA data set contains images of digits of “0” through “9” and letters of capital “A” through “Z”. YALE, ORL, and JAFEE consist of images of the person. Each image represents different facial expressions or configurations due to times, illumination conditions, and glasses/no glasses.
Since the input for our proposed method is kernel matrix, we design 12 kernels in total to fully examine its performance. They are: seven Gaussian kernels of the form with , where denotes the maximal distance between data points; a linear kernel ; four polynomial kernels of the form with and . Besides, all kernels are rescaled to by dividing each element by the largest element in its corresponding kernel matrix. These kernels are commonly used types in the literature, so we can well investigate the performance of our method.
To fully examine the effectiveness of the proposed framework on clustering, we choose a good set of methods to compare. In general, they can be classified into two categories: similarity-based and kernel-based clustering methods.
Spectral Clustering (SC) [Ng et al.2002]: SC is a widely used clustering method. It enjoys the advantage of exploring the intrinsic data structures. Its input is the graph Laplacian, which is constructed from the similarity matrix. Here, we directly treat kernel matrix as the similarity matrix for spectral clustering. For our proposed SLKE method, we employ learned to do spectral clustering. Thus, SC serves as a baseline method.
Simplex Sparse Representation (SSR) [Huang, Nie, and Huang2015]: SSR method has been proposed recently. It is based on adaptive neighbors idea. Another appealing property of this method is that its model parameter can be calculated by assuming a maximum number of neighbors. Therefore, we don’t need to tune the parameter any more. In addition, it outperforms many other state-of-the-art techniques.
Low-Rank Representation (LRR) [Liu et al.2013]: Based on self-expression, subspace clustering with low-rank regularizer achieves great success on a number of applications, such as face clustering, motion segmentation.
Sparse Subspace Clustering (SSC) [Elhamifar and Vidal2013]: Similar to LRR, SSC assumes a sparse solution of . Both LRR and SSC learn similarity matrix by reconstructing the original data. In this aspect, SC, LRR, and SSC are baseline methods w.r.t. our proposed algorithm.
Twin Learning for Similarity and Clustering (TLSC) [Kang, Peng, and Cheng2017b]: Recently, TLSC has been proposed and has shown promising results on real-world data sets. TLSC does not only learn similarity matrix via self-expression in kernel space, but also have optimal similarity graph guarantee. Besides, it has good theoretical properties, i.e., it is equivalent to kernel k-means and k-means under certain conditions.
SLKE: Our proposed similarity learning method with overall relations preserving capability. After obtaining similarity matrix , we use spectral clustering to conduct clustering experiments. We test both low-rank and sparse regularizer. We denote them as SLKE-R and SLKE-S, respectively222https://github.com/sckangz/SLKE.
To quantitatively and effectively assess the clustering performance, we utilize the two widely used metrics [Peng et al.2018], accuracy (Acc) and normalized mutual information (NMI).
Acc discovers the one-to-one relationship between clusters and classes. Let and be the clustering result and the ground truth cluster label of , respectively. Then the Acc is defined as
where is the sample size, Kronecker delta function equals one if and only if and zero otherwise, and map() is the best permutation mapping function that maps each cluster index to a true class label based on Kuhn-Munkres algorithm.
Given two sets of clusters and , NMI is defined as
represent the marginal probability distribution functions ofand , respectively. is the joint probability function of and . is the entropy function. The greater NMI means the better clustering performance.
We report the extensive experimental results in Table 2. Except SSC, LRR, CAN, and SSR, we run other methods on each kernel matrix individually. As a result, we show both the best performance among those 12 kernels and the average results over those 12 kernels for them. Based on this table, we can see that our proposed SLKE achieves the best performance in most cases. To be specific, we have the following observations:
1) Compared to classical k-means based RKKM and spectral clustering techniques, our proposed method SLKE has a big advantage in terms of accuracy and NMI. With respect to the recently proposed SSR and TLSC methods, SKLE always obtains better results.
2) SLKE-R and SLKE-S often outperform LRR and SSC, respectively. The accuracy increased by 8.92%, 8.76% on average, respectively. That is to say, kernel-based distance approach indeed performs better than original data reconstruction technique. This verifies the importance of retaining relation information when we learn a low-dimensional representation, especially for sparse representation.
3) With respect to adaptive neighbors approach CAN, we also obtain better performance on those datasets except COIL20. For COIL20, our results are quite close to CAN’s. Therefore, compared to various similarity learning techniques, our method is very competitive.
4) Regarding low-rank and sparse representation, it is hard to conclude which one is better. It totally depends on the specific data.
Furthermore, we run t-SNE [Maaten and Hinton2008] algorithm on the JAFFE data and the reconstructed data from the best result of our SLKE-R. As shown by Figure 1, we can see that our method can well preserve the cluster structure of the data.
To see the significance of improvements, we further apply the Wilcoxon signed rank test [Peng, Cheng, and Cheng2017] to Table 2. We show the -values in Table 3. We note that the testing results are under 0.05 in most cases when comparing SLKE-S and SLKE-R to other methods. Therefore, SLKE-S and SLKE-R outperform SC, RKKM, SSC, and SSR with statistical significance.
In this subsection, we investigate the influence of our model parameter on the clustering results. Take Gaussian kernel with of YALE and JAFFE data sets as examples, we plot our algorithm’s performance with in the range in Figure 2 and 3, respectively. As we can see that our proposed methods work well for a wide range of , e.g., from to .
In this paper, we present a novel similarity learning framework relying on an embedding of kernel-based distance. Our model is flexible to obtain either low-rank or sparse representation of data. Comprehensive experimental results on real data sets well demonstrate the superiority of the proposed method on the clustering task. It has great potential to be applied in a number of applications beyond clustering. It has been shown that the performance of the proposed method is largely determined by the choice of kernel function. In the future, we plan to address this issue by developing a multiple kernel learning method, which is capable of automatically learning an appropriate kernel from a pool of input kernels.
This paper was in part supported by Grants from the Natural Science Foundation of China (Nos. 61806045 and 61572111), a 985 Project of UESTC (No. A1098531023601041) and two Fundamental Research Fund for the Central Universities of China (Nos. A03017023701012 and ZYGX2017KYQD177).
Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18). AAAI Press.
On spectral clustering: Analysis and an algorithm.NIPS 2:849–856.
IEEE transactions on neural networks and learning systems.