With the advancements in information technology, high-dimensional data become very common for representing the data. However, it is difficult to deal with high-dimensional data due to challenges such as the curse of dimensionality, storage and computation costs. Fortunately, in practice data are not unstructured. For example, their samples usually lie around low-dimensional manifolds and have high correlation among themLiu et al. (2013); Kang et al. (2015)
. This phenomenon is validated by the widely used Principal Component Analysis (PCA) where the number of principal components is much smaller than the data dimension. Such a phenomenon is also evidenced in nonlinear manifold learningLee and Verleysen (2007). Since dimension is closely related to the rank of matrix, low-rank characteristic has been shown to be very effective in studying the low-dimensional structures in data Peng et al. (2015); Zhang et al. (2016).
, which are mainly applied to deal with first order data, such as voices and feature vectors. As an extension to the sparsity of order one data, low rankness is a measure for the sparsity of second order data, such as imagesLin and Zhang (2017). Low-rank models can effectively capture the correlation among rows and columns of a matrix as shown in robust PCA Candès et al. (2011), matrix completion Candès and Recht (2009); Kang et al. (2016)
, and so on. Recently, low-rank and sparse models have shown their effectiveness in processing high-dimensional data by effectively extracting rich low-dimensional structures in data, despite gross corruption and outliers. Unlike traditional manifold learning, this approach often enjoys good theoretical guarantees.
When data resides near multiple subspaces, a coefficient matrix is introduced to enforce correlation among samples. Two typical models are low-rank representation (LRR) Liu et al. (2013) and sparse subspace clustering (SSC) Elhamifar and Vidal (2013). Both LRR and SSC aim to find a coefficient matrix by trying to reconstruct each data point as a linear combination of all the other data points, which is called self-expression property. is assumed to be low-rank in LRR and sparse in SSC. In the literature, is also called the similarity matrix since it measures the similarity between samples Chen et al. (2015). LRR and SSC have achieved impressive performance in face clustering, motion segmentation, etc. In these applications, they first learn a similarity matrix
from the data by minimizing the reconstruction error. After that, they implement spectral clustering by treatingas similarity graph matrix Peng et al. (2017). Self-expression idea inspires a lot of work along this line. Whenever similarity among samples/features is needed, it can be used. For instance, in recommender system, we can use it to calculate the similarity among users and items Ning and Karypis (2011); in semisupervised classification, we can utilize it to obtain the similarity graph Li et al. (2015); in multiview learning, we can use it to characterize the connection between different views Tao et al. (2017).
More importantly, there are a variety of benefits to obtain similarity matrix through self-expression. First, by this means, the most informative “neighbors” for each data point are automatically chosen and the global structure information hidden in the data is explored Nie et al. (2014). This will avoid many drawbacks in widely used -nearest-neighborhood and -nearest-neighborhood graph construction methods, such as determination of neighbor number or radius . Second, it is independent of similarity metrics, such as Cosine, Euclidean distance, Gaussian function, which are often data-dependent and sensitive to noise and outliers Kang et al. (2017); Tang et al. (2019). Third, this automatic similarity learning from data can tackle data with structures at different scales of size and density Huang et al. (2015). Therefore, low-rank and sparse modeling based similarity learning can not only unveil low-dimensional structure, but also be robust to uncertainties of real-world data. It dramatically reduces the potential chances that might heavily influence the subsequent tasks Zelnik-Manor and Perona (2004).
Nevertheless, the data in various real applications is usually very complicated and can display structures beyond simply being low-rank or sparse Haeffele et al. (2014). Hence, it is essential to learn the representation that can well embed the rich structure information in the original data. Existing methods usually employ some simple models, which is generally less effective and hard to capture such rich structural information that exists in real world data. To combat this issue, in this paper we demonstrate that it is beneficial to preserve similarity information between samples when we perform structure learning and design a novel term for this task. This new term measures the inconsistency between two kernel matrices, one for raw data and another for reconstructed data, such that the reconstructed data well preserves rich structural information from the raw data. The advantage of this approach is demonstrated in three important problems: shallow clustering, semi-supervised classification, and deep clustering.
Compared with existing work in the literature, the main contributions of this paper are as follows:
Different from current low-dimensional structure learning methods, we explicitly model the data relation by preserving the pairwise similarity of the original data with a novel term. Our approach reduces the inconsistency between the structural information of raw and reconstructed data, which leads to enhanced performance.
Our proposed structure learning framework is also applied to deep auto-encoder. This helps to achieve a more informative and discriminative latent representation.
The effectiveness of the proposed approach is evaluated on both shallow and deep models with tasks from image clustering, document clustering, face recognition, digit/letter recognition, to visual object recognition. Comprehensive experiments demonstrate the superiority of our technique over other state-of-the-art methods.
Our method can serve as a fundamental framework, which can be readily applied to other self-expression methods. Moreover, beyond clustering and classification applications, the proposed framework can be efficiently generalized to a variety of other learning tasks.
The rest of the paper is organized as follows. Section 2 gives a brief review about two popular algorithms. Section 3 introduces the proposed technique and discusses its typical applications to spectral clustering and semi-supervised classification tasks. After that, we present a deep neural network implementation of our technique in Section 4. Clustering and semi-supervised classification experimental results and analysis are presented in Section 5 and Section 6, respectively. Section 7 validates our proposed deep clustering model. Finally, Section 8 draws conclusions.
Notations. Given a data matrix with features and samples, we denote its -th element and -th column as and , respectively. The -norm of vector is represented as , where is transpose operator. The -norm of is denoted by . The squared Frobenius norm is defined as . The definition of ’s nuclear norm is , where is the
-th singular value of.
represents the identity matrix with proper size anddenotes a column vector with proper length where all elements are ones. means all the elements of are nonnegative. Inner product is denoted by . Trace operator is denoted by .
2 Related Work
In this paper, we focus on the learning of new representation that characterizes the relationship between samples, namely, the pairwise similarity information. It is well-known that similarity measure is a fundamental and crucial problem in machine learning, pattern recognition, computer vision, data mining and so onZuo et al. (2018); Liu et al. (2015). A number of traditional approaches are often utilized in practice for convenience. As aforementioned, they often suffer from different kinds of drawbacks. Adaptive neighbors approach can learn a similarity matrix from data, but it can only capture the local structure information and thus the performance might have deteriorated in clustering Nie et al. (2014).
Self-expression, another strategy, has become increasingly popular in recent years Yang et al. (2014). The basic idea is to encode each datum as a weighted combination of other samples, i.e., its direct neighbors and reachable indirect neighbors. Similar to locally linear embedding (LLE) Roweis and Saul (2000), if and are similar, weight coefficient should be big. From this point of view, also behaves like a similarity matrix. For convenience, we denote the reconstructed data as , where . The discrepancy between the original data and the reconstructed data is minimized by solving the following problem:
where is a regularizer on , is used to balance the effects of the two terms. Thus, we can seek either a sparse representation or a low-rank representation of the data by adopting the norm and nuclear norm of , respectively. Since this approach can capture the global structure information hidden in the data, it has drawn significant attention and achieved impressive performance in a number of applications, including face recognition Zhang et al. (2011), subspace clustering Yao et al. (2019); Elhamifar and Vidal (2013); Feng et al. (2014), semisupervised learning Zhuang et al. (2012), dimension reduction Lu et al. (2016), and vision learning Li et al. (2017). To consider nonlinear or manifold structure information of data, some kernel-based methods Xiao et al. (2016); Patel and Vidal (2014) and manifold learning methods Zhuang et al. (2016); Liu et al. (2014) have been developed. However, these manifold-based methods depend on labels or graph Laplacian, which are often not available.
Recently, Kang et al. propose a twin learning for similarity and clustering (TLSC) Kang et al. (2017) method. TLSC performs similarity learning and clustering in a unified framework. In particular, the similarity matrix is learned via self-expression in kernel space. Consequently, it shows impressive performance in clustering task.
However, all existing self-expression based methods just try to reconstruct the original data such that some valuable information is largely ignored. In practice, the low-dimensional manifold structure of real data is often very complicated and presents complex structure apart from low-rank or sparse Haeffele et al. (2014). Exploiting data relations has been proved to be a promising means to discover the underlying structure in a number of techniques Tenenbaum et al. (2000); Roweis and Saul (2000). For instance, ISOMAP Tenenbaum et al. (2000) retains the geodesic distance between pairwise data in the low-dimensional space. LLE Roweis and Saul (2000) learns a low-dimensional manifold by preserving the linear relation, i.e., each data point is a linear combination of its neighbors. To seek a low-dimensional manifold, Laplacian Eigenmaps Belkin and Niyogi (2002) minimizes the weighted pairwise distance in the projected space, where weight characterizes the pairwise relation in the original space.
In this paper, we demonstrate how to integrate similarity information into the construction of new representation of data, resulting in a significant improvement on two fundamental tasks, i.e., clustering and semi-supervised classification. More importantly, the proposed idea can be readily applied to other self-expression methods such as smooth representation Hu et al. (2014), least squared representation Lu et al. (2012), and many applications, e.g., Occlusion Removing Qian et al. (2015), Saliency Detection Lang et al. (2012), Image Segmentation Cheng et al. (2011).
3 Proposed Formulation
To make our framework more general, we build our model in kernel space. Eq.(1) can be easily extended to kernel representation through mapping . By utilizing kernel trick , we have
In this paper, we aim to preserve the similarity information of the original data. To this end, we make use of the widely used inner product. Specifically, we try to minimize the inconsistency between two inner products: one for the raw data and another for reconstructed data . To make our model more general, we build it in a transformed space. In other words, we have
(3) can be simplified as
Comparing Eq. (4) to (2), we can see that Eq. (4) involves higher order of . Thus, our designed Eq. (4) captures high order information of original data. Although we claim that our method seeks to preserve similarity information, it also includes dissimilarity preserving effect, so it can preserve the relations between samples in general. Combining (4) with (2), we obtain our Structure Learning with Similarity Preserving (SLSP) framework:
Through solving this problem, we can obtain either a low-rank or sparse matrix , which carries rich structure information of the original data. Besides this, SLSP enjoys several other nice properties:
(1) Our framework can not only capture global structure information but also preserve the original pairwise similarities between the data points in the original data in the embedding space. If a linear kernel function is adopted in (5), our framework can recover linear structure information hidden in the data.
(2) Our proposed technique is particularly suitable to problems that are sensitive to sample similarity, such as clustering Elhamifar and Vidal (2013), classification Zhuang et al. (2012), users/items similarity in recommender systems Ning and Karypis (2011), patient/drug similarity in healthcare informatics Zhu et al. (2016). We believe that our framework can effectively model and extract rich low-dimensional structures in high-dimensional data such as images, documents, and videos.
(3) The input is kernel matrix. This is an appealing property, as not all types of real-world data can be represented in numerical feature vectors form. For example, we often find clusters of proteins based on their structures and group users in social media according to their friendship relations.
(4) Generic similarity rather than inner product can also be used to construct (4) given that the resulting optimization problem is still solvable. It means that similarity measures that reflect domain knowledge such as Popescu et al. (2006) can be incorporated in SLSP directly. Even dissimilarity measures can be included in this algorithm. This flexibility extends the range of applicability of SLSP.
Although the SLSP problem can be solved in several different ways, we describe an alternating direction method of multipliers (ADMM)  based approach, which is easy to understand. Since the objective function in (5) is a fourth-order function of , ADMM can lower its order by introducing auxiliary variables.
First, we rewrite (5) in the following equivalent form by introducing three new variables:
Then its augmented Lagrangian function can be written as:
where is a penalty parameter and , , are Lagrangian multipliers. We can update those variables alternatively, one at each step, while keeping the others fixed. Then, it yields the following updating rules.
Updating : By removing the irrelevant terms, we arrive at:
It can be seen that it is a strongly convex quadratic function and can be solved by setting its first derivative to zero, so
Updating : For , we are to solve:
By setting its first derivative to zero, we obtain
Updating : We fix other variables except , the objective function becomes:
Similar to , it yields
Updating : For , the subproblem is:
where . Depending on regularization strategy, we have different closed-form solutions for
. Let’s write the singular value decomposition (SVD) ofas . Then, for low-rank representation, i.e., , we have Cai et al. (2010),
To obtain a sparse representation, i.e., , we can update element-wisely as Beck and Teboulle (2009) :
For clarity, the complete algorithm to solve problem (6) is summarized in Algorithm 1. We stop the algorithm if the maximum iteration number 300 is reached or the relative change of is less than .
3.2 Complexity Analysis
First, the construction of kernel matrix costs . The computational cost of Algorithm 1 is mainly determined by updating the variables , , and . All of them involve matrix inversion and multiplication of matrices, whose complexity is . For large scale data sets, we might alleviate this by resorting to some approximation techniques or tricks, e.g., Woodbury matrix identity. In addition, depending on the choice of regularizer, we have different complexity for . For low-rank representation, it requires an SVD for every iteration and its complexity is if we employ partial SVD ( is lowest rank we can find), which can be achieved by package like PROPACK. The complexity of obtaining a sparse solution is . The updating of , , and cost .
3.3 Application of Similarity Matrix
One typical application of is spectral clustering which builds the graph Laplacian based on pairwise similarities between data points. pecific, , where is a diagonal matrix with -th element as . Spectral clustering solves the following problem:
where is the cluster indicator matrix.
Another classical task that make use of
is semi-supervised classification. In the past decade, graph-based semi-supervised learning (GSSL) has attracted numerous attentions due to its elegant formulation and low computation complexityCheng et al. (2009). Similarity graph construction is one of the two fundamental components in GSSL, which is critical to the quality of classification. Nevertheless, with respect to label inference, graph construction has attracted much less attention until recent years Berton and De Andrade Lopes (2015).
After we obtain , we can adopt the popular local and global consistency (LGC) as the classification framework Zhou et al. (2004). LGC finds a classification function by solving the following problem:
where is the class number, is the label matrix, in which iff the -th sample belongs to the -th class, and otherwise.
4 Extension to Deep Model
The proposed objective function in Eq. (5) can discover the structure in the input space. However, it has less representation powers of data. On the other hand, deep auto-encoder Hinton and Salakhutdinov (2006) and its variants Vincent et al. (2010); Masci et al. (2011) can learn structure of data in the nonlinear feature space. However, it ignores the geometry of data in learning data representations. It is a key challenge to learn useful representations for a specific task Bengio et al. (2013). In this paper, we propose the idea of similarity preserving for structure learning. Therefore, it is alluring to get the best of both worlds by implementing our SLSP framework within auto-encoder. As we show later, the proposed similarity preserving regularizer indeed enhance the performance of auto-encoder.
4.1 Model Formulation
proposed a deep subspace clustering model with the capability of similarity learning. Inspired by it, we introduce a self-expression layer into the deep auto-encoder architecture. Without bias and activation function, this fully connected layer encodes the notion of self-expression. In other words, this weights of this layer are the matrix. In addition, kernel mapping is no longer needed since we transform the input data with a neural network. Then, the architecture to implement our model can be depicted as Figure 1. As we can see, input data is first transformed into a latent representation , self-expressed by a fully-connected layer, and again mapped onto the original space.
Let denote the recovered data by decoder. We take each data point as a node in the network. Let the network parameters consist of encoder parameters , self-expression layer parameters , and decoder parameters . Then, is a function of and is a function of
. Eventually, we reach our loss function for Deep SLSP (DSLSP) as:
The first term denotes the traditional reconstruction loss which guarantees the recovering performance, so that the latent representation will retain the original information as much as possible. With the reconstruction performance guaranteed, the latent representation can be treated as a good representation of the input data . The second term is the self-expression as in Eq. (1
). The fourth term is the key component which functions as similarity preserving. For simplicity, it is implemented by dot product. This is also motivated by the fact that our input data points have experienced a series of highly non-linear transformations produced by the encoder.
5 Shallow Clustering Experiment
In this section, we conduct clustering experiments on images and documents with shallow models.
|# instances||# features||# classes|
We implement experiments on nine popular data sets. The statistics information of these data sets is summarized in Table 1. Specifically, the first five data sets include three face databases (ORL, YALE, and JAFFE), a toy image database COIL20, and a binary alpha digits data set BA. Tr11, Tr41, and Tr45 are derived from NIST TREC Document Database. TDT2 corpus has been among the ideal test sets for document clustering purposes.
Following the setting in Du et al. (2015), we design 12 kernels. They are: seven Gaussian kernels of the form with , where is the maximal distance between data points; a linear kernel ; four polynomial kernels of the form with and . Besides, all kernels are normalized to range, which is done through dividing each element by the largest pairwise squared distance Du et al. (2015).
5.2 Comparison Methods
To fully investigate the performance of our method on clustering, we choose a good set of methods to compare. In general, they can be classified into two categories: similarity-based and kernel-based clustering methods.
Spectral Clustering (SC) Ng et al. (2002): SC is a widely used clustering technique. It enjoys the advantage of exploring the intrinsic data structures. However, how to construct a good similarity graph is an open issue. Here, we directly use kernel matrix as its input. For our proposed SLSP method, we obtain clustering results by performing spectral clustering with our learned .
Simplex Sparse Representation (SSR) Huang et al. (2015): Based on sparse representation, SSR achieves satisfying performance in numerous data sets.
Kernelized LRR (KLRR) Xiao et al. (2016): Based on self-expression, low-rank representation has achieved great success on a number of applications. Kernelized LRR deals with nonlinear data and demonstrates better performance than LRR in many tasks.
Kernelized SSC (KSSC) Patel and Vidal (2014): Kernelized version of SSC has also been proposed to capture nonlinear structure information in the input space. Since our framework is an extension of KLRR and KSSC to preserve similarity information, the difference in performance will shed light on the effects of similarity preserving.
Twin Learning for Similarity and Clustering (TLSC) Kang et al. (2017): Based on self-expression, TLSC has been proposed recently and has shown superior performance on a number of real-world data sets. TLSC does not only learn similarity matrix via self-expression in kernel space but also has optimal similarity graph guarantee. However, it fails to preserve similarity information.
SLKE-S and SLKE-R Kang et al. (2019): They are closely related to our method developed in this paper. However, they only have similarity preserving term, which might lose some low-oder information.
Our proposed SLSP: Our proposed structure learning framework with similarity preserving capability. After obtaining similarity matrix , we perform spectral clustering based on Eq.(17). We examine both low-rank and sparse regularizer and denote their corresponding methods as SLSP-r and SLSP-s, respectively. The implementation of our algorithm is publicly available111https://github.com/sckangz/L2SP.
5.3 Evaluation Metrics
To quantify the effectiveness of our algorithm on clustering task, we use the popular metrices, i.e., accuracy (Acc) and normalized mutual information (NMI) Cai et al. (2009).
As the most widely used clustering metric, Acc aims to measure the one-to-one relationship between clusters and classes. If we use and to represent the clustering partition and the ground truth label of sample , respectively, then we can define Acc as
where is the total number of instances, is the famous delta function, and map() maps each cluster index to a true class label based on Kuhn-Munkres algorithm Chen et al. (2001).
5.4 Clustering Results
We report the experimental results in Table 2.
As we can see, our method can beat others in almost all experiments. Concretely, we can draw the following conclusions:
(i) The improvements of SLSP against SC verify the importance of high quality similarity measure. Rather than directly using kernel matrix in SC, we use learned as input of SC. Hence, the big improvement entirely comes from our high-quality similarity measure;
(ii) Comparing SLSP-s with KSSC and SLSP-r with KLRR, we can see the benefit of retaining similarity structure information. In particular, for TDT2 data set, SLSP-s enhances the accuracy of KSSC by 25.16.
(iii) It is worth pointing out our big gain over recently proposed method TLSC. Although both SLSP and TLSC are based on self-expression and kernel method, TLSC fails to consider preserving similarity information, which might be lost during the reconstruction process.
(iv) With respect to SLKE-S and SLKE-R, which have the effect of similarity preserving, our method still outperforms them in most cases. This is attributed to the fact that the first term in Eq. (5) can keep some low-order information, which is missing in SLKE-S and SLKE-R. We can observe that SLSP-r improves the accuracy of SLKE-R over 6% on ORL, BA, TR11 datasets.
In summary, these results confirm the crucial role of similarity measure in clustering and the great benefit due to similarity preserving.
5.5 Parameter Analysis
There are two parameters in our model: and . Taking YALE data set as an example, we demonstrate the sensitivity of our model SLSP-r and SLSP-s to and in Figure 2 and 3. They illustrate that our methods are quite insensitive to and over wide ranges of values.
6 Semi-supervised Classification Experiment
In this section, we show that our method performs well on semi-supervised classification task based on Eq.(18).
We perform experiments on different types of recognition tasks.
(1) Evaluation on Face Recognition: We examine the effectiveness of our similarity graph learning for face recognition on two frequently used face databases: YALE and JAFFE. The YALE face data set contains 15 individuals, and each person has 11 near frontal images taken under different illuminations. Each image is resized to 3232 pixels. The JAFFE face database consists of 10 individuals, and each subject has 7 different facial expressions (6 basic facial expressions +1 neutral). The images are resized to 2626 pixels.
(2) Evaluation on Digit/Letter Recognition: In this experiment, we address the digit/letter recognition problem on the BA database. The data set consists of digits of “0” through “9” and letters of capital “A” to “Z”, this leads to 36 classes. Moreover, each class has 39 samples.
(3) Evaluation on Visual Object Recognition: We conduct visual object recognition experiment on the COIL20 database. The database consists of 20 objects and 72 images for each object. For each object, the images were taken 5 degrees apart as the object is rotating on a turntable. The size of each image is 3232 pixels.
To reduce the work load, we construct 7 kernels for each data set. They include: four Gaussian kernels with varies over ; a linear kernel ; two polynomial kernels with .
6.2 Comparison Methods
We compare our method with several other state-of-the-art algorithms.
Local and Global Consistency (LGC) Zhou et al. (2004): LGC is a popular label propagation method. For this method, we use kernel matrix as its similarity measure to compute .
Gaussian Field and Harmonic function (GFHF) Zhu et al. (2003): Different from LGC, GFHF is another mechanics to infer those unknown labels as a process of propagating labels through the pairwise similarity.
Semi-supervised Classification with Adaptive Neighbours (SCAN) Nie et al. (2017): Based on adaptive neighbors method, SCAN adds the rank constraint to ensure that has exact connected components. As a result, the similarity matrix and class indicator matrix are learned simultaneously. It shows much better performance than many other techniques.
A Unified Optimization Framework for Semi-supervised Learning Li et al. (2015): Li et al. propose a unified framework based on self-expression approach. Similar to SCAN, the similarity matrix and class indicator matrix are updated alternatively. By using low-rank and sparse regularizer, they have SLRR and SR method, respectively.
Our Proposed SLSP: After we obtain from SLSP-r and SLSP-s, we plug them into LGC algorithm to predict labels for unlabeled data points.
6.3 Experimental Setup
The commonly used evaluation measure accuracy is adopted here. Its definition is
where is the number of samples correctly predicted and is the total number of samples. We randomly choose some portions of samples as labeled data and repeat 20 times. In our experiment, 10, 30, 50 of samples in each class are randomly selected and labeled. Then, classification accuracy and deviation are shown in Table 3. For GFHF, LGC, KLRR, KSSC, and our proposed SLSP method, the aforementioned seven kernels are tested and best performance is reported. For these methods, more importantly, the label information is only used in the label propagation stage. For SCAN, SLRR, and SR, the label prediction and similarity learning are conducted in a unified framework, which often leads to better performance.
As expected, the classification accuracy for all methods monotonically increases with the increase of the percentage of labeled samples. As it can be observed, our SLSP method consistently outperforms other state-of-the-art methods. This confirms the effectiveness of our proposed method. Specifically, we have the following observations:
(i) By comparing the performance of our proposed SLSP with LGC, we can clearly see the importance of graph construction in semi-supervised learning. On COIL20 data set, the average improvement of SLSP-s and SLSP-r over LGC is 11.67 and 10.31, respectively. In our experiments, LGC directly uses kernel matrix as input, while our method uses the learned similarity matrix instead in LGC. Hence, the improvements attribute to our high-quality graph construction;
(ii) The superiority of SLSP-s and SLSP-r over KSSC and KLRR, respectively, derives from our consideration of similarity preserving effect. The improvement is considerable especially when the portion of labeled samples is small, which means our method would be promising in a real situation. With 10 labeling, for example, the average gain is 7.69 and 6 for sparse and low-rank representation, respectively;
(iii) Although SCAN, SLRR, and SR can learn similarity matrix and labels simultaneously, our two-step approach still reach higher recognition rate. These imply that our proposed method can produce a more accurate similarity graph than existing techniques that without explicit similarity preserving capability.
7 Deep Clustering Experiment
To demonstrate the effect of deep model DSLSP, we follow the settings in Ji et al. (2017) and perform clustering task on Extended Yale B (EYaleB), ORL, COIL20, and COIL40 datasets. We compare with LRR Liu et al. (2013), Low Rank Subspace Clustering (LRSC) Vidal and Favaro (2014), SSC Elhamifar and Vidal (2013), Kernel Sparse Subspace Clustering (KSSC) Patel and Vidal (2014), SSC by Orthogonal Matching Pursuit (SSC-OMP) You et al. (2016), Efficient Dense Subspace Clustering (EDSC) Ji et al. (2014), SSC with pre-trained convolutional auto-encoder features (AE+SSC), Deep Embedding Clustering (DEC) Xie et al. (2016), Deep -means (KDM) Fard et al. (2018), Deep Subspace Clustering Network with norm (DSC-Net-L1) Ji et al. (2017), and Deep Subspace Clustering Network with norm (DSC-Net-L2) Ji et al. (2017). For a fair comparison with DSC-Nets, we adopt and
norm respectively using the same network architectures, which are denoted as DSLSP-L1 and DSLSP-L2. We adopt convolutional neural networks (CNNs) to implement the auto-encoder. Adam is employed to do the optimizationKingma and Ba (2014). The full batch of dataset is fed to our network. We pre-train the network without the self-expression layer. The details of the network structures are shown in Table 4.
The clustering performance of different methods is provided in Table 5. We observe that DSLSP-L2 and DSLSP-L1 achieve very good performance. Specifically, we have the following observations:
The norm performs slightly better than norm. This is consistent with the results in Ji et al. (2017). Perhaps, this is caused by the inaccurate optimization in norm since it is non-differentiable at zero.
As they share the same network for latent representation learning, the improvement of DSLSP over DSC-Net is attributed to our introduced similarity preserving mechanism. Note that the only difference between their objective function is the additional similarity preserving term in Eq. (19). For example, on COIL20, DSLSP-L2 improves over DSC-Net-L2 by 3.89% and 3.32% in terms of accuracy and NMI, respectively. For COIL40, our method with norm outperforms DSC-Net-L2 by 3.42% on accuracy and 3.26% on NMI.
to 0.8775 and 0.9757, respectively. Once again, this demonstrates the power of deep learning models. Furthermore, for these two datasets, our results in Table2 are also better than the shallow methods and AE+SSC in Table 5. This further verifies the superior advantages of our similarity preserving approach.
Compared to DEC and DKM, our method can improve the performance significantly. This is own to that our method is based on similarity, while other methods are based on Euclidean distance which is not suitable for complex data.
In summary, above conclusions imply the superiority of our proposed similarity preserving term, no matter in shallow or deep models.
In this paper, we introduce a new structure learning framework, which is capable of obtaining highly informative similarity graph for clustering and semi-supervised methods. Different from existing low-dimensional structure learning techniques, a novel term is designed to take advantage of sample pairwise similarity information in the learning stage. In particular, by incorporating the similarity preserving term in our objective function, which tends to keep the similarities between samples, our method consistently and significantly improves clustering and classification accuracy. Therefore, we can conclude that our framework can better capture the geometric structure of the data, resulting in more informative and discriminative similarity graph. Besides, our method can be easily extended to other self-expression based methods. In the future, we plan to further investigate efficient algorithms for constructing large-scale similarity graphs. Also, current methods conduct label learning after graph construction. It is interesting to develop principled method to solve the graph construction and label learning problems at the same time.
This paper was in part supported by Grants from the Natural Science Foundation of China (Nos. 61806045 and 61572111) and a Fundamental Research Fund for the Central Universities of China (Nos. ZYGX2017KYQD177).
- A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences 2 (1), pp. 183–202. Cited by: §3.1.
- Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in neural information processing systems, pp. 585–591. Cited by: §2.
- Representation learning: a review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1798–1828. Cited by: §4.
Graph construction for semi-supervised learning.
Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI’15, pp. 4343–4344. External Links: Cited by: §3.3.
- Locality preserving nonnegative matrix factorization.. In IJCAI, Vol. 9, pp. 1010–1015. Cited by: §5.3.
- A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization 20 (4), pp. 1956–1982. Cited by: §3.1.
- Robust principal component analysis?. Journal of the ACM (JACM) 58 (3), pp. 11. Cited by: §1.
- Exact matrix completion via convex optimization. Foundations of Computational mathematics 9 (6), pp. 717. Cited by: §1.
- Atomic decomposition by basis pursuit. SIAM review 43 (1), pp. 129–159. Cited by: §5.3.
- Similarity learning of manifold data. IEEE transactions on cybernetics 45 (9), pp. 1744–1756. Cited by: §1.
- Multi-task low-rank affinity pursuit for image segmentation. In Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 2439–2446. Cited by: §2.
- Sparsity induced similarity measure for label propagation. In Computer Vision, 2009 IEEE 12th International Conference on, pp. 317–324. Cited by: §3.3.
- Compressed sensing. IEEE Transactions on information theory 52 (4), pp. 1289–1306. Cited by: §1.
- Robust multiple kernel k-means using l21-norm. In Proceedings of the 24th International Conference on Artificial Intelligence, pp. 3476–3482. Cited by: item 2, §5.1.
- Sparse and redundant representations: from theory to applications in signal and image processing. 1st edition, Springer Publishing Company, Incorporated. External Links: Cited by: §1.
- Sparse subspace clustering: algorithm, theory, and applications. IEEE transactions on pattern analysis and machine intelligence 35 (11), pp. 2765–2781. Cited by: §1, §2, §3, §7.
- Deep -means: jointly clustering with -means and learning representations. arXiv preprint arXiv:1806.10069. Cited by: §7.
- Robust subspace segmentation with block-diagonal prior. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3818–3825. Cited by: §2.
- Structured low-rank matrix factorization: optimality, algorithm, and applications to image processing. In International Conference on Machine Learning, Cited by: §1, §2.
- Reducing the dimensionality of data with neural networks. science 313 (5786), pp. 504–507. Cited by: §4.
- Smooth representation clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3834–3841. Cited by: §2.
- A new simplex sparse learning model to measure data similarity for clustering.. In IJCAI, pp. 3569–3575. Cited by: §1, item 3.
- Efficient dense subspace clustering. In Applications of Computer Vision (WACV), 2014 IEEE Winter Conference on, pp. 461–468. Cited by: §7.
- Deep subspace clustering networks. In Advances in Neural Information Processing Systems, pp. 24–33. Cited by: §4.1, item 1, §7.
- Similarity learning via kernel preserving embedding. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19). AAAI Press, pp. 4057–4064. Cited by: item 7.
- Robust pca via nonconvex rank approximation. In Data Mining (ICDM), 2015 IEEE International Conference on, pp. 211–220. Cited by: §1.
- Top-n recommender system via matrix completion.. In AAAI, pp. 179–185. Cited by: §1.
- Twin learning for similarity and clustering: a unified kernel approach. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17). AAAI Press, Cited by: §1, §2, item 6.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §7.
- Saliency detection by multitask sparsity pursuit. IEEE Transactions on Image Processing 21 (3), pp. 1327–1338. Cited by: §2.
- Nonlinear dimensionality reduction. Springer Science & Business Media. Cited by: §1.
- Learning semi-supervised representation towards a unified optimization framework for semi-supervised learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2767–2775. Cited by: §1, item 4.
- Self-taught low-rank coding for visual learning. IEEE transactions on neural networks and learning systems. Cited by: §2.
- Low-rank models in visual analysis: theories, algorithms, and applications. 1st edition, Elsevier Science Publishing Co Inc. External Links: Cited by: §1.
- Robust recovery of subspace structures by low-rank representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (1), pp. 171–184. Cited by: §1, §1, §7.
- Enhancing low-rank subspace clustering by manifold regularization. IEEE Transactions on Image Processing 23 (9), pp. 4022–4030. Cited by: §2.
- Absent multiple kernel learning. In Twenty-Ninth AAAI Conference on Artificial Intelligence, Cited by: §2.
- Robust and efficient subspace segmentation via least squares regression. Computer Vision–ECCV 2012, pp. 347–360. Cited by: §2.
- Low-rank preserving projections. IEEE transactions on cybernetics 46 (8), pp. 1900–1913. Cited by: §2.
Stacked convolutional auto-encoders for hierarchical feature extraction. In International Conference on Artificial Neural Networks, pp. 52–59. Cited by: §4.
On spectral clustering: analysis and an algorithm. Advances in neural information processing systems 2, pp. 849–856. Cited by: item 1.
- Multi-view clustering and semi-supervised classification with adaptive neighbours.. In AAAI, pp. 2408–2414. Cited by: item 3.
- Clustering and projected clustering with adaptive neighbors. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 977–986. Cited by: §1, §2.
- Slim: sparse linear methods for top-n recommender systems. In Data Mining (ICDM), 2011 IEEE 11th International Conference on, pp. 497–506. Cited by: §1, §3.
- Kernel sparse subspace clustering. In Image Processing (ICIP), 2014 IEEE International Conference on, pp. 2849–2853. Cited by: §2, item 5, item 5, §7.
- Subspace clustering using log-determinant rank approximation. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 925–934. Cited by: §1.
- Constructing the l2-graph for robust subspace learning and subspace clustering. IEEE transactions on cybernetics 47 (4), pp. 1053–1066. Cited by: §1.
- Fuzzy measures on the gene ontology for gene product similarity. IEEE/ACM Transactions on computational biology and bioinformatics 3 (3), pp. 263–274. Cited by: §3.
- Robust nuclear norm regularized regression for face recognition with occlusion. Pattern Recognition 48 (10), pp. 3145–3159. Cited by: §2.
- Nonlinear dimensionality reduction by locally linear embedding. science 290 (5500), pp. 2323–2326. Cited by: §2, §2.
Unsupervised feature selection via latent representation learning and manifold regularization. Neural Networks 117, pp. 163–178. Cited by: §1.
- From ensemble clustering to multi-view clustering.. In IJCAI, pp. 2843–2849. Cited by: §1.
- A global geometric framework for nonlinear dimensionality reduction. science 290 (5500), pp. 2319–2323. Cited by: §2.
- Low rank subspace clustering (lrsc). Pattern Recognition Letters 43, pp. 47–61. Cited by: §7.
Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research 11 (Dec), pp. 3371–3408. Cited by: §4.
- Robust kernel low-rank representation. IEEE transactions on neural networks and learning systems 27 (11), pp. 2268–2281. Cited by: §2, item 4, item 5.
- Unsupervised deep embedding for clustering analysis. In International conference on machine learning, pp. 478–487. Cited by: §7.
- Data clustering by laplacian regularized l1-graph.. In AAAI, pp. 3148–3149. Cited by: §2.
- Multi-view multiple clustering. In IJCAI, Cited by: §2.
- Scalable sparse subspace clustering by orthogonal matching pursuit. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3918–3927. Cited by: §7.
- Self-tuning spectral clustering.. In NIPS, Vol. 17, pp. 16. Cited by: §1.
- Sparse representation or collaborative representation: which helps face recognition?. In Computer vision (ICCV), 2011 IEEE international conference on, pp. 471–478. Cited by: §2.
- Joint low-rank and sparse principal feature coding for enhanced robust representation and visual classification. IEEE Transactions on Image Processing 25 (6), pp. 2429–2443. Cited by: §1.
- Learning with local and global consistency. In Advances in neural information processing systems, pp. 321–328. Cited by: §3.3, item 1.
- Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International conference on Machine learning (ICML-03), pp. 912–919. Cited by: item 2.
- Measuring patient similarities via a deep architecture with medical concept embedding. In Data Mining (ICDM), 2016 IEEE 16th International Conference on, pp. 749–758. Cited by: §3.
- Non-negative low rank and sparse graph for semi-supervised learning. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 2328–2335. Cited by: §2, §3.
- Locality-preserving low-rank representation for graph construction from nonlinear manifolds. Neurocomputing 175, pp. 715–722. Cited by: §2.
- Guest editors’ introduction to the special section on large scale and nonlinear similarity learning for intelligent video analysis. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §2.