1 Introduction
With the advancements in information technology, highdimensional data become very common for representing the data. However, it is difficult to deal with highdimensional data due to challenges such as the curse of dimensionality, storage and computation costs. Fortunately, in practice data are not unstructured. For example, their samples usually lie around lowdimensional manifolds and have high correlation among them
Liu et al. (2013); Kang et al. (2015). This phenomenon is validated by the widely used Principal Component Analysis (PCA) where the number of principal components is much smaller than the data dimension. Such a phenomenon is also evidenced in nonlinear manifold learning
Lee and Verleysen (2007). Since dimension is closely related to the rank of matrix, lowrank characteristic has been shown to be very effective in studying the lowdimensional structures in data Peng et al. (2015); Zhang et al. (2016).Another motivation of utilizing rank in data and signal analysis is due to the tremendous success of sparse representation Elad (2010) and compressed sensing Donoho (2006)
, which are mainly applied to deal with first order data, such as voices and feature vectors. As an extension to the sparsity of order one data, low rankness is a measure for the sparsity of second order data, such as images
Lin and Zhang (2017). Lowrank models can effectively capture the correlation among rows and columns of a matrix as shown in robust PCA Candès et al. (2011), matrix completion Candès and Recht (2009); Kang et al. (2016), and so on. Recently, lowrank and sparse models have shown their effectiveness in processing highdimensional data by effectively extracting rich lowdimensional structures in data, despite gross corruption and outliers. Unlike traditional manifold learning, this approach often enjoys good theoretical guarantees.
When data resides near multiple subspaces, a coefficient matrix is introduced to enforce correlation among samples. Two typical models are lowrank representation (LRR) Liu et al. (2013) and sparse subspace clustering (SSC) Elhamifar and Vidal (2013). Both LRR and SSC aim to find a coefficient matrix by trying to reconstruct each data point as a linear combination of all the other data points, which is called selfexpression property. is assumed to be lowrank in LRR and sparse in SSC. In the literature, is also called the similarity matrix since it measures the similarity between samples Chen et al. (2015). LRR and SSC have achieved impressive performance in face clustering, motion segmentation, etc. In these applications, they first learn a similarity matrix
from the data by minimizing the reconstruction error. After that, they implement spectral clustering by treating
as similarity graph matrix Peng et al. (2017). Selfexpression idea inspires a lot of work along this line. Whenever similarity among samples/features is needed, it can be used. For instance, in recommender system, we can use it to calculate the similarity among users and items Ning and Karypis (2011); in semisupervised classification, we can utilize it to obtain the similarity graph Li et al. (2015); in multiview learning, we can use it to characterize the connection between different views Tao et al. (2017).More importantly, there are a variety of benefits to obtain similarity matrix through selfexpression. First, by this means, the most informative “neighbors” for each data point are automatically chosen and the global structure information hidden in the data is explored Nie et al. (2014). This will avoid many drawbacks in widely used nearestneighborhood and nearestneighborhood graph construction methods, such as determination of neighbor number or radius . Second, it is independent of similarity metrics, such as Cosine, Euclidean distance, Gaussian function, which are often datadependent and sensitive to noise and outliers Kang et al. (2017); Tang et al. (2019). Third, this automatic similarity learning from data can tackle data with structures at different scales of size and density Huang et al. (2015). Therefore, lowrank and sparse modeling based similarity learning can not only unveil lowdimensional structure, but also be robust to uncertainties of realworld data. It dramatically reduces the potential chances that might heavily influence the subsequent tasks ZelnikManor and Perona (2004).
Nevertheless, the data in various real applications is usually very complicated and can display structures beyond simply being lowrank or sparse Haeffele et al. (2014). Hence, it is essential to learn the representation that can well embed the rich structure information in the original data. Existing methods usually employ some simple models, which is generally less effective and hard to capture such rich structural information that exists in real world data. To combat this issue, in this paper we demonstrate that it is beneficial to preserve similarity information between samples when we perform structure learning and design a novel term for this task. This new term measures the inconsistency between two kernel matrices, one for raw data and another for reconstructed data, such that the reconstructed data well preserves rich structural information from the raw data. The advantage of this approach is demonstrated in three important problems: shallow clustering, semisupervised classification, and deep clustering.
Compared with existing work in the literature, the main contributions of this paper are as follows:

Different from current lowdimensional structure learning methods, we explicitly model the data relation by preserving the pairwise similarity of the original data with a novel term. Our approach reduces the inconsistency between the structural information of raw and reconstructed data, which leads to enhanced performance.

Our proposed structure learning framework is also applied to deep autoencoder. This helps to achieve a more informative and discriminative latent representation.

The effectiveness of the proposed approach is evaluated on both shallow and deep models with tasks from image clustering, document clustering, face recognition, digit/letter recognition, to visual object recognition. Comprehensive experiments demonstrate the superiority of our technique over other stateoftheart methods.

Our method can serve as a fundamental framework, which can be readily applied to other selfexpression methods. Moreover, beyond clustering and classification applications, the proposed framework can be efficiently generalized to a variety of other learning tasks.
The rest of the paper is organized as follows. Section 2 gives a brief review about two popular algorithms. Section 3 introduces the proposed technique and discusses its typical applications to spectral clustering and semisupervised classification tasks. After that, we present a deep neural network implementation of our technique in Section 4. Clustering and semisupervised classification experimental results and analysis are presented in Section 5 and Section 6, respectively. Section 7 validates our proposed deep clustering model. Finally, Section 8 draws conclusions.
Notations. Given a data matrix with features and samples, we denote its th element and th column as and , respectively. The norm of vector is represented as , where is transpose operator. The norm of is denoted by . The squared Frobenius norm is defined as . The definition of ’s nuclear norm is , where is the
th singular value of
.represents the identity matrix with proper size and
denotes a column vector with proper length where all elements are ones. means all the elements of are nonnegative. Inner product is denoted by . Trace operator is denoted by .2 Related Work
In this paper, we focus on the learning of new representation that characterizes the relationship between samples, namely, the pairwise similarity information. It is wellknown that similarity measure is a fundamental and crucial problem in machine learning, pattern recognition, computer vision, data mining and so on
Zuo et al. (2018); Liu et al. (2015). A number of traditional approaches are often utilized in practice for convenience. As aforementioned, they often suffer from different kinds of drawbacks. Adaptive neighbors approach can learn a similarity matrix from data, but it can only capture the local structure information and thus the performance might have deteriorated in clustering Nie et al. (2014).Selfexpression, another strategy, has become increasingly popular in recent years Yang et al. (2014). The basic idea is to encode each datum as a weighted combination of other samples, i.e., its direct neighbors and reachable indirect neighbors. Similar to locally linear embedding (LLE) Roweis and Saul (2000), if and are similar, weight coefficient should be big. From this point of view, also behaves like a similarity matrix. For convenience, we denote the reconstructed data as , where . The discrepancy between the original data and the reconstructed data is minimized by solving the following problem:
(1) 
where is a regularizer on , is used to balance the effects of the two terms. Thus, we can seek either a sparse representation or a lowrank representation of the data by adopting the norm and nuclear norm of , respectively. Since this approach can capture the global structure information hidden in the data, it has drawn significant attention and achieved impressive performance in a number of applications, including face recognition Zhang et al. (2011), subspace clustering Yao et al. (2019); Elhamifar and Vidal (2013); Feng et al. (2014), semisupervised learning Zhuang et al. (2012), dimension reduction Lu et al. (2016), and vision learning Li et al. (2017). To consider nonlinear or manifold structure information of data, some kernelbased methods Xiao et al. (2016); Patel and Vidal (2014) and manifold learning methods Zhuang et al. (2016); Liu et al. (2014) have been developed. However, these manifoldbased methods depend on labels or graph Laplacian, which are often not available.
Recently, Kang et al. propose a twin learning for similarity and clustering (TLSC) Kang et al. (2017) method. TLSC performs similarity learning and clustering in a unified framework. In particular, the similarity matrix is learned via selfexpression in kernel space. Consequently, it shows impressive performance in clustering task.
However, all existing selfexpression based methods just try to reconstruct the original data such that some valuable information is largely ignored. In practice, the lowdimensional manifold structure of real data is often very complicated and presents complex structure apart from lowrank or sparse Haeffele et al. (2014). Exploiting data relations has been proved to be a promising means to discover the underlying structure in a number of techniques Tenenbaum et al. (2000); Roweis and Saul (2000). For instance, ISOMAP Tenenbaum et al. (2000) retains the geodesic distance between pairwise data in the lowdimensional space. LLE Roweis and Saul (2000) learns a lowdimensional manifold by preserving the linear relation, i.e., each data point is a linear combination of its neighbors. To seek a lowdimensional manifold, Laplacian Eigenmaps Belkin and Niyogi (2002) minimizes the weighted pairwise distance in the projected space, where weight characterizes the pairwise relation in the original space.
In this paper, we demonstrate how to integrate similarity information into the construction of new representation of data, resulting in a significant improvement on two fundamental tasks, i.e., clustering and semisupervised classification. More importantly, the proposed idea can be readily applied to other selfexpression methods such as smooth representation Hu et al. (2014), least squared representation Lu et al. (2012), and many applications, e.g., Occlusion Removing Qian et al. (2015), Saliency Detection Lang et al. (2012), Image Segmentation Cheng et al. (2011).
3 Proposed Formulation
To make our framework more general, we build our model in kernel space. Eq.(1) can be easily extended to kernel representation through mapping . By utilizing kernel trick , we have
(2) 
By solving this problem, we can learn the nonlinear relations among . Note that (2) becomes (1) if a linear kernel is adopted.
In this paper, we aim to preserve the similarity information of the original data. To this end, we make use of the widely used inner product. Specifically, we try to minimize the inconsistency between two inner products: one for the raw data and another for reconstructed data . To make our model more general, we build it in a transformed space. In other words, we have
(3) 
(3) can be simplified as
(4) 
Comparing Eq. (4) to (2), we can see that Eq. (4) involves higher order of . Thus, our designed Eq. (4) captures high order information of original data. Although we claim that our method seeks to preserve similarity information, it also includes dissimilarity preserving effect, so it can preserve the relations between samples in general. Combining (4) with (2), we obtain our Structure Learning with Similarity Preserving (SLSP) framework:
(5) 
Through solving this problem, we can obtain either a lowrank or sparse matrix , which carries rich structure information of the original data. Besides this, SLSP enjoys several other nice properties:
(1) Our framework can not only capture global structure information but also preserve the original pairwise similarities between the data points in the original data in the embedding space. If a linear kernel function is adopted in (5), our framework can recover linear structure information hidden in the data.
(2) Our proposed technique is particularly suitable to problems that are sensitive to sample similarity, such as clustering Elhamifar and Vidal (2013), classification Zhuang et al. (2012), users/items similarity in recommender systems Ning and Karypis (2011), patient/drug similarity in healthcare informatics Zhu et al. (2016). We believe that our framework can effectively model and extract rich lowdimensional structures in highdimensional data such as images, documents, and videos.
(3) The input is kernel matrix. This is an appealing property, as not all types of realworld data can be represented in numerical feature vectors form. For example, we often find clusters of proteins based on their structures and group users in social media according to their friendship relations.
(4) Generic similarity rather than inner product can also be used to construct (4) given that the resulting optimization problem is still solvable. It means that similarity measures that reflect domain knowledge such as Popescu et al. (2006) can be incorporated in SLSP directly. Even dissimilarity measures can be included in this algorithm. This flexibility extends the range of applicability of SLSP.
3.1 Optimization
Although the SLSP problem can be solved in several different ways, we describe an alternating direction method of multipliers (ADMM) [41] based approach, which is easy to understand. Since the objective function in (5) is a fourthorder function of , ADMM can lower its order by introducing auxiliary variables.
First, we rewrite (5) in the following equivalent form by introducing three new variables:
(6) 
Then its augmented Lagrangian function can be written as:
(7) 
where is a penalty parameter and , , are Lagrangian multipliers. We can update those variables alternatively, one at each step, while keeping the others fixed. Then, it yields the following updating rules.
Updating : By removing the irrelevant terms, we arrive at:
(8) 
It can be seen that it is a strongly convex quadratic function and can be solved by setting its first derivative to zero, so
(9) 
Updating : For , we are to solve:
(10) 
By setting its first derivative to zero, we obtain
(11) 
Updating : We fix other variables except , the objective function becomes:
(12) 
Similar to , it yields
(13) 
Updating : For , the subproblem is:
(14) 
where . Depending on regularization strategy, we have different closedform solutions for
. Let’s write the singular value decomposition (SVD) of
as . Then, for lowrank representation, i.e., , we have Cai et al. (2010),(15) 
To obtain a sparse representation, i.e., , we can update elementwisely as Beck and Teboulle (2009) :
(16) 
For clarity, the complete algorithm to solve problem (6) is summarized in Algorithm 1. We stop the algorithm if the maximum iteration number 300 is reached or the relative change of is less than .
3.2 Complexity Analysis
First, the construction of kernel matrix costs . The computational cost of Algorithm 1 is mainly determined by updating the variables , , and . All of them involve matrix inversion and multiplication of matrices, whose complexity is . For large scale data sets, we might alleviate this by resorting to some approximation techniques or tricks, e.g., Woodbury matrix identity. In addition, depending on the choice of regularizer, we have different complexity for . For lowrank representation, it requires an SVD for every iteration and its complexity is if we employ partial SVD ( is lowest rank we can find), which can be achieved by package like PROPACK. The complexity of obtaining a sparse solution is . The updating of , , and cost .
3.3 Application of Similarity Matrix
One typical application of is spectral clustering which builds the graph Laplacian based on pairwise similarities between data points. pecific, , where is a diagonal matrix with th element as . Spectral clustering solves the following problem:
(17) 
where is the cluster indicator matrix.
Another classical task that make use of
is semisupervised classification. In the past decade, graphbased semisupervised learning (GSSL) has attracted numerous attentions due to its elegant formulation and low computation complexity
Cheng et al. (2009). Similarity graph construction is one of the two fundamental components in GSSL, which is critical to the quality of classification. Nevertheless, with respect to label inference, graph construction has attracted much less attention until recent years Berton and De Andrade Lopes (2015).After we obtain , we can adopt the popular local and global consistency (LGC) as the classification framework Zhou et al. (2004). LGC finds a classification function by solving the following problem:
(18) 
where is the class number, is the label matrix, in which iff the th sample belongs to the th class, and otherwise.
4 Extension to Deep Model
The proposed objective function in Eq. (5) can discover the structure in the input space. However, it has less representation powers of data. On the other hand, deep autoencoder Hinton and Salakhutdinov (2006) and its variants Vincent et al. (2010); Masci et al. (2011) can learn structure of data in the nonlinear feature space. However, it ignores the geometry of data in learning data representations. It is a key challenge to learn useful representations for a specific task Bengio et al. (2013). In this paper, we propose the idea of similarity preserving for structure learning. Therefore, it is alluring to get the best of both worlds by implementing our SLSP framework within autoencoder. As we show later, the proposed similarity preserving regularizer indeed enhance the performance of autoencoder.
4.1 Model Formulation
Implementing Eq. (5) in autoencoder, we first need to express . Recently, Ji et al. Ji et al. (2017)
proposed a deep subspace clustering model with the capability of similarity learning. Inspired by it, we introduce a selfexpression layer into the deep autoencoder architecture. Without bias and activation function, this fully connected layer encodes the notion of selfexpression. In other words, this weights of this layer are the matrix
. In addition, kernel mapping is no longer needed since we transform the input data with a neural network. Then, the architecture to implement our model can be depicted as Figure 1. As we can see, input data is first transformed into a latent representation , selfexpressed by a fullyconnected layer, and again mapped onto the original space.Let denote the recovered data by decoder. We take each data point as a node in the network. Let the network parameters consist of encoder parameters , selfexpression layer parameters , and decoder parameters . Then, is a function of and is a function of
. Eventually, we reach our loss function for Deep SLSP (DSLSP) as:
(19) 
The first term denotes the traditional reconstruction loss which guarantees the recovering performance, so that the latent representation will retain the original information as much as possible. With the reconstruction performance guaranteed, the latent representation can be treated as a good representation of the input data . The second term is the selfexpression as in Eq. (1
). The fourth term is the key component which functions as similarity preserving. For simplicity, it is implemented by dot product. This is also motivated by the fact that our input data points have experienced a series of highly nonlinear transformations produced by the encoder.
5 Shallow Clustering Experiment
In this section, we conduct clustering experiments on images and documents with shallow models.
5.1 Data
# instances  # features  # classes  

YALE  165  1024  15 
JAFFE  213  676  10 
ORL  400  1024  40 
COIL20  1440  1024  20 
BA  1404  320  36 
TR11  414  6429  9 
TR41  878  7454  10 
TR45  690  8261  10 
TDT2  9394  36771  30 
We implement experiments on nine popular data sets. The statistics information of these data sets is summarized in Table 1. Specifically, the first five data sets include three face databases (ORL, YALE, and JAFFE), a toy image database COIL20, and a binary alpha digits data set BA. Tr11, Tr41, and Tr45 are derived from NIST TREC Document Database. TDT2 corpus has been among the ideal test sets for document clustering purposes.
Following the setting in Du et al. (2015), we design 12 kernels. They are: seven Gaussian kernels of the form with , where is the maximal distance between data points; a linear kernel ; four polynomial kernels of the form with and . Besides, all kernels are normalized to range, which is done through dividing each element by the largest pairwise squared distance Du et al. (2015).
5.2 Comparison Methods
To fully investigate the performance of our method on clustering, we choose a good set of methods to compare. In general, they can be classified into two categories: similaritybased and kernelbased clustering methods.

Spectral Clustering (SC) Ng et al. (2002): SC is a widely used clustering technique. It enjoys the advantage of exploring the intrinsic data structures. However, how to construct a good similarity graph is an open issue. Here, we directly use kernel matrix as its input. For our proposed SLSP method, we obtain clustering results by performing spectral clustering with our learned .

Simplex Sparse Representation (SSR) Huang et al. (2015): Based on sparse representation, SSR achieves satisfying performance in numerous data sets.

Kernelized LRR (KLRR) Xiao et al. (2016): Based on selfexpression, lowrank representation has achieved great success on a number of applications. Kernelized LRR deals with nonlinear data and demonstrates better performance than LRR in many tasks.

Kernelized SSC (KSSC) Patel and Vidal (2014): Kernelized version of SSC has also been proposed to capture nonlinear structure information in the input space. Since our framework is an extension of KLRR and KSSC to preserve similarity information, the difference in performance will shed light on the effects of similarity preserving.

Twin Learning for Similarity and Clustering (TLSC) Kang et al. (2017): Based on selfexpression, TLSC has been proposed recently and has shown superior performance on a number of realworld data sets. TLSC does not only learn similarity matrix via selfexpression in kernel space but also has optimal similarity graph guarantee. However, it fails to preserve similarity information.

SLKES and SLKER Kang et al. (2019): They are closely related to our method developed in this paper. However, they only have similarity preserving term, which might lose some lowoder information.

Our proposed SLSP: Our proposed structure learning framework with similarity preserving capability. After obtaining similarity matrix , we perform spectral clustering based on Eq.(17). We examine both lowrank and sparse regularizer and denote their corresponding methods as SLSPr and SLSPs, respectively. The implementation of our algorithm is publicly available^{1}^{1}1https://github.com/sckangz/L2SP.


5.3 Evaluation Metrics
To quantify the effectiveness of our algorithm on clustering task, we use the popular metrices, i.e., accuracy (Acc) and normalized mutual information (NMI) Cai et al. (2009).
As the most widely used clustering metric, Acc aims to measure the onetoone relationship between clusters and classes. If we use and to represent the clustering partition and the ground truth label of sample , respectively, then we can define Acc as
where is the total number of instances, is the famous delta function, and map() maps each cluster index to a true class label based on KuhnMunkres algorithm Chen et al. (2001).
The NMI is defined as follows
where and denote two sets of clusters, and
are the corresponding marginal probability distribution functions induced from the joint distribution
, and represents the entropy function. Bigger NMI value indicates better clustering performance.5.4 Clustering Results
We report the experimental results in Table 2.
As we can see, our method can beat others in almost all experiments. Concretely, we can draw the following conclusions:
(i) The improvements of SLSP against SC verify the importance of high quality similarity measure. Rather than directly using kernel matrix in SC, we use learned as input of SC. Hence, the big improvement entirely comes from our highquality similarity measure;
(ii) Comparing SLSPs with KSSC and SLSPr with KLRR, we can see the benefit of retaining similarity structure information. In particular, for TDT2 data set, SLSPs enhances the accuracy of KSSC by 25.16.
(iii) It is worth pointing out our big gain over recently proposed method TLSC. Although both SLSP and TLSC are based on selfexpression and kernel method, TLSC fails to consider preserving similarity information, which might be lost during the reconstruction process.
(iv) With respect to SLKES and SLKER, which have the effect of similarity preserving, our method still outperforms them in most cases. This is attributed to the fact that the first term in Eq. (5) can keep some loworder information, which is missing in SLKES and SLKER. We can observe that SLSPr improves the accuracy of SLKER over 6% on ORL, BA, TR11 datasets.
In summary, these results confirm the crucial role of similarity measure in clustering and the great benefit due to similarity preserving.
5.5 Parameter Analysis
There are two parameters in our model: and . Taking YALE data set as an example, we demonstrate the sensitivity of our model SLSPr and SLSPs to and in Figure 2 and 3. They illustrate that our methods are quite insensitive to and over wide ranges of values.
6 Semisupervised Classification Experiment
In this section, we show that our method performs well on semisupervised classification task based on Eq.(18).
6.1 Data
We perform experiments on different types of recognition tasks.
(1) Evaluation on Face Recognition: We examine the effectiveness of our similarity graph learning for face recognition on two frequently used face databases: YALE and JAFFE. The YALE face data set contains 15 individuals, and each person has 11 near frontal images taken under different illuminations. Each image is resized to 3232 pixels. The JAFFE face database consists of 10 individuals, and each subject has 7 different facial expressions (6 basic facial expressions +1 neutral). The images are resized to 2626 pixels.
(2) Evaluation on Digit/Letter Recognition: In this experiment, we address the digit/letter recognition problem on the BA database. The data set consists of digits of “0” through “9” and letters of capital “A” to “Z”, this leads to 36 classes. Moreover, each class has 39 samples.
(3) Evaluation on Visual Object Recognition: We conduct visual object recognition experiment on the COIL20 database. The database consists of 20 objects and 72 images for each object. For each object, the images were taken 5 degrees apart as the object is rotating on a turntable. The size of each image is 3232 pixels.
To reduce the work load, we construct 7 kernels for each data set. They include: four Gaussian kernels with varies over ; a linear kernel ; two polynomial kernels with .
6.2 Comparison Methods
We compare our method with several other stateoftheart algorithms.

Local and Global Consistency (LGC) Zhou et al. (2004): LGC is a popular label propagation method. For this method, we use kernel matrix as its similarity measure to compute .

Gaussian Field and Harmonic function (GFHF) Zhu et al. (2003): Different from LGC, GFHF is another mechanics to infer those unknown labels as a process of propagating labels through the pairwise similarity.

Semisupervised Classification with Adaptive Neighbours (SCAN) Nie et al. (2017): Based on adaptive neighbors method, SCAN adds the rank constraint to ensure that has exact connected components. As a result, the similarity matrix and class indicator matrix are learned simultaneously. It shows much better performance than many other techniques.

A Unified Optimization Framework for Semisupervised Learning Li et al. (2015): Li et al. propose a unified framework based on selfexpression approach. Similar to SCAN, the similarity matrix and class indicator matrix are updated alternatively. By using lowrank and sparse regularizer, they have SLRR and SR method, respectively.

Our Proposed SLSP: After we obtain from SLSPr and SLSPs, we plug them into LGC algorithm to predict labels for unlabeled data points.
Data  Labeled Percentage()  GFHF  LGC  SR  SLRR  SCAN  KLRR  KSSC  SLSPs  SLSPr 

YALE  10  38.0011.91  47.3313.96  38.838.60  28.779.59  45.071.30  50.5311.36  47.0310.32  51.201.29  53.3414.30 
30  54.139.47  63.082.20  58.254.25  42.585.93  60.924.03  62.673.38  70.083.39  70.713.13  71.132.88  
50  60.285.16  69.565.42  69.006.57  51.226.78  68.944.57  70.614.98  77.835.84  78.064.74  75.894.82  
JAFFE  10  92.857.76  96.682.76  97.331.51  94.386.23  96.921.68  95.293.27  91.222.46  98.591.07  98.990.83 
30  98.501.01  98.861.14  99.250.81  98.821.05  98.201.22  99.220.72  98.171.54  99.330.99  99.200.99  
50  98.941.11  99.290.94  99.820.60  99.470.59  99.255.79  99.860.32  99.380.65  99.910.27  99.910.99  
BA  10  45.093.09  48.371.98  25.321.14  20.102.51  55.051.67  46.292.33  49.132.06  56.711.71  58.181.27 
30  62.740.92  63.311.03  44.161.03  43.841.54  68.841.09  62.821.47  66.511.15  68.861.71  67.371.01  
50  68.301.31  68.451.32  54.101.55  52.491.27)  72.201.44  67.741.44  70.691.25  73.401.06  73.821.24  
COIL20  10  87.742.26  85.431.40  93.571.59  81.101.69  90.091.15  88.902.46  85.704.03  97.351.22  94.682.38 
30  95.481.40  87.821.03  96.520.68  87.691.39  95.270.93  96.751.49  96.171.65  99.461.55  98.500.85  
50  98.620.71  88.470.45  97.870.10  90.921.19  97.530.82  98.891.02  98.240.97  99.910.34  99.470.59 
6.3 Experimental Setup
The commonly used evaluation measure accuracy is adopted here. Its definition is
where is the number of samples correctly predicted and is the total number of samples. We randomly choose some portions of samples as labeled data and repeat 20 times. In our experiment, 10, 30, 50 of samples in each class are randomly selected and labeled. Then, classification accuracy and deviation are shown in Table 3. For GFHF, LGC, KLRR, KSSC, and our proposed SLSP method, the aforementioned seven kernels are tested and best performance is reported. For these methods, more importantly, the label information is only used in the label propagation stage. For SCAN, SLRR, and SR, the label prediction and similarity learning are conducted in a unified framework, which often leads to better performance.
6.4 Results
As expected, the classification accuracy for all methods monotonically increases with the increase of the percentage of labeled samples. As it can be observed, our SLSP method consistently outperforms other stateoftheart methods. This confirms the effectiveness of our proposed method. Specifically, we have the following observations:
(i) By comparing the performance of our proposed SLSP with LGC, we can clearly see the importance of graph construction in semisupervised learning. On COIL20 data set, the average improvement of SLSPs and SLSPr over LGC is 11.67 and 10.31, respectively. In our experiments, LGC directly uses kernel matrix as input, while our method uses the learned similarity matrix instead in LGC. Hence, the improvements attribute to our highquality graph construction;
(ii) The superiority of SLSPs and SLSPr over KSSC and KLRR, respectively, derives from our consideration of similarity preserving effect. The improvement is considerable especially when the portion of labeled samples is small, which means our method would be promising in a real situation. With 10 labeling, for example, the average gain is 7.69 and 6 for sparse and lowrank representation, respectively;
(iii) Although SCAN, SLRR, and SR can learn similarity matrix and labels simultaneously, our twostep approach still reach higher recognition rate. These imply that our proposed method can produce a more accurate similarity graph than existing techniques that without explicit similarity preserving capability.
7 Deep Clustering Experiment
To demonstrate the effect of deep model DSLSP, we follow the settings in Ji et al. (2017) and perform clustering task on Extended Yale B (EYaleB), ORL, COIL20, and COIL40 datasets. We compare with LRR Liu et al. (2013), Low Rank Subspace Clustering (LRSC) Vidal and Favaro (2014), SSC Elhamifar and Vidal (2013), Kernel Sparse Subspace Clustering (KSSC) Patel and Vidal (2014), SSC by Orthogonal Matching Pursuit (SSCOMP) You et al. (2016), Efficient Dense Subspace Clustering (EDSC) Ji et al. (2014), SSC with pretrained convolutional autoencoder features (AE+SSC), Deep Embedding Clustering (DEC) Xie et al. (2016), Deep means (KDM) Fard et al. (2018), Deep Subspace Clustering Network with norm (DSCNetL1) Ji et al. (2017), and Deep Subspace Clustering Network with norm (DSCNetL2) Ji et al. (2017). For a fair comparison with DSCNets, we adopt and
norm respectively using the same network architectures, which are denoted as DSLSPL1 and DSLSPL2. We adopt convolutional neural networks (CNNs) to implement the autoencoder. Adam is employed to do the optimization
Kingma and Ba (2014). The full batch of dataset is fed to our network. We pretrain the network without the selfexpression layer. The details of the network structures are shown in Table 4.EYaleB  ORL  COIL20  COIL40  

encoder 
55@10  55@5  33@15  33@20 
33@20  33@3      
33@30  33@3      
Z  24322432  400400  14401440  28802880 
decoder  33@30  33@3  33@15  33@20 
33@20  33@3      
55@10  55@5     
Dataset  Metric  SSC  KSSC  SSCOMP  EDSC  LRR  LRSC  AE+SSC  DEC  DKM  DSCNetL1  DSCNetL2  DSLSPL1  DSLSPL2 

EYaleB  Accuracy  0.7354  0.6921  0.7372  0.8814  0.8499  0.7931  0.7480  0.2303  0.1713  0.9681  0.9733  0.9757  0.9762 
NMI  0.7796  0.7359  0.7803  0.8835  0.8636  0.8264  0.7833  0.4258  0.2704  0.9687  0.9703  0.9668  0.9674  
ORL  Accuracy  0.7425  0.7143  0.7100  0.7038  0.8100  0.7200  0.7563  0.5175  0.4682  0.8550  0.8600  0.8700  0.8775 
NMI  0.8459  0.8070  0.7952  0.7799  0.8603  0.8156  0.8555  0.7449  0.7332  0.9023  0.9034  0.9237  0.9249  
COIL20  Accuracy  0.8631  0.7087  0.6410  0.8371  0.8118  0.7416  0.8711  0.7215  0.6651  0.9314  0.9368  0.9743  0.9757 
NMI  0.8892  0.8243  0.7412  0.8828  0.8747  0.8452  0.8990  0.8007  0.7971  0.9353  0.9408  0.9731  0.9740  
COIL40  Accuracy  0.7191  0.6549  0.4431  0.6870  0.6493  0.6327  0.4872  0.5812  0.1713  0.8003  0.8075  0.8389  0.8417 
NMI  0.8212  0.7888  0.6545  0.8139  0.7828  0.7737  0.8318  0.7417  0.7840  0.8852  0.8941  0.9262  0.9267 
The clustering performance of different methods is provided in Table 5. We observe that DSLSPL2 and DSLSPL1 achieve very good performance. Specifically, we have the following observations:

The norm performs slightly better than norm. This is consistent with the results in Ji et al. (2017). Perhaps, this is caused by the inaccurate optimization in norm since it is nondifferentiable at zero.

As they share the same network for latent representation learning, the improvement of DSLSP over DSCNet is attributed to our introduced similarity preserving mechanism. Note that the only difference between their objective function is the additional similarity preserving term in Eq. (19). For example, on COIL20, DSLSPL2 improves over DSCNetL2 by 3.89% and 3.32% in terms of accuracy and NMI, respectively. For COIL40, our method with norm outperforms DSCNetL2 by 3.42% on accuracy and 3.26% on NMI.

Both ORL and COIL20 datasets are used in Table 2 and 5. DSLSPL2 enhances the accuracy from 0.81, 0.8771 in Table 2
to 0.8775 and 0.9757, respectively. Once again, this demonstrates the power of deep learning models. Furthermore, for these two datasets, our results in Table
2 are also better than the shallow methods and AE+SSC in Table 5. This further verifies the superior advantages of our similarity preserving approach. 
Compared to DEC and DKM, our method can improve the performance significantly. This is own to that our method is based on similarity, while other methods are based on Euclidean distance which is not suitable for complex data.
In summary, above conclusions imply the superiority of our proposed similarity preserving term, no matter in shallow or deep models.
8 Conclusion
In this paper, we introduce a new structure learning framework, which is capable of obtaining highly informative similarity graph for clustering and semisupervised methods. Different from existing lowdimensional structure learning techniques, a novel term is designed to take advantage of sample pairwise similarity information in the learning stage. In particular, by incorporating the similarity preserving term in our objective function, which tends to keep the similarities between samples, our method consistently and significantly improves clustering and classification accuracy. Therefore, we can conclude that our framework can better capture the geometric structure of the data, resulting in more informative and discriminative similarity graph. Besides, our method can be easily extended to other selfexpression based methods. In the future, we plan to further investigate efficient algorithms for constructing largescale similarity graphs. Also, current methods conduct label learning after graph construction. It is interesting to develop principled method to solve the graph construction and label learning problems at the same time.
Acknowledgment
This paper was in part supported by Grants from the Natural Science Foundation of China (Nos. 61806045 and 61572111) and a Fundamental Research Fund for the Central Universities of China (Nos. ZYGX2017KYQD177).
References
 A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences 2 (1), pp. 183–202. Cited by: §3.1.
 Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in neural information processing systems, pp. 585–591. Cited by: §2.
 Representation learning: a review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1798–1828. Cited by: §4.

Graph construction for semisupervised learning.
In
Proceedings of the 24th International Conference on Artificial Intelligence
, IJCAI’15, pp. 4343–4344. External Links: ISBN 9781577357384, Link Cited by: §3.3.  Locality preserving nonnegative matrix factorization.. In IJCAI, Vol. 9, pp. 1010–1015. Cited by: §5.3.
 A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization 20 (4), pp. 1956–1982. Cited by: §3.1.
 Robust principal component analysis?. Journal of the ACM (JACM) 58 (3), pp. 11. Cited by: §1.
 Exact matrix completion via convex optimization. Foundations of Computational mathematics 9 (6), pp. 717. Cited by: §1.
 Atomic decomposition by basis pursuit. SIAM review 43 (1), pp. 129–159. Cited by: §5.3.
 Similarity learning of manifold data. IEEE transactions on cybernetics 45 (9), pp. 1744–1756. Cited by: §1.
 Multitask lowrank affinity pursuit for image segmentation. In Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 2439–2446. Cited by: §2.
 Sparsity induced similarity measure for label propagation. In Computer Vision, 2009 IEEE 12th International Conference on, pp. 317–324. Cited by: §3.3.
 Compressed sensing. IEEE Transactions on information theory 52 (4), pp. 1289–1306. Cited by: §1.
 Robust multiple kernel kmeans using l21norm. In Proceedings of the 24th International Conference on Artificial Intelligence, pp. 3476–3482. Cited by: item 2, §5.1.
 Sparse and redundant representations: from theory to applications in signal and image processing. 1st edition, Springer Publishing Company, Incorporated. External Links: ISBN 144197010X, 9781441970107 Cited by: §1.
 Sparse subspace clustering: algorithm, theory, and applications. IEEE transactions on pattern analysis and machine intelligence 35 (11), pp. 2765–2781. Cited by: §1, §2, §3, §7.
 Deep means: jointly clustering with means and learning representations. arXiv preprint arXiv:1806.10069. Cited by: §7.
 Robust subspace segmentation with blockdiagonal prior. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3818–3825. Cited by: §2.
 Structured lowrank matrix factorization: optimality, algorithm, and applications to image processing. In International Conference on Machine Learning, Cited by: §1, §2.
 Reducing the dimensionality of data with neural networks. science 313 (5786), pp. 504–507. Cited by: §4.
 Smooth representation clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3834–3841. Cited by: §2.
 A new simplex sparse learning model to measure data similarity for clustering.. In IJCAI, pp. 3569–3575. Cited by: §1, item 3.
 Efficient dense subspace clustering. In Applications of Computer Vision (WACV), 2014 IEEE Winter Conference on, pp. 461–468. Cited by: §7.
 Deep subspace clustering networks. In Advances in Neural Information Processing Systems, pp. 24–33. Cited by: §4.1, item 1, §7.
 Similarity learning via kernel preserving embedding. In Proceedings of the ThirtyThird AAAI Conference on Artificial Intelligence (AAAI19). AAAI Press, pp. 4057–4064. Cited by: item 7.
 Robust pca via nonconvex rank approximation. In Data Mining (ICDM), 2015 IEEE International Conference on, pp. 211–220. Cited by: §1.
 Topn recommender system via matrix completion.. In AAAI, pp. 179–185. Cited by: §1.
 Twin learning for similarity and clustering: a unified kernel approach. In Proceedings of the ThirtyFirst AAAI Conference on Artificial Intelligence (AAAI17). AAAI Press, Cited by: §1, §2, item 6.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §7.
 Saliency detection by multitask sparsity pursuit. IEEE Transactions on Image Processing 21 (3), pp. 1327–1338. Cited by: §2.
 Nonlinear dimensionality reduction. Springer Science & Business Media. Cited by: §1.
 Learning semisupervised representation towards a unified optimization framework for semisupervised learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2767–2775. Cited by: §1, item 4.
 Selftaught lowrank coding for visual learning. IEEE transactions on neural networks and learning systems. Cited by: §2.
 Lowrank models in visual analysis: theories, algorithms, and applications. 1st edition, Elsevier Science Publishing Co Inc. External Links: ISBN 9780128127315 Cited by: §1.
 Robust recovery of subspace structures by lowrank representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (1), pp. 171–184. Cited by: §1, §1, §7.
 Enhancing lowrank subspace clustering by manifold regularization. IEEE Transactions on Image Processing 23 (9), pp. 4022–4030. Cited by: §2.
 Absent multiple kernel learning. In TwentyNinth AAAI Conference on Artificial Intelligence, Cited by: §2.
 Robust and efficient subspace segmentation via least squares regression. Computer Vision–ECCV 2012, pp. 347–360. Cited by: §2.
 Lowrank preserving projections. IEEE transactions on cybernetics 46 (8), pp. 1900–1913. Cited by: §2.

Stacked convolutional autoencoders for hierarchical feature extraction
. In International Conference on Artificial Neural Networks, pp. 52–59. Cited by: §4. 
On spectral clustering: analysis and an algorithm
. Advances in neural information processing systems 2, pp. 849–856. Cited by: item 1.  Multiview clustering and semisupervised classification with adaptive neighbours.. In AAAI, pp. 2408–2414. Cited by: item 3.
 Clustering and projected clustering with adaptive neighbors. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 977–986. Cited by: §1, §2.
 Slim: sparse linear methods for topn recommender systems. In Data Mining (ICDM), 2011 IEEE 11th International Conference on, pp. 497–506. Cited by: §1, §3.
 Kernel sparse subspace clustering. In Image Processing (ICIP), 2014 IEEE International Conference on, pp. 2849–2853. Cited by: §2, item 5, item 5, §7.
 Subspace clustering using logdeterminant rank approximation. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 925–934. Cited by: §1.
 Constructing the l2graph for robust subspace learning and subspace clustering. IEEE transactions on cybernetics 47 (4), pp. 1053–1066. Cited by: §1.
 Fuzzy measures on the gene ontology for gene product similarity. IEEE/ACM Transactions on computational biology and bioinformatics 3 (3), pp. 263–274. Cited by: §3.
 Robust nuclear norm regularized regression for face recognition with occlusion. Pattern Recognition 48 (10), pp. 3145–3159. Cited by: §2.
 Nonlinear dimensionality reduction by locally linear embedding. science 290 (5500), pp. 2323–2326. Cited by: §2, §2.

Unsupervised feature selection via latent representation learning and manifold regularization
. Neural Networks 117, pp. 163–178. Cited by: §1.  From ensemble clustering to multiview clustering.. In IJCAI, pp. 2843–2849. Cited by: §1.
 A global geometric framework for nonlinear dimensionality reduction. science 290 (5500), pp. 2319–2323. Cited by: §2.
 Low rank subspace clustering (lrsc). Pattern Recognition Letters 43, pp. 47–61. Cited by: §7.

Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion
. Journal of machine learning research 11 (Dec), pp. 3371–3408. Cited by: §4.  Robust kernel lowrank representation. IEEE transactions on neural networks and learning systems 27 (11), pp. 2268–2281. Cited by: §2, item 4, item 5.
 Unsupervised deep embedding for clustering analysis. In International conference on machine learning, pp. 478–487. Cited by: §7.
 Data clustering by laplacian regularized l1graph.. In AAAI, pp. 3148–3149. Cited by: §2.
 Multiview multiple clustering. In IJCAI, Cited by: §2.
 Scalable sparse subspace clustering by orthogonal matching pursuit. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3918–3927. Cited by: §7.
 Selftuning spectral clustering.. In NIPS, Vol. 17, pp. 16. Cited by: §1.
 Sparse representation or collaborative representation: which helps face recognition?. In Computer vision (ICCV), 2011 IEEE international conference on, pp. 471–478. Cited by: §2.
 Joint lowrank and sparse principal feature coding for enhanced robust representation and visual classification. IEEE Transactions on Image Processing 25 (6), pp. 2429–2443. Cited by: §1.
 Learning with local and global consistency. In Advances in neural information processing systems, pp. 321–328. Cited by: §3.3, item 1.
 Semisupervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International conference on Machine learning (ICML03), pp. 912–919. Cited by: item 2.
 Measuring patient similarities via a deep architecture with medical concept embedding. In Data Mining (ICDM), 2016 IEEE 16th International Conference on, pp. 749–758. Cited by: §3.
 Nonnegative low rank and sparse graph for semisupervised learning. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 2328–2335. Cited by: §2, §3.
 Localitypreserving lowrank representation for graph construction from nonlinear manifolds. Neurocomputing 175, pp. 715–722. Cited by: §2.
 Guest editors’ introduction to the special section on large scale and nonlinear similarity learning for intelligent video analysis. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §2.