1 Introduction
Clustering is a fundamental unsupervised learning problem with wide applications. The hierarchical clustering,
means clustering and spectral clustering (SC) methods are widely used in practice [20]. It is known that interpretation of the dendrogram in hierarchical clustering can be difficult in practice, especially for large datasets, and means clustering, closely related to Lloyd’s algorithm, does not guarantee to find the optimal solution and perform poorly for nonlinearly separable or nonconvex clusters. SC [13, 35, 31] is a graphbased clustering method and it provides a promising alternative for identifying locally connected clusters.Given the data matrix , where is the number of data points and
is the feature dimension, SC constructs a symmetric affinity matrix
, where measures the pairwise similarity between two data samples and for . Denote diagonal matrix with. The main step of SC is to solve the following eigenvalue decomposition:
(1) 
where is the normalized Laplacian matrix, denotes the identity matrix, and is the number of clusters. The rows of can be regarded as an embedding of the data from to . The cluster assignment is then decided after using a standard clustering method such as the
means clustering on the estimated embedding matrix
obtained by solving (1). Ideally, should be a sparse matrix such that if and only if the sample belongs to the th cluster. Therefore should be a block diagonal matrix which is also sparse. To this end, the sparse spectral clustering (SSC) [29, 28, 32] is proposed to impose sparsity on , which leads to the following optimization problem:(2) 
where is the entrywise norm of that promotes the sparsity of , and is a regularization parameter.
In practice, the performance of SSC is sensitive to a single measure of similarity between data points, and there are no clear criteria to choose an optimal similarity measure. Moreover, for some very complex data such as the singlecell RNA sequencing (scRNAseq) data [24], one may benefit from considering multiple similarity matrices because they provide more information to the data. The nextgeneration sequencing technologies provide large detailed catalogs of the transcriptomes of massive cells to identify putative cell types. Clustering highdimensional scRNAseq data provides an informative step to disentangle the complex relationship between different cell types. For example, it is important to characterize the patterns of monoallelic gene expression across mammalian cell types [14], explore the mechanisms that control the progression of lung progenitors across distinct cell types [39], or study the functionally distinct lineage in the bone marrow across mouse conventional dendritic cell types [34]. To this end, Park and Zhao [32] suggest the following similarity matrices which lead to multiplekernel SSC (MKSSC):
where represents a set of sample indices that are the top nearest neighbors of the sample . The parameters and control the width of the neighborhoods. We use and to denote the sets of possible choices of and , respectively. Then the total number of similarity matrices is equal to . We denote the normalized Laplacian matrices corresponding to these similarity matrices as , . The MKSSC can be formulated as the following optimization problem:
(3)  
s.t. 
where are unknown weightings of the kernels, and servers as a regularization term, and are two regularization parameters. Note that yields a closedform solution of in the iterative algorithm we introduce later, which reduces the computational time.
Note that both SSC (2) and MKSSC (3) are nonconvex and nonsmooth with Riemannian manfiold constraints. Therefore, they are both numerically challenging to solve. In this paper, we propose a manifold proximal linear (ManPL) method for solving the SSC (2) in Section 2, and an alternating ManPL (AManPL) method for solving the MKSSC (3) in Section 3. ManPL and AManPL enjoy the convergence guarantees for solving the nonsmooth manfiold optimization problem. Moreover, we present the numerical experiments in Section 4 to demonstrate the numerical performance of ManPL and AManPL in benchmark datasets, synthetic datasets, and real datasets in singlecell data analysis.
Our contributions lie in several folds.

[leftmargin=*]

We analyze the convergence and iteration complexity of both ManPL and AManPL.

We apply our proposed methods to clustering of singlecell RNA sequencing data .
Notation. Throughout this paper, we use to denote the Stiefel manifold. The smoothness, convexity, and Lipschitz continuity of a function are always interpreted as the function is considered in the ambient Euclidean space. We use to denote the set of positive semidefinite matrices, to denote the trace of matrix .
2 A Manifold Proximal Linear Method for SSC
Since SSC (2) is both nonsmooth and nonconvex, it is numerically challenging to solve. In the literature, convex relaxations and smooth approximations of (2) have been suggested. In particular, Lu et. al. [29] proposed to replace with a positive semidefinite matrix and solve the following convex relaxation:
(4) 
This convex problem (4) can be solved by classical optimization algorithms such as ADMM. Denote the solution of (4) by , the solution of (2) can be approximated by the top eigenvectors of . In another work, Lu et. al. [28] proposed a nonconvex ADMM to solve the following smooth variant of (2):
(5) 
where is a smooth function that approximates the regularizer . In [28], the authors used the following smooth function:
(6) 
where , and is a smoothing parameter. The nonconvex ADMM for solving (5) typically iterates as equationparentequation
(7a)  
(7b)  
(7c) 
where the augmented Lagrangian function is defined as
and is a penalty parameter. The two subproblems (7a) and (7b) are both relatively easy to solve. The reason to use the smooth function to approximate in (5) is for the purpose of convergence guarantee. In [28], the authors proved that any limit point of the sequence generated by the nonconvex ADMM (7) is a stationary point of (5). This result relies on the fact that function is smooth. If one applies ADMM for the original SSC (2), then no convergence guarantee is known.
Note that both the convex relaxation (4) and the smooth approximation (5) are only approximations to the original SSC (2). In this section, we introduce our ManPL algorithm that solves the original SSC (2) directly. For the ease of presentation, we rewrite (2) as
(8) 
where , , , is the Stiefel manifold. Moreover, note that and are smooth mappings, and is nonsmooth but convex in the ambient Euclidean space. Therefore, (8) is a Riemannian optimization problem with nonsmooth objective. Furthermore, throughout this paper, we use , , to denote the Lipschitz constants of , , and , respectively. Riemannian optimization has drawn much attention recently, due to its wide applications, including low rank matrix completion [4], phase retrieval [2, 37], phase synchronization [3, 27], and dictionary learning [12, 36]. Several important classes of algorithms for Riemannian optimization with a smooth objective function were covered in one monograph [1]. On the other hand, there has been limited number of algorithms for Riemannian optimization with nonsmooth objective until very recently. The most natural idea for this class of optimization problems is the Riemannian subgradient method (RSGM) [19, 21, 23]. Recently, Li et. al. [26] studied the RSGM for Riemannian optimization with weakly convex objective. In particular, they showed the number of iteration needed by RSGM for obtaining an stationary point is . Motivated by the proximal gradient method for solving composite minimization in Euclidean space, Chen et. al. [10] proposed a manifold proximal gradient method (ManPG) for solving the following Riemannian optimization problem that includes the sparse PCA [43] as a special example:
(9) 
where is the Stiefel manifold, is a smooth function, and is a nonsmooth and convex function. The iteration complexity of ManPG is proved to be for obtaining an stationary point [10], which is better than the complexity of RSGM [26]. Variants of ManPG have been designed for different applications, such as alternating ManPG (AManPG) [11] for sparse PCA and sparse CCA, manifold proximal point algorithm (ManPPA) [8, 9] for robust subspace recovery and orthogonal dictionary learning, and stochastic ManPG [40] for online sparse PCA. Motivated by the success of ManPG and its variants, we propose a manifold proximal linear algorithm for solving SSC (8).
The proximal linear method has recently drawn great research attentions. It targets to solve the optimization problem in the form of (8) without the manifold constraint, i.e.,
(10) 
where and are smooth mappings, is convex and nonsmooth. The proximal linear method for solving (10) iterates as follows:
(11) 
where is the Jacobian of , and is a step size. Note that since is convex, the update (11) is a convex problem. This method has been studied recently by [25, 15, 18] and applied to solving many important applications such as robust phase retrieval [17], robust matrix recovery/completion [6], and robust blind deconvolution [7].
Due to the nonconvex constraint , solving (8) is more difficult than (10). Motivated by ManPG and the proximal linear method (11), we propose a ManPL algorithm for solving (8). A typical iteration of the ManPL algorithm for solving (8) is: equationparentequation
(12a)  
(12b) 
where denotes the tangent space of at , and are step sizes, and Retr denotes the retraction operation. For Stiefel manifold, and in this paper we choose the retraction operation to be the one based on the Polar decomposition (see more details in Appendix A). The equation (12a) computes the descent direction by minimizing a convex function over the tangent space of . The objective function is a linearization of the smooth functions and , plus a quadratic proximal term. The difference of (12a) and (11) is the constraint in (12a), which is needed in the Riemannian optimization setting. The retraction step (12b) brings the iterate back to the manifold .
The complete description of the ManPL for solving SSC (8) is given in Algorithm 1. The step (13) is a line search step to find the step size such that there is a sufficient decrease on the function .
(13) 
3 An Alternating ManPL Method for MultipleKernel SSC
In this section, we consider the multiplekernel SSC (3). Park and Zhao [32] consider to solve the following relaxation of (3) by letting :
(14)  
s.t. 
Park and Zhao [32] suggested to use an alternating minimization algorithm (AMA) to solve Eq. 14. Note that this method is named MPSSC in [32]. In the th iteration of AMA, one first fixes as and solves the resulting problem with respect to to obtain , and then fixes as and solves the resulting problem with respect to . In particular, when is fixed as , problem (14) reduces to
(15) 
which is a convex problem and can be solved via convex ADMM algorithm. When is fixed as , problem (14) reduces to
(16) 
where , . This is also a convex problem and it can be easily verified that (16) admits a closedform solution given by
(17) 
In summary, a typical iteration of the AMA algorithm proposed by Park and Zhao [32] is as follows:
(18) 
Another approach to approximate (3) is to combine the idea of AMA (18) and the nonconvex ADMM for solving the smooth problem (5). In particular, one can solve the following smooth variant of (3):
(19) 
where is the smooth approximation to defined in (6). When fixing , (19) is in the same form as the smoothed SSC (5), so it can be solved by the nonconvex ADMM (7). When fixing , (19) is in the same form as (16), and admits a closedform solution (17). In summary, the AMA+ADMM algorithm for solving Eq. 19 works as follows:
(20) 
By exploiting the structure of (3), we propose to solve (3) by an alternating ManPL algorithm (AManPL). More specifically, in the th iteration of AManPL, we first fix as , then (3) reduces to
(21) 
which is in the same form of (8) with in (8) replaced by . Therefore, (21) can also be solved by ManPL. Here we adopt one step of ManPL, i.e., (12) to obtain . More specifically, is computed by the following two steps: equationparentequation
(22a)  
(22b) 
where , , and . We then fix in (3) as , and then (3) reduces to
(23) 
where , . This optimization problem has the same form as (16), and again admits a closedform solution given by (17). The AManPL is described in Algorithm 2.
We have the following convergence and iteration complexity analysis for AManPL for solving MKSSC (3). Its proof is given in Appendix B.2.
Theorem 2
Assume in (3) is lower bounded by . The limit point of the sequence generated by AManPL (Algorithm 2) is a stationary point of problem (3). Moreover, to obtain an stationary point, the number of iterations needed by AManPL is .
4 Numerical Experiments
In this section, we compare our proposed methods ManPL and AManPL with some existing methods. In particular, for SSC (2), we compare ManPL (Algorithm 1) with convex ADMM [29] (denoted by CADMM ^{1}^{1}1cdoes downloaded from https://github.com/canyilu/LibADMM/blob/master/algorithms/sparsesc.m) for solving (4) and nonconvex ADMM [28] (denoted by NADMM) for solving (5). We also include the spectral clustering (denoted by SC) in the comparison. For MKSSC (3), we compare AManPL (Algorithm 2) with MPSSC (i.e., AMA+CADMM ^{2}^{2}2codes downloaded from https://github.com/ishspsy/project/tree/master/MPSSC) [32] and AMA+NADMM (20). Default settings of the parameters for the existing methods were adopted. All the algorithms were terminated when the absolute change of the objective value is smaller than , which indicates that the algorithms was not making much progress.
Problem  Algorithm  Parameters 

Convex SSC (4)  CADMM [29]  
Smoothed SSC (5)  NADMM (7) [28]  
Original SSC (2)  ManPL (Algorithm 1)  
MKSSC (14)  AMA+CADMM (18) [32]  , 
Smoothed MKSSC (19)  AMA+NADMM (20)  
Original MKSSC (3)  AManPL (Algorithm 2)  , 
4.1 UCI Datasets
We first compare the clustering performance of different methods on three benchmark datasets in UCI machine learning repository
[16]. We list the parameters used in the algorithms in Table 1. We follow [32] to construct the similarity matrices and record the normalized mutual information (NMI) to measure the performance of the clustering. Note that higher NMI scores indicate better clustering performance. The NMI scores are reported in Table 2. From Table 2 we see that the three algorithms for SSC usually outperform SC, and among the three algorithms for SSC, ManPL usually outperforms the other two methods. We also see that the MKSSC model usually generates higher NMI scores than SSC. Moreover, among all three algorithms for SSC and three algorithms for MKSSC, the AManPL for MKSSC always has the largest NMI score. This indicates the great potential of our AManPL method for solving MKSSC. Morevoer, we show the heatmap of for the Wine data set in Fig. 1. From this figure we see that for SSC, the NADMM and ManPL generate much better results than CADMM in terms of recovering the block matrices, and for MKSSC, the AMA+NADMM and AManPL generate much better results than AMA+CADMM. The heatmp for the other two data sets are similar, so we omit them for brevity.SSC  MKSSC  

Datasets  SC  CADMM  NADMM  ManPL  AMA+CADMM  AMA+NADMM  AManPL  C 
Wine  0.8650  0.8650  0.8782  0.8782  0.8854  0.8854  0.8926  3 
Iris  0.7496  0.7582  0.7582  0.7582  0.7665  0.7601  0.7705  3 
Glass  0.3165  0.3418  0.2047  0.3471  0.2656  0.3315  0.3644  6 
4.2 Synthetic Data
In this subsection, we follow [32] to evaluate the clustering performance of different methods on two synthetic datasets with clusters. To compare ManPL, AManPL and existing methods, we select the regularization parameter using an independent tuning dataset and set other parameters as in Table 1. Specifically, we generate the tuning datasets using the same data generating process, and we select the parameter that maximizes the average NMI score over 50 independent repetitions. The two synthetic datasets are generated as follows.

Synthetic data 1. We randomly generate points in the dimensional latent space spanning a circle as the centers of clusters. For each cluster, we randomly generate the points by adding an independent noise to its center. We project these dimensional data to a dimensional space using a linear projection matrix and then add the heterogeneous noise to obtain the data matrix . The noise level is of the radius of the circle in the embedded space.

Synthetic data 2. We randomly generate a matrix with
by drawing its entries independently from the uniform distribution on
. We randomly assign the cluster labels . Let and . We generate whereis an error matrix with independent normally distributed entries that
.
Figure 2 visualizes one realization of the simulated data for these two settings. From Fig. 2 we see that different clusters mix together and the variability between clusters varies. Since we found that MKSSC is better to handle the heterogeneous noise than SSC in both settings, we only focus on the comparison of AMA+CADMM, AMA+NADMM and AManPL for MKSSC. In Table 3
, we report the NMI scores of the three algorithms for solving MKSSC, and the NMI scores are averaged over 50 independent runs and the numbers in the parentheses are the standard deviation of the NMI scores. From Table
3 we see that AManPL consistently outperforms the other two methods in all tested instances. This is not surprising because AManPL solves the original MKSSC, and the other two methods only solve its approximations. Moreover, we show the heatmaps of for synthetic data 1 in Figure 3 (synthetic data 2 is simliar). We see that the block diagonal structure is well recovered by AManPL, which shows much better performance than the other two methods.Method  AMA+CADMM  AMA+NADMM  AManPL  

Synthetic data 1  
100  250  0.9852 (1.2e3)  0.9852 (2.4e3)  0.9933 (1.7e4) 
100  300  0.9789 (2.1e2)  0.9844 (1.4e2)  0.9917 (7.9e4) 
100  500  0.9787 (2.0e3)  0.9834 (1.5e3)  0.9956 (1.2e4) 
200  250  0.9821 (1.8e3)  0.9803 (1.9e3)  0.9955 (6.6e5) 
200  300  0.9830 (1.3e3)  0.9833 (1.3e3)  0.9844 (1.6e4) 
200  500  0.9607 (2.8e3)  0.9606 (2.8e3)  0.9867 (3.6e4) 
Synthetic data 2  
100  250  0.6491 (5.8e3)  0.7163 (3.9e3)  0.7334 (7.6e3) 
100  300  0.6304 (1.3e3)  0.7466 (2.4e3)  0.7881 (1.5e3) 
100  500  0.6253 (2.3e3)  0.7289 (1.3e3)  0.7308 (1.1e3) 
200  250  0.7977 (2.1e3)  0.1371 (2.2e3)  0.9182 (3.2e4) 
200  300  0.7380 (1.1e3)  0.1034 (1.2e3)  0.8773 (1.2e3) 
200  500  0.7130 (9.6e3)  0.1220 (2.1e3)  0.8199 (4.5e3) 
4.3 SingleCell Data Analysis
Clustering cells and identifying subgroups are important topics in highdimensional scRNAseq data analysis. The multiple kernel learning approach is vital as clustering scRNAseq data is usually sensitive to the choice of the number of neighbors and scaling parameter. Recently, [32] showed that AMA+CADMM for MKSSC provides a promising clustering result and outperforms several stateofart methods such as SC, SSC, tSNE [30], and SIMLR [41]. In what follows, we focus on the numerical comparison of AMA+CADMM, AMA+NADMM and AManPL to cluster highdimensional scRNAseq data on six real datasets. These six real datasets represent several types of important dynamic processes such as cell differentiation, and they include the information about single cell types. We follow the procedure of [32] to specify multiple kernels for clustering scRNAseq data and choose the proper tuning parameters and . The six datasets and the NMI scores of the three algorithms: AMA+CADMM, AMA+NADMM and AManPL for solving MKSSC, are summarized in Table 4. From Table 4 we observe that AManPL always achieves the highest NMI scores and consistently outperforms the other two methods on all six real datasets. We also demonstrate the performance of AManPL by showing the twodimensional embedding estimated by AManPL in Figure 4. From this table, we see that AManPL yielded clear and meaningful clusters, even for the Ginhous dataset [34], which was known to be very challenging. This again demonstrates the great practical potential of our AManPL method for analyzing the scRNAseq data.
Datasets  AMA+CADMM  AMA+NADMM  AManPL  

Deng [14]  0.7319  0.7389  0.7464  7 
Ting [38]  0.9283  0.9524  0.9755  5 
Treutlein [39]  0.7674  0.7229  0.8817  5 
Buettner [5]  0.7929  0.8744  0.8997  3 
Ginhoux [34]  0.6206  0.6398  0.6560  3 
Pollen [33]  0.9439  0.9372  0.9631  11 
5 Conclusion
Motivated by the recent need on analyzing single cell RNA sequencing data, we considered the sparse spectral clustering and multiplekernel sparse spectral clustering in this paper. We proposed the manifold proximal linear method for solving SSC, and the alternating manifold proximal linear method for solving MKSSC. Convergence and iteration complexity of the proposed methods are analyzed. Numerical results on synthetic data and the single cell RNA sequencing data demonstrated the great potential of our proposed methods.
Appendix A Preliminaries
a.1 Optimality Condition of Manifold Optimization
Definition 1
(Generalized Clarke subdifferential [22]) For a locally Lipschitz function on , the Riemannian generalized directional derivative of at in direction is defined by
(25) 
where is a coordinate chart at and denotes the Jacobian of . The generalized gradient or the Clarke subdifferential of at , denoted by , is given by
(26) 
Definition 2
For a smooth function over Riemannian submanifold, if the metric on the manifold is induced by the Euclidean inner product in the ambient space, then we know that . Here denotes the Riemannian gradient of , and denotes the projection onto . According to Lemma 5.1 in [42], for a regular function^{3}^{3}3See the definition in [42]. Convex function is regular and the addition of regular functions is regular. The generalized subdifferential of is the projection of its convex subdifferential. , we have Moreover, the function in problem (8) is weakly convex and thus is regular according to Lemma 5.1 in [42]. By Theorem 4.1 in [42], the firstorder optimality condition of problem (8) is given by
(27) 
Definition 3
Definition 4
A retraction on an differentiable manifold is a smooth mapping Retr from the tangent bundle T onto satisfying the following two conditions (here denotes the restriction of Retr onto )

, , where denotes the zero element of .

For , it holds that
Th retraction onto the Euclidean space is simply the indentity mapping: . Common retractions include the polar decomposition:
the QR decomposition:
where is the factor of the QR factorization of ; the Cayley transformation,
where
Lemma 1
Lemma 2
The function in (8) is Lipschitz continuous and the Jacobian is Lipschitz continuous, where and .
Proof
Since where , we immediately get that is Lipschitz continuous. Secondly, observing that for any , we have
Hence, is Lipschitz continuous.
Appendix B Proofs
b.1 Convergence Analysis of ManPL (Algorithm 1)
Before we presents the result in ManPL, we need the following useful lemma in proximal linear algorithm.
Now, we start to analyze the convergence and iteration complexity of our ManPL algorithm. The following lemma states that the minimizer of (12a) is a descent direction on the tangent space for the local model.
Lemma 4
Proof
Since is the minimizer of (12a), we will have for any :
which implies that
Using the convexity of , we have
Letting , we get
Finally, from the convexity of we get
which completes the proof.
The following lemma suggests that the new point has a lower function value.
Lemma 5
Given any , consider is the optimal solution of (12a). Denote , there exists a constant such that
(32) 
Proof
We will prove by induction. Denote . For , by using the convexity of and Lipchitz continuity of , we can show that:
(33) 
Since is Lipschiz continuous, we obtain
Comments
There are no comments yet.