A Manifold Proximal Linear Method for Sparse Spectral Clustering with Application to Single-Cell RNA Sequencing Data Analysis

07/18/2020
by   Zhongruo Wang, et al.
11

Spectral clustering is one of the fundamental unsupervised learning methods widely used in data analysis. Sparse spectral clustering (SSC) imposes sparsity to the spectral clustering and it improves the interpretability of the model. This paper considers a widely adopted model for SSC, which can be formulated as an optimization problem over the Stiefel manifold with nonsmooth and nonconvex objective. Such an optimization problem is very challenging to solve. Existing methods usually solve its convex relaxation or need to smooth its nonsmooth part using certain smoothing techniques. In this paper, we propose a manifold proximal linear method (ManPL) that solves the original SSC formulation. We also extend the algorithm to solve the multiple-kernel SSC problems, for which an alternating ManPL algorithm is proposed. Convergence and iteration complexity results of the proposed methods are established. We demonstrate the advantage of our proposed methods over existing methods via the single-cell RNA sequencing data analysis.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

page 11

12/08/2017

Nonconvex Sparse Spectral Clustering by Alternating Direction Method of Multipliers and Its Convergence Analysis

Spectral Clustering (SC) is a widely used data clustering method which f...
05/24/2015

Constrained 1-Spectral Clustering

An important form of prior information in clustering comes in form of ca...
03/04/2014

The Hidden Convexity of Spectral Clustering

In recent years, spectral clustering has become a standard method for da...
09/30/2014

Backhaul-Constrained Multi-Cell Cooperation Leveraging Sparsity and Spectral Clustering

Multi-cell cooperative processing with limited backhaul traffic is studi...
12/03/2010

An Inverse Power Method for Nonlinear Eigenproblems with Applications in 1-Spectral Clustering and Sparse PCA

Many problems in machine learning and statistics can be formulated as (g...
11/28/2014

Learning with Algebraic Invariances, and the Invariant Kernel Trick

When solving data analysis problems it is important to integrate prior k...
10/16/2019

A Notion of Harmonic Clustering in Simplicial Complexes

We outline a novel clustering scheme for simplicial complexes that produ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Clustering is a fundamental unsupervised learning problem with wide applications. The hierarchical clustering,

-means clustering and spectral clustering (SC) methods are widely used in practice [20]. It is known that interpretation of the dendrogram in hierarchical clustering can be difficult in practice, especially for large datasets, and -means clustering, closely related to Lloyd’s algorithm, does not guarantee to find the optimal solution and perform poorly for non-linearly separable or non-convex clusters. SC [13, 35, 31] is a graph-based clustering method and it provides a promising alternative for identifying locally connected clusters.

Given the data matrix , where is the number of data points and

is the feature dimension, SC constructs a symmetric affinity matrix

, where measures the pairwise similarity between two data samples and for . Denote diagonal matrix with

. The main step of SC is to solve the following eigenvalue decomposition:

(1)

where is the normalized Laplacian matrix, denotes the identity matrix, and is the number of clusters. The rows of can be regarded as an embedding of the data from to . The cluster assignment is then decided after using a standard clustering method such as the

-means clustering on the estimated embedding matrix

obtained by solving (1). Ideally, should be a sparse matrix such that if and only if the sample belongs to the -th cluster. Therefore should be a block diagonal matrix which is also sparse. To this end, the sparse spectral clustering (SSC) [29, 28, 32] is proposed to impose sparsity on , which leads to the following optimization problem:

(2)

where is the entry-wise norm of that promotes the sparsity of , and is a regularization parameter.

In practice, the performance of SSC is sensitive to a single measure of similarity between data points, and there are no clear criteria to choose an optimal similarity measure. Moreover, for some very complex data such as the single-cell RNA sequencing (scRNA-seq) data [24], one may benefit from considering multiple similarity matrices because they provide more information to the data. The next-generation sequencing technologies provide large detailed catalogs of the transcriptomes of massive cells to identify putative cell types. Clustering high-dimensional scRNA-seq data provides an informative step to disentangle the complex relationship between different cell types. For example, it is important to characterize the patterns of monoallelic gene expression across mammalian cell types [14], explore the mechanisms that control the progression of lung progenitors across distinct cell types [39], or study the functionally distinct lineage in the bone marrow across mouse conventional dendritic cell types [34]. To this end, Park and Zhao [32] suggest the following similarity matrices which lead to multiple-kernel SSC (MKSSC):

where represents a set of sample indices that are the top nearest neighbors of the sample . The parameters and control the width of the neighborhoods. We use and to denote the sets of possible choices of and , respectively. Then the total number of similarity matrices is equal to . We denote the normalized Laplacian matrices corresponding to these similarity matrices as , . The MKSSC can be formulated as the following optimization problem:

(3)
s.t.

where are unknown weightings of the kernels, and servers as a regularization term, and are two regularization parameters. Note that yields a closed-form solution of in the iterative algorithm we introduce later, which reduces the computational time.

Note that both SSC (2) and MKSSC (3) are nonconvex and nonsmooth with Riemannian manfiold constraints. Therefore, they are both numerically challenging to solve. In this paper, we propose a manifold proximal linear (ManPL) method for solving the SSC (2) in Section 2, and an alternating ManPL (AManPL) method for solving the MKSSC (3) in Section 3. ManPL and AManPL enjoy the convergence guarantees for solving the nonsmooth manfiold optimization problem. Moreover, we present the numerical experiments in Section 4 to demonstrate the numerical performance of ManPL and AManPL in benchmark datasets, synthetic datasets, and real datasets in single-cell data analysis.

Our contributions lie in several folds.

  1. [leftmargin=*]

  2. We propose ManPL method for solving SSC (2) and AManPL method for solving MKSSC (3).

  3. We analyze the convergence and iteration complexity of both ManPL and AManPL.

  4. We apply our proposed methods to clustering of single-cell RNA sequencing data .

Notation. Throughout this paper, we use to denote the Stiefel manifold. The smoothness, convexity, and Lipschitz continuity of a function are always interpreted as the function is considered in the ambient Euclidean space. We use to denote the set of positive semidefinite matrices, to denote the trace of matrix .

2 A Manifold Proximal Linear Method for SSC

Since SSC (2) is both nonsmooth and nonconvex, it is numerically challenging to solve. In the literature, convex relaxations and smooth approximations of (2) have been suggested. In particular, Lu et. al. [29] proposed to replace with a positive semidefinite matrix and solve the following convex relaxation:

(4)

This convex problem (4) can be solved by classical optimization algorithms such as ADMM. Denote the solution of (4) by , the solution of (2) can be approximated by the top eigenvectors of . In another work, Lu et. al. [28] proposed a nonconvex ADMM to solve the following smooth variant of (2):

(5)

where is a smooth function that approximates the regularizer . In [28], the authors used the following smooth function:

(6)

where , and is a smoothing parameter. The nonconvex ADMM for solving (5) typically iterates as equationparentequation

(7a)
(7b)
(7c)

where the augmented Lagrangian function is defined as

and is a penalty parameter. The two subproblems (7a) and (7b) are both relatively easy to solve. The reason to use the smooth function to approximate in (5) is for the purpose of convergence guarantee. In [28], the authors proved that any limit point of the sequence generated by the nonconvex ADMM (7) is a stationary point of (5). This result relies on the fact that function is smooth. If one applies ADMM for the original SSC (2), then no convergence guarantee is known.

Note that both the convex relaxation (4) and the smooth approximation (5) are only approximations to the original SSC (2). In this section, we introduce our ManPL algorithm that solves the original SSC (2) directly. For the ease of presentation, we rewrite (2) as

(8)

where , , , is the Stiefel manifold. Moreover, note that and are smooth mappings, and is nonsmooth but convex in the ambient Euclidean space. Therefore, (8) is a Riemannian optimization problem with nonsmooth objective. Furthermore, throughout this paper, we use , , to denote the Lipschitz constants of , , and , respectively. Riemannian optimization has drawn much attention recently, due to its wide applications, including low rank matrix completion [4], phase retrieval [2, 37], phase synchronization [3, 27], and dictionary learning [12, 36]. Several important classes of algorithms for Riemannian optimization with a smooth objective function were covered in one monograph [1]. On the other hand, there has been limited number of algorithms for Riemannian optimization with nonsmooth objective until very recently. The most natural idea for this class of optimization problems is the Riemannian subgradient method (RSGM) [19, 21, 23]. Recently, Li et. al. [26] studied the RSGM for Riemannian optimization with weakly convex objective. In particular, they showed the number of iteration needed by RSGM for obtaining an -stationary point is . Motivated by the proximal gradient method for solving composite minimization in Euclidean space, Chen et. al. [10] proposed a manifold proximal gradient method (ManPG) for solving the following Riemannian optimization problem that includes the sparse PCA [43] as a special example:

(9)

where is the Stiefel manifold, is a smooth function, and is a nonsmooth and convex function. The iteration complexity of ManPG is proved to be for obtaining an -stationary point [10], which is better than the complexity of RSGM [26]. Variants of ManPG have been designed for different applications, such as alternating ManPG (A-ManPG) [11] for sparse PCA and sparse CCA, manifold proximal point algorithm (ManPPA) [8, 9] for robust subspace recovery and orthogonal dictionary learning, and stochastic ManPG [40] for online sparse PCA. Motivated by the success of ManPG and its variants, we propose a manifold proximal linear algorithm for solving SSC (8).

The proximal linear method has recently drawn great research attentions. It targets to solve the optimization problem in the form of (8) without the manifold constraint, i.e.,

(10)

where and are smooth mappings, is convex and nonsmooth. The proximal linear method for solving (10) iterates as follows:

(11)

where is the Jacobian of , and is a step size. Note that since is convex, the update (11) is a convex problem. This method has been studied recently by [25, 15, 18] and applied to solving many important applications such as robust phase retrieval [17], robust matrix recovery/completion [6], and robust blind deconvolution [7].

Due to the nonconvex constraint , solving (8) is more difficult than (10). Motivated by ManPG and the proximal linear method (11), we propose a ManPL algorithm for solving (8). A typical iteration of the ManPL algorithm for solving (8) is: equationparentequation

(12a)
(12b)

where denotes the tangent space of at , and are step sizes, and Retr denotes the retraction operation. For Stiefel manifold, and in this paper we choose the retraction operation to be the one based on the Polar decomposition (see more details in Appendix A). The equation (12a) computes the descent direction by minimizing a convex function over the tangent space of . The objective function is a linearization of the smooth functions and , plus a quadratic proximal term. The difference of (12a) and (11) is the constraint in (12a), which is needed in the Riemannian optimization setting. The retraction step (12b) brings the iterate back to the manifold .

The complete description of the ManPL for solving SSC (8) is given in Algorithm 1. The step (13) is a line search step to find the step size such that there is a sufficient decrease on the function .

  Input: initial point , step size , parameter
  for  do
     Calculate by solving (12a)
     Let be the smallest nonnegative integer such that
(13)
     Let and compute by (12b)
  end for
Algorithm 1 The ManPL for SSC (1)

The main convergence result of ManPL (Algorithm 1) is given as Theorem 1. Its proof is given in Appendix B.1.

Theorem 1

Assume is lower bounded by . The limit point of the sequence generated by ManPL (Algorithm 1) is a stationary point of (8). Moreover, ManPL returns an -stationary point of (8) in iterations.

3 An Alternating ManPL Method for Multiple-Kernel SSC

In this section, we consider the multiple-kernel SSC (3). Park and Zhao [32] consider to solve the following relaxation of (3) by letting :

(14)
s.t.

Park and Zhao [32] suggested to use an alternating minimization algorithm (AMA) to solve Eq. 14. Note that this method is named MPSSC in [32]. In the -th iteration of AMA, one first fixes as and solves the resulting problem with respect to to obtain , and then fixes as and solves the resulting problem with respect to . In particular, when is fixed as , problem (14) reduces to

(15)

which is a convex problem and can be solved via convex ADMM algorithm. When is fixed as , problem (14) reduces to

(16)

where , . This is also a convex problem and it can be easily verified that (16) admits a closed-form solution given by

(17)

In summary, a typical iteration of the AMA algorithm proposed by Park and Zhao [32] is as follows:

(18)

Another approach to approximate (3) is to combine the idea of AMA (18) and the nonconvex ADMM for solving the smooth problem (5). In particular, one can solve the following smooth variant of (3):

(19)

where is the smooth approximation to defined in (6). When fixing , (19) is in the same form as the smoothed SSC (5), so it can be solved by the nonconvex ADMM (7). When fixing , (19) is in the same form as (16), and admits a closed-form solution (17). In summary, the AMA+ADMM algorithm for solving Eq. 19 works as follows:

(20)

By exploiting the structure of (3), we propose to solve (3) by an alternating ManPL algorithm (AManPL). More specifically, in the -th iteration of AManPL, we first fix as , then (3) reduces to

(21)

which is in the same form of (8) with in (8) replaced by . Therefore, (21) can also be solved by ManPL. Here we adopt one step of ManPL, i.e., (12) to obtain . More specifically, is computed by the following two steps: equationparentequation

(22a)
(22b)

where , , and . We then fix in (3) as , and then (3) reduces to

(23)

where , . This optimization problem has the same form as (16), and again admits a closed-form solution given by (17). The AManPL is described in Algorithm 2.

  Input: initial point , step size , parameter , let be the optimal solution to (23) for
  for  do
     Calculate by solving (22a)
     Let be the smallest nonnegative integer such that
(24)
     Let and compute by (22b)
     Update by (17) with , .
  end for
Algorithm 2 The AManPL Method for Solving MKSSC (3)

We have the following convergence and iteration complexity analysis for AManPL for solving MKSSC (3). Its proof is given in Appendix B.2.

Theorem 2

Assume in (3) is lower bounded by . The limit point of the sequence generated by AManPL (Algorithm 2) is a stationary point of problem (3). Moreover, to obtain an -stationary point, the number of iterations needed by AManPL is .

4 Numerical Experiments

In this section, we compare our proposed methods ManPL and AManPL with some existing methods. In particular, for SSC (2), we compare ManPL (Algorithm 1) with convex ADMM [29] (denoted by CADMM 111cdoes downloaded from https://github.com/canyilu/LibADMM/blob/master/algorithms/sparsesc.m) for solving (4) and nonconvex ADMM [28] (denoted by NADMM) for solving (5). We also include the spectral clustering (denoted by SC) in the comparison. For MKSSC (3), we compare AManPL (Algorithm 2) with MPSSC (i.e., AMA+CADMM 222codes downloaded from https://github.com/ishspsy/project/tree/master/MPSSC) [32] and AMA+NADMM (20). Default settings of the parameters for the existing methods were adopted. All the algorithms were terminated when the absolute change of the objective value is smaller than , which indicates that the algorithms was not making much progress.

Problem Algorithm Parameters
Convex SSC (4) CADMM [29]
Smoothed SSC (5) NADMM (7) [28]
Original SSC (2) ManPL (Algorithm 1)
MKSSC (14) AMA+CADMM (18) [32] ,
Smoothed MKSSC (19) AMA+NADMM (20)
Original MKSSC (3) AManPL (Algorithm 2) ,
Table 1: Algorithms and their parameters

4.1 UCI Datasets

We first compare the clustering performance of different methods on three benchmark datasets in UCI machine learning repository

[16]. We list the parameters used in the algorithms in Table 1. We follow [32] to construct the similarity matrices and record the normalized mutual information (NMI) to measure the performance of the clustering. Note that higher NMI scores indicate better clustering performance. The NMI scores are reported in Table 2. From Table 2 we see that the three algorithms for SSC usually outperform SC, and among the three algorithms for SSC, ManPL usually outperforms the other two methods. We also see that the MKSSC model usually generates higher NMI scores than SSC. Moreover, among all three algorithms for SSC and three algorithms for MKSSC, the AManPL for MKSSC always has the largest NMI score. This indicates the great potential of our AManPL method for solving MKSSC. Morevoer, we show the heatmap of for the Wine data set in Fig. 1. From this figure we see that for SSC, the NADMM and ManPL generate much better results than CADMM in terms of recovering the block matrices, and for MKSSC, the AMA+NADMM and AManPL generate much better results than AMA+CADMM. The heatmp for the other two data sets are similar, so we omit them for brevity.

SSC MKSSC
Datasets SC CADMM NADMM ManPL AMA+CADMM AMA+NADMM AManPL C
Wine 0.8650 0.8650 0.8782 0.8782 0.8854 0.8854 0.8926 3
Iris 0.7496 0.7582 0.7582 0.7582 0.7665 0.7601 0.7705 3
Glass 0.3165 0.3418 0.2047 0.3471 0.2656 0.3315 0.3644 6
Table 2: Comparison of the NMI scores on the UCI data sets.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
Figure 1: The heatmaps of on the Wine dataset estimated by (a) SC, (b) CADMM for SSC, (c) NADMM for SSC, (d) ManPL for SSC, (e) AMA+CADMM for MKSSC, and (f) AManPL for MKSSC.

4.2 Synthetic Data

In this subsection, we follow [32] to evaluate the clustering performance of different methods on two synthetic datasets with clusters. To compare ManPL, AManPL and existing methods, we select the regularization parameter using an independent tuning dataset and set other parameters as in Table 1. Specifically, we generate the tuning datasets using the same data generating process, and we select the parameter that maximizes the average NMI score over 50 independent repetitions. The two synthetic datasets are generated as follows.

  • Synthetic data 1. We randomly generate points in the -dimensional latent space spanning a circle as the centers of clusters. For each cluster, we randomly generate the points by adding an independent noise to its center. We project these -dimensional data to a -dimensional space using a linear projection matrix and then add the heterogeneous noise to obtain the data matrix . The noise level is of the radius of the circle in the embedded space.

  • Synthetic data 2. We randomly generate a matrix with

    by drawing its entries independently from the uniform distribution on

    . We randomly assign the cluster labels . Let and . We generate where

    is an error matrix with independent normally distributed entries that

    .

Figure 2 visualizes one realization of the simulated data for these two settings. From Fig. 2 we see that different clusters mix together and the variability between clusters varies. Since we found that MKSSC is better to handle the heterogeneous noise than SSC in both settings, we only focus on the comparison of AMA+CADMM, AMA+NADMM and AManPL for MKSSC. In Table 3

, we report the NMI scores of the three algorithms for solving MKSSC, and the NMI scores are averaged over 50 independent runs and the numbers in the parentheses are the standard deviation of the NMI scores. From Table

3 we see that AManPL consistently outperforms the other two methods in all tested instances. This is not surprising because AManPL solves the original MKSSC, and the other two methods only solve its approximations. Moreover, we show the heatmaps of for synthetic data 1 in Figure 3 (synthetic data 2 is simliar). We see that the block diagonal structure is well recovered by AManPL, which shows much better performance than the other two methods.

(a) Synthetic data 1
(b) Synthetic data 2
Figure 2: Illustration of one realization of the synthetic data.
Method AMA+CADMM AMA+NADMM AManPL
Synthetic data 1
100 250 0.9852 (1.2e-3) 0.9852 (2.4e-3) 0.9933 (1.7e-4)
100 300 0.9789 (2.1e-2) 0.9844 (1.4e-2) 0.9917 (7.9e-4)
100 500 0.9787 (2.0e-3) 0.9834 (1.5e-3) 0.9956 (1.2e-4)
200 250 0.9821 (1.8e-3) 0.9803 (1.9e-3) 0.9955 (6.6e-5)
200 300 0.9830 (1.3e-3) 0.9833 (1.3e-3) 0.9844 (1.6e-4)
200 500 0.9607 (2.8e-3) 0.9606 (2.8e-3) 0.9867 (3.6e-4)
Synthetic data 2
100 250 0.6491 (5.8e-3) 0.7163 (3.9e-3) 0.7334 (7.6e-3)
100 300 0.6304 (1.3e-3) 0.7466 (2.4e-3) 0.7881 (1.5e-3)
100 500 0.6253 (2.3e-3) 0.7289 (1.3e-3) 0.7308 (1.1e-3)
200 250 0.7977 (2.1e-3) 0.1371 (2.2e-3) 0.9182 (3.2e-4)
200 300 0.7380 (1.1e-3) 0.1034 (1.2e-3) 0.8773 (1.2e-3)
200 500 0.7130 (9.6e-3) 0.1220 (2.1e-3) 0.8199 (4.5e-3)
Table 3: NMI of three algorithms for solving MKSSC for synthetic data.
(a)
(b)
(c)
Figure 3: The heatmaps of on synthetic data 1 estimated by (a) AMA+CADMM, (b) AMA+NADMM, and (c) AManPL for MKSSC.

4.3 Single-Cell Data Analysis

Clustering cells and identifying subgroups are important topics in high-dimensional scRNA-seq data analysis. The multiple kernel learning approach is vital as clustering scRNA-seq data is usually sensitive to the choice of the number of neighbors and scaling parameter. Recently, [32] showed that AMA+CADMM for MKSSC provides a promising clustering result and outperforms several state-of-art methods such as SC, SSC, t-SNE [30], and SIMLR [41]. In what follows, we focus on the numerical comparison of AMA+CADMM, AMA+NADMM and AManPL to cluster high-dimensional scRNA-seq data on six real datasets. These six real datasets represent several types of important dynamic processes such as cell differentiation, and they include the information about single cell types. We follow the procedure of [32] to specify multiple kernels for clustering scRNA-seq data and choose the proper tuning parameters and . The six datasets and the NMI scores of the three algorithms: AMA+CADMM, AMA+NADMM and AManPL for solving MKSSC, are summarized in Table 4. From Table 4 we observe that AManPL always achieves the highest NMI scores and consistently outperforms the other two methods on all six real datasets. We also demonstrate the performance of AManPL by showing the two-dimensional embedding estimated by AManPL in Figure 4. From this table, we see that AManPL yielded clear and meaningful clusters, even for the Ginhous dataset [34], which was known to be very challenging. This again demonstrates the great practical potential of our AManPL method for analyzing the scRNA-seq data.

Datasets AMA+CADMM AMA+NADMM AManPL
Deng [14] 0.7319 0.7389 0.7464 7
Ting [38] 0.9283 0.9524 0.9755 5
Treutlein [39] 0.7674 0.7229 0.8817 5
Buettner [5] 0.7929 0.8744 0.8997 3
Ginhoux [34] 0.6206 0.6398 0.6560 3
Pollen [33] 0.9439 0.9372 0.9631 11
Table 4: NMI scores of three algorithms for MKSSC on six real scRNA-seq datasets.
(a) Ting [38]
(b) Ginhoux [34]
Figure 4: Visualization of the cells in 2-D embedded space using AManPL for MKSSC on two real datasets in high-dimensional scRNA-seq analysis.

5 Conclusion

Motivated by the recent need on analyzing single cell RNA sequencing data, we considered the sparse spectral clustering and multiple-kernel sparse spectral clustering in this paper. We proposed the manifold proximal linear method for solving SSC, and the alternating manifold proximal linear method for solving MKSSC. Convergence and iteration complexity of the proposed methods are analyzed. Numerical results on synthetic data and the single cell RNA sequencing data demonstrated the great potential of our proposed methods.

Appendix A Preliminaries

a.1 Optimality Condition of Manifold Optimization

Definition 1

(Generalized Clarke subdifferential [22]) For a locally Lipschitz function on , the Riemannian generalized directional derivative of at in direction is defined by

(25)

where is a coordinate chart at and denotes the Jacobian of . The generalized gradient or the Clarke subdifferential of at , denoted by , is given by

(26)

Definition 2

([42]) A function is said to be regular at along if

  • for all , exists, and

  • for all , .

For a smooth function over Riemannian submanifold, if the metric on the manifold is induced by the Euclidean inner product in the ambient space, then we know that . Here denotes the Riemannian gradient of , and denotes the projection onto . According to Lemma 5.1 in [42], for a regular function333See the definition in [42]. Convex function is regular and the addition of regular functions is regular. The generalized subdifferential of is the projection of its convex subdifferential. , we have Moreover, the function in problem (8) is weakly convex and thus is regular according to Lemma 5.1 in [42]. By Theorem 4.1 in [42], the first-order optimality condition of problem (8) is given by

(27)
Definition 3

A point is called a stationary point of problem (8) if it satisfies the first-order optimality condition (27).

Definition 4

A retraction on an differentiable manifold is a smooth mapping Retr from the tangent bundle T onto satisfying the following two conditions (here denotes the restriction of Retr onto )

  • , , where denotes the zero element of .

  • For , it holds that

Th retraction onto the Euclidean space is simply the indentity mapping: . Common retractions include the polar decomposition:

the QR decomposition:

where is the factor of the QR factorization of ; the Cayley transformation,

where

Lemma 1

[10] For all and there exist constants and such that the following two inequalities hold:

(28)
(29)

Lemma 2

The function in (8) is -Lipschitz continuous and the Jacobian is -Lipschitz continuous, where and .

Proof

Since where , we immediately get that is -Lipschitz continuous. Secondly, observing that for any , we have

Hence, is -Lipschitz continuous.

Appendix B Proofs

b.1 Convergence Analysis of ManPL (Algorithm 1)

Before we presents the result in ManPL, we need the following useful lemma in proximal linear algorithm.

Lemma 3

[15] Under the assumptions on and , we have the following result:

(30)
Proof

It follows form Lemma 2 that

Now, we start to analyze the convergence and iteration complexity of our ManPL algorithm. The following lemma states that the minimizer of (12a) is a descent direction on the tangent space for the local model.

Lemma 4

Let be the minimizer of (12a), the following holds for any , with any ,

(31)

Proof

Since is the minimizer of (12a), we will have for any :

which implies that

Using the convexity of , we have

Letting , we get

Finally, from the convexity of we get

which completes the proof.

The following lemma suggests that the new point has a lower function value.

Lemma 5

Given any , consider is the optimal solution of (12a). Denote , there exists a constant such that

(32)

Proof

We will prove by induction. Denote . For , by using the convexity of and Lipchitz continuity of , we can show that:

(33)

Since is Lipschiz continuous, we obtain