1 Introduction
The problem of finding a subspace that captures the features of a given dataset and possesses certain properties is at the heart of many machine learning applications. One commonly encountered formulation of the problem, which is motivated largely by sparsity or robustness considerations, is given by
(1.1) 
where is a given matrix and denotes the
norm of a vector. To better understand how problem (
1.1) arises in applications, let us consider two representative examples.
Orthogonal Dictionary Learning (ODL). The goal of ODL is to find an orthonormal basis that can compactly represent a given set of () data points . Such a problem arises in many signal and image processing applications; see, e.g., [4, 23] and the references therein. By letting
, the problem can be understood as finding an orthogonal matrix
and a sparse matrix such that . Noting that this means should be sparse, one approach is to find a collection of sparse vectors from the row space of and apply some postprocessing procedure to the collection to form the orthogonal matrix . This has been pursued in various works; see, e.g., [24, 18, 25, 2]. In particular, the work [2] considers the formulation (1.1) and shows that under a standard generative model of the data, one can recover from certain local minimizers of problem (1.1). 
Robust Subspace Recovery (RSR). RSR is a fundamental problem in machine learning and data mining [11]
. It is concerned with fitting a linear subspace to a dataset corrupted by outliers. Specifically, given a dataset
, where the columns of are the inlier points spanning a dimensional subspace of (), the columns of are outlier points without a linear structure, and is an unknown permutation, the goal is to recover the inlier subspace , or equivalently, to cluster the points into inliers and outliers. One recently proposed approach for solving this problem is the socalled dual principal component pursuit (DPCP) [27, 31]. A key task in DPCP is to find a hyperplane that contains all the inliers. Such a task can be tackled by solving problem (
1.1). In fact, it has been shown in [27, 31] that under certain conditions on the inliers and outliers, any global minimizer of problem (1.1) yields a vector that is orthogonal to the inlier subspace .
Despite its attractive theoretical properties in various applications, problem (1.1) is numerically challenging to solve due to its nonsmooth objective and nonconvex constraint. Nevertheless, the manifold structure of the constraint set suggests that problem (1.1) could be amenable to manifold optimization techniques [1]. One approach is to apply smoothing to the nonsmooth objective in (1.1) and use existing algorithms for Riemannian smooth optimization to solve the resulting problem. For instance, when tackling the ODL problem, Sun et al. [25, 26] and Gilboa et al. [10] proposed to replace the absolute value function by the smooth surrogate with being a smoothing parameter, while Qu et al. [19] proposed to replace the norm with the norm. They then solve the resulting smoothed problems by either the Riemannian trustregion method [25, 26] or the Riemannian gradient descent method [10, 19]. Although it can be shown that these methods will yield the desired orthonormal basis under a standard generative model of the data, the smoothing approach can introduce significant analytic and computational difficulties [2]. Another approach, which avoids smoothing the objective, is to solve (1.1) directly using Riemannian nonsmooth optimization techniques. For instance, in the recent work [2], Bai et al. proposed to solve (1.1) using the Riemannian subgradient method (RSGM), which generates the iterates via
(1.2) 
Here, is the step size; denotes the Riemannian subdifferential of and is given by
where is the identity matrix and is the usual subdifferential of the convex function [29, Section 5]. Bai et al. [2]
showed that for the ODL problem, the RSGM with a suitable initialization will converge at a sublinear rate to a basis vector with high probability under a standard generative model of the data. Moreover, by running the RSGM
times, each time with an independent random initialization, one can recover the entire orthonormal basis with high probability. Around the same time, Zhu et al. [31] proposed a projected subgradient method (PSGM) for solving (1.1). The method generates the iterates via(1.3) 
The updates (1.2) and (1.3) differ in the choice of the direction —the former uses a Riemannian subgradient of at , while the latter uses a usual Euclidean subgradient. For the DPCP formulation of the RSR problem, Zhu et al. [31] showed that under certain assumptions on the data, the PSGM with suitable initialization and piecewise geometrically diminishing step sizes will converge at a linear rate to a vector that is orthogonal to the inlier subspace . The step sizes take the form , where and satisfy certain conditions. In practice, however, the parameters are difficult to determine. Therefore, Zhu et al. [31] also proposed a PSGM with modified backtracking line search (PSGMMBLS), which works well in practice but has no convergence guarantee.
1.1 Motivations for this Work
Although the results in [2, 31] demonstrate, both theoretically and computationally, the efficacy of RSGM and PSGM for solving instances of (1.1) that arise from the ODL and RSR problems, respectively, two fundamental questions remain. First, while the PSGM can be shown to achieve a linear convergence rate on the DPCP formulation of the RSR problem [31], only a sublinear convergence rate has been established for the RSGM on the ODL problem [2]. Given the similarity of the updates (1.2) and (1.3), it is natural to ask whether the slower convergence rate of the RSGM is an artifact of the analysis or due to the inherent structure of the ODL problem. Second, the convergence analyses in [2, 31] focus only on the ODL and RSR problems. In particular, they do not shed light on the performance of the RSGM or PSGM when tackling general instances of problem (1.1). It would be of interest to fill this gap by identifying or developing practically fast methods that have more general convergence guarantees, especially since different applications may give rise to instances of problem (1.1) with different structures. In a recent attempt to address these questions, Li et al. [13] showed, among other things, that the RSGM will converge at the sublinear rate of (here, is the iteration counter) when applied to a general instance of problem (1.1) and at a linear rate when applied to a socalled sharp instance of problem (1.1). Informally, an optimization problem is said to possess the sharpness property if the objective function grows linearly with the distance to a set of local minima [5]. Such a property is key to establishing fast convergence guarantees for a host of iterative methods; see, e.g., [5, 13, 15] and also [17, 30, 16] for related results. Since the ODL problem and the DPCP formulation of the RSR problem are known to possess the sharpness property under certain assumptions on the data [2, 31], the results in [13] imply that the RSGM will converge linearly on these problems.
1.2 Our Contributions
In this paper, we depart from the subgradienttype approaches (such as the RSGM (1.2) and PSGM (1.3)) and present another method called the manifold proximal point algorithm (ManPPA) to tackle problem (1.1). At each iterate , ManPPA computes a search directon by minimizing the sum of and a proximal term defined in terms of the Euclidean distance over the tangent space to at . This should be contrasted with other existing PPAs on manifolds (see, e.g., [8, 3]), in which the proximal term is defined in terms of the Riemannian distance. Such a difference is important. Indeed, although the search direction defined in ManPPA does not admit a closedform formula, it can be computed in a highly efficient manner by exploiting the structure of problem (1.1); see Section 2.2. However, the search direction defined in the existing PPAs on manifolds can be as difficult to compute as a solution to the original problem. Consequently, the applicability of those methods is rather limited.
We now summarize our contributions as follows:

We show that ManPPA will converge at the sublinear rate of when applied to a general instance of problem (1.1) and at a quadratic rate when applied to a sharp instance of problem (1.1). Although the sublinear rate result follows from the results in [6], the quadratic rate result is new. Moreover, both rates are superior to those of the RSGM established in [13].

We propose a stochastic version of ManPPA called StManPPA to tackle problem (1.1). StManPPA is well suited for the setting where the number of the data points is extremely large, as each iteration involves only a simple closedform update. We also analyze the convergence behavior of StManPPA. In particular, we show that it converges at the sublinear rate of when applied to a general instance of problem (1.1). Although the convergence rate of StManPPA is the same as that of RSGM established in [13], StManPPA is more efficient in practice, as it has a cheaper periteration update.

Using ManPPA as a building block, we develop a new approach for solving the following matrix analog of problem (1.1):
(1.4) Our interest in problem (1.4) stems from the observation that it provides alternative formulations of the ODL and RSR problems. Indeed, for the ODL problem, one can recover the entire orthonormal basis all at once by solving problem (1.4) with . For the RSR problem, if one knows the dimension of the inlier subspace , then one can recover it by solving problem (1.4) with . We show that a good feasible solution to problem (1.4
) can be found in a columnbycolumn manner by suitably modifying ManPPA. Although the proposed method is only a heuristic, our extensive numerical experiments show that it yields solutions of comparable quality to but is significantly faster than existing methods on the ODL and RSR problems.
1.3 Organization and Notation
The rest of the paper is organized as follows. In Section 2, we present ManPPA for solving problem (1.1) and describe a highly efficient method for solving the subproblem that arises in each iteration of ManPPA. We also analyze the convergence behavior of ManPPA. In Section 3, we propose StManPPA, a stochastic version of ManPPA that is well suited for largescale computation, and analyze its convergence behavior. In Section 4 we discuss an extension of ManPPA for solving the matrix analog (1.4) of problem (1.1). In Section 5, we apply ManPPA to solve the ODL problem and the DPCP formulation of the RSR problem and compare its performance with some existing methods. We draw our conclusions in Section 6.
Besides the notation introduced earlier, we use to denote the Lipschitz constant of ; i.e., for all (note that ). Given a closed set , we use to denote the projection of onto and to denote the distance between and . Given a proper lower semicontinuous function , its proximal mapping is given by . Given two vectors , we use or to denote their usual inner product. Other notation is standard.
2 A Manifold Proximal Point Algorithm
Since problem (1.1) is nonconvex, our goal is to compute a stationary point of (1.1), which is a point that satisfies the firstorder optimality condition
(see [29]). In the recent work [6], Chen et al. considered the more general problem of minimizing the sum of a smooth function and a nonsmooth convex function over the Stiefel manifold and developed a manifold proximal gradient method (ManPG) for finding a stationary point of it. When specialized to solve problem (1.1), the method generates the iterates via
(2.1) 
where the search direction is given by
(2.2) 
and and are step sizes. As the reader may readily recognize, without the constraint , the subproblem (2.2) is simply computing the proximal mapping of at and coincides with the update of the classic proximal point algorithm (PPA) [21]. The constraint in (2.2), which states that the search direction should lie on the tangent space to at , is introduced to account for the manifold constraint in problem (1.1) and ensures that the next iterate achieves sufficient decrease in objective value. Motivated by the above discussion, we call the method obtained by specializing ManPG to the setting of problem (1.1) ManPPA and present its details below:
Naturally, ManPPA inherits the properties of ManPG established in [6]. However, due to the structure of problem (1.1), many of the developments in [6] can be refined and/or simplified for ManPPA. In particular, by observing that (2.2) is a linearly constrained strongly convex quadratic minimization problem, we show in Section 2.2 how line 3 of Algorithm 1 can be implemented in a provably and practically efficient manner. Moreover, we can prove the following result, which shows that the line search step in line 4 of Algorithm 1 is well defined. It simplifies [6, Lemma 5.2] and yields sharper constants. The proof can be found in Appendix B.
Proposition 2.1.
Let be the sequence generated by Algorithm 1. Define . For any , we have
(2.3) 
As a result, we have for any in Algorithm 1, which implies that the line search step terminates after at most iterations. In particular, if , then we have , which implies that we can take in line 4 of Algorithm 1; i.e., no line search is needed.
2.1 Convergence Analysis of ManPPA
In this subsection, we study the convergence behavior of ManPPA. Recall from [6, Lemma 5.3] that if in (2.2), then is a stationary point of problem (1.1). This motivates us to call an stationary point of problem (1.1) with if the solution to (2.2) satisfies . By specializing the convergence results in [6, Theorem 5.5] for ManPG to ManPPA, we obtain the following theorem:
Theorem 2.2.
Theorem 2.2 shows that the rate at which ManPPA converges to a stationary point of problem (1.1) is , which is superior to the rate of RSGM established in [13].
Now, let us analyze the convergence rate of ManPPA in the setting where problem (1.1) possesses the sharpness property. Such a setting is highly relevant in applications, as both the ODL problem and DPCP formulation of the RSR problem give rise to sharp instances of problem (1.1); see [2, Proposition C.8] and [13, Proposition 4]. To proceed, we first introduce the notion of sharpness.
Definition 2.3 (Sharpness; see, e.g., [5]).
We say that is a set of weak sharp minima for the function with parameters (where ) if for any , we have
(2.4) 
From the definition, we see that if is a set of weak sharp minima of , then it is the set of minimizers of over . Moreover, the function value grows linearly with the distance to . In the presence of such a regularity property, ManPPA can be shown to converge at a much faster rate. The following result, which has not appeared in the literature before and is thus new, constitutes the first main contribution of this paper.
Theorem 2.4.
Suppose that is a set of weak sharp minima for the function with parameters . Let be the sequence generated by Algorithm 1 with and . Then, we have
(2.5) 
(2.6) 
2.2 Solving the Subproblem (2.2)
Although Theorem 2.2 shows that ManPPA converges to a stationary point of problem (1.1) within iterations, each iteration requires solving the subproblem (2.2) to obtain the search direction. Thus, the efficiency of ManPPA depends crucially on how fast we can solve the subproblem (2.2). Since (2.2) is a linearly constrained strongly convex quadratic minimization problem, one approach is to adapt the semismooth Newton (SSN) augmented Lagrangian method (ALM) originally developed in [14] for LASSOtype problems to solve it. As we shall see, such an approach yields to a highly efficient method for solving the subproblem (2.2).
To set the stage for our development, let us drop the index from (2.2) for simplicity and set . Then, we see that (2.2) can be equivalently written as
(2.7) 
Problem (2.7) can be solved by an inexact ALM. To describe the algorithm, we first write down the augmented Lagrangian function corresponding to (2.7):
(2.8) 
Here, and are Lagrange multipliers (dual variables) associated with the constraints in (2.7) and is a penalty parameter. Then, the inexact ALM for solving (2.7) can be described as follows [20, 14]:
(2.9) 
Since the subproblem (2.9) can only be solved inexactly in general, we adopt the following stopping criteria, which are standard in the literature (see [20, 14]): equationparentequation
(2.10a)  
(2.10b)  
(2.10c) 
Here, is the optimal value of (2.9). Conditions (2.10a)–(2.10c) ensure that starting from any initial point , the inexact ALM (Algorithm 2) will converge at a superlinear rate to an optimal solution to problem (2.7). The proof can be found in Appendix D.
Now, it remains to discuss how to solve the subproblem (2.9) in an efficient manner. Again, let us drop the index in (2.9) for simplicity. By simple manipulation, we have
Consider the function . Upon letting and using the definition of the proximal mapping of , we have
It follows that if and only if
Using [22, Theorem 2.26] and the Moreau decomposition , where is the conjugate function of , it can be deduced that is strongly convex and continuously differentiable with
Thus, we can find by solving the nonsmooth equation
(2.11) 
Towards that end, we apply an SSN method, which finds the solution by successive linearization of the map . To implement the method, we first need to compute the generalized Jacobian of [7, Definition 2.6.1], denoted by
. By the chain rule
[7, Corollary of Theorem 2.6.6] and the Moreau decomposition, each element takes the form(2.12) 
where . Using the definition of , it can be shown that the diagonal matrix with
is an element of [14, Section 3.3] and hence can be used to define an element via (2.12). Note that the matrix so defined is positive definite. As such, the following generic iteration of the SSN method for solving (2.11) is well defined: equationparentequation
(2.13a)  
(2.13b) 
Here, is the step size. Moreover, since is a diagonal matrix whose entries are either 0 or 1, the matrix can be assembled in a very efficient manner; again, see [14]. The detailed implementation of the SSN method for solving (2.11) is given below:
3 Stochastic Manifold Proximal Point Algorithm
In this section, we propose, for the first time, a stochastic ManPPA (StManPPA) for solving problem (1.1), which is well suited for the setting where (typically representing the number of data points) is much larger than (typically representing the ambient dimension of the data points). To begin, observe that problem (1.1) has the finitesum structure
where is the th column of . When is extremely large, computing the matrixvector product can be expensive. To circumvent this difficulty, in each iteration of StManPPA, a column of , say , is randomly chosen and the search direction is given by
(3.1) 
The key advantage of StManPPA is that the subproblem (3.1) admits a closedform solution that is very easy to compute.
Proposition 3.1.
Let . Then, the solution to (3.1) is given by
Proof.
The firstorder optimality conditions of (3.1) are equationparentequation
(3.2a)  
(3.2b) 
Suppose that is a solution to (3.2). If , then , and (3.2a) implies that . This, together with (3.2b) and the fact that , gives . It follows that and hence is equivalent to . This establishes the first case in (3.1). The other two cases in (3.1), which correspond to and , can be derived using a similar argument. ∎
We now present the details of StManPPA below:
3.1 Convergence Analysis of StManPPA
In this section, we present our convergence results for StManPPA. Let us begin with some preparations. Define to be the function and let denote the Lipschitz constant of , where . Set . Furthermore, define the Moreau envelope and proximal mapping on by equationparentequation
(3.3a)  
(3.3b) 
respectively. The proximal mapping is well defined since the constraint set is compact. As it turns out, the proximal mapping can be used to define an alternative notion of stationarity for problem (1.1). Indeed, for any and , if we denote , then the optimality condition of (3.3) yields
Since is a projection operator and hence nonexpansive, we obtain
In particular, if , then (i) is stationary in the sense that and (ii) is close to the stationary point . This motivates us to use
as a stationarity measure for problem (1.1). We call an nearly stationary point of problem (1.1) if . It is worth noting that such a notion has also been used in [13] to study the stochastic RSGM.
We are now ready to establish the convergence rate of StManPPA, which constitutes the second main contribution of this paper. The proof can be found in Appendix E.
Theorem 3.2.
For any , the point output by Algorithm 4 satisfies
where the expectation is taken over all random choices made by the algorithm. In particular, if the step sizes satisfy and , then . Moreover, if we take for , then the number of iterations needed by StManPPA to obtain a point satisfying is .
4 Extension to Stiefel Manifold Constraint
In this section, we consider the matrix analog (1.4) of problem (1.1), which also arises in many applications such as certain “oneshot” formulations of the ODL and RSR problems (see Section 1.2). Currently, there are two existing approaches for solving problem (1.4
), namely a sequential linear programming (SLP) approach and an iteratively reweighted least squares (IRLS) approach
[27]. In the SLP approach, the columns of are extracted one at a time. Suppose that we have already obtained the first columns of () and arrange them in the matrix (with ). Then, the st column of is obtained by solvingThis is achieved by the alternating linearization and projection (ALP) method, which generates the iterates via
Note that the subproblem is a linear program, which can be efficiently solved by offtheshelf solvers.
In the IRLS approach, the following variant of problem (1.4), which favors rowwise sparsity of and has also been studied by Lerman et al. in [12], is considered:
(4.1) 
Here, denotes the sum of the Euclidean norms of the rows of . The IRLS method for solving (4.1) generates the iterates via
where and is a perturbation parameter to prevent the denominator from being 0. The solution can be obtained via an SVD and is thus easy to implement. However, there has been no convergence guarantee for the IRLS method so far.
Recently, Wang et al. [28] proposed a proximal alternating maximization method for solving a maximization version of (1.4), which arises in the socalled PCA problem (see [11]). However, the method cannot be easily adapted to solve problem (1.4).
The similarity between problems (1.1) and (1.4) suggests that the latter can also be solved by ManPPA. This is indeed the case. However, the SSN method for solving the resulting nonsmooth equation (i.e., the matrix analog of (2.11)) can be slow, as the dimension of the linear system (2.13) is high. To overcome this difficulty, we propose an alternative method called sequential ManPPA, which solves problem (1.4) in a columnbycolumn manner and constitutes the third main contribution of this paper. The method is based on the observation that the objective function in (1.4) is separable in the columns of . To find , we simply solve
using ManPPA as it is an instance of problem (1.1). Now, suppose that we have found the first columns of () and arrange them in the matrix (with ). Then, we find the st column by solving
(4.2) 
The difference between problems (1.1) and (4.2) lies in the extra linear equality constraint . Due to this constraint, neither the PSGM nor the RSGM can be easily applied to solve (4.2). However, our ManPPA can still be adapted to solve (4.2) with very minor modifications. To simplify notation, let us drop the index and denote , . In the th iteration of ManPPA, we update the iterate by (2.1), where the search direction is computed by