Manifold Proximal Point Algorithms for Dual Principal Component Pursuit and Orthogonal Dictionary Learning

05/05/2020 ∙ by Shixiang Chen, et al. ∙ University of California-Davis The Chinese University of Hong Kong Texas A&M University 0

We consider the problem of maximizing the ℓ_1 norm of a linear map over the sphere, which arises in various machine learning applications such as orthogonal dictionary learning (ODL) and robust subspace recovery (RSR). The problem is numerically challenging due to its nonsmooth objective and nonconvex constraint, and its algorithmic aspects have not been well explored. In this paper, we show how the manifold structure of the sphere can be exploited to design fast algorithms for tackling this problem. Specifically, our contribution is threefold. First, we present a manifold proximal point algorithm (ManPPA) for the problem and show that it converges at a sublinear rate. Furthermore, we show that ManPPA can achieve a quadratic convergence rate when applied to the ODL and RSR problems. Second, we propose a stochastic variant of ManPPA called StManPPA, which is well suited for large-scale computation, and establish its sublinear convergence rate. Both ManPPA and StManPPA have provably faster convergence rates than existing subgradient-type methods. Third, using ManPPA as a building block, we propose a new approach to solving a matrix analog of the problem, in which the sphere is replaced by the Stiefel manifold. The results from our extensive numerical experiments on the ODL and RSR problems demonstrate the efficiency and efficacy of our proposed methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The problem of finding a subspace that captures the features of a given dataset and possesses certain properties is at the heart of many machine learning applications. One commonly encountered formulation of the problem, which is motivated largely by sparsity or robustness considerations, is given by

(1.1)

where is a given matrix and denotes the

norm of a vector. To better understand how problem (

1.1) arises in applications, let us consider two representative examples.

  • Orthogonal Dictionary Learning (ODL). The goal of ODL is to find an orthonormal basis that can compactly represent a given set of () data points . Such a problem arises in many signal and image processing applications; see, e.g., [4, 23] and the references therein. By letting

    , the problem can be understood as finding an orthogonal matrix

    and a sparse matrix such that . Noting that this means should be sparse, one approach is to find a collection of sparse vectors from the row space of and apply some post-processing procedure to the collection to form the orthogonal matrix . This has been pursued in various works; see, e.g., [24, 18, 25, 2]. In particular, the work [2] considers the formulation (1.1) and shows that under a standard generative model of the data, one can recover from certain local minimizers of problem (1.1).

  • Robust Subspace Recovery (RSR). RSR is a fundamental problem in machine learning and data mining [11]

    . It is concerned with fitting a linear subspace to a dataset corrupted by outliers. Specifically, given a dataset

    , where the columns of are the inlier points spanning a -dimensional subspace of (), the columns of are outlier points without a linear structure, and is an unknown permutation, the goal is to recover the inlier subspace , or equivalently, to cluster the points into inliers and outliers. One recently proposed approach for solving this problem is the so-called dual principal component pursuit (DPCP) [27, 31]

    . A key task in DPCP is to find a hyperplane that contains all the inliers. Such a task can be tackled by solving problem (

    1.1). In fact, it has been shown in [27, 31] that under certain conditions on the inliers and outliers, any global minimizer of problem (1.1) yields a vector that is orthogonal to the inlier subspace .

Despite its attractive theoretical properties in various applications, problem (1.1) is numerically challenging to solve due to its nonsmooth objective and nonconvex constraint. Nevertheless, the manifold structure of the constraint set suggests that problem (1.1) could be amenable to manifold optimization techniques [1]. One approach is to apply smoothing to the nonsmooth objective in (1.1) and use existing algorithms for Riemannian smooth optimization to solve the resulting problem. For instance, when tackling the ODL problem, Sun et al. [25, 26] and Gilboa et al. [10] proposed to replace the absolute value function by the smooth surrogate with being a smoothing parameter, while Qu et al. [19] proposed to replace the norm with the norm. They then solve the resulting smoothed problems by either the Riemannian trust-region method [25, 26] or the Riemannian gradient descent method [10, 19]. Although it can be shown that these methods will yield the desired orthonormal basis under a standard generative model of the data, the smoothing approach can introduce significant analytic and computational difficulties [2]. Another approach, which avoids smoothing the objective, is to solve (1.1) directly using Riemannian nonsmooth optimization techniques. For instance, in the recent work [2], Bai et al. proposed to solve (1.1) using the Riemannian subgradient method (RSGM), which generates the iterates via

(1.2)

Here, is the step size; denotes the Riemannian subdifferential of and is given by

where is the identity matrix and is the usual subdifferential of the convex function  [29, Section 5]. Bai et al. [2]

showed that for the ODL problem, the RSGM with a suitable initialization will converge at a sublinear rate to a basis vector with high probability under a standard generative model of the data. Moreover, by running the RSGM

times, each time with an independent random initialization, one can recover the entire orthonormal basis with high probability. Around the same time, Zhu et al. [31] proposed a projected subgradient method (PSGM) for solving (1.1). The method generates the iterates via

(1.3)

The updates (1.2) and (1.3) differ in the choice of the direction —the former uses a Riemannian subgradient of at , while the latter uses a usual Euclidean subgradient. For the DPCP formulation of the RSR problem, Zhu et al. [31] showed that under certain assumptions on the data, the PSGM with suitable initialization and piecewise geometrically diminishing step sizes will converge at a linear rate to a vector that is orthogonal to the inlier subspace . The step sizes take the form , where and satisfy certain conditions. In practice, however, the parameters are difficult to determine. Therefore, Zhu et al. [31] also proposed a PSGM with modified backtracking line search (PSGM-MBLS), which works well in practice but has no convergence guarantee.

1.1 Motivations for this Work

Although the results in [2, 31] demonstrate, both theoretically and computationally, the efficacy of RSGM and PSGM for solving instances of (1.1) that arise from the ODL and RSR problems, respectively, two fundamental questions remain. First, while the PSGM can be shown to achieve a linear convergence rate on the DPCP formulation of the RSR problem [31], only a sublinear convergence rate has been established for the RSGM on the ODL problem [2]. Given the similarity of the updates (1.2) and (1.3), it is natural to ask whether the slower convergence rate of the RSGM is an artifact of the analysis or due to the inherent structure of the ODL problem. Second, the convergence analyses in [2, 31] focus only on the ODL and RSR problems. In particular, they do not shed light on the performance of the RSGM or PSGM when tackling general instances of problem (1.1). It would be of interest to fill this gap by identifying or developing practically fast methods that have more general convergence guarantees, especially since different applications may give rise to instances of problem (1.1) with different structures. In a recent attempt to address these questions, Li et al. [13] showed, among other things, that the RSGM will converge at the sublinear rate of (here, is the iteration counter) when applied to a general instance of problem (1.1) and at a linear rate when applied to a so-called sharp instance of problem (1.1). Informally, an optimization problem is said to possess the sharpness property if the objective function grows linearly with the distance to a set of local minima [5]. Such a property is key to establishing fast convergence guarantees for a host of iterative methods; see, e.g., [5, 13, 15] and also [17, 30, 16] for related results. Since the ODL problem and the DPCP formulation of the RSR problem are known to possess the sharpness property under certain assumptions on the data [2, 31], the results in [13] imply that the RSGM will converge linearly on these problems.

1.2 Our Contributions

In this paper, we depart from the subgradient-type approaches (such as the RSGM (1.2) and PSGM (1.3)) and present another method called the manifold proximal point algorithm (ManPPA) to tackle problem (1.1). At each iterate , ManPPA computes a search directon by minimizing the sum of and a proximal term defined in terms of the Euclidean distance over the tangent space to at . This should be contrasted with other existing PPAs on manifolds (see, e.g., [8, 3]), in which the proximal term is defined in terms of the Riemannian distance. Such a difference is important. Indeed, although the search direction defined in ManPPA does not admit a closed-form formula, it can be computed in a highly efficient manner by exploiting the structure of problem (1.1); see Section 2.2. However, the search direction defined in the existing PPAs on manifolds can be as difficult to compute as a solution to the original problem. Consequently, the applicability of those methods is rather limited.

We now summarize our contributions as follows:

  1. We show that ManPPA will converge at the sublinear rate of when applied to a general instance of problem (1.1) and at a quadratic rate when applied to a sharp instance of problem (1.1). Although the sublinear rate result follows from the results in [6], the quadratic rate result is new. Moreover, both rates are superior to those of the RSGM established in [13].

  2. We propose a stochastic version of ManPPA called StManPPA to tackle problem (1.1). StManPPA is well suited for the setting where the number of the data points is extremely large, as each iteration involves only a simple closed-form update. We also analyze the convergence behavior of StManPPA. In particular, we show that it converges at the sublinear rate of when applied to a general instance of problem (1.1). Although the convergence rate of StManPPA is the same as that of RSGM established in [13], StManPPA is more efficient in practice, as it has a cheaper per-iteration update.

  3. Using ManPPA as a building block, we develop a new approach for solving the following matrix analog of problem (1.1):

    (1.4)

    Our interest in problem (1.4) stems from the observation that it provides alternative formulations of the ODL and RSR problems. Indeed, for the ODL problem, one can recover the entire orthonormal basis all at once by solving problem (1.4) with . For the RSR problem, if one knows the dimension of the inlier subspace , then one can recover it by solving problem (1.4) with . We show that a good feasible solution to problem (1.4

    ) can be found in a column-by-column manner by suitably modifying ManPPA. Although the proposed method is only a heuristic, our extensive numerical experiments show that it yields solutions of comparable quality to but is significantly faster than existing methods on the ODL and RSR problems.

1.3 Organization and Notation

The rest of the paper is organized as follows. In Section 2, we present ManPPA for solving problem (1.1) and describe a highly efficient method for solving the subproblem that arises in each iteration of ManPPA. We also analyze the convergence behavior of ManPPA. In Section 3, we propose StManPPA, a stochastic version of ManPPA that is well suited for large-scale computation, and analyze its convergence behavior. In Section 4 we discuss an extension of ManPPA for solving the matrix analog (1.4) of problem (1.1). In Section 5, we apply ManPPA to solve the ODL problem and the DPCP formulation of the RSR problem and compare its performance with some existing methods. We draw our conclusions in Section 6.

Besides the notation introduced earlier, we use to denote the Lipschitz constant of ; i.e., for all (note that ). Given a closed set , we use to denote the projection of onto and to denote the distance between and . Given a proper lower semicontinuous function , its proximal mapping is given by . Given two vectors , we use or to denote their usual inner product. Other notation is standard.

2 A Manifold Proximal Point Algorithm

Since problem (1.1) is nonconvex, our goal is to compute a stationary point of (1.1), which is a point that satisfies the first-order optimality condition

(see [29]). In the recent work [6], Chen et al. considered the more general problem of minimizing the sum of a smooth function and a nonsmooth convex function over the Stiefel manifold and developed a manifold proximal gradient method (ManPG) for finding a stationary point of it. When specialized to solve problem (1.1), the method generates the iterates via

(2.1)

where the search direction is given by

(2.2)

and and are step sizes. As the reader may readily recognize, without the constraint , the subproblem (2.2) is simply computing the proximal mapping of at and coincides with the update of the classic proximal point algorithm (PPA) [21]. The constraint in (2.2), which states that the search direction should lie on the tangent space to at , is introduced to account for the manifold constraint in problem (1.1) and ensures that the next iterate achieves sufficient decrease in objective value. Motivated by the above discussion, we call the method obtained by specializing ManPG to the setting of problem (1.1) ManPPA and present its details below:

1:  Input: , , .
2:  for  do
3:     Solve the subproblem (2.2) to obtain .
4:     Let be the smallest nonnegative integer such that
5:     Set according to (2.1) with .
6:  end for
Algorithm 1 ManPPA for Solving Problem (1.1)

Naturally, ManPPA inherits the properties of ManPG established in [6]. However, due to the structure of problem (1.1), many of the developments in [6] can be refined and/or simplified for ManPPA. In particular, by observing that (2.2) is a linearly constrained strongly convex quadratic minimization problem, we show in Section 2.2 how line 3 of Algorithm 1 can be implemented in a provably and practically efficient manner. Moreover, we can prove the following result, which shows that the line search step in line 4 of Algorithm 1 is well defined. It simplifies [6, Lemma 5.2] and yields sharper constants. The proof can be found in Appendix B.

Proposition 2.1.

Let be the sequence generated by Algorithm 1. Define . For any , we have

(2.3)

As a result, we have for any in Algorithm 1, which implies that the line search step terminates after at most iterations. In particular, if , then we have , which implies that we can take in line 4 of Algorithm 1; i.e., no line search is needed.

2.1 Convergence Analysis of ManPPA

In this subsection, we study the convergence behavior of ManPPA. Recall from [6, Lemma 5.3] that if in (2.2), then is a stationary point of problem (1.1). This motivates us to call an -stationary point of problem (1.1) with if the solution to (2.2) satisfies . By specializing the convergence results in [6, Theorem 5.5] for ManPG to ManPPA, we obtain the following theorem:

Theorem 2.2.

Any limit point of the sequence generated by Algorithm 1 is a stationary point of problem (1.1). Moreover, Algorithm 1 with returns an -stationary point in at most iterations, where is the optimal value of problem (1.1).

Theorem 2.2 shows that the rate at which ManPPA converges to a stationary point of problem (1.1) is , which is superior to the rate of RSGM established in [13].

Now, let us analyze the convergence rate of ManPPA in the setting where problem (1.1) possesses the sharpness property. Such a setting is highly relevant in applications, as both the ODL problem and DPCP formulation of the RSR problem give rise to sharp instances of problem (1.1); see [2, Proposition C.8] and [13, Proposition 4]. To proceed, we first introduce the notion of sharpness.

Definition 2.3 (Sharpness; see, e.g., [5]).

We say that is a set of weak sharp minima for the function with parameters (where ) if for any , we have

(2.4)

From the definition, we see that if is a set of weak sharp minima of , then it is the set of minimizers of over . Moreover, the function value grows linearly with the distance to . In the presence of such a regularity property, ManPPA can be shown to converge at a much faster rate. The following result, which has not appeared in the literature before and is thus new, constitutes the first main contribution of this paper.

Theorem 2.4.

Suppose that is a set of weak sharp minima for the function with parameters . Let be the sequence generated by Algorithm 1 with and . Then, we have

(2.5)
(2.6)

Theorem 2.4 establishes the quadratic convergence rate of ManPPA when applied to a sharp instance of problem (1.1). Again, this is superior to the linear convergence rate of the RSGM established in [6] for this setting. The proof of Theorem 2.4 can be found in Appendix C.

2.2 Solving the Subproblem (2.2)

Although Theorem 2.2 shows that ManPPA converges to a stationary point of problem (1.1) within iterations, each iteration requires solving the subproblem (2.2) to obtain the search direction. Thus, the efficiency of ManPPA depends crucially on how fast we can solve the subproblem (2.2). Since (2.2) is a linearly constrained strongly convex quadratic minimization problem, one approach is to adapt the semismooth Newton (SSN) augmented Lagrangian method (ALM) originally developed in [14] for LASSO-type problems to solve it. As we shall see, such an approach yields to a highly efficient method for solving the subproblem (2.2).

To set the stage for our development, let us drop the index from (2.2) for simplicity and set . Then, we see that (2.2) can be equivalently written as

(2.7)

Problem (2.7) can be solved by an inexact ALM. To describe the algorithm, we first write down the augmented Lagrangian function corresponding to (2.7):

(2.8)

Here, and are Lagrange multipliers (dual variables) associated with the constraints in (2.7) and is a penalty parameter. Then, the inexact ALM for solving (2.7) can be described as follows [20, 14]:

1:  Input: , , , , .
2:  for  do
3:     Compute
(2.9)
4:     Update dual variables:
5:     Update .
6:  end for
Algorithm 2 Inexact ALM for Solving Problem (2.7)

Since the subproblem (2.9) can only be solved inexactly in general, we adopt the following stopping criteria, which are standard in the literature (see [20, 14]): equationparentequation

(2.10a)
(2.10b)
(2.10c)

Here, is the optimal value of (2.9). Conditions (2.10a)–(2.10c) ensure that starting from any initial point , the inexact ALM (Algorithm 2) will converge at a superlinear rate to an optimal solution to problem (2.7). The proof can be found in Appendix D.

Now, it remains to discuss how to solve the subproblem (2.9) in an efficient manner. Again, let us drop the index in (2.9) for simplicity. By simple manipulation, we have

Consider the function . Upon letting and using the definition of the proximal mapping of , we have

It follows that if and only if

Using [22, Theorem 2.26] and the Moreau decomposition , where is the conjugate function of , it can be deduced that is strongly convex and continuously differentiable with

Thus, we can find by solving the nonsmooth equation

(2.11)

Towards that end, we apply an SSN method, which finds the solution by successive linearization of the map . To implement the method, we first need to compute the generalized Jacobian of  [7, Definition 2.6.1], denoted by

. By the chain rule 

[7, Corollary of Theorem 2.6.6] and the Moreau decomposition, each element takes the form

(2.12)

where . Using the definition of , it can be shown that the diagonal matrix with

is an element of  [14, Section 3.3] and hence can be used to define an element via (2.12). Note that the matrix so defined is positive definite. As such, the following generic iteration of the SSN method for solving (2.11) is well defined: equationparentequation

(2.13a)
(2.13b)

Here, is the step size. Moreover, since is a diagonal matrix whose entries are either 0 or 1, the matrix can be assembled in a very efficient manner; again, see [14]. The detailed implementation of the SSN method for solving (2.11) is given below:

1:  Input: , , , .
2:  for  do
3:     Choose . Let . Find an approximate solution to the linear system
that satisfies
4:     Let be the smallest nonnegative integer such that
5:     Set with .
6:  end for
Algorithm 3 SSN method for Solving the Nonsmooth Equation (2.11)

The SSN method (Algorithm 3) will converge at a superlinear rate to the unique solution to (2.11). The proof can be found in Appendix D.

3 Stochastic Manifold Proximal Point Algorithm

In this section, we propose, for the first time, a stochastic ManPPA (StManPPA) for solving problem (1.1), which is well suited for the setting where (typically representing the number of data points) is much larger than (typically representing the ambient dimension of the data points). To begin, observe that problem (1.1) has the finite-sum structure

where is the -th column of . When is extremely large, computing the matrix-vector product can be expensive. To circumvent this difficulty, in each iteration of StManPPA, a column of , say , is randomly chosen and the search direction is given by

(3.1)

The key advantage of StManPPA is that the subproblem (3.1) admits a closed-form solution that is very easy to compute.

Proposition 3.1.

Let . Then, the solution to (3.1) is given by

Proof.

The first-order optimality conditions of (3.1) are equationparentequation

(3.2a)
(3.2b)

Suppose that is a solution to (3.2). If , then , and (3.2a) implies that . This, together with (3.2b) and the fact that , gives . It follows that and hence is equivalent to . This establishes the first case in (3.1). The other two cases in (3.1), which correspond to and , can be derived using a similar argument. ∎

We now present the details of StManPPA below:

1:  Input: , .
2:  for  do
3:     Select uniformly at random and solve the subproblem (3.1) with , to obtain .
4:     Set .
5:  end for
6:  Output: with probability .
Algorithm 4 StManPPA for Solving Problem (1.1)

3.1 Convergence Analysis of StManPPA

In this section, we present our convergence results for StManPPA. Let us begin with some preparations. Define to be the function and let denote the Lipschitz constant of , where . Set . Furthermore, define the Moreau envelope and proximal mapping on by equationparentequation

(3.3a)
(3.3b)

respectively. The proximal mapping is well defined since the constraint set is compact. As it turns out, the proximal mapping can be used to define an alternative notion of stationarity for problem (1.1). Indeed, for any and , if we denote , then the optimality condition of (3.3) yields

Since is a projection operator and hence nonexpansive, we obtain

In particular, if , then (i) is -stationary in the sense that and (ii) is close to the -stationary point . This motivates us to use

as a stationarity measure for problem (1.1). We call an -nearly stationary point of problem (1.1) if . It is worth noting that such a notion has also been used in [13] to study the stochastic RSGM.

We are now ready to establish the convergence rate of StManPPA, which constitutes the second main contribution of this paper. The proof can be found in Appendix E.

Theorem 3.2.

For any , the point output by Algorithm 4 satisfies

where the expectation is taken over all random choices made by the algorithm. In particular, if the step sizes satisfy and , then . Moreover, if we take for , then the number of iterations needed by StManPPA to obtain a point satisfying is .

The sublinear convergence rate of StManPPA established in Theorem 3.2 is comparable to that of RSGM established in [13]. However, since StManPPA has a much lower per-iteration complexity than RSGM, the former is computationally more efficient.

4 Extension to Stiefel Manifold Constraint

In this section, we consider the matrix analog (1.4) of problem (1.1), which also arises in many applications such as certain “one-shot” formulations of the ODL and RSR problems (see Section 1.2). Currently, there are two existing approaches for solving problem (1.4

), namely a sequential linear programming (SLP) approach and an iteratively reweighted least squares (IRLS) approach 

[27]. In the SLP approach, the columns of are extracted one at a time. Suppose that we have already obtained the first columns of () and arrange them in the matrix (with ). Then, the -st column of is obtained by solving

This is achieved by the alternating linearization and projection (ALP) method, which generates the iterates via

Note that the -subproblem is a linear program, which can be efficiently solved by off-the-shelf solvers.

In the IRLS approach, the following variant of problem (1.4), which favors row-wise sparsity of and has also been studied by Lerman et al. in [12], is considered:

(4.1)

Here, denotes the sum of the Euclidean norms of the rows of . The IRLS method for solving (4.1) generates the iterates via

where and is a perturbation parameter to prevent the denominator from being 0. The solution can be obtained via an SVD and is thus easy to implement. However, there has been no convergence guarantee for the IRLS method so far.

Recently, Wang et al. [28] proposed a proximal alternating maximization method for solving a maximization version of (1.4), which arises in the so-called -PCA problem (see [11]). However, the method cannot be easily adapted to solve problem (1.4).

The similarity between problems (1.1) and (1.4) suggests that the latter can also be solved by ManPPA. This is indeed the case. However, the SSN method for solving the resulting nonsmooth equation (i.e., the matrix analog of (2.11)) can be slow, as the dimension of the linear system (2.13) is high. To overcome this difficulty, we propose an alternative method called sequential ManPPA, which solves problem (1.4) in a column-by-column manner and constitutes the third main contribution of this paper. The method is based on the observation that the objective function in (1.4) is separable in the columns of . To find , we simply solve

using ManPPA as it is an instance of problem (1.1). Now, suppose that we have found the first columns of () and arrange them in the matrix (with ). Then, we find the -st column by solving

(4.2)

The difference between problems (1.1) and (4.2) lies in the extra linear equality constraint . Due to this constraint, neither the PSGM nor the RSGM can be easily applied to solve (4.2). However, our ManPPA can still be adapted to solve (4.2) with very minor modifications. To simplify notation, let us drop the index and denote , . In the -th iteration of ManPPA, we update the iterate by (2.1), where the search direction is computed by