# Dual Principal Component Pursuit: Probability Analysis and Efficient Algorithms

Recent methods for learning a linear subspace from data corrupted by outliers are based on convex ℓ_1 and nuclear norm optimization and require the dimension of the subspace and the number of outliers to be sufficiently small. In sharp contrast, the recently proposed Dual Principal Component Pursuit (DPCP) method can provably handle subspaces of high dimension by solving a non-convex ℓ_1 optimization problem on the sphere. However, its geometric analysis is based on quantities that are difficult to interpret and are not amenable to statistical analysis. In this paper we provide a refined geometric analysis and a new statistical analysis that show that DPCP can tolerate as many outliers as the square of the number of inliers, thus improving upon other provably correct robust PCA methods. We also propose a scalable Projected Sub-Gradient Method method (DPCP-PSGM) for solving the DPCP problem and show it admits linear convergence even though the underlying optimization problem is non-convex and non-smooth. Experiments on road plane detection from 3D point cloud data demonstrate that DPCP-PSGM can be more efficient than the traditional RANSAC algorithm, which is one of the most popular methods for such computer vision applications.

## Authors

• 23 publications
• 6 publications
• 17 publications
• 5 publications
• 60 publications
• 16 publications
10/15/2015

### Dual Principal Component Pursuit

We consider the problem of outlier rejection in single subspace learning...
06/06/2017

### Hyperplane Clustering Via Dual Principal Component Pursuit

We extend the theoretical analysis of a recently proposed single subspac...
10/06/2021

### Boosting RANSAC via Dual Principal Component Pursuit

In this paper, we revisit the problem of local optimization in RANSAC. O...
12/07/2013

### Robust Subspace System Identification via Weighted Nuclear Norm Optimization

Subspace identification is a classical and very well studied problem in ...
08/26/2011

### Solving Principal Component Pursuit in Linear Time via l_1 Filtering

In the past decades, exactly recovering the intrinsic data structure fro...
10/20/2010

### Robust PCA via Outlier Pursuit

Singular Value Decomposition (and Principal Component Analysis) is one o...
12/07/2015

### Pseudo-Bayesian Robust PCA: Algorithms and Analyses

Commonly used in computer vision and other applications, robust PCA repr...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Fitting a linear subspace to a dataset corrupted by outliers is a fundamental problem in machine learning and statistics, primarily known as

(Robust) Principal Component Analysis (PCA) or Robust Subspace Recovery (Jolliffe, 1986; Candès et al., 2011; Lerman and Maunu, 2018)

. The classical formulation of PCA, dating back to Carl F. Gauss, is based on minimizing the sum of squares of the distances of all points in the dataset to the estimated linear subspace. Although this problem is non-convex, it admits a closed form solution given by the span of the top eigenvectors of the data covariance matrix. Nevertheless, it is well-known that the presence of outliers can severely affect the quality of the computed solution because the Euclidean norm is not robust to outliers. To clearly specify the problem, given a unit

-norm dataset , where are inlier points spanning a -dimensional subspace of , are outlier points having no linear structure, and is an unknown permutation, the goal is to recover the inlier space or equivalently to cluster the points into inliers and outliers.

### 1.1 Related Work

The sensitivity of classical -based PCA to outliers has been addressed by using robust maximum likelihood covariance estimators, such as the one in (Tyler, 1987). However, the associated optimization problems are non-convex and thus difficult to provide global optimality guarantees. Another classical approach is the exhaustive-search method of Random Sampling And Consensus (RANSAC) (Fischler and Bolles, 1981), which given a time budget, computes at each iteration a -dimensional subspace as the span of randomly chosen points, and outputs the subspace that agrees with the largest number of points. Even though RANSAC is currently one of the most popular methods in many computer vision applications such as multiple view geometry (Hartley and Zisserman, 2004), its performance is sensitive to the choice of a thresholding parameter. Moreover, the number of required samplings may become prohibitive in cases when the number of outliers is very large and/or the subspace dimension is large and close to the dimension of the ambient space (i.e., the high relative dimension case).

As an alternative to traditional robust subspace learning methods, during the last decade ideas from compressed sensing (Candès and Wakin, 2008) have given rise to a new class of methods that are based on convex optimization, and admit elegant theoretical analyses and efficient algorithmic implementations.

##### ℓ2,1-Rpca

A prominent example is based on decomposing the data matrix into low-rank (corresponding to the inliers) and column-sparse parts (corresponding to the ourliers) (Xu et al., 2010). Specifically, -RPCA solves the following problem

 minimizeL,E∥L∥∗+λ∥E∥2,1, s.t.  L+E=˜\mathbfcalX, (1)

where denotes the nulcear norm of

(a.k.a the sum of the singular values of

) and amounts to the sum of the norm of each column in . The main insight behind -RPCA is that has rank which is small for a relative low dimensional subspace, while has zero columns. Since both the norm and the nuclear norm are convex functions and the later can be optimized efficiently via SemiDefinite Programming (SDP), (1) can be solved using a general SDP solver such as SDPT3 or SeDuMi. However, such methods do not scale well to large data-sets and the SDP becomes prohibitive even for modest problem sizes. As a consequence, the authors (Xu et al., 2010) proposed an efficient algorithm based on proximal gradient descent. The main limitation of this method is that it is theoretically justifiable only for subspaces of low relative dimension (and hence is low-rank compared to the ambient dimension ).

##### Coherence Pursuit (CoP)

Motivated by the observation that inliers lying in a low-dimensional subspace are expected to be correlated with each other while the outliers do not admit such strucuture or property, CoP (Rahmani and Atia, 2016) first measures the coherence of each point with every other point in the dataset. Compared to the outliers, the inliers tend to have higher coherence with the rest of the points in the data set. Thus, the inliers are then simply detected by selecting sufficiently number of points that have the highest coherence with the rest of the points until they span a -dimensional subspace. This algorithm can be implemented very efficiently as it only involves computing the Gram matrix of the data set for the coherence and also as demonstrated in (Rahmani and Atia, 2016), CoP has a competitive performance. Similar to the -RPCA, the main limitation of CoP is that it is theoretically guaranteed only for . However, for applications such as D point cloud analysis, two/three-view geometry in computer vision, and system identification, a subspace of dimension (high relative dimension) is sought (Späth and Watson, 1987; Wang et al., 2015).

##### Reaper

A promising direction towards handling subspaces of high relative dimension is minimizing the sum of the distances of the points to the subspace111

denotes the set of eigenvalues.

 minimizePdL∑j=1∥(I−Pd)˜xj∥2 s.t.  trace(Pd)=d,Pd=P⊤d,eig(Pd)={0,1}, (2)

where is an orthogonal projection onto any -dimensinal subspace. Due to the nonconvexity of the set of orthogonal projections, (2) is a non-convex problem, which REAPER (Lerman et al., 2015a) relaxes to the following SDP:

 minimizePdL∑j=1∥(I−Pd)˜xj∥2 s.t.  trace(Pd)=d, 0⪯Pd⪯I. (3)

To overcome the computational issue of the above SDP solved by general SDP solvers, an Iteratively Reweighted Least Squares (IRLS) scheme is proposed in (Lerman et al., 2015a) to obtain a numerical solution of (3). Even though in practice REAPER outperforms low-rank methods (Xu et al., 2010; Soltanolkotabi et al., 2014; You et al., 2017; Rahmani and Atia, 2016) for subspaces of high relative dimension, its theoretical guarantees still require .

The limitation mentioned above of REAPER is improved upon by the recent work of (Maunu and Lerman, 2017) that directly considers (2) by parameterizing the orthogonal projector and addressing its equivalent form

 minimizeV∈RD×dL∑j=1∥(I−VV⊤)˜xj∥2 s.t.  VTV=I, (4)

which is then solved by using a geodesic gradient descent (GGD) algorithm (which is a subgradient method on the Grassmannian in the strict sense). When initialized near the inlier subspace, GGD is theoretically guaranteed to converge to the inlier subspace, despite the nonconvexity of (2). In particular, with a random Gaussian model for the dataset, this convergence holds with high-probability for any , as long as the number of outliers scales as . Also as a first-order method in nature, the GGD can be implemented efficiently with each iteration mainly involving the computation of a subgradient, its singular value decompotion (SVD) and update of the subspace algong the geodesic direction. However, only a sublinear convergence rate ( where indicaes the total number of iteration) was formally established for GGD in (Maunu and Lerman, 2017), although a fast convergence was experimentally observed with a line search method for choosing the step sizes.

##### Dual Principal Component Pursuit (DPCP)

As a dual approach to (2), the recently proposed DPCP method (Tsakiris and Vidal, 2015, 2017a, 2017b) seeks to learn recursively a basis for the orthogonal complement of the subspace by solving an

minimization problem on the sphere. In particular, the main idea of DPCP is to first compute a hyperplane

that contains all the inliers . Such a hyperplane can be used to discard a potentially very large number of outliers, after which a method such as RANSAC may successfully be applied to the reduced dataset 222Note that if the outliers are in general position, then will contain at most outliers.. Alternatively, if is known, then one may proceed to recover as the intersection of orthogonal hyperplanes that contain

. In any case, DPCP computes a normal vector

to the first hyperplane as follows:

 minb∈RD∥˜\mathbfcalX⊤b∥0 s.t.  b≠0. (5)

Notice that the function being minimized simply counts how many points in the dataset that are not contained in the hyperplane with normal vector . Assuming that there are at least inliers and at least outliers (this is to avoid degenerate situations), and that all points are in general position333Every -tuple of inliers is linearly independent, and every -tuple of outliers is linearly independent., then every solution to (5) must correspond to a hyperplane that contains , and hence is orthogonal to . Since (5) is computationally intractable, it is reasonable to replace it by444This optimization problem also appears in different contexts (e.g., (Qu et al., 2014) and (Späth and Watson, 1987)).

 minb∈RDf(b):=∥˜\mathbfcalX⊤b∥1 s.t.  ∥b∥2=1. (6)

In fact, the above optimization problem is equivalent to (4) for the special case . As shown in (Tsakiris and Vidal, 2015, 2017a), as long as the points are well distributed in a certain deterministic sense, any global minimizer of this non-convex problem is guaranteed to be a vector orthogonal to the subspace, regardless of the outlier/inlier ratio and the subspace dimension; a result that agrees with the earlier findings of (Lerman and Zhang, 2014). Indeed, for synthetic data drawn from a hyperplane (), DPCP has been shown to be the only method able to correctly recover the subspace with up to outliers (). Nevertheless, the analysis of (Tsakiris and Vidal, 2015, 2017a) involves geometric quantities that are difficult to analyze in a probabilistic setting, and consequently it has been unclear how the number of outliers that can be tolerated scales as a function of the number of inliers. Moreover, even though (Tsakiris and Vidal, 2015, 2017a)

show that relaxing the non-convex problem to a sequence of linear programs (LPs) guarantees finite convergence to a vector orthogonal to the subspace, this approach is computationally expensive. See

Section 3.1 for the corresponding Alternating Linerization and Projection (ALP) method. Alternatively, while the IRLS scheme proposed in (Tsakiris and Vidal, 2017a, b)

is more efficient than the linear programming approach, it comes with no theoretical guarantees and scales poorly for high-dimensional data, since it involves an SVD at each iteration.

We note that the literature on this subject is vast and the above mentioned related work is far from exhaustive; other methods such as the Fast Median Subspace (FMS) (Lerman and Maunu, 2017), Geometric Median Subspace (GMS) (Zhang and Lerman, 2014), Tyler M-Estimator (TME) (Zhang, 2016), Thresholding-based Outlier Robust PCA (TORP) (Cherapanamjeri et al., 2017), expressing each data point as a sparse linear combination of other data points (Soltanolkotabi et al., 2014; You et al., 2017), online subspace learning (Balzano et al., 2010), etc. For many other related methods, we refer to a recent review article (Lerman and Maunu, 2018) that thoroughly summarizes the entire body of work on robust subspace recovery.

### 1.2 Main Contribution and Outline

The focus of the present paper is to provide improved deterministic analysis as well as probability analysis of global optimialtiy and efficient algorithms for the DPCP approach. We make the following specific contributions.

• Theory: Although the conditions established in (Tsakiris and Vidal, 2015, 2017a) suggest that global minimizers of (6) are orthogonal to (if the outliers are well distributed on the unit sphere and the inliers are well distributed on the intersection of the unit sphere with the subspace ), they are deterministic in nature and difficult to interpret. In Section 2, we provide an improved analysis of global optimality for DPCP that replaces the cumbersome geometric quantities in (Tsakiris and Vidal, 2015, 2017a) with new quantities that are both tighter and easier to bound in probability. In order to do this, we first provide a geometric characterization of the critical points of (6) (see Lemma 1), revealing that any critical point of (6) is either orthogonal to the inlier subspace , or very close to , with its principal angle from the inlier subspace being smaller for well distributed points and smaller outlier to inlier ratios . Employing a spherical random model, the improved global optimality condition suggests that DPCP can handle outliers. This is in sharp contrast to existing provably correct state-of-the-art robust PCA methods, which as per the recent review (Lerman and Maunu, 2018, Table I) can tolerate at best outliers, when the subspace dimension and the ambient dimension are fixed. The comparison of the largest number of outliers can be tolarated by different approaches is listed in Table 1, where the random models are descripted in Section 2.3.

• Algorithms: In Section 3, we first establish conditions under which solving one linear program returns a normal vector to the inlier subspace. Specifically, given an appropriate , the optimal solution of the following linear program

 minb∈RDf(b):=∥˜\mathbfcalX⊤b∥1 s.t.  b⊤0b=1

must be orthogonal to the inlier subspace. This improves upon the convergence guarantee in (Tsakiris and Vidal, 2017a, b) where the ALP is only proved to converge in a finite number of iterations, but without any explict convergence rate. To further reduce the computational burden, we then provide a scalable Projected Sub-Gradient Method with piecewise geometrically diminishing step size (DPCP-PSGM), which is proven to solve the non-convex DPCP problem (6) with linear convergence and using only matrix-vector multiplications. This is in sharp contrast to classic results in the literature on PSGM methods, which usually require the problem to be convex in order to establish sub-linear convergence (Boyd et al., 2003). DPCP-PSGM is orders of magnitude faster than the ALP and IRLS schemes proposed in (Tsakiris and Vidal, 2017a), allowing us to extend the size of the datasets that we can handle from to data points.

• Experiments: Aside from experiments with synthetical data, we conduct experiments on road plane detection from 3D point cloud data using the KITTI dataset (Geiger et al., 2013), which is an important computer vision task in autonomous car driving systems. The experiments show that for the same computational budget DPCP-PSGM outperforms RANSAC, which is one of the most popular methods for such computer vision applications.

### 1.3 Notation

We briefly introduce some of the notations used in this paper. Finite-dimensional vectors and matrices are indicated by bold characters. The symbols and

represent the identity matrix and zero matrix/vector, respectively. We denote the sign function by

 sign(a):={a/|a|,a≠0,0,a=0.

We also require the sub-differential of the absolute value function defined as

 Sgn(a):={a/|a|,a≠0,,a=0. (7)

Denote by an arbitrary point in . We also use to indicate that we apply the function element-wise to the vector and similarly for and . The unit sphere of is denoted by .

## 2 Global Optimality Analysis for Dual Principal Component Pursuit

Although problem (6) is non-convex (because of the constraint) and non-smooth (because of the norm), the work of (Tsakiris and Vidal, 2015, 2017a) established conditions suggesting that if the outliers are well distributed on the unit sphere and the inliers are well distributed on the intersection of the unit sphere with the subspace , then global minimizers of (6) are orthogonal to . Nevertheless, these conditions are deterministic in nature and difficult to interpret. In this section, we give improved global optimality conditions that are i) tighter, ii) easier to interpret and iii) amenable to a probabilistic analysis.

### 2.1 Geometry of the Critical Points

The heart of our analysis lies in a tight geometric characterization of the critical points of (6) (see Lemma 1 below). Before stating the result, we need to introduce some further notation and definitions. Letting be the orthogonal projection onto , we define the principal angle of from as such that . Since we will consider the first-order optimality condition of (6), we naturally need to compute the sub-differential of the objective function in (6). Since is convex, its sub-differental at is defined as

 ∂f(b):={d′∈RD:f(a)≥f(b)+⟨d′,a−b⟩, ∀a∈RD},

where each is called a subgradient of at . Note that the

norm is subdifferetially regular. By the chain rule for subdifferentials of subdifferentially regular functions, we have

 ∂f(b)=˜\mathbfcalXSgn(˜\mathbfcalX⊤b). (8)

Next, global minimizers of (6) are critical points in the following sense:

###### Definition 1

is called a critical point of (6) if there is such that the Riemann gradient .

We now illustrate the key idea behind characterizing the geometry of the critical points. Let be a critical point that is not orthogonal to . Then, under general position assumptions on the data, can be orthogonal to columns of . It follows that any Riemann sub-gradient evaluated at has the form

 d=(I−bb⊤)\mathbfcalXsgn(\mathbfcalX⊤b)+(I−bb⊤)\mathbfcalOsign(\mathbfcalO⊤b)+ξ, (9)

where with the columns of orthogonal to and . Note that . Since is a critical point, Definition 1 implies a choice of so that . Decompose , where is the principal angle of from , and and are the orthonormal projections of onto and , respectively. Defining and noting that , it follows that

 0=⟨d,g⟩ =g⊤\mathbfcalOsign(\mathbfcalO⊤b)+g⊤\mathbfcalXsgn(\mathbfcalX⊤b)+g⊤ξ =g⊤\mathbfcalOsign(\mathbfcalO⊤b)−sin(ϕ)s⊤\mathbfcalXsgn(\mathbfcalX⊤s)+g⊤ξ =g⊤\mathbfcalOsign(\mathbfcalO⊤b)−sin(ϕ)∥∥\mathbfcalX⊤s∥∥1+g⊤ξ,

which in particular implies that

 sin(ϕ) =g⊤\mathbfcalOsign(\mathbfcalO⊤b)+g⊤ξ∥∥\mathbfcalX⊤s∥∥1 ≤∣∣g⊤\mathbfcalOsign(\mathbfcalO⊤b)∣∣+D∥∥\mathbfcalX⊤s∥∥1

since . Thus, we obtain Lemma 1 after defining

 η\mathbfcalO:=1Mmaxb∈SD−1 ∥∥(I−bb⊤)\mathbfcalOsign(\mathbfcalO⊤b)∥∥2 (10)

and

 c\mathbfcalX,min:=1Nminb∈S∩SD−1∥\mathbfcalX⊤b∥1. (11)
###### Lemma 1

Any critical point of (6) must either be a normal vector of , or have a principal angle from smaller than or equal to , where

 ¯¯¯η\mathbfcalO:=η\mathbfcalO+D/M. (12)

Towards interpreting Lemma 1, we first give some insight into the quantities and . First, we claim that

reflects how well distributed the outliers are, with smaller values corresponding to more uniform distributions. This can be seen by noting that as

and assuming that remains well distributed, the quantity tends to the quantity , where is the average height of the unit hemi-sphere of (Tsakiris and Vidal, 2015, 2017a)

 (13)

Since , in the limit . Second, the quantity is the same as the permeance statistic defined in (Lerman et al., 2015a), and for well-distributed inliers is bounded away from small values, since there is no single direction in sufficiently orthogonal to . We thus see that according to Lemma 1, any critical point of (6) is either orthogonal to the inlier subspace , or very close to , with its principal angle from being smaller for well distributed points and smaller outlier to inlier ratios . Interestingly, Lemma 1 suggests that any algorithm can be utilized to find a normal vector to as long as the algorithm is guaranteed to find a critical point of (6) and this critical point is sufficiently far from the subspace , i.e., it has principal angle larger than . We will utilize this crucial observation in the next section to derive guarantees for convergence to the global optimum for a new scalable algorithm.

We now compare creftypecap 1 with the result in (Maunu and Lerman, 2017, Theorem 1). In the case when the subspace is a hyperplane, creftypecap 1 and (Maunu and Lerman, 2017, Theorem 1) share similarities and differences. For comparison, we interpret the results in (Maunu and Lerman, 2017, Theorem 1) for the DPCP problem. On one hand, both creftypecap 1 and (Maunu and Lerman, 2017, Theorem 1) attempt to characterize certian behaviors of the objective function when is away from the subspace by looking at the first-order information. On the other hand, we obtain creftypecap 1 by directly considering the Riemannian subdifferentional and proving that any Riemannian subgradient is not zero when is away from the subspace but not its normal vector. While (Maunu and Lerman, 2017, Theorem 1) is obtained by checking a particular (directional) geodesic subderivatrive and showing it is negative666There is a subtle issue for the optimality condition by only checking a particular subderivative. This issue can be solved by checking either all the elements in the (directional) geodesic subdifferentional or (directional) geodesic directional derivative. In particular, under the general assumption of the data points as utilized in this paper, this issue can be mitigated by adding an additional term (such as the difference between and ) into (Maunu and Lerman, 2017, Theorem 1).. These two approaches also lead to different quantities utilized for capturing the region in which there is no critical point or local minimum. Particularly, with let

which are the regions in which there is no critical point and no local minimum as claimed in creftypecap 1 and (Maunu and Lerman, 2017, Theorem 1), respectively777(Maunu and Lerman, 2017, Theorem 1) utilizes a different quantity, which we prove is equivalent to for the hyperplance case.. We note that is larger than ,888This can be seen as follows

where the second equality follows from the min-max theorem. indicating that the region is larger than . Also, under a probability setting, we provide a much tighter upper bound for , i.e., versus (which roughtly scales as ) in (Maunu and Lerman, 2017). Consequently our result leads to a much better bound on the number of outliers that can be tolerated scales as a function of the number of inliers.

We finally note that (Maunu and Lerman, 2017, Theorem 1) also covers the case where the subspace has higher co-dimension. We leave the extension of creftypecap 1 for multiple normal vectors as future work.

### 2.2 Global Optimality

In order to characterize the global solutions of (6), we define quantities similar to but associated with the outliers, namely

 c\mathbfcalO,min:=1Mminb∈SD−1∥\mathbfcalO⊤b∥1  and  c\mathbfcalO,max:=1Mmaxb∈SD−1∥\mathbfcalO⊤b∥1. (14)

The next theorem, whose proof relies on creftypecap 1, provides new deterministic conditions under which any global solution to (6) must be a normal vector to .

###### Theorem 1

Any global solution to (6) must be orthogonal to the inlier subspace as long as

 MN⋅√¯¯¯η2\mathbfcalO+(c\mathbfcalO,max−c\mathbfcalO,min)2c\mathbfcalX,min<1. (15)

The proof of creftypecap 1 is given in Section 4.1. Towards interpreting Theorem 1, recall that for well distributed inliers and outliers is small, while the permeance statistics are bounded away from small values. Now, the quantity , thought of as a dual permeance statistic, is bounded away from large values for the reason that there is not a single direction in that can sufficiently capture the distribution of . In fact, as increases the two quantities tend to each other and their difference goes to zero as . With these insights, Theorem 1 implies that regardless of the outlier/inlier ratio , as we have more and more inliers and outliers while keeping and fixed, and assuming the points are well-distributed, condition (15) will eventually be satisfied and any global minimizer must be orthogonal to the inlier subspace .

We note that a similar condition to (15) is also given in (Tsakiris and Vidal, 2015, Theorem 2). Although the proofs of the two theorems share some common elements, (Tsakiris and Vidal, 2015, Theorem 2) is derived by establishing discrepancy bounds between (6) and a continuous analogue of (6), and involves quantities difficult to handle such as spherical cap discrepancies and circumradii of zonotopes. In addition, as shown in Figure 1, a numerical comparison of the conditions of the two theorems reveals that condition (15) is much tighter. We attribute this to the quantities in our new analysis better representing the function being minimized, namely , , , and , when compared to the quantities used in the analysis of (Tsakiris and Vidal, 2015, 2017a). Moreover, our quantities are easier to bound under a probabilistic model, thus leading to the following characterization of the number of outliers that may be tolerated.

###### Theorem 2

Consider a random spherical model where the columns of and are drawn independently and uniformly at random from the unit sphere and the intersection of the unit sphere with a subspace of dimension , respectively. Then for any positive scalar , with probability at least , any global solution of (6) is orthogonal to as long as

 (4+t)2M+C0(√DlogD+t)2M≤(cdN−(2+t/2)√N)2, (16)

where is a universal constant that is indepedent of and , and is defined in (13).

The proof of creftypecap 2 is given in Section 4.2. To interpret (16), first note that for any .999We show it by induction. For the LHS, first note that holds for and . Now suppose it holds for any and we show it is also true for . Towards that end, by the defintion of (13), we have Thus, by induction, we have for any . Similarly, for the RHS, holds for any and . Now suppose it holds for any and we show it is also true for . Towards that end, by the defintion of (13), we have Thus, by induction, we have for any . As a consequence, (16) implies that at least . More interestingly, according to Theorem 2 DPCP can tolerate outliers, and particularly for fixed and . We note that the universal constant comes from [ (Van der Vaart, 1998), Cor. 19.35] which is utilized to bound the supreme of an empirical process related to our quantity . However, it is possible to get rid of this constant or obtain an explit expression of the constant by utilizing a different approach to interpret in the random model. With a different approach for , we also believe it is possible to improve the bound with respect to for (16). In particular, we expect that an alternative bound for improves the condition for the success of DPCP up to . This topic will be the subject of future work.

As corollaries of creftypecap 2, the following two results further establish the global optimality condition for the random spherical model in the cases of high-dimensional subspace and large-scale data points.

###### Corollary 1 (high-dimensional subspace)

Consider the same random spherical model as in creftypecap 2. Then for any positive scalar , with probability at least , any global solution of (6) is orthogonal to as long as

 (4+√td)2M+C0(√DlogD+√td)2M≤(cdN−(2+√td2)√N)2, (17)

where is a universal constant in creftypecap 2.

###### Corollary 2 (large-scale data points)

Consider the same random spherical model as in creftypecap 2. Then for any and positive scalar , with probability at least , any global solution of (6) is orthogonal to as long as

 (4+√tNα)2M+C0(√DlogD+√tNα)2M≤(cdN−(2+√tNα2)√N)2, (18)

where is a universal constant in creftypecap 2.

Unlike creftypecap 2, for fixed , the failure probability in creftypecap 1 or creftypecap 2 decreases when or increases, respectively. On the other hand, (17) gives a similar bound while (18) requires , which maybe slightly stronger than indicated by (16).

### 2.3 Comparison with Existing Results

We now compare with the existing methods that are provably tolerable to the outliers. The methods are covered by a recent review in (Lerman and Maunu, 2018, Table I), including the Geodesic Gradient Descent (GGD) (Maunu and Lerman, 2017), Fast Median Subspace (FMS) (Lerman and Maunu, 2017), REAPER (Lerman et al., 2015a), Geometric Median Subspace (GMS) (Zhang and Lerman, 2014), -RPCA (Xu et al., 2010) (which is called Outlier Pursuit (OP) in (Lerman and Maunu, 2018, Table I)), Tyler M-Estimator (TME) (Zhang, 2016), Thresholding-based Outlier Robust PCA (TORP) (Cherapanamjeri et al., 2017) and the Coherence Pursuit (CoP) (Rahmani and Atia, 2016). However, we note that the comparison maynot be very fair since the results summarized in (Lerman and Maunu, 2018, Table I) are established for random Gaussian models where the columns of and are drawn independently and uniformly at random from the distribtuion and with being an orthonormal basis of the inlier subspace . Nevertheless, these two random models are closely related since each columns of or in the random Gaussian model is also concentrated around the sphere , especially when is large.

That being said, we now review these results on the random Gaussian model. First, the global optimality condition in (Lerman et al., 2015a) indicates that with probability at least , the inlier subpsace can be exactly recovered by solving the convex problem (3) (possibly with a final projection step) if

 6(MD+1+t)≤1√32π(Nd−π(4+2t)), d≤(D−1)/2. (19)

Compared with (19) which requires , (16) requires . On one hand, when the dimension and are fixed as constants, (16) gives a better relationship between and than (19). On the other hand, the relationship between and given in (16) is worser than the one in (19).

The work of (Maunu and Lerman, 2017) establishes and interprets a local optimality condition (which is similar to creftypecap 1) rather than a global one for (2). Specifically, according to (Maunu and Lerman, 2017, Theorem 5), suppose that for some absolute constant , and other constants , and ,

 cos(θ)((1−a)√Nd−˜C1)2>⎛⎝M√dD+√Md+√2M˜C0dD⎞⎠, N>˜C21(1−a)2d. (20)

Then, with probability at least for some absolute constant , the inlier subspace is the only local minimizer of (2) among all the subspaces (which is one-to-one correspondence to their orthogonal projection) that have subspace angle at most to the inlier subspace . Using this, one may establish a similar global optimality condition with the approach used in creftypecap 1. Here, we instead interpret our creftypecap 1 in the random spherical model, implying that for any positive constants , if

 cos(θ)(cdN−(2+t2)√N)>C0(√Dlog(√cDD)+t)√M, t<2(cd√N−2), (21)

then with probability at least , any critical point of (6) that has principal angle from small than must be orthogonal to . Here is a universal constant in creftypecap 2. Now comparing (20) and (21), (20) requires , while (21) needs . As we stated before, this difference mostly owes to a much tighter upper bound for , i.e., versus (which roughtly scales as ) in (Maunu and Lerman, 2017).

Finally, we summarize the exact recovery condition or global optimality condition for the existing methods that are provably tolerable to the outliers in Table 1 which imitates (Lerman and Maunu, 2018, Table I). One one hand, for fixed and , Table 1 indicates that DPCP in the only method that can tolerate up to outliers. On the other hand, the result on DPCP has suboptimal bound with respect to .

## 3 Efficient Algorithms for Dual Principal Component Pursuit

In this section, we first review the linear programming approach proposed in (Tsakiris and Vidal, 2015, 2017a) for solving the DPCP problem (6) and provide new convergence guarantee for this approach. We then provide a projected sub-gradient method which has guaranteed convergence performance and is scalable to the problem sizes as it only uses marix-vector multiplications in each iteration.

### 3.1 Alternating Linerization and Projection Method

Note that the DPCP problem (6) involves a convex objective function and a non-convex feasible region, which nevertheless is easy to project onto. This structure was exploited in (Qu et al., 2014; Tsakiris and Vidal, 2015), where in the second case the authors proposed an Alternating Linearization and Projection (ALP) method that solves a sequence of linear programs (LP) with a linearization of the non-convex constraint and then projection onto the sphere. Specifically, if we denote the constraint function associated with (6) as , then the first order Tayor approximation of at any is . With an initial guess of , we compute a sequence of iterates via the update (Tsakiris and Vidal, 2017a)

 bk=arg~{}minb∈Rd∥˜\mathbfcalX⊤b∥1, s.t.  b⊤ˆbk−1=1andˆbk=bk/∥bk∥2, (22)

where the optimization subproblem can be written as a linear program (LP) rewritting the norm in an equivalent linear form with auxiliary variables. An alternatively view of the constraint in (22) is that it defines an affine hyperplane which excludes the original point and has as its normal vector.

The following result establishes conditions under which is orthogonal to the subspace and new conditions to guarantee that the sequence converges to a normal vector of in a finite number of iterations.

###### Theorem 3

Consider the sequence generated by the recursion (22). Let be the principal angle of from . Then,

1. is orthogonal to the subspace if

 ϕ0>ϕ♮0:=arctan(Mc\mathbfcalO,maxNc\mathbfcalX,min−M¯¯¯η\mathbfcalO); (23)
2. the sequence converges to a normal vector of in a finite number of iterations if

 ϕ0>ϕ⋆0:=arccos⎛⎜ ⎜⎝√N2c2\mathbfcalX,min−M2¯¯¯η2\mathbfcalO−M(c\mathbfcalO,max−c\mathbfcalO,min)Nc\mathbfcalX,max⎞⎟ ⎟⎠, (24)

where .

The proof of creftypecap 3 is given in Section 4.3. First note that the expressions in (23) and (24) always define angles between and when the condition (15) is satisfied. As illustrated in Figure 2, when the initialization is not very close to the subspace , creftypecap 3 () indicates that one procedure of (22) gives a vector that is orthogonal , i.e., finds a hyperplance that contains all the inliers . creftypecap 3 () suggests that the requirement on the principal angle of the initialization can be weakened if we are allowed to implement multiple alternating procedures in (22). To see this, for well distributed points (both inliers and outliers), the angle defined in (24) tends to zero as go to infitiny with constant; while defined in (23) goes to . We finally note that a similar condition to (24) is also given in (Tsakiris and Vidal, 2017a, Theorem 12). Compared with the condition in (Tsakiris and Vidal, 2017a, Theorem 12), (24) defines a slightly smaller angle and again more importantly, the quantities in (15) is slightly tighter and amenable to probabilistic analysis.

### 3.2 Projected Sub-Gradient Method for Dual Principal Component Pursuit

Although efficient LP solvers (such as Gurobi (Gurobi Optimization, 2015)) may be used to solve each LP in the ALP approach, these methods do not scale well with the problem size (i.e., and ). Inspired by creftypecap 1, which states that any critical point that has principal angle larger than must be a normal vector of , we now consider solving (6) with a first-order method, specifically Projected Sub-Gradient Method (DPCP-PSGM), which is stated in Algorithm 1.

To see why it is possible that DPCP-PSGM finds a normal vector, recall that at the -th step:

 bk=ˆbk−1−μk˜\mathbfcalXsign(˜\mathbfcalX⊤ˆbk−1)=ˆbk−1−μk\mathbfcalXsign(\mathbfcalX⊤ˆbk−1)−μk\mathbfcalOsign(\mathbfcalO⊤ˆbk−1). (25)

For the rest of this section, it is more convenient to use the principal angle between and the orthogonal subspace ; thus is a normal vector of if and only if . Similarly let be the principal angle between and the complement . Suppose , i.e., is not a normal vector to . We rewrite as

 ˆbk−1=sin(θk−1)sk−1+cos(θk−1)nk−1, (26)

where and are the orthonormal projections of onto and , respectively. Since is the normalized version of , they have the same principal angle to . We now consider how the principal angle changes after we apply sub-gradient method to as in (25). Towards that end, with (26), we first decompose the term (appeared in (25)) into two parts

 \mathbfcalOsign(