# Coherence Pursuit: Fast, Simple, and Robust Principal Component Analysis

This paper presents a remarkably simple, yet powerful, algorithm termed Coherence Pursuit (CoP) to robust Principal Component Analysis (PCA). As inliers lie in a low dimensional subspace and are mostly correlated, an inlier is likely to have strong mutual coherence with a large number of data points. By contrast, outliers either do not admit low dimensional structures or form small clusters. In either case, an outlier is unlikely to bear strong resemblance to a large number of data points. Given that, CoP sets an outlier apart from an inlier by comparing their coherence with the rest of the data points. The mutual coherences are computed by forming the Gram matrix of the normalized data points. Subsequently, the sought subspace is recovered from the span of the subset of the data points that exhibit strong coherence with the rest of the data. As CoP only involves one simple matrix multiplication, it is significantly faster than the state-of-the-art robust PCA algorithms. We derive analytical performance guarantees for CoP under different models for the distributions of inliers and outliers in both noise-free and noisy settings. CoP is the first robust PCA algorithm that is simultaneously non-iterative, provably robust to both unstructured and structured outliers, and can tolerate a large number of unstructured outliers.

## Authors

• 15 publications
• 17 publications
• ### Structured and Unstructured Outlier Identification for Robust PCA: A Non iterative, Parameter free Algorithm

Robust PCA, the problem of PCA in the presence of outliers has been exte...
09/11/2018 ∙ by Vishnu Menon, et al. ∙ 0

• ### Outlier Detection and Data Clustering via Innovation Search

The idea of Innovation Search was proposed as a data clustering method i...
12/30/2019 ∙ by Mostafa Rahmani, et al. ∙ 48

• ### Robust PCA and Robust Subspace Tracking

Principal Components Analysis (PCA) is one of the most widely used dimen...
11/26/2017 ∙ by Namrata Vaswani, et al. ∙ 0

• ### Robust PCA via Outlier Pursuit

Singular Value Decomposition (and Principal Component Analysis) is one o...
10/20/2010 ∙ by Huan Xu, et al. ∙ 0

• ### Fast, Parameter free Outlier Identification for Robust PCA

Robust PCA, the problem of PCA in the presence of outliers has been exte...
04/13/2018 ∙ by Vishnu Menon, et al. ∙ 0

• ### Low Rank Approximation in the Presence of Outliers

We consider the problem of principal component analysis (PCA) in the pre...

• ### Approximating the Span of Principal Components via Iterative Least-Squares

In the course of the last century, Principal Component Analysis (PCA) ha...
07/28/2019 ∙ by Yariv Aizenbud, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Standard tools such as Principal Component Analysis (PCA) have been instrumental in reducing dimensionality by finding linear projections of high-dimensional data along the directions where the data is most spread out to minimize information loss. These techniques are widely applicable in a broad range of data analysis problems, including problems in computer vision, image processing, machine learning and bioinformatics

[1, 2, 3, 4, 5, 6].

Given a data matrix , PCA finds an -dimensional subspace by solving

 min^U∥D−^U^UTD∥Fsubject to^UT^U=I, (1)

where is an orthonormal basis for the -dimensional subspace,

denotes the identity matrix and

the Frobenius norm. Despite its notable impact on exploratory data analysis and multivariate analyses, PCA is notoriously sensitive to outliers that prevail much of the real world data since the solution to (1) can arbitrarily deviate from the true subspace in presence of a small number of outlying data points that do not conform with the low-dimensional model [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17].

As a result, much research work was devoted to investigate PCA algorithms that are robust to outliers. The corrupted data can be expressed as

 D=L+C, (2)

where is a low rank matrix whose columns span a low-dimensional subspace, and the matrix models the data corruption, and is referred to as the outlier matrix. Two main models for the outlier matrix were considered in the literature – these two models are mostly incomparable in theory, practice and analysis techniques. The first corruption model is the element-wise model in which is a sparse matrix with arbitrary support, whose entries can have arbitrarily large magnitudes [18, 19, 20, 21, 22, 23, 24]. In view of the arbitrary support of , any of the columns of may be affected by the non-zero elements of . We do not consider this model in this paper. The second model, which is the focus of our paper, is a column-wise model wherein only a fraction of the columns of are non-zero, wherefore a portion of the columns of (the so-called inliers) remain unaffected by [25, 26, 27, 28, 29].

### I-a The inlier-outlier structure

We formally describe the data model adopted in this paper, which only focuses on the column-wise outlier model.

###### Data Model 1.

The given data matrix satisfies the following.
1. The matrix can be expressed as

 D=L+C=[AB]T, (3)

where , , and is an arbitrary permutation matrix.
2. The columns of lie in an -dimensional subspace , the column space of . The columns of do not lie entirely in , i.e., the columns of are the inliers and the columns of are the outliers.

We consider two types of column-wise outliers. The first type consists of data points which do not follow a low-dimensional structure. In addition, a small number of these points are not linearly dependent. We refer to this type as ‘unstructured outliers’. Unstructured outliers are typically modeled as data points scattered uniformly at random in the ambient space [30, 31, 32]. Such outliers are generally distinguishable even if they dominate the data [30, 32]. A conceivable scenario for unstructured outliers is when a set of the data points are intolerably noisy or highly corrupted. The second type, which we refer to as ‘structured outliers’, concerns data points which are linearly dependent or form a cluster. In other words, structured outliers exist in small numbers relative to the size of the data but form some low-dimensional structure different from that of most of the data points. Structured outliers are often associated with rare patterns or events of interest, such as important regions in an image [33], malignant tissues [34], or web attacks [35].

The column-wise model for robust PCA has direct bearing on a host of applications in signal processing and machine learning, which spurred enormous progress in dealing with subspace recovery in the presence of outliers. This paper is motivated by some of the limitations of existing techniques, which we further detail in Section II on related work. The vast majority of existing approaches to robust PCA have high computational complexity, which makes them unsuitable in high-dimensional settings

. For instance, many of the existing iterative techniques incur a long run time as they require a large number of iterations, each with a Singular Value Decomposition (SVD) operation. Also, most iterative solvers have

no provable guarantees for exact subspace recovery. Moreover, some of the existing methods rest upon restrictive definitions of outliers. For instance, [30, 31, 32] can only detect unstructured randomly distributed outliers and [26] requires to be column sparse. In this paper, we present a new provable non-iterative robust PCA algorithm, dubbed Coherence Pursuit (CoP), which involves one simple matrix multiplication, and thereby achieves remarkable speedups over the state-of-the-art algorithms. CoP does not presume a restrictive model for outliers and provably detects both structured and unstructured outliers. In addition, it can tolerate a large number of unstructured outliers – even if the ratio of inliers to outliers approaches zero.

### I-B Notation and definitions

Bold-face upper-case and lower-case letters are used to denote matrices and vectors, respectively. Given a matrix

, denotes its spectral norm, its nuclear norm, and its column space. For a vector , denotes its -norm and its element. Given two matrices and with an equal number of rows, the matrix

 A3=[A1A2]

is the matrix formed by concatenating their columns. For a matrix , denotes its column, and is equal to with the column removed. Given matrix , . The function orth returns an orthonormal basis for the range of its matrix argument. In addition, denotes the unit -norm sphere in .

###### Definition 1.

The coherence value corresponding to the data point with parameter is defined as

 p(i)=n∑k=1k≠i|xTixk|p,

where for . The vector contains the coherence values for all the data points.

## Ii Related Work

Some of the earliest approaches to robust PCA relied on robust estimation of the data covariance matrix, such as S-estimators, the minimum covariance determinant, the minimum volume ellipsoid, and the Stahel-Donoho estimator

[36]

. This is a class of iterative approaches that compute a full SVD or eigenvalue decomposition in each iteration and generally have no explicit performance guarantees. The performance of these approaches greatly degrades when

.

To enhance robustness to outliers, another approach is to replace the Frobenius norm in (1) with other norms [37]. For example, [38] uses an -norm relaxation commonly used for sparse vector estimation, yielding robustness to outliers [39, 40, 22]. However, the approach presented in [38] has no provable guarantees and requires to be column sparse, i.e., a very small portion of the columns of can be non-zero. The work in [41] replaces the -norm in [38] with the -norm. While the algorithm in [41] can handle a large number of outliers, the complexity of each iteration is and its iterative solver has no performance guarantees. Recently, the idea of using a robust norm was revisited in [42, 43]. Therein, the non-convex constraint set is relaxed to a larger convex set and exact subspace recovery is guaranteed under certain conditions. The algorithm presented in [42] obtains and [43] finds its complement. However, the iterative solver of [42] computes a full SVD of an weighted covariance matrix in each iteration. Thus, the overall complexity of the solver of [42] is roughly per iteration, where the second term is the complexity of computing the weighted covariance matrix. Similarly, the solver of [43] has complexity per iteration. In [31], the complement of the column space of is recovered via a series of linear optimization problems, each obtaining one direction in the complement space. This method is sensitive to structured outliers, particularly linearly dependent outliers, and requires the columns of not to exhibit a clustering structure, which prevails much of the real world data. Also, the approach presented in [31] requires solving linear optimization problems consecutively resulting in high computational complexity and long run time for high-dimensional data.

Robust PCA using convex rank minimization was first analyzed in [21, 22] for the element-wise corruption model. In [26], the algorithm analyzed in [21, 22] was extended to the column-wise corruption model where it was shown that the optimal point of

 min^L,^C∥^L∥∗+λ∥^C∥1,2subject to^L+^C=D (4)

yields the exact subspace and correctly identifies the outliers provided that is sufficiently column-sparse. The solver of (4) requires too many iterations, each computing the SVD of an dimensional matrix. Also, the algorithm can only tolerate a small number of outliers – the ratio should be roughly less than 0.05. Moreover, the algorithm is sensitive to linearly dependent outliers.

A different approach to outlier detection was proposed in [32, 44]

, where a data point is classified as an outlier if it does not admit a sparse representation in the rest of the data. However, this approach is limited to the randomly distributed unstructured outliers. In addition, the complexity of solving the corresponding optimization problem is

per iteration. In the outlier detection algorithm presented in [30], a data point is identified as an outlier if the maximum value of its mutual coherences with the other data points falls below a predefined threshold. Clearly, this approach places a restrictive assumption on the outlying data points and is unable to detect structured outliers.

### Ii-a Motivation and summary of contributions

This work is motivated by the limitations of prior work on robust PCA as summarized below.
Complex iterations. Most of the state-of-the-art robust PCA algorithms require a large number of iterations each with high computational complexity. For instance, many of these algorithms require the computation of the SVD of an , or , or matrix in each iteration [45, 26, 42], leading to long run time.

Guarantees. While the optimal points of the optimization problems underlying many of the existing robust subspace recovery techniques yield the exact subspace, there are no such guarantees for their corresponding iterative solvers. Examples include the optimization problems presented in [26, 41]. In addition, most of the existing guarantees are limited to the cases where the outliers are scattered uniformly at random in the ambient space and the inliers are distributed uniformly at random in [32, 31, 30].

Robustness issues. Some of the existing algorithms are tailored to one specific class of outlier models. For example, algorithms based on sparse outlier models utilize sparsity promoting norms, thus can only handle a small number of outliers. On the other hand, algorithms such as [30, 31, 32] can handle a large number of unstructured outliers, albeit they fail to locate structured ones (e.g., linearly dependent or clustered outliers). Spherical PCA (SPCA) is a non-iterative robust PCA algorithm that is also scalable [46]. In this algorithm, all the columns of are first projected onto the unit sphere , then the subspace is identified as the span of the principal directions of the normalized data. However, in the presence of outliers, the recovered subspace is never equal to the true subspace and it significantly deviates from the underlying subspace when outliers abound.

To the best of our knowledge, CoP is the first algorithm that addresses these concerns all at once. In the proposed method, we distinguish outliers from inliers by comparing their degree of coherence with the rest of the data. The advantages of the proposed algorithm are summarized below.

• CoP is a considerably simple non-iterative algorithm which roughly involves one matrix multiplication to compute the Gram matrix.

• CoP can tolerate a large number of unstructured outliers. It is shown that exact subspace recovery is guaranteed with high probability even if

goes to zero provided that is sufficiently large.

• CoP is robust to both structured and unstructured outliers with provable performance guarantees for both types of outlying data points.

• CoP is notably and provably robust to the presence of additive noise.

## Iii Proposed Method

In this section, we present the Coherence Pursuit algorithm and provide some insight into its characteristics. The main theoretical results are provided in Section IV. Algorithm 1 presents CoP along with the definitions of the used symbols.

Coherence: The inliers lie in a low-dimensional subspace . In addition, the inliers are mostly correlated and form clusters. Thus, an inlier bears strong resemblance to many other inliers. By contrast, an outlier is by definition dissimilar to most of the other data points. As such, CoP uses the coherence value in Definition 1 to measure the degree of similarity between data points. Then, is obtained as the span of those columns that have large coherence values.

For instance, assume that the distributions of the inliers and outliers follow the following assumption.

###### Assumption 1.

The subspace is a random -dimensional subspace in . The columns of are drawn uniformly at random from the intersection of and . The columns of are drawn uniformly at random from . To simplify the exposition and notation, it is assumed without loss of generality that in (3) is the identity matrix, i.e, .

Suppose the column is an inlier and the column is an outlier. In the appendix, it is shown that under Assumption 1, , while , for where denotes the expectation. Accordingly, if , the inliers have much larger coherence values. In the following, we demonstrate the important features of CoP, then present the theoretical results.

### Iii-a Large number of unstructured outliers

Unlike some of the robust PCA algorithms which require to be much smaller than , CoP tolerates a large number of unstructured outliers. For instance, consider a setting in which , , , and the distributions of inliers and outliers follow Assumption 1. Fig. 1 shows the vector for different values of and . In all the plots, the maximum element is scaled to 1. One can observe that even if , CoP can recover the exact subspace since there is a clear gap between the values of corresponding to outliers and inliers.

### Iii-B Robustness to noise

In the presence of additive noise, we model the data as

 D=[AB]T+E, (5)

where represents the noise component.

The strong resemblance between the inliers (columns of ) unlike the outliers (columns of ) creates a large gap between their corresponding coherence values as evident in Fig. 1 even when . This large gap affords tolerance to high levels of noise. For example, assume , , and the distributions of the inliers and outliers follow Assumption 1. Define the parameter as

 τ=E∥e∥2E∥a∥2, (6)

where and are arbitrary columns of and , respectively. Fig. 2 shows the entries of for different values of . As shown, the elements corresponding to inliers are clearly separated from the ones corresponding to outliers even at very low signal to noise ratio, e.g. and .

### Iii-C Structured outlying columns

At a fundamental level, CoP affords a global view of an outlying column, namely, a data column is identified as an outlier if it has weak total coherence with the rest of the data. This global view of a data point with respect to the rest of the data allows the algorithm to identify outliers that bear resemblance to few other outliers. Therefore, unlike some of the more recent robust PCA algorithms [30, 31, 32] which are restricted to unstructured randomly distributed outliers, CoP can detect both structured and unstructured outliers. For instance, suppose the columns of admit the following clustering structure.

###### Assumption 2.

The outlier is formed as . The vectors and are drawn uniformly at random from .

Under Assumption 2, the columns of are clustered around . As decreases, the outliers get closer to each other. Suppose contains such outliers. Fig. 3 shows the elements of for different values of . When , the outliers are tightly concentrated around , i.e., are very similar to each other, but even then CoP can clearly distinguish the outliers.

### Iii-D Subspace identification

In the third step of Algorithm 1, we sample the columns of with the largest coherence values which span an -dimensional space. In this section, we present several options for efficient implementation of this step. One way is to start sampling the columns with the highest coherence values and stop when the rank of the sampled columns is equal to . However, if the columns of admit a clustering structure and their distribution is highly non-uniform, this method will sample many redundant columns, which can in turn increase the run time of the algorithm. Hence, we propose two low-complexity techniques to accelerate the subspace identification step.

1. In many applications, we may have an upper bound on . For instance, suppose we know that up to 40 percent of the data could be outliers. In this case, we simply remove 40 percent of the columns corresponding to the smallest values of and obtain the subspace using the remaining data points.

2. The second technique is an adaptive sampling method presented in Algorithm 2. First, the data is projected onto a random -dimensional subspace to reduce the computational complexity for some integer . According to the analysis presented in [29, 33], even is sufficient to preserve the rank of and the structure of the outliers , i.e., the rank of is equal to and the columns of do not lie in , where is the projection matrix. The parameter that thresholds the -norms of the columns of the projected data is chosen based on the noise level (if the data is noise free, ). In Algorithm 2, the data is projected onto the span of the sampled columns (step 2.3). Thus, a newly sampled column brings innovation relative to the previously sampled ones. Therefore, redundant columns are not sampled.

###### Remark 1.

Suppose we run Algorithm 2   times – each time the sampled columns are removed from the data and newly sampled columns are added to . If the given data is noisy, the first singular values of are the dominant ones and the rest correspond to the noise component. If we increase , the span of the dominant singular vectors will be closer to . However, if is chosen unreasonably large, the sampler may also sample outliers.

### Iii-E Computational complexity

The main computational complexity is in the second step of Algorithm 2 which is of order . If we utilize Algorithm 2 as the third step of Algorithm 1, the overall complexity is of order . However, unlike most existing algorithms, CoP does not require solving an optimization problem and roughly involves only one matrix multiplication. Therefore, it is very fast and simple for hardware implementation (c.f. Section V-B on run time). Moreover, the overall complexity can be reduced to if we utilize the randomized sketching designs presented in [29, 33].

## Iv Theoretical Investigations

The theoretical results are presented in the next 4 subsections and their proofs are provided in Sections VI and VII. First, we show that CoP can recover the true subspace even if the data is predominantly unstructured outliers. Second, we show that CoP can accurately distinguish structured outliers provided their population size is small enough. Third, we extend the robustness analysis to noisy settings. Fourth, we show that the more coherent the inliers are, the better CoP is at distinguishing them from the outliers.

In the theoretical studies corresponding to the unstructured outliers, the performance guarantees are provided for both and . In the rest of the studies, the results are only presented for . For each case, we present two guarantees. First, we establish sufficient conditions to ensure that the expected values of the elements of the vector corresponding to inliers are much greater than the ones corresponding to outliers, in which case the algorithm is highly likely to yield exact subspace recovery. Second, we present theoretical results which guarantee exact subspace recovery with high probability.

### Iv-a Subspace recovery with dominant unstructured outliers

Here, we focus on the unstructured outliers, i.e., it is assumed that the distribution of the outlying columns follows Assumption 1. The following lemmas establish sufficient conditions for the expected values of the elements of corresponding to inliers to be at least twice as large as those corresponding to outliers.

###### Lemma 1.

Suppose Assumption 1 holds, the column is an inlier and the column is an outlier. If

 n1√r(√2π−√4r2m)>5n24√m+√2πr, (7)

then

 E∥gi∥1>2E∥gn1+j∥1

recalling that is the column of the Gram matrix .

###### Lemma 2.

Suppose Assumption 1 holds, the column is an inlier and the column is an outlier. If

 n1r(1−2r2m)>n2m+1r (8)

then

 E∥gi∥22>2E∥gn1+j∥22.

The sufficient conditions provided in Lemma 1 and Lemma 2 reveal three important points.

I) The ratios and are key performance factors. The intuition is that as increases, the density of the inliers in the subspace increases, and consequently their mutual coherence also increases. Similarly, the mutual coherence between the outliers is proportional to . Thus, the main requirement is that should be sufficiently larger than .

II) In real applications, and , hence the sufficient conditions are easily satisfied. This fact is evident in Fig. 1, which shows that CoP can recover the correct subspace even if .

III) In high-dimensional settings, . Therefore, could be much greater than . Accordingly, the conditions in Lemma 1 are stronger than those in Lemma 2, suggesting that CoP can tolerate a larger number of unstructured outliers with than with . This is confirmed by comparing the plots in the last row of Fig. 1.

The following theorems show that the same set of factors are important to guarantee that CoP recovers the exact subspace with high probability.

###### Theorem 3.

If Assumption 1 is true and

 n1√r(√2π−r+2√βκr√m)−2√n1− ⎷2n1logn1δr−1>n2√m+2√n2+ ⎷2n2logn2δm−1+1√r, (9)

then Algorithm 1 with recovers the exact subspace with probability at least , where and .

###### Theorem 4.

If Assumption 1 is true and

 n1(1r−r+4ζκ+4√ζrκm)−η1>2η2+1r, (10)

then Algorithm 1 with recovers the correct subspace with probability at least , where

 η1=max(43log2rn1δ,√4n1rlog2rn1δ),
 η2=max(43log2mn2δ,√4n2mlog2mn2δ),

, and .

###### Remark 2.

The dominant factors of the LHS and the RHS of (10) are and , respectively. As in Lemma 2, we see the factor , but under a square root. Thus, the requirement of Theorem 4 is less stringent than that of Lemma 2. This is because Theorem 4 guarantees that the elements of corresponding to inliers are greater than those corresponding to outliers with high probability, but does not guarantee a large gap between their values as in Lemma 2.

### Iv-B Distinguishing structured outliers

In this section, we focus on structured outliers whose distribution is assumed to follow Assumption 2. Under Assumption 2, each column has a unit expected squared norm, which affords a more tractable analysis versus normalizing the data. The columns of are clustered around , and get closer to each other as decreases. The following lemma establishes that if is sufficiently small, the expected coherence value for an inlier is at least twice that of an outlier.

###### Lemma 5.

Suppose the distribution of the outliers follows Assumption 2 and the inliers are distributed as in Assumption 1. Define and set the diagonal elements of equal to zero. Assume the column is an inlier, the column is an outlier, and . If

 (n1−1)√2πr>2n21+μ2+1√m(2μ2n2+4μn2+2n1√r(1+μ2)(μ+1)1+μ2), (11)

then

 E∥g′i∥1>2E∥g′n1+j∥1.

The sufficient condition (11) is consistent with our intuition regarding the detection of structured outliers. According to (11), if is smaller or is larger (i.e., the outliers are less strcutured), the outliers will be more distinguishable. The following theorem reaffirms the requirements of Lemma 5. Before we state the theorem, we define , where and is the incomplete beta function [47]. The function is monotonically decreasing. Examples are shown in Fig. 4, which displays for different values of . The function decays nearly exponentially fast with and converges for large values of to the function shown in yellow with circle markers in Fig. 4 where the plots for and coincide.

###### Theorem 6.

Suppose the distribution of the outliers follows Assumption 2 and the inliers are as in Assumption 1. Define and set the diagonal elements of equal to zero. Assume the column is an inlier, the column is an outlier, and . If

 √2πn1−1√r−2√n1−√2n1logn1δr>n21+μ2+μ2+μ1+μ2⎛⎜ ⎜⎝n2√m+2√n2+ ⎷2n2logn2δm−1⎞⎟ ⎟⎠+μn2√tδ(1+μ2)√m+n1(μ+1)√(1+μ2)m(√r+2√βκ), (12)

then for all and with probability at least , where and .

Theorem 6 certifies the requirements of Lemma 5. According to (12), if the outliers are structured, the number of inliers should be sufficiently larger than the number of outliers.

### Iv-C Performance analysis with noisy data

CoP is notably robust to noise since the noise is neither coherent with the inliers nor the outliers. In this section, we establish performance guarantees for noisy data. It is assumed that the given data satisfies the following assumption.

###### Assumption 3.

Matrices and follow Assumption 1. The columns of are drawn uniformly at random from . Each column of matrix is defined as , where

are i.i.d samples from a normal distribution

and and are the columns of and , respectively. The given data matrix can be expressed as

According to Assumption 3, each inlier is a sum of a random unit -norm vector in the subspace and a random vector which models the noise. Per Assumption 3, each data column has an expected squared norm equal to 1.

###### Lemma 7.

Suppose follows Asumption 3. Define , set the diagonal elements of equal to zero, and define , where is the -th column of . In addition, assume the column is an inlier and the column is an outlier. If

 n1√r⎛⎝√2π(1+σ2n)−√4r2m⎞⎠>n2√1+σ2n√m+√2πr+ξ, (13)

where

 ξ=√2σ2nπm(n1√1+σ2n(1+σn√π2+√r)+n2+2n1), (14)

then

 E∥gei∥1>2E∥gen1+j∥1.

The sufficient conditions of Lemma 7 are very similar to the conditions presented in Lemma 1 for noise-free data with one main difference, namely, an additional term on the RHS of (13) due to the presence of noise. Nevertheless, akin to the unstructured outliers, the component corresponding to noise is linear in , where is the ambient dimension. In addition, is practically smaller than 1 noting that the signal to noise ratio is . Thus, CoP exhibits robustness even in the presence of a strong noise component. The effect of noise is manifested in the subspace identification step wherein the subspace is recovered as the span of the principal singular vectors of the noisy inliers. If the noise power increases, the distance between the span of the principal singular vectors of the noisy inliers and the column space of the noise-free inliers increases. However, this error is inevitable and we cannot achieve better recovery given the noisy data. The following theorem affirms that the noise component does not have a notable effect on the sufficient conditions for the elements of corresponding to inliers to be greater than those corresponding to outliers with high probability.

###### Theorem 8.

Suppose follows Asumption 3. Define , set the diagonal elements of equal to zero, and define . If

 n1√r⎛⎝√2π(1+σ2n)−r+2√βr√m−1⎞⎠−2√n11+σ2n− ⎷2n1logn1δ(r−1)(1+σ2n)>√1+σ2n⎛⎜ ⎜⎝n2√m+2√n2+ ⎷2n2logn2δm−1⎞⎟ ⎟⎠+1√r+ς, (15)

where

 ς=(cσn+c2σ2n√1+σ2n+cσn)⎛⎜ ⎜⎝n1√m+2√n1+ ⎷2n1logn1δm−1⎞⎟ ⎟⎠+cn1σn√1+σ2n⎛⎝√rm+2√β′m−1⎞⎠, (16)

, , and , then for all and with probability at least .

Again, the sufficient condition (15) is very similar to (9) for noise-free data. The main difference is the additional term on the RHS of (15). However, the presence of has no effect on the orders in the sufficient condition in comparison to (9), and is approximately linear in .

### Iv-D The distribution of inliers

In the theoretical investigations presented in Section IV-A, Section IV-B , and Section IV-C, we assumed a random distribution for the inliers. However, we emphasize that this is not a requirement of the proposed approach. In fact, the random distribution of the inliers leads to a fairly challenging scenario. In practice, the inliers mostly form clusters and tend to be highly correlated. Since CoP exploits the coherence between the inliers, its ability to distinguish inliers could even improve if their distribution is further away from being uniformly random. We provide a theoretical example to underscore this fact. In this example, we assume that the inliers form a cluster around a given direction in . The distribution of the inliers is formalized in the following assumption.

###### Assumption 4.

The inlier is formed as . The vectors and are drawn uniformly at random from the intersection of and .

According to Assumption 4, the inliers are clustered around the vector . For example, suppose and . Fig. 5 shows the distribution of the inliers for different values of

. The data points become more uniformly distributed as

increases, and from a cluster when is less than one.

###### Lemma 9.

Suppose the distribution of the inliers follows Assumption 4 and the columns of are drawn uniformly at random from . Define and set its diagonal elements to zero. Assume the column is an inlier, the column is an outlier, and . If

 n1(1−ν2+2ν√r)>1+2n1(1+ν)√r(1+ν2)√m+n2√1+ν2√m(ν−√2π+2√1+ν2), (17)

then

 E∥g′i∥1>2E∥g′n1+j∥1.

According to (17), if decreases (i.e., the data points are less randomly distributed), it is more likely that CoP recovers the correct subspace. In other words, with CoP, clustered inliers are preferred over randomly distributed inliers.

## V Numerical Simulations

In this section, the performance of CoP is investigated with both synthetic and real data. We compare the performance of CoP with the state-of-the-art robust PCA algorithms including FMS [25], GMS [43], R1-PCA [41], OP [26], and SPCA [46]. For FMS, we implemented Algorithm 1 in [25] with . We have also tried different values for , which did not yield much difference in the results from what we report in our experiments. For the GMS algorithm, we implemented Algorithm 2 in [43] to obtain the matrix . The output of the algorithm is the last singular vectors of the obtained matrix , which serve as an orthonormal basis for the span of the inliers. For R1-PCA, we implemented the iterative algorithm presented in [41], which iteratively updates an orthonormal basis for the inliers.

### V-a Phase transition

Our analysis with unstructured outliers has shown that CoP yields exact subspace recovery with high probability if is sufficiently greater than

. In this experiment, we investigate the phase transition of CoP in the

and plane. Suppose, , , and the distributions of inliers/outliers follow Assumption 1. Define and as the exact and recovered orthonormal bases for the span of the inliers, respectively. A trial is considered successful if

 (∥U−^U^UTU∥F/∥U∥F)≤10−5.

In this simulation, we construct the matrix using 20 columns of corresponding to the largest 20 elements of the vector . Fig. 6 shows the phase transition, where white indicates correct subspace recovery and black designates incorrect recovery. As shown, if increases, we need larger values of . However, one can observe that with , the algorithm can yield exact recovery even if .

### V-B Running time

In this section, we compare the speed of CoP with the existing approaches. Table I shows the run time in seconds for different data sizes. In all experiments, and . One can observe that CoP is remarkably faster by virtue of its simplicity (single step algorithm).

### V-C Subspace recovery in presence of unstructured outliers

In this experiment, we assess the robustness of CoP to outliers in comparison to existing approaches. It is assumed that , , and the distribution of inliers/outliers follow Assumption 1. Define and as before, and the recovery error as

 Log-Recovery Error=log10(∥U−^U^UTU∥F/∥U∥F).

In this simulation, we use 30 columns to form the matrix . Fig. 7 shows the recovery error versus for different values of . In addition to its simplicity, CoP yields exact subspace recovery even if the data is overwhelmingly outliers. Similar to CoP and FMS, the algorithms presented in [30, 32] can also yield exact subspace recovery in presence of unstructured outliers even if they dominate the data. However, they are not applicable to the next experiments that deal with structured outliers. For instance, the outlier detection method presented in [30] assumes the order of the inner product between any two outliers is . Therefore, it is unable to identify structured outliers in high-dimensional data.

### V-D Detection of structured outliers

In this section, we examine the ability of CoP at detecting structured outliers in four experiments. In the first experiment, a robust PCA algorithm is used to identify the saliency map [48] of a given image. For the second experiment, an outlier detection algorithm is used to detect the frames corresponding to an activity in a video file. In the third, we examine the performance of the robust PCA algorithms with synthetic structured outliers. For the fourth experiment, we consider the problem of identifying the dominant low-dimensional subspace with real world data.

Example D.1 (Saliency map identification): A saliency map indicates the regions of an image that tend to attract the attention of a human viewer [48, 33]. If we divide the image into small patches and construct a data matrix from the vectorized versions of the patches, the salient regions can be viewed as outlying columns [33, 49]. Hence, if we are able to detect the outlying columns, we will identify the salient regions from the corresponding patches. However, the different patches in the salient regions could be similar to each other. Therefore, the outlying data points are normally structured outliers. In this experiment, we obtained the images shown in the first column of Fig. 8 from the MSRA Salient Object Database [50]. The patches are non-overlapping pixel windows. Fig. 8 shows the saliency maps obtained by CoP and FMS. In both methods, the parameter (the rank of ) is set equal to 2. As shown, both CoP and FMS properly identify the visually salient regions of the images since the two methods are robust to both structured and unstructured outliers.

Example D.2 (Activity detection): In many applications, an anomaly/outlier corresponds to the occurrence of some important rare event. In this experiment, we use the robust PCA method to detect activity in a video file. The file we use here is the Waving Tree file, a video of a dynamic background [51, 52] showing a tree smoothly waving, and in the middle of the video a person crosses the frame. We expect the algorithm to detect those few frames where the person is present as outliers. We construct the data matrix from the vectorized video frames, i.e., each column corresponds to a specific frame.

The frames which show the background are inliers. Since the tree is waving, the rank of is greater than one. We set the parameter in this experiment. The outliers correspond to the few frames in which the person crosses the scene. Obviously, in this application the outliers are structured because the consecutive frames are quite similar to each other. Thus, algorithms such as [30, 32], which model the outliers as randomly distributed vectors, are not applicable here to detect the outliers. We use CoP, FMS, and R1-PCA to detect the outlying frames. Define as the obtained orthonormal basis for the inliers. We identify as an outlier if . CoP and FMS identify all the outlying frames correctly. Fig. 10 shows some of the frames identified as inliers and outliers. R1-PCA could only detect a subset of the outliers. In the video file, the person enters the scene from one side, stays for a second, and leaves the scene from the other side. R1-PCA detects only those frames in which the person enters or leaves the scene. Fig. 10 shows two outlying frames that R1-PCA could detect and two frames it could not detect.

Example D.3 (Synthetic clustered outliers): In this experiment, we use synthetic data to study the performance of CoP in distinguishing structured outliers. The data matrix is generated as , where with follows Assumption 4 with . The matrix follows Assumption 2. Thus, the inliers are clustered and the outliers could be clustered too depending on the value of . Table II shows the subspace recovery error, , for different values of . One can observe that CoP and GMS correctly recover the column space of for all values of . However, for smaller values of , where the outliers become more concentrated, FMS and R1-PCA fail to retrieve the exact subspace.

Example D.4 (Dominant subspace identification): An application of robust PCA is in the problem of subspace clustering [53, 6]. This problem is a general form of PCA in which the data points lie in a union of linear subspaces [53]. A subspace clustering algorithm identifies the subspaces and clusters the data points with respect to the subspaces. A robust PCA algorithm can be applied in two different ways to the subspace clustering problem. The first way is to use the robust PCA method sequentially to learn one subspace in each iteration. In other words, in each iteration the data points in the dominant subspace (the one which contains the maximum number of data points) are considered as inliers and the others as outliers. In each step one subspace is identified and the corresponding data points are removed from the data. RANSAC is a popular subspace clustering method which is based on robust PCA [54, 53]. The second way is to use robust PCA just to identify the most dominant subspace. In many applications, such as motion segmentation, the majority of the data points lie in a data cluster and the rest of the data points – which are of particular interest – form data clusters with smaller populations. Therefore, by identifying the dominant subspace and removing its data points, we can substantially reduce the computational complexity of the subsequent processing algorithms (e.g., the clustering algorithm).

In this experiment, we use the Hopkins155 dataset, which contains video sequences of 2 or 3 motions [55]. The data is generated by extracting and tracking a set of feature points through the frames. In motion segmentation, each motion corresponds to one subspace. Thus, the problem here is to cluster data lying in two or three subspaces [53]. Here, we use 8 data matrices of traffic videos with 2 motions. Since the data lies in a union of 2 subspaces, we can also cluster the data via learning the dominant subspace. The number of data points in the dominant subspace is large and it is important to observe the accuracy of the algorithm at identifying the outliers. Thus, we define the average clustering error as

 ACE=0.5(ne1/n1+ne2/n2),

where and are the numbers of misclassified inliers and misclassified outliers, respectively. Table III reports the ACE for different algorithms. As shown, CoP yields the most accurate result.

### V-E Clustering error correction – Real data

In this section, we present a new application of robust PCA in subspace clustering. The performance of the subspace clustering algorithms – especially the ones with scalable computational complexity – degrades in presence of noise or when the subspaces are closer to each other. Without loss of generality, suppose the data , where the columns of lie in the linear subspaces , respectively, and is the number of subspaces. Define as the output of the clustering algorithm (the clustered data). Define the clustering error as the ratio of misclassified points to the total number of data points. With errors in clustering, some of the columns of believed to lie in may actually belong to some other subspace. Such columns can be viewed as outliers in the matrix . Accordingly, the robust PCA algorithm can be utilized to correct the clustering error. We present Algorithm 3 as an error correction algorithm which can be applied to the output of any subspace clustering algorithm to reduce the clustering error. In each iteration, Algorithm 3 applies the robust PCA algorithm to the clustered data to obtain a set of bases for the subspaces. Subsequently, the obtained clustering is updated based on the obtained bases.

In this experiment, we imagine a subspace clustering algorithm with 20 percent clustering error and apply Algorithm 3 to the output of the algorithm to correct the errors. We use the Hopkins155 dataset. Thus, the problem here is to cluster data lying in two or three subspaces [53]. We use the traffic data sequences, which include 8 scenarios with two motions and 8 scenarios with three motions. When CoP is applied, 50 percent of the columns of are used to form the matrix . Fig. 11 shows the average clustering error (over all traffic data matrices) after each iteration of Algorithm 3 for different robust PCA algorithms. CoP clearly outperforms the other approaches. As a matter of fact, most of the robust PCA algorithms fail to obtain the correct subspaces and end up increasing the clustering error. The outliers in this application are linearly dependent and highly correlated. Thus, the approaches presented in [30, 31, 32] which assume a random distribution for the outliers are not applicable.

## Vi Proofs of the Main Results

Proof of Lemma 1
The column of without its element can be expressed as

 (18)

Thus,

 ∥gi∥1=∥aTiA−i∥1+∥aTiB∥1. (19)

If , then

 E|aTiak|=E|uTak|≥√2πr, (20)

where is a fixed vector in with unit -norm. The last inequality follows from [42]. By Assumption 1, is a random subspace and is a random direction in . Accordingly, the distribution of is the same as the distribution of a vector drawn uniformly at random from . Thus, similar to (20)

 E|aTibk|≥√2πm. (21)

Replacing (20) and (21) in (19),

 E∥gi∥1≥(n1−1)√2πr+n2√2πm. (22)

The column of without its element can be expressed as

 [bTjAbTjB−j]T.

Define as an orthonormal basis for . Thus,

 E|bTjak|≤E∥bTjU∥2. (23)

It is not hard to show that

 E∥bTjU∥22=rm. (24)

Since is a convex function, by Jensen’s inequality

 E∥bTjU∥2≤√rm. (25)

Similarly, for ,

 E|bTjbk|≤√1m. (26)

Therefore, according to (25) and (26)

 E∥gn1+j∥1≤n1√rm+(n2−1)√1m. (27)

Thus, if (7) is satisfied, .

Proof of Lemma 2
If the column is an inlier, then

 ∥gi∥22=∥aTiA−i∥22+∥aTiB∥22. (28)

Since the inliers are distributed uniformly at random within ,

 E∥aTiA−i∥22=n1−1r. (29)

The subspace is a random subspace and is a random direction within . Thus,

 E∥aTiB∥22=n2m. (30)

Replacing in (28),

 E∥gi∥22=n1−1r+n2m. (31)

Similarly,

 ∥gn1+j∥22=∥bTjA∥