# Nonconvex Low-Rank Matrix Recovery with Arbitrary Outliers via Median-Truncated Gradient Descent

Recent work has demonstrated the effectiveness of gradient descent for directly recovering the factors of low-rank matrices from random linear measurements in a globally convergent manner when initialized properly. However, the performance of existing algorithms is highly sensitive in the presence of outliers that may take arbitrary values. In this paper, we propose a truncated gradient descent algorithm to improve the robustness against outliers, where the truncation is performed to rule out the contributions of samples that deviate significantly from the sample median of measurement residuals adaptively in each iteration. We demonstrate that, when initialized in a basin of attraction close to the ground truth, the proposed algorithm converges to the ground truth at a linear rate for the Gaussian measurement model with a near-optimal number of measurements, even when a constant fraction of the measurements are arbitrarily corrupted. In addition, we propose a new truncated spectral method that ensures an initialization in the basin of attraction at slightly higher requirements. We finally provide numerical experiments to validate the superior performance of the proposed approach.

## Authors

• 2 publications
• 29 publications
• 15 publications
• 47 publications
• ### Nonconvex Robust Low-rank Matrix Recovery

In this paper we study the problem of recovering a low-rank matrix from ...
09/24/2018 ∙ by Xiao Li, et al. ∙ 0

• ### Median-Truncated Nonconvex Approach for Phase Retrieval with Outliers

This paper investigates the phase retrieval problem, which aims to recov...
03/11/2016 ∙ by Huishuai Zhang, et al. ∙ 0

• ### Nonconvex Matrix Factorization from Rank-One Measurements

We consider the problem of recovering low-rank matrices from random rank...
02/17/2018 ∙ by Yuanxin Li, et al. ∙ 0

• ### Median and Mode Ellipse Parameterization for Robust Contour Fitting

Problems that require the parameterization of closed contours arise freq...
04/22/2015 ∙ by Michael A. Greminger, et al. ∙ 0

• ### Reconstructing Point Sets from Distance Distributions

We study the problem of reconstructing the locations u_i of a set of po...
04/06/2018 ∙ by Shuai Huang, et al. ∙ 0

• ### Near-Optimal Joint Object Matching via Convex Relaxation

Joint matching over a collection of objects aims at aggregating informat...
02/06/2014 ∙ by Yuxin Chen, et al. ∙ 0

• ### A Computational Theory of Robust Localization Verifiability in the Presence of Pure Outlier Measurements

The problem of localizing a set of nodes from relative pairwise measurem...
10/12/2019 ∙ by Mahroo Bahreinian, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Low-rank matrix recovery is a problem of great interest in applications such as collaborative filtering, signal processing, and computer vision. A considerable amount of work has been done on low-rank matrix recovery in recent years, where it is shown that low-rank matrices can be recovered accurately and efficiently from much fewer observations than their ambient dimensions

[1, 2, 3, 4, 5, 6]. An extensive overview on low-rank matrix recovery can be found in [7]. In particular, convex relaxation is a popular strategy which replaces the low-rank constraint by a convex surrogate, such as nuclear norm minimization [1, 8, 9, 5, 6]. However, despite statistical (near-)optimality, the computational and memory costs of this approach are prohibitive for high-dimensional problems.

In practice, a widely used alternative, pioneered by Burer and Monteiro [10]

, is to directly estimate the factors

and , of a low-rank matrix if its rank

is approximately known or can be upper bounded. Since the factors have a much lower-dimensional representation, this approach admits more computationally and memory efficient algorithms. Due to the bilinear constraint induced by factorization, this typically leads to a nonconvex loss function that may be difficult to optimize globally. Interestingly, a series of recent work has demonstrated that, starting from a careful initialization, simple algorithms such as gradient descent

[11, 12, 13, 14, 15] and alternating minimization [16, 17] enjoy global convergence guarantees under near-optimal sample complexity. Some of these algorithms also converge at a linear rate, making them extremely appealing computationally. On the other hand, the global geometry of nonconvex low-rank matrix estimation has been investigated in [18, 19, 20, 21], and it is proven that no spurious local optima, except strict saddle points, exist under suitable coherence conditions and sufficiently large sample size. This implies global convergence from random initialization, provided the algorithm of choice can escape saddle points [22, 23, 24].

In real-world applications, it is quite typical that measurements may suffer from outliers that need to be addressed carefully. In this paper, we focus on low-rank matrix recovery from random linear measurements in the presence of arbitrary outliers. Specifically, the sensing matrices are generated with i.i.d. standard Gaussian entries. Moreover, we assume that a small number of measurements are corrupted by outliers, possibly in an adversarial fashion with arbitrary amplitudes. This setting generalizes the outlier-free models studied in [1, 9, 11, 12], where convex and nonconvex approaches have been developed to accurately recover the low-rank matrix. Unfortunately, the vanilla gradient descent algorithm in [11, 12] is very sensitive in the presence of even a single outlier, as the outliers can perturb the search directions arbitrarily. To handle outliers, existing convex optimization approaches based on sparse and low-rank decompositions can be applied using semidefinite programming [25, 26]. However, their computational cost is very expensive. Therefore, our goal in this paper is to develop fast and robust nonconvex alternatives that are globally convergent in a provable manner that can handle a large number of adversarial outliers.

### 1.1 Our Approach and Results

We propose a median-truncation strategy to robustify the gradient descent approach in [11, 12], which includes careful modifications on both initialization and local search. As it is widely known, the sample median is a more robust quantity to outliers, compared with the sample mean, which cannot be perturbed arbitrarily unless over half of the samples are outliers [27]. Therefore, it becomes an ideal metric to illuminate samples that are likely to be outliers and therefore should be eliminated during the gradient descent updates. Indeed, in a recent work by a subset of current authors [28], a median-truncated gradient descent algorithm has been proposed to robustify phase retrieval via a nonconvex method, where the sample median was exploited to control both the initialization and the local search step, so that only a subset of samples are selected to contribute to the search direction in each iteration. It was demonstrated that such an approach provably tolerates a constant fraction of outliers at a near-optimal sample complexity up to a logarithmic factor for the phase retrieval problem.

Inspired by [28], we design a tailored median-truncated gradient descent (median-TGD) algorithm for low-rank matrix recovery, where we carefully set the truncation strategy to mitigate the impact of outliers. Specifically, we develop a truncated spectral method for initialization, where only samples whose absolute values are not too deviated from the sample median are included. Similarly, we develop a truncated gradient update, where only samples whose measurement residuals using the current estimates are not too deviated from the sample median are included. This leads to an adaptive, iteration-varying strategy to mitigate the effects of outliers. In particular, the proposed algorithm does not assume a priori information regarding the outliers in terms of their fraction, distribution nor values.

Theoretically, we demonstrate that, when initialized in a basin of attraction close to the ground truth, the proposed algorithm converges to the ground truth at a linear rate for the Gaussian measurement model with an order of measurements, where , even when a constant fraction of the measurements are arbitrarily corrupted, which is nearly optimal up to a logarithmic factor. In addition, the truncated spectral method ensures an initialization in the basin of attraction with an order of measurements when a fraction of measurements are arbitrarily corrupted. In the case when the rank is a small constant, our results indicate that the proposed algorithm can tolerate a constant fraction of outliers with an order of measurements, which is much smaller than the size of the matrix.

To obtain the performance guarantees, we establish that the proposed median-truncated gradient satisfies a so-called regularity condition [29], which is a sufficient condition for establishing the linear convergence to the ground truth. Since its debut in [29], the regularity condition has been employed successfully in the analysis of phase retrieval [29, 30, 31, 28], blind deconvolution [32] and low-rank matrix recovery [11, 12, 15] in the recent literature, to name a few. However, our analysis is significantly more involved due to the fact that the truncation procedure involving low-rank matrices has not been tackled in the previous literature. In particular, we establish a new restricted isometry property (RIP) of the sample median for the class of low-rank matrices, which can be thought as an extension of the RIP for the sample mean in compressed sensing literature [1, 9]. We remark that such a result can be of independent interest, and its establishment is non-trivial due to the nonlinear character of the median operation. Numerical experiments demonstrate the excellent empirical performance of the proposed algorithm for low-rank matrix recovery from outlier-corrupted measurements, which significantly outperforms the existing algorithms that are not resilient to outliers [11, 12].

Computationally, because the sample median can be computed in a linear time [33], our median-truncated gradient descent algorithm shares a similar attractive computational cost as [11, 12]. Specifically, the per-iteration computational complexity of the proposed algorithm is on the order of , which is linear with respect to , while is quadratic with respect to and 111In practice, our algorithm can be applied to other measurement ensembles with more structures, such as sparsity, and the computational complexity can be further reduced.. The proposed algorithm enjoys a lower computational complexity, compared with SVD-based methods [8] and alternating minimization [16], which usually require more than or operations during each iteration.

### 1.2 Related Works

Our work is amid the recent surge of nonconvex approaches for high-dimensional signal estimation, e.g. an incomplete and still growing list [29, 13, 30, 14, 31, 28, 32, 11, 12, 15]. The most closely-related work is on low-rank matrix recovery using random linear measurements [11, 12] in the absence of outliers, in the context of which our algorithm can be thought as a robust counterpart. Our particular approach is inspired by the previous work of a subset of current authors [28] on robust phase retrieval, which can be thought as robust recovery of a rank-one positive semidefinite (PSD) matrix using rank-one measurement operators [6]. Our model in the current paper differs as we tackle low-rank matrix recovery using random full-rank measurement operators, and thus non-trivial technical developments are necessary.

It is worth mentioning that other nonconvex approaches for robust low-rank matrix completion have been presented in [34, 35, 36], where the goal is to separate a low-rank matrix and sparse outliers from a small number of direct or linear measurements of their sum. The approaches typically use thresholding-based truncation for outlier removal and projected gradient descent for low-rank matrix recovery, which are somewhat similar to our approach in terms of different ways to remove outliers. However, this line of work typically requires stronger assumptions on the outliers such as spread-ness conditions, while we allow arbitrary outliers.

### 1.3 Paper Organizations and Notations

The remainder of this paper is organized as follows. Section 2 formulates the problems of interest. Section 3 describes the proposed algorithm and its performance guarantees. Section 4 provides numerical evidence on the superior performance of the proposed algorithm in the presence of outliers. Section 5 and Section 6 provide the proofs of the main theoretical results, and finally, we conclude the paper in Section 7.

Throughout this paper, we denote vectors by boldface lowercase letters and matrices by boldface uppercase letters. The notations

, , and represent the transpose, the spectral norm and the Frobenius norm of a matrix , respectively. We denote the

th singular value of

by , and the

th eigenvalue by

. For a vector , denotes the median of the entries in , and denotes the vector that contains its entry-wise absolute values. The th entry of a matrix is denoted by . Besides, the inner product between two matrices and is defined as , where denotes the trace of a matrix. The indicator function of an event is denoted by , which equals to if is true and otherwise. In addition, we use , , , with different superscripts and subscripts to represent universal constants, whose values may change from line to line.

## 2 Problem Formulation

Let be a rank- matrix that can be written as

 M=XYT, (1)

where and are the low-rank factors of . Define the condition number and the average condition number of as , and , respectively. Clearly, .

Let be the number of measurements, and the set of sensing matrices are given as , where is the th sensing matrix. In particular, each entry of is generated with i.i.d. standard Gaussian entries, i.e. . Denote the index set of corrupted measurements by , and correspondingly, the index set of clean measurements is given as the complementary set . Mathematically, the measurements are given as

 yi={⟨Ai,M⟩, ifi∈Sc;ηi,ifi∈S, (2)

where is the set of outliers that can take arbitrary values. Denote the cardinality of by , where is the fraction of outliers. To simplify the notations, we define the linear maps , and .

Instead of recovering , we aim to directly recover its low-rank factors from the corrupted measurements , without a priori knowledge of statistical distribution or fractions of the outliers, in a computationally efficient and provably accurate manner. It is straightforward to see that for any orthonormal matrix and scaler such that , we have . To address the scaling ambiguity, we assume , and consequently, can be recovered only up to orthonormal transformations. Hence, we measure the estimation accuracy by taking this into consideration. Let the estimates of low-rank factors be and , and define the augmented variables

 W=[UV]∈R(n1+n2)×r,Z=[XY]∈R(n1+n2)×r. (3)

Then the distance between and is measured as

 (4)

Define

 Q(W,Z)=argminP∈Rr×r,PPT=I∥W−ZP∥F, (5)

and then , where the subscript of is dropped for notational simplicity.

## 3 Proposed Algorithm and Theoretical Guarantees

Define a quadratic loss function with respect to the th measurement as

 fi(U,V)=14m(yi−Ai(UVT))2, (6)

where and . In order to get rid of the impact of outliers, an ideal approach is to minimize an oracle loss function, expressed as

 horacle(U,V)=foracle(U,V)+g(U,V)=∑i∈Scfi(U,V)+λ4∥∥UTU−VTV∥∥2F, (7)

which aims to minimize the quadratic loss over only the clean measurements, in addition to a regularization term

 g(U,V)=λ4∥∥UTU−VTV∥∥2F, (8)

that aims at balancing the norm of the two factors. Nevertheless, it is impossible to minimize directly, since the oracle information regarding the support of outliers is absent. Moreover, the loss function is nonconvex, adding difficulty to its global optimization.

We consider a gradient descent strategy where in each iteration, only a subset of all samples contribute to the search direction:

 Ut+1=Ut−μt∥U0∥2⋅∇Uht(Ut,Vt);Vt+1=Vt−μt∥V0∥2⋅∇Vht(Ut,Vt), (9)

where denotes the step size, and is the initialization that will be specified later. Also, denote . In particular, the iteration-varying loss function is given as

 ht(U,V)=∑i∈Etfi(U,V)+g(U,V):=ftr(U,V)+g(U,V), (10)

where the set varies at each iteration and includes only samples that are likely to be inliers. Denote the residual of the th measurement at the th iteration by

 rti=yi−Ai(UtVTt),i=1,2,…,m, (11)

and . Then the set is defined as

 Et={i∣∣|rti|≤αh⋅med{|rt|}}, (12)

where is some small constant. In other words, only samples whose current absolute residuals are not too deviated from the sample median of the absolute residuals are included in the gradient update. As the estimate gets more accurate, we expect that the set gets closer to the oracle set , and hence the gradient search is more accurate. Note that the set varies per iteration, and therefore can adaptively prune the outliers. The gradients of with respect to and are given as

 ∇Uht(U,V)=12m∑i∈Et[Ai(UVT)−yi]AiV+λU(UTU−VTV);∇Vht(U,V)=12m∑i∈Et[Ai(UVT)−yi]ATiU+λV(VTV−UTU). (13)

For initialization, we adopt a truncated spectral method, which uses the top singular vectors of a sample-weighted surrogate matrix, where again only the samples whose values do not significantly digress from the sample median are included. To avoid statistical dependence in the theoretical analysis, we split the samples by using the sample median of samples to estimate , and then using the rest of the samples to construct the truncated surrogate matrix to perform a spectral initialization. In practice, we find that this sample split is unnecessary, as demonstrated in the numerical simulations.

The details of the proposed algorithm, denoted as median-truncated gradient descent (median-TGD), are provided in Algorithm 1, where the stopping criterion is simply set as reaching a preset maximum number of iterations. In practice, it is also possible to set the stopping criteria by examining the progress between iterations. In sharp contrast to the standard gradient descent approach that exploits all samples in every iteration [11], both the initialization and the search directions are controlled more carefully in order to adaptively eliminate outliers, while maintaining a similar low computational cost.

### 3.2 Theoretical Guarantees

Theorem 1 summarizes the performance guarantee of median-TGD in Algorithm 1 for low-rank matrix recovery using Gaussian measurements in the presence of sparse arbitrary outliers, when initialized within a proper neighborhood around the ground truth.

###### Theorem 1 (Exact recovery with sparse arbitrary outliers).

Assume the measurement model (2), where each is generated with i.i.d. standard Gaussian entries. Suppose that the initialization satisfies

 dist(W0,Z)≤124σr(Z).

Recall that . Set and with . There exist some constants ,

such that with probability at least

, if , and , there exists a constant , such that with , the estimates of median-TGD satisfy

 dist(Wt,Z)≤(1−μ10κ)t/2dist(W0,Z).

Theorem 1 suggests that if the initialization lies in the basin of attraction, median-TGD converges to the ground truth at a linear rate as long as the number of measurements is on the order of , even when a constant fraction of measurements are corrupted arbitrarily. In comparisons, the gradient descent algorithm by Tu et.al. [12] achieves the same convergence rate in a similar basin of attraction, with an order of measurements using outlier-free measurements. Therefore, our algorithm achieves robustness up to a constant fraction of outliers with a slight price of an additional logarithmic factor in the sample complexity.

Theorem 2 guarantees that the proposed truncated spectral method provides an initialization in the basin of attraction with high probability.

###### Theorem 2.

Assume the measurement model (2), and . Set . There exist some constants and such that with probability at least , if , and , we have

 dist(W0,Z)≤124σr(Z).

Theorem 2 suggests that the proposed initialization scheme is guaranteed to obtain a valid initialization in the basin of attraction with an order of measurements when a fraction of measurements are arbitrarily corrupted, assuming the average condition number is a small constant. In comparisons, in the outlier-free setting, Tu et.al. [12] requires an order of measurements for a one-step spectral initialization, which is closest to our scheme. Therefore, our initialization achieves robustness to a fraction of outliers at a slight price of additional logarithmic factors in the sample complexity. It is worthwhile to note that in the absence of outliers, Tu et.al. [12] was able to further reduce the sample complexity of initialization to an order of by running multiple iterations of projected gradient descent. However, it is not clear whether such an iterative scheme can be generalized to the setting with outliers in our paper.

Finally, we note that the parameter bounds in all theorems, including , and , are not optimized for performance, but mainly selected to establish the theoretical guarantees.

## 4 Numerical Experiments

In this section, we evaluate the performance of the proposed median-TGD algorithm via conducting several numerical experiments. As mentioned earlier, for the initialization step, in practice we find it is not necessary to split the samples into two parts. Therefore, the matrix in (14) is changed instead to

 Y=1mm∑i=1yiAiI{|yi|≤αy⋅med(|y|)}. (15)

In particular, we check the trade-offs between the number of measurements, the rank and the fraction of outliers for accurate low-rank matrix recovery, and compare against the algorithm in [12], referred to as the vanilla gradient descent algorithm (vanilla-GD), to demonstrate the performance improvements in the presence of outliers due to median truncations.

Let , . We randomly generate a rank- matrix as , where both and are composed of i.i.d. standard Gaussian variables. The outliers are i.i.d. randomly generated following . We set and , and pick a constant step size . In all experiments, the maximum number of iterations for median-TGD algorithm is set as to guarantee convergence. Moreover, let be the solution to the algorithm under examination, and the recovered low-rank matrix is given as . Then, the normalized estimate error is defined as .

### 4.1 Phase Transitions

We first examine the phase transitions of median-TGD algorithm with respect to the number of measurements, the rank and the percent of outliers. Fix the percent of outliers as

. Given a pair of and , a ground truth is generated composed of i.i.d. standard Gaussian variables. Multiple Monte Carlo trials are carried out, and each trial is deemed a success if the normalized estimate error is less than . Fig. 1 (a) shows the success rates of median-TGD, averaged over trials, with respect to the number of measurements and the rank, where the red line shows the theoretical limit defined as

by a heuristic count of the degrees of freedom. It can be seen that the required number of measurements for a successful matrix recovery scales linearly with the rank

, and the transition is sharp. We next examine the success rates of median-TGD with respect to the percent of outliers and the rank. Fix . Under the same setup as Fig. 1 (a), Fig. 1 (b) shows the success rate of median-TGD, averaged over trials, with respect to the rank and the percent of outliers. The performance of median-TGD degenerates smoothly with the increase of the percent of outliers. Similarly, the red line shows the theoretical limit as a comparison.

### 4.2 Stability to Additional Bounded Noise

We next examine the performance of median-TGD when the measurements are contaminated by both sparse outliers and dense noise. Here, the measurements are rewritten as

 yi={⟨Ai,M⟩+wi,ifi∈Sc;ηi+wi,ifi∈S,

where , for , denote the additional bounded noise.

Fix and . The dense noise is generated with i.i.d. random entries following . Fig. 2 depicts the average normalized reconstruction errors with respect to the number of measurements using both median-TGD and vanilla-GD [12], where vanilla-GD is always given the true rank information, i.e. . The performance of median-TGD is comparable to that of vanilla-GD using outlier-free measurements, which cannot produce reliable estimates when the measurements are corrupted by outliers. Therefore, median-TGD can handle outliers in a much more robust manner. Moreover, the performance of median-GD is stable as long as an upper bound of the true rank is used.

We next compare the convergence rates of median-TGD and vanilla-GD under various outlier settings, by fixing while keeping the other settings the same as Fig. 2. Fig. 3 shows the normalized estimate error with respect to the number of iterations of median-TGD and vanilla-GD with no outliers, of outliers, and of outliers, respectively. In the outlier-free case, both algorithms have comparable convergence rates. However, even with a few outliers, vanilla-GD suffers from a dramatical performance degradation, while median-TGD is robust against outliers and can still converge to an accurate estimate.

## 5 Proof of Linear Convergence

In this section, we present the proof of Theorem 1. Section 5.1 first establishes an RIP-like property for the median of random linear measurements of low-rank matrices, which can be of independent interest. Section 5.2 describes the regularity condition (RC), which is used to certify the linear convergence of the proposed algorithm. Section 5.3 proves several properties of the truncated gradient which are then used in Section 5.4 to prove the RC and finish the proof.

### 5.1 Concentration Property of Sample Median

To begin, we define below the quantile function of a population distribution and its corresponding sample version.

###### Definition 1 (Generalized quantile function).

Let

. For a cumulative distribution function (CDF)

, the generalized quantile function is defined as

 F−1(τ)=inf{x∈R:F(x)≥τ}.

For simplicity, denote as the -quantile of . Moreover, for a sample collection , the sample -quantile means , where is the empirical distribution of the samples . Specifically, .

We establish a RIP-style concentration property for the sample median used in the truncation indicator of gradient descent, which provides theoretical footings on the success of the proposed algorithm. The concentration property of the sample -quantile function of all rank- matrices is formulated in the following proposition, of which the proof is shown in Appendix B.

###### Proposition 1.

Fix . If for some large enough constant , then with probability at least , where and are some constants, we have for all rank- matrices ,

 θ12(|A(G)|)∈[0.6745−ϵ,0.6745+ϵ]∥G∥F, θ0.49(|A(G)|)∈[0.6588−ϵ,0.6588+ϵ]∥G∥F, θ0.51(|A(G)|)∈[0.6903−ϵ,0.6903+ϵ]∥G∥F.

Proposition 1 suggests that as long as is on the order of , the sample median concentrates around a scaled for all rank- matrices , which resembles the matrix RIP in [1]. Based on Proposition 1, provided that for some large enough constant , setting , we have

 θ0.49,θ12,θ0.51(∣∣A(XYT)−A(UVT)∣∣)∈[0.65,0.70]∥∥XYT−UVT∥∥F (16)

holds with probability at least for all , , , and . On the other end, due to Lemma 4, we have

 θ12−s(∣∣A(XYT)−A(UVT)∣∣)≤med(∣∣y−A(UVT)∣∣)≤θ12+s(∣∣A(XYT)−A(UVT)∣∣).

As a result, when the fraction of corruption satisfies , the above equation together with (16) yields

 (17)

Therefore, an important consequence is that the truncation event satisfies

 (18)

### 5.2 Regularity Condition

We first introduce the so-called Regularity Condition (RC) [29, 11, 12] that characterizes the benign curvature of the loss function around the ground truth, and guarantees the linear convergence of gradient descent to the ground truth.

We first rewrite the loss function in terms of the augmented variables in (3). Denote the matrix , and define and , then we can have the equivalent representation

 Bi(WWT)=⟨Bi,WWT⟩=⟨Ai,UVT⟩=Ai(UVT). (19)

The regularizer can be rewritten as

 g(W)=λ4∥∥WTDW∥∥2F, (20)

where , and its gradient can be rewritten as

 ∇g(W)=λDW(WTDW). (21)

Then the truncated gradient can be rewritten as a function of ,

 (22)

where

 Ei={∣∣yi−Bi(WWT)∣∣≤αh⋅med(∣∣y−B(WWT)∣∣)}. (23)

Then the RC is defined in the following definition.

###### Definition 2 (Regularity Condition).

Suppose is the ground truth. The set of matrices that are in an -neighborhood of is defined as

 C(ϵ)={W∈R(n1+n2)×r:dist(W,Z)≤ϵ}.

Then the function is said to satisfy the RC, denoted by , if for all matrices , the following inequality holds:

 ⟨∇h(W),W−ZQ⟩≥σ2r(Z)α∥W−ZQ∥2F+1β∥Z∥2∥∇h(W)∥2F, (24)

where is an orthonormal matrix given in (5).

The neighborhood is known as the basin of attraction. Interestingly, if satisfies the RC, then initializing a simple gradient descent algorithm in the basin of attraction guarantees that the iterates converge at a linear rate to the ground truth, as summarized in the following lemma.

###### Lemma 1.

[29, 11, 12] Suppose that satisfies and . Consider the gradient descent update

 Wt+1=Wt−μ∥Z∥2∇h(Wt) (25)

with the step size . Then for all , we have and

 dist(Wt,Z)≤(1−2μακ)t/2dist(W0,Z).

Note that since the initialization satisfies , by the triangle inequality we can guarantee that

 2324∥Z∥≤∥W0∥≤2524∥Z∥,

which implies

 2324√2∥Z∥≤∥U0∥≤2524√2∥Z∥; 2324√2∥Z∥≤∥V0∥≤2524√2∥Z∥,

where we use the fact . Therefore, instead of proving the linear convergence of the actual update size and , we prove it for the step size in (25), since they only differ by a constant scaling of . Hence, the rest of the proof is to verify that RC holds for the truncated gradient.

### 5.3 Properties of Truncated Gradient

We start by proving a few key properties of the truncated gradient . Consider the measurement model with sparse outliers in (2). Define the truncation event

which is the same as except that the measurements used to calculate the residual are replaced by clean measurements. In particular, it is straight to see that (18) also holds for . Then we can write as

 ∇ftr(W) =1mm∑i=1(Bi(WWT)−yi)BiWIEi =1m∑i∉S(Bi(WWT)−Bi(ZZT))BiWI~Ei+1m∑i∈S(Bi(WWT)−yi)BiWIEi =1mm∑i=1(Bi(WWT)−Bi(ZZT))BiWI~Ei∇cftr(W) +1m∑i∈S[(Bi(WWT)−yi)IEi−(Bi(WWT)−Bi(ZZT))I~Ei]BiW∇oftr(W),

where corresponds to the truncated gradient as if all measurements are clean, and corresponds to the contribution of the outliers.

For notational simplicity, define

 H=[H1H2]=W−ZQ=[U−XQV−YQ], (26)

where is given in (5). We have

 ⟨∇cftr(W),H⟩ =1mm∑i=1⟨Bi,WWT−ZZT⟩⋅⟨Bi,HWT⟩⋅I~Ei. (27)

Define the set as

 D ={i|⟨Bi,WWT−ZZT⟩⋅⟨Bi,HWT⟩<0}. (28)

We can then split (27) and bound it as

 ⟨∇cftr(W),H⟩ (29) ≥1m∑i∉D⟨Bi,WWT−ZZT⟩⋅⟨Bi,HWT⟩⋅I{∣∣⟨Bi,WWT−ZZT⟩∣∣≤0.65αh∥∥UVT−XYT∥∥F} +1m∑i∈D⟨Bi,WWT−ZZT⟩⋅⟨Bi,HWT⟩⋅I{∣∣⟨Bi,WWT−ZZT⟩∣∣≤0.70αh∥∥UVT−XYT∥∥F} =12m∑i∉D⟨Ai,UVT−XYT⟩⋅⟨Ai,H1VT+UHT2⟩⋅I{∣∣⟨Ai,UVT−XYT⟩∣∣≤0.65αh∥∥UVT−XYT∥∥F}B1 +12m∑i∈D⟨Ai,UVT−XYT⟩⋅⟨Ai,H1VT+UHT2⟩⋅I{∣∣⟨Ai,UVT−XYT⟩∣∣≤0.70αh∥∥UVT−XYT∥∥F}B2. (30)

The first term in (30) can be lower bounded by Proposition 2, whose proof is given in Appendix C.

###### Proposition 2.

Provided , then

 B1 (31)

holds for all with probability at least , where with , and are numerical constants.

The second term in (30) can be lower bounded by Proposition 3, whose proof is given in Appendix D.

###### Proposition 3.

Provided , we have

 B2 ≥−0.36αh∥∥UVT−XYT∥∥F∥∥H1HT2∥∥F (32)

holds for all with probability at least , where are numerical constants.

The contribution of outliers can be bounded by the following proposition, whose proof is given in Appendix E.

###### Proposition 4.

Provided , we have

 |⟨∇oftr(W),H⟩|≤0.71αh√s∥∥XYT−UVT∥∥F∥H1VT+UHT2∥F (33)

holds for all with probability at least , where are numerical constants.

On the other end, Proposition 5 establishes an upper bound for , whose proof is given in Appendix F.

###### Proposition 5.

Let . Provided , we have

 ∥∇ftr(W)∥2F≤0.25α2h∥∥UVT−XYT∥∥2F∥W∥2 (34)

holds for all with probability at least , where