 # Median-Truncated Nonconvex Approach for Phase Retrieval with Outliers

This paper investigates the phase retrieval problem, which aims to recover a signal from the magnitudes of its linear measurements. We develop statistically and computationally efficient algorithms for the situation when the measurements are corrupted by sparse outliers that can take arbitrary values. We propose a novel approach to robustify the gradient descent algorithm by using the sample median as a guide for pruning spurious samples in initialization and local search. Adopting the Poisson loss and the reshaped quadratic loss respectively, we obtain two algorithms termed median-TWF and median-RWF, both of which provably recover the signal from a near-optimal number of measurements when the measurement vectors are composed of i.i.d. Gaussian entries, up to a logarithmic factor, even when a constant fraction of the measurements are adversarially corrupted. We further show that both algorithms are stable in the presence of additional dense bounded noise. Our analysis is accomplished by developing non-trivial concentration results of median-related quantities, which may be of independent interest. We provide numerical experiments to demonstrate the effectiveness of our approach.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Phase retrieval is a classical problem in signal processing, optics and machine learning that has a wide range of applications such as X-ray crystallography , astronomical imaging, and diffraction imaging. Mathematically, it is formulated as recovering a signal from the magnitudes of its linear measurements:

 yi=|⟨ai,x⟩|2,i=1,…,m, (1)

where is the total number of measurements, and is the th known measurement vector, . Phase retrieval is known to be notoriously difficult due to the quadratic form of the measurements. Classical methods based on alternating minimization between the signal of interest and the phase information , though computationally simple, are often trapped at local minima and lack rigorous performance guarantees.

There has been, however, a recent line of work that successfully develops provably accurate algorithms for phase retrieval, in particular for the case when the measurement vectors ’s are composed of independent and identically distributed

(i.i.d.) Gaussian entries. Broadly speaking, two classes of approaches have been proposed based on convex and nonconvex optimization techniques, respectively. Using the lifting trick, the phase retrieval problem can be reformulated as estimating a rank-one positive semidefinite (PSD) matrix

from linear measurements , for which convex relaxations into semidefinite programs have been studied [50, 10, 16, 19, 8, 35]. In particular, Phaselift 

perfectly recovers the signal with high probability as long as the number of measurements

is on the order of . However, the computational complexity of Phaselift is at least cubic in , which becomes expensive when is large. Very recently, another convex relaxation has been proposed in the natural parameter space without lifting [24, 3, 26]

, resulting in a linear program that can handle large problem dimensions as long as

is on the order of .

Another class of approaches aims to find the signal that minimizes a loss function based on certain postulated noise model, which often results in a nonconvex optimization problem due to the quadratic measurements. Despite nonconvexity, it is demonstrated in

[9, 42] that the so-called Wirtinger flow (WF) algorithm, based on gradient descent, works remarkably well: it converges to the global optima when properly initialized using the spectral method. Several variants of WF have been proposed thereafter to further improve its performance, including the truncated Wirtinger flow (TWF) algorithm , the reshaped Wirtinger flow (RWF) algorithm , and the truncated amplitude flow (TAF) algorithm  Notably, TWF, RWF and TAF are shown to converge globally at a linear rate as long as is on the order of , and attain -accuracy within flops using a constant step size.333Notation or means that there exists a constant such that .

### 1.1 Outlier-Robust Phase Retrieval

The aforementioned algorithms are evaluated based on their statistical and computational performances: statistically, we wish the sample complexity to be as small as possible; computationally, we wish the run time to be as fast as possible. As can be seen, existing WF-type algorithms are already near-optimal both statistically and computationally. This paper introduces a third consideration, which is the robustness to outliers, where we wish the algorithm continues to work well even in the presence of outliers that may take arbitrary magnitudes. This bears great importance in practice, because outliers arise frequently from the phase imaging applications  due to various reasons such as detector failures, recording errors, and missing data. Specifically, suppose the set of measurements are given as

 yi=|⟨ai,x⟩|2+ηi,i=1,⋯,m, (2)

where for are outliers that can take arbitrary values. We assume that outliers are sparse with no more than nonzero values, i.e., , where . Here, is a nonzero constant, representing the faction of measurements that are corrupted by outliers.

The goal of this paper is to develop phase retrieval algorithms with both statistical and computational efficiency, and provable robustness to even a constant proportion of outliers. None of the existing algorithms meet all of the three considerations simultaneously. The performance of WF-type algorithms is very sensitive to outliers which introduce anomalous search directions when their values are excessively deviated. While a form of Phaselift  is robust to a constant portion of outliers, it is computationally too expensive.

A natural idea is to recover the signal as a solution to the following loss minimization problem:

 minz12mm∑i=1ℓ(z;yi) (3)

where is postulated using the negative likelihood of Gaussian or Poisson noise model. Since the measurements are quadratic in , the objective function is nonconvex. We consider two choices of in this paper. The first one is the Poisson loss function of employed in TWF , which is given by

 ℓ(z;yi)=|aTiz|2−yilog|aTiz|2. (4)

The second one is the reshaped444It is called “reshaped” in order to distinguish it from the quadratic loss of used in . quadratic loss of employed in RWF , which is given by

 ℓ(z;yi)=(|aTiz|−√yi)2. (5)

Though (5) is not smooth everywhere, it resembles more closely the least-squares loss as if the phase information is available than the quadratic loss of , resulting in a more amenable curvature.

In the presence of outliers, the signal of interest may no longer be the global optima of (3). Therefore, we wish to only include the clean samples that are not corrupted in the optimization (3), which is, however, impossible as we do not assume any a priori knowledge of the outliers. Our key strategy is to prune the bad samples adaptively and iteratively, using a gradient descent procedure that proceeds as follows:

 z(t+1)=z(t)−μm∑i∈Tt+1∇ℓ(z(t);yi). (6)

where denotes the th iterate of the algorithm, is the gradient of , and is the step size, for . In each iteration, only a subset of data-dependent and iteration-varying samples contributes to the search direction. But how to select the set ? Note that the gradient of the loss function typically contains the term (for TWF) or (for RWF), which measures the residual using the current iterate. With being corrupted by arbitrarily large outliers, the gradient can deviate the search direction from the signal arbitrarily.

Inspired by the utility of median to combat outliers in robust statistics , we prune samples whose gradient components are much larger than the sample median to control the search direction of each update. Hiding some technical details, this gives the main ingredient of our median-truncated gradient descent update rule555Please see the exact form of the algorithms in Section 2., i.e., for each iterate :

 Tt+1 :={i:|yi−|aTiz(t)|2|≲med({|yi−|aTiz(t)|2}mi=1)},for TWF, (7) Tt+1 :={i:|√yi−|aTiz(t)||≲med({|√yi−|aTiz(t)|}mi=1)},% for RWF, (8)

where denotes the sample median. The robust property of median lies in the fact that the median cannot be arbitrarily perturbed unless the outliers dominate the inliers . This is in sharp contrast to the sample mean, which can be made arbitrarily large even by a single outlier. Thus, using the sample median in the truncation rule can effectively remove the impact of outliers. Finally, there still left the question of initialization, which is critical to the success of the algorithm. We use the spectral method, i.e., initialize

by a proper rescaling of the top eigenvector of a surrogate matrix

 Y=1m∑i∈T0yiaiaTi, (9)

where again includes only a subset of samples whose values are not excessively large compared with the sample median of the measurements, given as

 T0={i:yi≲med({yi}mi=1)}. (10)

Putting things together (the update rule (6) and the initialization (9)), we obtain two new median-truncated gradient descent algorithms, median-TWF and median-RWF, based on applying the median truncation strategy for the loss functions used in TWF and RWF, respectively. The median-TWF and median-RWF algorithms do not assume a priori knowledge of the outliers, such as their existence or the number of outliers, and therefore can be used in an oblivious fashion. Importantly, we establish the following performance guarantees.

Main Result (informal): For the Gaussian measurement model, with high probability, median-TWF and median-RWF recover all signal up to the global sign at a linear rate of convergence, even with a constant fraction of outliers, as long as the number of measurements is on the order of . Furthermore, the reconstruction is stable in the presence of additional bounded dense noise.

Statistically, the sample complexity of both algorithms is near-optimal up to a logarithmic factor, and to reassure, they continue to work even when outliers are absent. Computationally, both algorithms converge linearly, requiring a mere computational cost of to reach

-accuracy. More importantly, our algorithms now tolerate a constant fraction of arbitrary outliers, without sacrificing performance otherwise. To the best of our knowledge, this is the first application of the median to robustify high-dimensional statistical estimation in the presence of arbitrary outliers with rigorous non-asymptotic performance guarantees.

To establish the performance guarantees, we first show that the initialization is close enough to the ground truth, and then that within the neighborhood of the ground truth, the gradients satisfy certain Regularity Condition [9, 13] that guarantees linear convergence of the descent rule, as long as the fraction of outliers is small enough and the sample complexity is large enough. As a nonlinear operator, the sample median is much more difficult to analyze than the sample mean, which is a linear operator and many existing concentration inequalities are readily applicable. Therefore, considerable technical efforts are devoted to develop novel non-asymptotic concentrations of the sample median, and various statistical properties of the sample median related quantities, which may be of independent interest.

Finally, we note that while median-TWF and median-RWF share similar theoretical performance guarantees, their empirical performances vary under different scenarios, due to the use of different loss functions. Their theoretical analyses also have significant difference that worth separate treatments. While we only consider the loss functions used in TWF and RWF in this paper, we believe the median-truncation technique can be applied to gradient descent algorithms for solving other problems as well.

### 1.3 Related Work

Our work is closely related to the TWF algorithm , which is also a truncated gradient descent algorithm for phase retrieval. However, the truncation rule in TWF is based on the sample mean, which is very sensitive to outliers. In [37, 25, 53], the problem of phase retrieval under outliers is investigated, but the proposed algorithms either lack performance guarantees or are computationally too expensive.

The adoption of median in machine learning is not unfamiliar, for example, -median clustering  and resilient data aggregation for sensor networks . Our work here further extends the applications of median to robustifying high-dimensional estimation problems with theoretical guarantees. Another popular approach in robust estimation is to use the trimmed mean , which has found success in robustifying sparse regression , subspace clustering , etc. However, using the trimmed mean requires knowledge of an upper bound on the number of outliers, whereas median does not require such information.

Developing non-convex algorithms with provable global convergence guarantees has attracted intensive research interest recently. A partial list of these studies include phase retrieval [38, 9, 13, 51, 44], matrix completion [31, 29, 45, 27, 18, 57, 30, 23], low-rank matrix recovery [6, 17, 47, 56, 40, 52, 33, 36], robust PCA [39, 54]

, robust tensor decomposition

, dictionary learning [2, 43], community detection , phase synchronization, blind deconvolution [32, 34], joint alignment , etc. Our algorithm provides a new instance in this list that emphasizes robust high-dimensional signal estimation under minimal assumptions of outliers.

### 1.4 Paper Organization and Notations

The rest of this paper is organized as follows. Section 2 describes the proposed two algorithms, median-TWF and median-RWF, in details and their performance guarantees. Section 3 presents numerical experiments. Section 4 provides the preliminaries and the proof road map. Section 5 provides the proofs for median-TWF and Section 6 provides the proofs of median-RWF, respectively. Finally, we conclude in Section 7. Supporting proofs are given in the Appendix.

We adopt the following notations in this paper. Given a set of numbers , the sample median is denoted as . The indicator function if the event holds, and otherwise. For a vector , denotes the norm. For two matrices, if is a positive semidefinite matrix.

## 2 Algorithms and Performance Guarantees

We consider the following model for phase retrieval, where the measurements are corrupted by not only sparse arbitrary outliers but also dense bounded noise. Under such a model, the measurements are given as

 yi=|⟨ai,x⟩|2+wi+ηi,i=1,⋯,m, (11)

where is the unknown signal666We focus on real signals here, but our analysis can be extended to complex signals., is the th measurement vector composed of Gaussian entries distributed as , and for are outliers with arbitrary values satisfying , where is the fraction of outliers, and is the bounded noise satisfying for some universal constant .

It is straightforward that changing the sign of the signal does not affect the measurements. The goal is to recover the signal , up to a global sign difference, from the measurements and the measurement vectors . To this end, we define the Euclidean distance between two vectors up to a global sign difference as the performance metric,

 dist(z,x):=min{∥z+x∥,∥z−x∥}. (12)

We propose two median-truncated gradient descent algorithms, median-TWF in Section 2.1 and median-RWF in Section 2.2, based on different choices of the loss functions. This leads to applying the truncation based on the sample median of in median-TWF, and the sample median of in median-RWF. Section 2.3 provides the theoretical performance guarantees of median-TWF and median-RWF, which turn out to be almost the same at the order level except the choice of constants. The empirical comparisons of median-TWF and median-RWF are demonstrated in Section 3.

### 2.1 Median-TWF Algorithm

In median-TWF, we adopt the Poisson loss function of employed in TWF , given as

 ℓ(z):=12mm∑i=1(|aTiz|2−yilog|aTiz|2). (13)

The median-TWF algorithm, as described in Algorithm 1, gradually eliminates the influence of outliers on the way of minimizing (13). Specifically, it comprises an initialization step and a truncated gradient descent step.

1. Initialization: As in (14), we initialize by the spectral method using a truncated set of samples, where the threshold is determined by . As will be shown in Section 4.2, as long as the fraction of outliers is not too large and the sample complexity is large enough, our initialization is guaranteed to be within a small neighborhood of the true signal.

2. Gradient loop: for each iteration , median-TWF uses an iteration-varying truncated gradient given as

 ∇ℓtr(z(t))=1mm∑i=1|aTiz(t)|2−yiaTiz(t)ai1Ei1∩Ei2. (16)

In order to remove the contribution of corrupted samples, from the definition of the set (see Algorithm 1), it is clear that samples are truncated if their measurement residuals evaluated using the current iterate are much larger than the sample median. Samples are also truncated according to the set

, which removes the contribution of samples outside some confidence interval to better control the search direction, since

. The median-TWF algorithm closely resembles the TWF algorithm, except that the truncation is guided by the sample median, rather than the sample mean.

We set the step size in median-TWF to be a fixed small constant, i.e., . The rest of the parameters are set to satisfy

 ζ1:=max{E[ξ21{|ξ|<√1.01αl or |ξ|>√0.99αu}],E[1{|ξ|<√1.01αl or |ξ|>√0.99αu}]}, ζ2:=E[ξ21{|ξ|>0.248αh}], (17) 2(ζ1+ζ2)+√8/πα−1h<1.99 αy≥3,

where . For example, we can set and , and consequently and .

### 2.2 Median-RWF Algorithm

In median-RWF, we adopt the reshaped quadratic loss function of employed in RWF , given as

 R(z)=12mm∑i=1(√yi−|aTiz|)2, (19)

which has been shown to be advantageous over other loss functions for phase retrieval .

Similarly to median-TWF, the median-RWF algorithm as described in Algorithm 2, gradually eliminates the influence of outliers on the way of minimizing (19). Specifically, it also comprises an initialization step and a truncated gradient descent step.

1. Initialization: we initialize in the same manner as in median-TWF (Algorithm 1).

2. Gradient loop: for each iteration , median-RWF uses the following iteration-varying truncated gradient:

 ∇Rtr(z(t))=1mm∑i=1(aTiz(t)−√yi⋅aTiz(t)|aTiz(t)|)ai1Ti, (20)

From the definition of the set (see Algorithm 2), samples are truncated by the sample median of gradient components evaluated at the current iteration. We set the step size in median-RWF to be a fixed small constant, i.e., . Compared with median-TWF, the truncation rule is much simpler with fewer parameters. We simply set the truncation threshold . It is possible that including may further improves the performance, but we wish to highlight that, in this paper, the simple truncation rule is already sufficient to guarantee both robustness and efficiency of median-RWF.

### 2.3 Performance Guarantees

In this section, we characterize the performance guarantees of median-TWF and median-RWF, which turn out to be very similar though the proofs in fact involve quite different techniques. To avoid repetition, we present the guarantees together for both algorithms. We note that the values of constants in the results can vary for median-TWF and median-RWF.

We first show that median-TWF/median-RWF performs well for the noise-free model in the following proposition, which lends support to the model with outliers. This also justifies that we can run median-TWF/median-RWF without having to know whether the underlying measurements are corrupted.

###### Proposition 1 (Exact recovery for the noise-free model).

Suppose that the measurements are noise-free, i.e., and for in the model (11). There exist constants , and such that if and , then with probability at least , median-TWF/median-RWF yields

 dist(z(t),x)≤ν(1−ρ)t∥x∥,∀t∈N (21)

simultaneously for all .

Proposition 1 suggests that median-TWF/median-RWF allows exact recovery at a linear rate of convergence as long as the sample complexity is on the order of , which is in fact slightly worse, by a logarithmic factor, than existing WF-type algorithms (TWF, RWF and TAF) for the noise-free model. This is a price due to working with the nonlinear operator of median in the proof, and it is not clear whether it is possible to further improve the result. Nonetheless, as the median is quite stable as long as the number of outliers is not so large, the following main theorem indeed establishes that median-TWF/median-RWF still performs well even in the presence of a constant fraction of sparse outliers with the same sample complexity.

###### Theorem 1 (Exact recovery with sparse arbitrary outliers).

Suppose that the measurements are corrupted by sparse outliers, i.e., for in the model (11). There exist constants , , and such that if , , , then with probability at least , median-TWF/median-RWF yields

 dist(z(t),x)≤ν(1−ρ)t∥x∥,∀t∈N (22)

simultaneously for all .

Theorem 1 indicates that median-TWF/median-RWF admits exact recovery for all signals in the presence of sparse outliers with arbitrary magnitudes even when the number of outliers scales linearly with the number of measurements, as long as the sample complexity satisfies . Moreover, median-TWF/median-RWF converges at a linear rate using a constant step size, with per-iteration cost (note that the median can be computed in linear time ). To reach -accuracy, i.e., , only iterations are needed, yielding the total computational cost as , which is highly efficient. Empirically in the numerical experiments in Section 3, median-RWF converges faster and tolerates a larger fraction of outliers than median-TWF, which can be due to the use of the reshaped quadratic loss function.

We next consider the model when the measurements are corrupted by both sparse arbitrary outliers and dense bounded noise. Our following theorem characterizes that median-TWF/median-RWF is stable to coexistence of the two types of noises.

###### Theorem 2 (Stability to sparse arbitrary outliers and dense bounded noises).

Consider the phase retrieval problem given in (11) in which measurements are corrupted by both sparse arbitrary and dense bounded noises. There exist constants , and such that if , , , then with probability at least , median-TWF and median-RWF respectively yield

 dist(z(t),x)≲∥\boldmath{w}∥∞∥x∥+(1−ρ)t∥x∥,∀t∈N for median-TWF, (23) dist(z(t),x)≲√∥\boldmath{w}∥∞+(1−ρ)t∥x∥,∀t∈N for median-RWF, (24)

simultaneously for all .

Theorem 2 immediately implies the stability of median-TWF/median-RWF when the measurements are only corrupted by dense bounded noise.

###### Corollary 1.

Consider the phase retrieval problem in which measurements are corrupted only by dense bounded noises, i.e., for in the model (11). There exist constants , and such that if , , then with probability at least , median-TWF and median-RWF respectively yield

 dist(z(t),x)≲∥\boldmath{w}∥∞∥x∥+(1−ρ)t∥x∥,∀t∈N for median-TWF, (25) dist(z(t),x)≲√∥\boldmath{w}∥∞+(1−ρ)t∥x∥,∀t∈N for median-RWF, (26)

simultaneously for all .

With both sparse arbitrary outliers and dense bounded noises, Theorem 2 and Corollary 1 imply that median-TWF/median-RWF achieves the same convergence rate and the same level of estimation error as the model with only bounded noise. In fact, together with Theorem 1 and Proposition 1, it can be seen that applying median-TWF/median-RWF does not require the knowledge of the existence of outliers. When there do exist outliers, median-TWF/median-RWF achieves almost the same performance as if outliers do not exist.

## 3 Numerical Experiments

In this section, we provide numerical experiments to demonstrate the effectiveness of median-TWF and median-RWF, which corroborate our theoretical findings.

### 3.1 Exact Recovery for Noise-free Data

We first show that, in the noise-free case, median-TWF and median-RWF provide similar performance as TWF  and RWF  for exact recovery. We set the parameters of median-TWF and median-RWF as specified in Section 2.1 and Section 2.2, and those of TWF and RWF as suggested in  and , respectively. Let the signal length take values from to by a step size of , and the ratio of the number of measurements to the signal dimension, , take values from to by a step size of . For each pair of , we generate a signal , and the measurement vectors i.i.d. for . For all algorithms, a fixed number of iterations are run, and the trial is declared successful if , the output of the algorithm, satisfies . Figure 1 shows the number of successful trials out of 20 trials for all algorithms, with respect to and . It can be seen that, as soon as is above

, exact recovery is achieved for all four algorithms. Around the phase transition boundary, the empirical sample complexity of median-TWF is slightly worse than that of TWF, which is possibly due to the inefficiency of median compared to mean in the noise-free case

. Interestingly, the empirical sample complexity of median-RWF is slightly better than RWF because the truncation rule used in median-RWF allows sample pruning that improves the performance.777The original RWF in  does not have sample truncation.

### 3.2 Exact Recovery with Sparse Outliers

We next examine the performance of median-TWF and median-RWF in the presence of sparse outliers. We compare the performance of median-TWF and median-RWF with TWF, and also an alternative which we call trimean-TWF, based on replacing the sample mean in TWF by the trimmed mean for truncation purpose. Specifically, trimean-TWF requires knowing the fraction of outliers so that samples corresponding to largest values in the measurements or gradients are first removed, and truncation is then applied based on the mean of the remaining samples similar to TWF.

We fix the signal length and the number of measurements . Let each measurement be corrupted with probability independently, where the corruption value

is randomly generated from a uniform distribution. Figure

2 shows the success rate of exact recovery over trials as a function of at different levels of outlier magnitudes , for the four algorithms median-TWF, median-RWF, trimean-TWF and TWF.

From Figure 2, it can be seen that median-TWF and median-RWF allow exact recovery as long as

is not too large for all levels of outlier magnitudes, without assuming any knowledge of the outliers, which validates our theoretical analysis. Empirically, median-RWF can tolerate a larger fraction of outliers than median-TWF. This could be due to the fact that the lower-order objective adopted in median-RWF reduces the variance and allows more stable search direction. Unsurprisingly, TWF fails quickly even with a very small fraction of outliers. No successful instance is observed for TWF when

irrespective of the value of . Trimean-TWF, even knowing the number of outliers, still does not exhibit a sharp phase transition, and in general underperforms the proposed median-TWF and median-RWF, except when both and get very large, see Figure 2(d). This is because in this range of large outliers, knowing the fraction of outliers provides substantial advantage for trimean-TWF to eliminate them, while the sample median can deviate significantly from the true median for large . Moreover, it is worth mentioning that exact recovery is more challenging for median-TWF and median-RWF when the magnitudes of most outliers are comparable to the measurements, as in Figure 2(c). In such a case, the outliers are more difficult to exclude as opposed to the case with very large outlier magnitudes as in Figure 2(d); and meanwhile, the outlier magnitudes in Figure 2(c) are large enough to affect the accuracy heavily in contrast to the cases in Figures 2(a) and 2(b) where outliers are less prominent.

### 3.3 Stable Recovery with Sparse Outliers and Dense Noise

We now examine the performance of median-TWF and median-RWF in the presence of both sparse outliers and dense bounded noise. The entries of the dense bounded noise term are generated independently from . The entries of the sparse outlier are then generated as independently. Figure 3(a) and Figure 3(b) depict the relative error with respect to the iteration count , when and respectively. In the presence of sparse outliers, it can be seen that both median-TWF and median-RWF clearly outperforms TWF under the same situation, and acts as if the outliers do not exist by achieving almost the same accuracy as TWF without outliers. Moreover, the relative error of the reconstruction using median-TWF or median-RWF has 10 times gain from Figure 3(a) to Figure 3(b) as shrinks by a factor of , which corroborates Theorem 2 nicely. Furthermore, it can be seen that median-RWF converges faster than the other algorithms, due to the improved curvature of using low-order objectives, corroborating the result in . On the other hand, median-TWF returns more accurate estimates, due to employing more delicate truncation rules that may help reduce the noise.

Finally, we consider the case when the measurements are corrupted by both Poisson noise and outliers, modeling photon detection in optical imaging applications. We generate each measurement as , for , which is then corrupted with probability by outliers. The entries of the outlier are obtained by first generating independently, and then rounding it to the nearest integer. Figure 4 depicts the relative error with respect to the iteration count , where median-TWF and median-RWF under both outliers and Poisson noise have almost the same accuracy as, if not better than, TWF under only the Poisson noise. Figure 4: The relative error with respect to the iteration count for median-TWF, median-RWF and TWF with both Poisson noise and sparse outliers, and TWF with only Poisson noise.

## 4 Preliminaries and Proof Roadmap

Broadly speaking, the proofs for median-TWF and median-RWF follow the same roadmap. The crux is to use the statistical properties of the median to show that the median-truncated gradients satisfy the so-called Regularity Condition , which guarantees the linear convergence of the update rule, provided the initialization provably lands in a small neighborhood of the true signal.

We first develop a few statistical properties of median that will be useful throughout our analysis in Section 4.1. Section 4.2 analyzes the initialization that is used in both algorithms. We then state the definition of Regularity Condition in Section 4.3 and explain how it leads to the linear convergence rate. We provide separate detailed proofs for two algorithms in Section 5 and Section 6, respectively, because they involve different bounding techniques that may be of independent interest due to different loss functions.

### 4.1 Properties of Median

We start by the definitions of the quantile of a population distribution and its sample version.

###### Definition 1 (Generalized quantile function).

Let

. For a cumulative distribution function (CDF)

, the generalized quantile function is defined as

 F−1(p)=inf{x∈R:F(x)≥p}. (27)

For simplicity, denote as the -quantile of . Moreover for a sample sequence , the sample -quantile means , where is the empirical distribution of the samples .

###### Remark 1.

We note that the median , and we use both notations interchangeably.

Next, we show that as long as the sample size is large enough, the sample quantile concentrates around the population quantile (motivated from ), as in Lemma 1.

###### Lemma 1.

Suppose is cumulative distribution function (i.e., non-decreasing and right-continuous) with continuous density function . Assume the samples are i.i.d. drawn from . Let . If for all in }, then

 |θp({Xi}mi=1)−θp(F)|<ϵ (28)

holds with probability at least .

###### Proof.

See Appendix A.1. ∎

Lemma 2 bounds the distance between the median of two sequences.

###### Lemma 2.

Given a vector , reorder the entries in a non-decreasing manner

 X(1)≤X(2)≤...≤X(n−1)≤X(n).

Given another vector , then

 |X(k)−Y(k)|≤∥\boldmath{X}−Y∥∞, (29)

holds for all

###### Proof.

See Appendix A.2. ∎

Lemma 3, as a key robustness property of median, suggests that in the presence of outliers, one can bound the sample median from both sides by neighboring quantiles of the corresponding clean samples.

###### Lemma 3.

Consider clean samples . If a fraction () of them are corrupted by outliers, one obtains contaminated samples which contain corrupted samples and clean samples. Then for a quantile such that , we have

 θp−s({~Xi})≤θp({Xi})≤θp+s({~Xi}).
###### Proof.

See Appendix A.3. ∎

Finally, Lemma 4

is related to bound the value of the median, as well as the density at the median for the product of two possibly correlated standard Gaussian random variables.

###### Lemma 4.

Let which can be correlated with the correlation coefficient . Let , and represent the density of . Denote as the median of , and the value of at the median as . Then for all ,

 0.348<θ1/2(ψρ)<0.455, 0.47<ψρ(θ1/2)<0.76.
###### Proof.

See Appendix A.4. ∎

### 4.2 Robust Initialization with Outliers

Considering the model that the measurements are corrupted by both bounded noise and sparse outliers given by (11), we show that the initialization provided by the median-truncated spectral method in (14) is close enough to the ground truth, i.e., .

###### Proposition 2.

Fix and , and consider the model given by (11). Suppose that for some sufficiently small constant and that for some sufficiently small constant . With probability at least , the initialization given by the median-truncated spectral method obeys888Notation or means that there exists a constant such that .

 dist(z(0),x)≤δ∥x∥, (30)

provided that for some constant .

See Appendix B.

### 4.3 Regularity Condition

Once the initialization is guaranteed to be within a small neighborhood of the ground truth, we only need to show that the truncated gradient (16) and (20) satisfy the Regularity Condition () [9, 13], which guarantees the geometric convergence of median-TWF/median-RWF once the initialization lands into this neighborhood.

###### Definition 2.

The gradient is said to satisfy the Regularity Condition if

 ⟨∇ℓ(z),z−x⟩≥μ2∥∇ℓ(z)∥2+λ2∥z−x∥2 (31)

for all obeying .

The above guarantees that the gradient descent update converges to the true signal geometrically  if . We repeat this argument below for completeness.

 dist2(z−μ∇ℓ(z),x) ≤∥z−μ∇ℓ(z)−x∥2 ≤∥z−x∥2+∥μ∇ℓ(z)∥2−μ2∥∇ℓ(z)∥2−μλ∥z−x∥2 =(1−μλ)dist2(z,x).

## 5 Proofs for Median-TWF

We first show that in (16) satisfies the for the noise-free case in Section 5.1, and then extend it to the model with only sparse outliers in Section 5.2, thus together with Proposition 2 establishing the global convergence of median-TWF in both cases. Section 5.3 proves Theorem 2 in the presence of both sparse outliers and dense bounded noise.

### 5.1 Proof of Proposition 1

We consider the noise-free model. The central step to establish the is to show that the sample median used in the truncation rule of median-TWF concentrates at the level as stated in the following proposition.

###### Proposition 3.

If , then with probability at least ,

 0.6∥z∥∥z−x∥≤θ0.49,θ0.5,θ0.51({∣∣|aTix|2−|aTiz|2∣∣}mi=1)≤∥z∥∥z−x∥, (32)

holds for all satisfying .

###### Proof.

Detailed proof is provided in Appendix C.1. ∎

We note that a similar property for the sample mean has been shown in  as long as the number of measurements is on the order of . In fact, the sample median is much more challenging to bound due to its non-linearity, which also causes slightly more measurements compared to the sample mean.

Then we can establish that is lower bounded on the order of , as in Proposition 4, and that is upper bounded on the order of , as in Proposition 5.

###### Proposition 4 (Adapted version of Proposition 2 of ).

Consider the noise-free case for , and any fixed constant . Under the condition (17), if , then with probability at least ,

 ⟨∇ℓtr(z),z−x⟩≥{1.99−2(ζ1+ζ2)−√8/πα−1h−ϵ}∥z−x∥2 (33)

holds uniformly over all satisfying

 ∥z−x∥∥z∥≤min{111,αlαh,αl6,√98/3(αl)22αu+αl}, (34)

where are some universal constants, and and are defined in (17).

The proof of Proposition 4 adapts the proof of Proposition 2 of , by properly setting parameters based on the properties of sample median. For completeness, we include a short outline of the proof in Appendix C.2.

###### Proposition 5 (Lemma 7 of ).

Under the same condition as in Proposition 4, if , then there exist some constants such that with probability at least ,

 ∥∇ℓtr(z)∥≤(1+δ)⋅2√1.02+2/αh∥z−x∥ (35)

holds uniformly over all satisfying

 ∥z−x∥∥z∥≤min{111,αlαh,αl6,√98/3(αl)22αu+αl}, (36)

where can be arbitrarily small as long as sufficiently large, and and are given in (17).

###### Proof.

See the proof of Lemma 7 in . ∎

With these two propositions and (17), is guaranteed by setting

 μ<μ0:=(1.99−2(ζ1+ζ2)−√8/πα−1h2(1+δ)2⋅(1.02+2/αh), λ+μ⋅4(1+δ)2⋅(1.02+2/αh)<2{1.99−2(ζ1+ζ2)−√8/πα−1h−ϵ}.

### 5.2 Proof of Theorem 1

We next consider the model (11) with only sparse outliers. It suffices to show that continues to satisfy the . The critical step is to bound the sample median of the corrupted measurements. Lemma 3 yields

 θ12−s({|(aTix)2−(aTiz)2|})≤θ