# Structured signal recovery from quadratic measurements: Breaking sample complexity barriers via nonconvex optimization

This paper concerns the problem of recovering an unknown but structured signal x ∈ R^n from m quadratic measurements of the form y_r=|<a_r,x>|^2 for r=1,2,...,m. We focus on the under-determined setting where the number of measurements is significantly smaller than the dimension of the signal (m<<n). We formulate the recovery problem as a nonconvex optimization problem where prior structural information about the signal is enforced through constrains on the optimization variables. We prove that projected gradient descent, when initialized in a neighborhood of the desired signal, converges to the unknown signal at a linear rate. These results hold for any constraint set (convex or nonconvex) providing convergence guarantees to the global optimum even when the objective function and constraint set is nonconvex. Furthermore, these results hold with a number of measurements that is only a constant factor away from the minimal number of measurements required to uniquely identify the unknown signal. Our results provide the first provably tractable algorithm for this data-poor regime, breaking local sample complexity barriers that have emerged in recent literature. In a companion paper we demonstrate favorable properties for the optimization problem that may enable similar results to continue to hold more globally (over the entire ambient space). Collectively these two papers utilize and develop powerful tools for uniform convergence of empirical processes that may have broader implications for rigorous understanding of constrained nonconvex optimization heuristics. The mathematical results in this paper also pave the way for a new generation of data-driven phase-less imaging systems that can utilize prior information to significantly reduce acquisition time and enhance image reconstruction, enabling nano-scale imaging at unprecedented speeds and resolutions.

## Authors

• 36 publications
• ### Reshaped Wirtinger Flow and Incremental Algorithm for Solving Quadratic System of Equations

We study the phase retrieval problem, which solves quadratic system of e...
05/25/2016 ∙ by Huishuai Zhang, et al. ∙ 0

• ### Non-Convex Structured Phase Retrieval

Phase retrieval (PR), also sometimes referred to as quadratic sensing, i...
06/23/2020 ∙ by Namrata Vaswani, et al. ∙ 0

• ### Robust Wirtinger Flow for Phase Retrieval with Arbitrary Corruption

We consider the phase retrieval problem of recovering the unknown signal...
04/20/2017 ∙ by Jinghui Chen, et al. ∙ 0

• ### Signal Recovery on Incoherent Manifolds

Suppose that we observe noisy linear measurements of an unknown signal t...
02/08/2012 ∙ by Chinmay Hegde, et al. ∙ 0

• ### On the Sample Complexity and Optimization Landscape for Quadratic Feasibility Problems

We consider the problem of recovering a complex vector x∈C^n from m quad...
02/04/2020 ∙ by Parth Thaker, et al. ∙ 5

• ### A Nonconvex Framework for Structured Dynamic Covariance Recovery

We propose a flexible yet interpretable model for high-dimensional data ...
11/11/2020 ∙ by Katherine Tsai, et al. ∙ 0

• ### Reconstructing Point Sets from Distance Distributions

We study the problem of reconstructing the locations u_i of a set of po...
04/06/2018 ∙ by Shuai Huang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Signal reconstruction from quadratic measurements is at the heart of many applications in signal and image processing. In this problem we acquire quadratic measurements of the form

 yr=|⟨ar,x⟩|2,r=1,2,…,m, (1.1)

from an unknown, structured signal . Here,

are known sampling vectors and

are observed measurements. Such quadratic signal recovery problems are of interest in a variety of domains ranging from combinatorial optimization, to wireless communications and imaging. Focusing on signal processing applications, recovering a signal from measurements of the form (

1.1) is usually referred to as the generalized phase retrieval problem. The connection with phase retrieval is due to the fact that optical detectors, specially at small wavelengths, can often only record the intensity of the light field and not its phase. Indeed, the acquired measurements in many popular coherent diffraction imaging systems such as those based on Ptychography or phase from defocus are of the form (1.1), with corresponding to the object of interest, modulated sinusoids, and the recorded data.

Given the ubiquity of the generalized phase retrieval problem in signal processing, over the years many heuristics have been developed for its solution. On the one hand, invention of new X-ray sources and new experimental setups that enable recording and reconstruction of non-crystalline objects has caused a major revival in the use of phase retrieval techniques in imaging. On the other hand, the last five years has also witnessed tremendous progress in terms of providing rigorous mathematical guarantees for the performance of some classical heuristics such as alternating minimization [58, 75] as well as newer ones based on semidefinite programing [17] and Wirtinger flows [16] and its variants [19, 80, 81, 14, 77]. We shall review all these algorithms and mathematical results in greater detail in Section 6. These results essentially demonstrate that a signal of dimension can be recovered efficiently and reliably from the order of generic quadratic measurements of the form (1.1).

The recent surge of applied and theoretical activity regarding phaseless imaging is in part driven by the hope that it will eventually lead to successful imaging of large protein complexes and biological specimens enabling live imaging of bio-chemical activities at the molecular level. Furthermore, phaseless imaging techniques increasingly play a crucial role in emerging national security applications aimed at monitoring electronic products that are intended for military or infrastructure use so as to ensure these products do not contain secret backdoors granting foreign governments cyber access to vital US infrastructure. Despite the incredible progress discussed earlier on both applied and mathematical fronts, major challenges impede the use of such techniques to these emerging domains. One major challenge is that acquiring measurements of large specimens at high resolutions (corresponding to very short wavelengths) require time consuming and expensive measurements. To be concrete, the most modern phase-less imaging setups require image acquisition times exceeding days for imaging a micrometer micrometer specimen at nm resolution!

To overcome these challenges, in this paper we aim to utilize a-priori structural information available about the signal to reduce the required number of quadratic measurements of the form (1.1). Indeed, in the application domains discussed above there is a lot of a-priori knowledge available that can be utilized to reduce acquisition time and enhance image reconstruction. For example, images of electronic chips are extremely structured e.g. piecewise constant and often projections of 3D rectilinear models. While, historically various a-priori information such as non-negativty has been used to enhance image reconstruction, such simple forms of structural information are often not sufficient. Complicating the matter further our mathematical understanding of how well even simple forms of a-priori information can enhance reconstruction is far from complete. To be concrete, assume we know the signal of interest is sparse e.g. it has at most nonzero entries. In this case for known tractable algorithms to yield accurate solutions the number of generic measurements must exceed with a constant [60]

. This is surprising, as the degrees of freedom of an

-sparse vector is of the order and based on the compressive sensing literature one expects to be able to recover the signal from the order of generic quadratic measurements. In fact, it is known that on the order of generic quadratic measurements uniquely specify the unknown signal up to a global phase factor. However, it is not known whether a tractable algorithm can recover the signal from such minimal number of generic quadratic measurements.

The above example demonstrates a significant gap in our ability to utilize prior structural assumptions in phase retrieval problems so as to reduce the required number of measurements or sample complexity. This is not an isolated example, and such gaps hold more generally for a variety of problems and structures (see [60, 2, 22, 23, 21] for more details on related gaps). The emergence of such sample complexity “barriers” is quite surprising as in many cases there is no tractable algorithm known to close this gap. In fact, for some problems such as sparse PCA it is known that closing this gap via a computationally tractable approach will yield tractable algorithms for notoriously difficult problems such as planted clique [9].

## 2 Minimizing (non)convex objectives with (non)convex constraints

We wish to discern an unknown but “structured” signal from quadratic measurements of the form , for . However, in the applications of interest typically the number of equations is significantly smaller than the number of variables so that there are infinitely many solutions obeying the quadratic constraints. However, it may still be possible to recover the signal by exploiting knowledge of its structure. To this aim, let be a cost function that reflects some notion of “complexity” of the “structured” solution. It is then natural to use the following optimization problem to recover the signal.

 minimizez∈CnL(z):=1mm∑r=1ℓ(√yr,|a∗rz|)subject toR(z)≤R(x). (2.1)

Here,

is a loss function measuring the misfit between the measurements

and the data model and is a regularization function that reflects known prior knowledge about the signal. A natural approach to solve this problem is via projected gradient type updates of the form

 zτ+1=PK(zτ−μτ∇L(zτ)). (2.2)

Here, is the Wirtinger derivative of (see [16, Section 6] for details) and denotes the projection of onto the set

 K={w∈Cn:R(w)≤R(x)}. (2.3)

Following [16], we shall refer to this iterative procedure as the Projected Wirtinger Flow (PWF) algorithm.

A-priori it is completely unclear why the iterative updates (2.2) should converge as not only the loss function may be nonconvex but also the regularization function! Efficient signal reconstruction from nonlinear measurements in this high-dimensional setting poses new challenges:

• When are the iterates able to escape local optima and saddle points and converge to global optima?

• How many measurements do we need? Can we break through the barriers faced by convex relaxations?

• How does the number of measurements depend on the a-priori prior knowledge available about the signal? What regularizer is best suited to utilizing a particular form of prior knowledge?

• How many passes (or iterations) of the algorithm is required to get to an accurate solution?

At the heart of answering these questions is the ability to predict convergence behavior/rate of (non)convex constrained optimization algorithms.

## 3 Precise measures for statistical resources

Throughout the rest of the paper we assume that the signal and the measurement vectors are all real-valued. For sake of brevity we have focused our attention to this real-valued case. However, we note that all of our definitions/results trivially extend to the complex case. We wish to characterize the rates of convergence for the projected gradient updates (2.2) as a function of the number of samples, the available prior knowledge and the choice of the regularizer. To make these connections precise and quantitative we need a few definitions. Naturally the required number of samples for reliable signal reconstruction depends on how well the regularization function can capture the properties of the unknown signal . For example, if we know our unknown parameter is approximately sparse naturally using an norm for the regularizer is superior to using an regularizer. To quantify this capability we first need a couple of standard definitions which we adapt from [61, 62].

###### Definition 3.1 (Descent set and cone)

The set of descent of a function at a point is defined as

 DR(x)={h: R(x+h)≤R(x)}.

The cone of descent is defined as a closed cone that contains the descent set, i.e. . The tangent cone is the conic hull of the descent set. That is, the smallest closed cone obeying .

We note that the capability of the regularizer in capturing the properties of the unknown signal depends on the size of the descent cone . The smaller this cone is the more suited the function is at capturing the properties of . To quantify the size of this set we shall use the notion of mean width.

###### Definition 3.2 (Gaussian width)

The Gaussian width of a set is defined as:

 ω(C):=Eg[supz∈C ⟨g,z⟩],

where the expectation is taken over . Throughout we use to denote the the unit ball/sphere of .

We now have all the definitions in place to quantify the capability of the function in capturing the properties of the unknown parameter . This naturally leads us to the definition of the minimum required number of samples.

###### Definition 3.3 (minimal number of samples)

Let be a cone of descent of at . We define the minimal sample function as

 M(R,x)=ω2(CR(x)∩Bn).

We shall often use the short hand with the dependence on implied.

We note that is exactly the minimum number of samples required for structured signal recovery from linear measurements when using convex regularizers [18, 1]. Specifically, the optimization problem

 m∑r=1(yr−⟨ar,x⟩)2subject toR(z)≤R(x), (3.1)

succeeds at recovering the unknown signal

with high probability from

measurements of the form if and only if .111We would like to note that only approximately characterizes the minimum number of samples required. A more precise characterization is where . However, since our results have unspecified constants we avoid this more accurate characterization. While this result is only known to be true for convex regularization functions we believe that also characterizes the minimal number of samples even for nonconvex regularizers in (3.1). See [61] for some results in the nonconvex case as well as the role this quantity plays in the computational complexity of projected gradient schemes for linear inverse problems. Given that in phase-less imaging we have less information (we loose the phase of the linear measurements) we can not hope to recover structured signals from when using (2.1). Therefore, we can use as a lower-bound on the minimum number of measurements required for projected gradient descent iterations (2.2) to succeed in recovering the signal of interest.

## 4 Nonconvex regularization examples

Next we provide two examples of (non)convex regularizers which are of interest in phase-less imaging applications. These two simple examples are meant to highlight the importance of nonconvex regularizers in imaging. However, our theoretical framework is by no means limited to these simple examples and can deal with significantly more complicated nonconvex regularizers capturing much more nuanced forms of prior structure.

• piecewise constant structure. One a-priori structural information available in many phase-less imaging applications is that images tend to be piecewise constant. For example, contiguous parts of biological specimens are made up of the same tissue and exhibit the same behavior under electro-magnetic radiations. Similarly, images of electronic chips are often projections of piecewise constant, 3D rectilinear models. Let denote a image consisting of an array of pixels. A popular approache for exploiting piecewise-constant structures is to use total variation regularization functions. Two common choices are the isotropic and anisotropic totoal variation regularizations defined as

 Riso(z)= ∑i,j(√∣∣zi+1,j−zi,j∣∣2+∣∣zi,j+1−zi,j∣∣2)p, Rani(z)= ∑i,j∣∣zi+1,j−zi,j∣∣p+∣∣zi,j+1−zi,j∣∣p.

When , these regularization functions are convex. However, for many image reconstruction tasks total variation regularization with , despite being nonconvex, is significantly more effective at capturing piece-wise constant structure.

• discrete values. Another form of a-priori knowledge that is sometimes available in imaging applications is possible discrete values the image pixels can take. For example, in imaging electronic chips the metallic composition of the different parts pre-determines the possible discrete values and are known in advance. Let denote a image consisting of an array of pixels and assume the possible discrete values are . A natural regularization in this case is

This regularization is convenient as projection onto its sub-level sets is easy and amounts to replacing each entry of the input vector/matrix with the closest discrete value from (a.k.a. hard thresholding). Of course this is not the only regularization function that can enforce discrete structures and in practice “soft thresholding” variants may be more effective. Our framework can be used to analyze many such variants. Indeed, an interesting aspect of our results is that it allows us to understand what regularizer is best suited at enforcing a particular form of prior structure.

## 5 Theoretical results for Projected Wirtinger Flows

In this section we shall explain our main theoretical results. To this aim we need to define the distance to the solution set.

###### Definition 5.1

Let be any solution to the quadratic system (the signal we wish to recover). For each , define

 \emphdist(z,x)=min(∥z−x∥ℓ2,∥z+x∥ℓ2).

As we mentioned earlier we are interested in recovering structured signal recovery problems from quadratic measurements via the optimization problem (2.1). Naturally the convergence/lack of convergence as well as the rate of convergence of projected Wirtinger Flow iterates (2.2) depends on the loss function . We now discuss our theoretical results for two different loss functions.

### 5.1 Intensity-based Wirtinger Flows with convex regularizers

Our first result focuses on a quadratic loss function applied to intensity measurements, i.e.  in (2.1). In this case, the optimization problem takes the form

 minimizez∈RnLI(z):=14mm∑r=1(yr−|a∗rz|2)2subject toR(z)≤R(x). (5.1)

Our first result studies the effectiveness of projected Wirtinger Flows on this objective.

###### Theorem 5.2

Let be an arbitrary vector and be a proper convex function. Suppose is a Gaussian map and let be

, start from a point obeying

 (5.2)

and apply the Projected Wirtinger Flow (PWF) updates

 zτ+1=PK(zτ−μτ∇L(zτ)), (5.3)

with . Also set the learning parameter sequence as and 222We note that can be trivially estimated from the measurements as and our proofs are robust to this misspecification. We avoid stating this variant for ease of reading. for all and assume for some fixed numerical constant . Furthermore, let , defined by 3.3, be our lower bound on the number of measurements. Also assume

 m>cm0logn, (5.4)

holds for a fixed numerical constant . Then there is an event of probability at least such that on this event starting from any initial point obeying (5.2) the update (5.3) satisfy

 \emphdist(zτ,x)≤(1−μ125)τ2\emphdist(z0,x). (5.5)

As mentioned earlier is the minimal number of measurements required to recover a structured signal from linear measurements. also serves as a lower bound on structured signal recovery from quadratic measurements as they are even less informative (we loose sign information). Theorem 5.2 shows that PWF applied to quadratic loss using intensity measurements can (locally) reconstruct the signal with this minimal sample complexity (up to a constant and log factor). To be concrete consider the case where the unknown signal is known to be sparse and we use as the regularizer in (5.1). In this case it is known that and Theorem 5.2 predicts that an -sparse signal can (locally) be recovered from the order of measurements. This breaks through well-known barriers that have emerged for this problem in recent literature. Indeed, for known tractable convex relaxation schemes to yield accurate solutions the number of generic measurements must exceed with a constant [54, 60]. We also note that even recent nonconvex approaches such as [14, 78] have also not succeeded at breaking through this barrier even when an initialization obeying (5.2) is available.

The convergence guarantees provided above hold as long as PWF is initialized per (5.2) in a neighborhood of the unknown signal with relative error less than a constant. In this paper we are concerned only with local convergence properties of PWF and therefore do not provide an explicit construction for such an initialization. However, in a companion paper we demonstrate that the optimization problem (5.1) has certain favorable characteristics that may allow global convergence guarantees from any initialization using second order methods.

Another interesting aspect of the above result is that the rate of convergence is geometric. Specifically, to achieve a relative error of (), the required number of iterations is . Note that the cost of each iteration depends on applying the matrix and its transpose which has computational complexity on the order of . This is assuming that the projection has negligible cost compared to the cost of applying . This is the case for example for sparse signals when using the regularizer . Therefore, in these cases to achieve a relative error of the total computational complexity of PWF is on the order of .

Let us now discuss some ways in which this theorem is sub-optimal. Even though this theorem breaks through known sample complexity barriers, a natural question is whether it is possible to remove the log factor so as to have a sample complexity that is only a constant factor away from the minimum sample complexity of structured signal recovery from linear measurements. Another way in which the algorithm is sub-optimal is computational complexity. While the rate of convergence of PWF stated above is geometric, it is not linear. With a linear rate of convergence to achieve a relative error of the total computational complexity would be on the order of which is a factor of smaller than the guarantees provided by PWF. In the next section we will show how to close these gaps in sample complexity and computational complexity by using a different loss function in (5.1). Finally, a major draw back of Theorem 5.2 it that it only applies to convex regularizers. In the next section we will show how to also remove this assumption so as to allow arbitrary nonconvex regularizers.

### 5.2 Amplitude-based Wirtinger Flows with (non)convex regularizers

Our second result focuses on a quadratic loss function applied to amplitude measurements, i.e.  in (2.1). In this case, the optimization problem takes the form

 minimizez∈RnLA(z):=12mm∑r=1(√yr−|a∗rz|)2subject toR(z)≤R(x). (5.6)

One challenging aspect of the above loss function is that it is not differentiable and it is not clear how to run projected gradient descent. However, this does not pose a fundamental challenge as the loss function is differentiable except for isolated points and we can use the notion of generalized gradients to define the gradient at a non-differentiable point as one of the limits points of the gradient in a local neighborhood of the non-differentiable point. For the loss in (5.6) the generalized gradient takes the form

 ∇LA(z):=1mm∑r=1(|a∗rz|−√yr)sgn(a∗rz)ar. (5.7)
###### Theorem 5.3

Let be an arbitrary vector and be a proper function (convex or nonconvex). Suppose is a Gaussian map and let be quadratic measurements. To estimate , start from a point obeying

 (5.8)

and apply the Projected Wirtinger Flow (PWF) updates

 zτ+1=PK(zτ−μτ∇LA(zτ)), (5.9)

with and defined via (5.7). Also set the learning parameter sequence and for all . Furthermore, let , defined by 3.3, be our lower bound on the number of measurements. Also assume

 m>cm0, (5.10)

holds for a fixed numerical constant . Then there is an event of probability at least such that on this event starting from any initial point obeying (5.8) the update (5.9) satisfy

 \emphdist(zτ,x)≤(23)τ\emphdist(z0,x). (5.11)

Here is a fixed numerical constant.

The first interesting and perhaps surprising aspect of this result is its generality: it applies not only to convex regularization functions but also nonconvex ones! As we mentioned earlier the optimization problem in (2.1) is not known to be tractable even for convex regularizers. Despite the nonconvexity of both the objective and regularizer, the theorem above shows that with a near minimal number of measurements, projected gradient descent provably converges to the original signal without getting trapped in any local optima.

The amplitude-based loss also has stronger sample complexity and computational complexity guarantees compared with the intensity-based version. Indeed, the required number of measurements improves upon the intensity-based loss by a logarithmic factor, achieving a near optimal sample complexity for this problem (up to a constant factor). Also, the convergence rate of the amplitude-based approach is now linear. Therefore, to achieve a relative error of the total number of iterations is on the order of . Thus the overall computational complexity is on the order of (in general the cost is the total number of iterations multiplied by the cost of applying the measurement matrix and its transpose). As a result, the computational complexity is also now optimal in terms of dependence on the matrix dimensions. Indeed, for a dense matrix even verifying that a good solution has been achieved requires one matrix-vector multiplication which takes time.

We now pause to discuss the choice of the loss function. The theoretical results above suggests that the least squares loss on amplitude values is superior to one on intensity values in terms of both sample and computational complexity. Such improved performance has also been observed empirically for more realistic models in optics [79]. Indeed, [79] shows that not only the amplitude-based least squares has faster convergence rates but also is more robust to noise and model misspecification. However, we would like to point out that the least squares objective on intensity values does have certain advantages. For instance, it is possible to do exact line search (in closed form) on this objective. We have observed that this approach works rather well in some practical domains (e.g. ptychography for chip imaging) without the need for any tuning as the step size in each iteration is calculated in closed form via exact line search. Therefore, we would like to caution against rushed judgments declaring one variant of Wirtinger Flow superior to another due to minor (e.g. logarithmic) theoretical improvements in sample complexity and or computational complexity.333Unfortunately such premature declarations have become exceedingly common in recent literature. We would like to emphasize that there is no “best” or “correct” loss function that works better than others for all application domains. Ultimately, the choice of the loss function is dictated by the statistics of the noise or misspecification present in a particular domain.

## 6 Discussions and prior art

Phase retrieval is a century old problem and many heuristics have been developed for its solution. For a partial review of some of these heuristics as well as some recent theoretical advances in related problems we refer the reader to the overview articles/chapters [68, 56, 40] and [69, Part II] as well as [15, Section 1.6], and references therein such as [13, 59, 7, 76, 43, 31, 32, 50, 51]. There has also been a surge of activity surrounding nonconvex optimization problems in the last few years. While discussing all of these results is beyond the scope of this paper we shall briefly discuss some of the most relevant and recent literature in the coming paragraphs. We refer the reader to [71] and references therein [12, 47, 45, 46, 33] for a more comprehensive review of such results. We also refer the reader to[35, 29, 5]

for recent algorithmic approaches based on linear programs and

[52] for characterizing large systems limits of dynamics of phase retrieval algorithms.

The Wirtinger Flow algorithms for solving quadratic systems of equations was introduce in [16]. [16] also provides a local convergence analysis when no prior structural assumption is available about the signal. The analysis of [16] was based on the so called regularity condition. This regularity condition and closely related notions have been utilized/generalized in a variety of interesting ways to provide rigorous convergence guarantees for related nonconvex problems arising in diverse applications ranging from matrix completion to dictionary learning and blind deconvolution [6, 72, 66, 3, 73, 19, 14, 83, 82, 20, 80, 70, 84, 81, 77, 64, 53]. The intensity-based results presented in Section 5.1 are based on a generalization of the regularity condition in [16] so as to allow arbitrary convex constraints.

The second set of results we presented in this paper were based on least squares fit of the amplitudes. This objective function has been historically used in phase retrieval applications [27] and has close connections with the classical Fienup algorithm [69, Chapter 13]. Focusing on more recent literature, [79] demonstrated the effectiveness of this approach in optical applications. More recently, a few interesting publications [81, 77] study variants of this loss function and develop guarantees for its convergence. The analysis presented in both of these papers are also based on variants of the regularity condition of [16] and do not utilize any structural assumptions. In this paper we have analyzed the performance of the amplitude-based PWF with any constraint (convex or nonconvex). These results are based on a new approach to analyzing nonconvex optimization problems that differs from the regularity approach used in [16] and all of the papers mentioned above. Rather, this new technique follows a more direct route, utilizing/developing powerful concentration inequalities to directly show the error between the iterates and the structured signal decreases at each iteration.

A more recent line of research aims to provide a more general understanding of the geometric landscape of nonconvex optimization problems by showing that in many problems there are no spurious local minmizers and saddles points have favorable properties [70, 28, 10, 11]. A major advantage of such results is that they do not required specialized initializations in the sense that trust region-type algorithms or noisy stochastic methods are often guaranteed to converge from a random initialization and not just when an initial solution is available in a local neighborhood of the optimal solution. The disadvantage of such results is that the guaranteed rates of convergence of these approaches are either not linear/geometric or each iteration is very costly. These approaches also have slightly looser sample complexity bounds. Perhaps the most relevant result of this kind to this paper is the interesting work of Sun, Qu and Wright [70] which studies the geometric landscape of the objective (5.1) in the absence of any regularizer. The authors also show that a certain trust region algorithm achieves a relative error of after as long as the number of samples exceeds . As mentioned previously, using different proof techniques in a companion paper we demonstrate a result of a similar flavor to [70] for the constrained problem (5.1). This results shows that with measurements all local optima are global and a second order scheme recovers the global optima (the unknown signal) in a polynomial number of iterations.

We now pause to caution against erroneous miss-interpretations of the theoretical results discussed in the previous paragraph:

• There are no spurious local optima i.e. all local optima are global in phase retrieval applications

• Initialization is irrelevant in phase retrieval applications

The reason these conclusions are inaccurate are two-fold. First, while the results of the previous paragraph and Theorem 5.2 both require on the order of samples the multiplicative constants in these results tend to be drastically different in practice. Second, the measurement vectors occurring in practical domains are substantially more ill-conditioned than the Gaussian measurements studied in this paper. This further amplifies the gap between the sample complexity of local versus global results. Indeed, in many practical domains where phase retrieval is applied local optima are abound and a major source of algorithmic stagnation. Therefore, carefully crafted initialization schemes or regularization methods are crucial for the convergence of local search heuristics in many phase-less imaging domains.

We would also like to mention prior work on sparse phase retrieval. For generic measurements such as the Gaussian distribution studied in this paper,

[54] provides guarantees for the convex relaxation-based PhaseLift algorithm as long as the number of samples exceed where is the number of non-zeros in the sparse signal. The papers [54, 60] showed that these results are essentially unimprovable when using simple SDP relaxations. More recently, interesting work by Cai, Li and Ma [14] studies the performance of Wirtinger Flow based schemes for sparse phase retrieval problems. This result also requires measurements even when an initialization obeying (5.2) is available. Therefore, this results also does not breakthrough the local barrier. More recently, there are a few publications aimed at going below measurements. These results differ from ours in that they are either applicable to specific designs which tailor the algorithm to the measurement process [4, 65, 48, 36, 37] or require additional constraints on the coefficient of the sparse signal [78, 34, 49]. In contrast to the above publications in this paper we have demonstrated that locally only samples suffice to recover any sparse signal from generic quadratic measurements formally breaking through the barrier. Furthermore, our results applies to any regularizer (convex or nonconvex), allowing us to enforce various forms of prior knowledge in our reconstuction.

Finally, we would like to mention that there has also been some recent publications aimed at developing theoretical guarantees for more practical models. For instance, the papers [15, 32, 67, 42, 41] develop theoretical guarantees for convex relaxation techniques for more realistic Fourier based models such as coded diffraction patterns and Ptychography. More recently, the papers [39, 38, 44, 26, 8] also develop some theoretical guarantees for faster but sometimes design-specific algorithms. Despite all of this interesting progress the known results for more realistic measurement models are far inferior to their Gaussian counterparts in terms of sample complexity, computational complexity or stability to noise. Closing these gaps is an interesting and important future direction.

## 7 Proofs

In the Gaussian model the measurement vectors also obey for all with probability at least . Thoughout the proofs, we assume we are on this event without explicitly mentioning it each time. Without loss of generality we will assume throughout the proofs that . We remind the reader that throughout is a solution to our quadratic equations, i.e. obeys and that the sampling vectors are independent from . We also remind the reader that for a set , is the mean width of per Definition 3.2. Throughout, we use to denote the unit sphere/unit ball of . We first discuss some common background and results used for proving both theorems. Since the proof of the two theorems follow substantially different paths we dedicate a subsection to each: Section 7.4 for proof of Theorem 5.2 and Section 7.5 for proof of Theorem 5.3.

As a reminder the intensity-based loss function is equal to

 LI(z):=14mm∑r=1(yr−|a∗rz|2)2,

 ∇LI(z)=1mm∑r=1(|a∗rz|2−yr)(a∗rz)ar.

As a reminder the amplitude-based loss function is equal to

 LA(z):=12mm∑r=1(√yr−|a∗rz|)2,

and the generalized gradient is equal to

 ∇LA(z)=1mm∑r=1(|a∗rz|−√yr)sgn(a∗rz)ar.

### 7.2 Concentration and bounds for stochastic processes

In this section we gather some useful results on concentration of stochastic processes which will be crucial in our proofs. We begin with a lemma which is a direct consequence of Gordon’s escape from the mesh lemma [30] whose proof is deferred to Appendix A.1.

###### Lemma 7.1

Assume is a cone and is the unit sphere of . Also assume that

 m≥max(20ω2(C∩Sn−1)δ2,12δ−1),

for a fixed numerical constant . Then for all

 ∣∣ ∣∣1mm∑r=1(a∗rh)2−∥h∥2ℓ2∣∣ ∣∣≤δ∥h∥2ℓ2,

holds with probability at least .

We also need a generalization of the above lemma stated below and proved in Appendix A.2.

###### Lemma 7.2

Assume is a cone (not necessarily convex) and is the unit sphere of . Also assume that

 m≥max(80ω2(C∩Sn−1)δ2,2δ−1),

for a fixed numerical constant . Then for all

 ∣∣ ∣∣1mm∑r=1(a∗ru)(a∗rh)−u∗h∣∣ ∣∣≤δ∥u∥ℓ2∥h∥ℓ2,

holds with probability at least .

We next state a generalization of Gordon’s escape through the mesh lemma, whose proof appears in Appendix A.3.

###### Lemma 7.3

Let be fixed vector with nonzero entries and construct the diagonal matrix . Also, let have i.i.d.  entries. Furthermore, assume and define

 bm(d)=E[∥Dg∥ℓ2],

where is distributed as . Define

 σ(T):=maxv∈T % ∥v∥ℓ2,

then for all

 ∣∣∥DAu∥ℓ2−bm(d)∥u∥ℓ2∣∣≤∥d∥ℓ∞ω(T)+η,

holds with probability at least

 1−6e−η28∥d∥2ℓ∞σ2(T).

The previous lemma leads to the following Corollary. We skip the proof as it is identical to how Lemma 7.1 is derived from Gordon’s lemma (See Section A.1 for details).

###### Corollary 7.4

Let be fixed vector with nonzero entries and assume . Furthermore, assume

 (m∑r=1d2r)≥max(20∥d∥2ℓ∞ω2(T)δ2,32δ−1).

Then for all ,

 ∣∣ ∣∣∑mr=1d2r(a∗ru)2∑mr=1d2r−∥u∥2ℓ2∣∣ ∣∣≤δ,

holds with probability at least .

The above generalization of Gordon’s lemma together with its corollary will be very useful in our proofs in particular it allows us to prove the following key result whose proof is also deferred to Appendix A.4.

###### Lemma 7.5

Assume is a cone and is the unit sphere of . Furthermore, let be a fixed vector. Also assume that

Then for all

 1mm∑r=1(a∗rh)2(a∗rx)2−(∥h∥2ℓ2∥x∥2ℓ2+2(h∗x)2)≤δ∥h∥2ℓ2∥x∥2ℓ2,

holds with probability at least with and fixed numerical constants.

We also need the following important lemma. The proof of this lemma is based on the paper [57]. Please also see [25] for related calculations. We defer the proof to Appendix A.5.

###### Lemma 7.6

Assume are cones and is the unit sphere of . Also assume

 m≥c⋅max(ω2(C∩Sn−1),ω2(C′∩Sn−1)),

for a fixed numerical constant . Then for any and

 ∣∣ ∣∣1mm∑r=1|u∗ara∗rv|−E[|u∗aa∗v|]∣∣ ∣∣≤δ∥u∥ℓ2∥v∥ℓ2,

holds with probability at least where is a fixed numerical constant. Here, is distributed as .

We also state a simple generalization of Lemma 7.6 above. This lemma has a near identical proof. We skip details for brevity.

###### Lemma 7.7

Assume are sets with diameters bounded by fixed numerical constants. Also assume

 m≥c⋅max(ω2(C),ω2(D)),

for a fixed numerical constant . Then for any and

 ∣∣ ∣∣1mm∑r=1|u∗ara∗rv|−E[|u∗aa∗v|]∣∣ ∣∣≤δ

holds with probability at least where is a fixed numerical constant. Here, is distributed as .

Finally, we also need the following lemma with the proof appearing in Appendix A.6.

###### Lemma 7.8

For any define . Then we have

 E[|u∗aa∗v|]=2π∥u∥ℓ2∥v∥ℓ2(sin(θ)+cos(θ)(π2−θ))≥2π∥u∥ℓ2∥v∥ℓ2.

### 7.3 Cone and projection identities

In this section we will gather a few results regarding higher dimensional cones and projections that are used throughout the proofs. These results are directly adapted from [61, 6.1]. We begin with a result about projections onto sets. The first part concerning projections onto convex sets is the well known contractivity result regarding convex projections.

###### Lemma 7.9

Assume is a closed set and . Then if is convex for every we have

 ∥PK(v)−u∥ℓ2≤∥v−u∥ℓ2. (7.1)

Furthermore, for any closed set (not necessarily convex) and for every we have

 ∥PK(v)−u∥ℓ2≤2∥v−u∥ℓ2. (7.2)

Proof Equation (7.1) is well known. We shall prove the second result. To this aim note that by definition of projection onto a set we have

 ∥v−PK(v)∥2ℓ2≤∥v−u∥2ℓ2. (7.3)

Also note that

 ∥v−PK(v)∥2ℓ2= ∥(v−u)−(PK(v)−u)∥2ℓ2, = ∥v−u∥2ℓ2+∥PK(v)−u∥2ℓ2−2⟨PK(v)−u,v−u⟩.

Combining the latter inequality with (7.3) and using the Cauchy-Schwarz inequality we have

 ∥PK(v)−u∥2ℓ2= ∥v−PK(v)∥2ℓ2−∥v−u∥2ℓ2+2⟨PK(v)−u,v−u⟩, ≤ 2⟨PK(v)−u,v−u⟩, ≤ 2∥PK(v)−u∥ℓ2∥v−u∥ℓ2.

Dividing both sides of the above inequality by concludes the proof.

We now state a result concerning projection onto cones.

###### Lemma 7.10

Let be a closed cone and . The followings two identities hold

 ∥v∥2ℓ2= ∥v−PC(v)∥2ℓ2+∥PC(v)∥2ℓ2, (7.4) ∥PC(v)∥ℓ2= supu∈C∩Bnu∗v. (7.5)

The following lemma is straightforward and follows from the fact that translation preserves distances.

###### Lemma 7.11

Suppose is a closed set. The projection onto obeys

 PK(x+v)−x=PK−{x}(v).

The next lemma compares the length of a projection onto a set to the length of projection onto the conic approximation of the set.

###### Lemma 7.12 (Comparison of projections)

Let be a closed and nonempty set that contains . Let be a nonempty and closed cone containing (). Then for all ,

 ∥PD(v)∥ℓ2≤2∥PC(v)∥ℓ2 (7.6)

Furthermore, assume is a convex set. Then for all ,

 ∥PD(v)∥ℓ2≤∥PC(v)∥ℓ2. (7.7)

### 7.4 Convergence analysis for intensity-based Wirtinger Flows

In this section we shall prove Theorem 5.2. The proof of this result is based on an extension of the framework developed in [16]. Therefore, the outline of our exposition closely follows that of [16]. Section 7.4.1 discusses our general convergence analysis and shows that it follows from a certain Regularity Condition (RC). In this section we also show that the regularity condition can be proven by showing two sufficient Local Curvature and Local Smoothness conditions denoted by LCC and LSC. We then prove the Local Curvature condition in Section 7.4.2 and the Local Smoothness condition in Section 7.4.3.

#### 7.4.1 General convergence analysis

Note that (5.2) guarantees that either or is small. Throughout the proof without loss of generality we assume is the smaller one. To introduce our general convergence analysis we begin by defining

 E(ϵ)={z∈Rn:R(z)≤R(x), ∥z−x∥ℓ2≤ϵ}.

Note that when condition (5.2) holds the next iterate obeys with . The reason is that when the regularizer is convex so is the set and by contractivity of projection onto convex sets (Lemma 7.9) we also have

 ∥z1−x∥ℓ2=∥PK(z0)−x∥ℓ2≤∥z0−x∥ℓ2≤ϵ.

We will assume that the function satisfies a regularity condition on , which essentially states that the gradient of the function is well-behaved.

###### Condition 7.13 (Regularity Condition)

We say that the function satisfies the regularity condition or if for all vectors we have

 (7.8)

In the lemma below we show that as long as the regularity condition holds on then Projected Wirtinger Flow starting from an initial solution in converges to a global optimizer at a geometric rate. Subsequent sections shall establish that this property holds.

###### Lemma 7.14

Assume that obeys RC for all . Furthermore, suppose , and assume . Consider the following update

 zτ+1=PK(zτ−μ∇f(zτ)).

Then for all we have and

 ∥zτ−x∥2ℓ2≤(1−2μα)τ∥z0−x∥2ℓ2.

Proof The proof is similar to a related proof in the Wirtinger Flow paper [16]. We prove that if then for all

 z+=z−μ∇f(z)

obeys

 ∥z+−x∥2ℓ2≤(1−2μα)∥z−x∥2ℓ2. (7.9)

The latter implies that if , then . Combining the latter with the fact that projection onto convex sets are contractive (Lemma 7.9) we conclude that

 ∥PK(z+)−x∥ℓ2=∥PK(z+)−P(x)∥ℓ2≤∥z+−x∥ℓ2≤∥z−x∥ℓ2≤ϵ. (7.10)

Also by the definition of we have . Therefore, if then we also have . The lemma follows by inductively applying (7.9) and (7.10). Now let us demonstrate how (7.9) follows from simple algebraic manipulations together with the regularity condition (7.8). To this aim note that

 ∥z+−x∥2ℓ2= ∥z−x−μ∇LI(z)∥2ℓ2 = ∥z−x∥2ℓ2−2μ⟨∇LI(z),(z−x)⟩+μ2∥∇LI(z)∥2ℓ2 ≤ ∥z−x∥2ℓ2−2μ(1α∥z−x∥2ℓ2+1β∥∇LI(z)∥2ℓ2)+μ2∥∇LI(z)∥2ℓ2 = (1−2μα)∥z−x∥2ℓ2+μ(μ−2β)∥∇LI(z)∥2ℓ2 ≤ (1−2μα)∥z−x∥2ℓ2,

where the last line follows from . This concludes the proof.

For any , we need to show that

 (7.11)

We prove that (7.11) holds with by establishing that our gradient satisfies the local smoothness and local curvature conditions defined below. Combining both these two properties gives (7.11).

###### Condition 7.15 (Local Curvature Condition)

We say that the function satisfies the local curvature condition or if for all vectors ,

 (7.12)

This condition essentially states that the function curves sufficiently upwards (along most directions) near the curve of global optimizers.

###### Condition 7.16 (Local Smoothness Condition)

We say that the function satisfies the local smoothness condition or if for all vectors we have

 ∥∇LI(z)∥2ℓ2≤β(λ∥z−x∥2ℓ2+γmm∑r=1|a∗r(z−x)|4). (7.13)

This condition essentially states that the gradient of the function is well behaved (the function does not vary too much) near the curve of global optimizers.

#### 7.4.2 Proof of the local curvature condition

For any , we want to prove the local curvature condition (7.12). Recall that

 ∇LI(z)=1mm∑r=1(|⟨ar,z⟩|2−yr)(ara∗r)z,

and define . To establish (7.12) it suffices to prove that

 1mm∑r=1(2(h∗ara∗rx)2+3(h∗ara∗rx)|a∗rh|2+|a∗rh|4)−(γmm∑r=1|a∗rh|4)≥(1α+λ)∥h∥2ℓ2, (7.14)

holds for all satisfying . Equivalently, we only need to prove that for all