1 Introduction
Signal reconstruction from quadratic measurements is at the heart of many applications in signal and image processing. In this problem we acquire quadratic measurements of the form
(1.1) 
from an unknown, structured signal . Here,
are known sampling vectors and
are observed measurements. Such quadratic signal recovery problems are of interest in a variety of domains ranging from combinatorial optimization, to wireless communications and imaging. Focusing on signal processing applications, recovering a signal from measurements of the form (
1.1) is usually referred to as the generalized phase retrieval problem. The connection with phase retrieval is due to the fact that optical detectors, specially at small wavelengths, can often only record the intensity of the light field and not its phase. Indeed, the acquired measurements in many popular coherent diffraction imaging systems such as those based on Ptychography or phase from defocus are of the form (1.1), with corresponding to the object of interest, modulated sinusoids, and the recorded data.Given the ubiquity of the generalized phase retrieval problem in signal processing, over the years many heuristics have been developed for its solution. On the one hand, invention of new Xray sources and new experimental setups that enable recording and reconstruction of noncrystalline objects has caused a major revival in the use of phase retrieval techniques in imaging. On the other hand, the last five years has also witnessed tremendous progress in terms of providing rigorous mathematical guarantees for the performance of some classical heuristics such as alternating minimization [58, 75] as well as newer ones based on semidefinite programing [17] and Wirtinger flows [16] and its variants [19, 80, 81, 14, 77]. We shall review all these algorithms and mathematical results in greater detail in Section 6. These results essentially demonstrate that a signal of dimension can be recovered efficiently and reliably from the order of generic quadratic measurements of the form (1.1).
The recent surge of applied and theoretical activity regarding phaseless imaging is in part driven by the hope that it will eventually lead to successful imaging of large protein complexes and biological specimens enabling live imaging of biochemical activities at the molecular level. Furthermore, phaseless imaging techniques increasingly play a crucial role in emerging national security applications aimed at monitoring electronic products that are intended for military or infrastructure use so as to ensure these products do not contain secret backdoors granting foreign governments cyber access to vital US infrastructure. Despite the incredible progress discussed earlier on both applied and mathematical fronts, major challenges impede the use of such techniques to these emerging domains. One major challenge is that acquiring measurements of large specimens at high resolutions (corresponding to very short wavelengths) require time consuming and expensive measurements. To be concrete, the most modern phaseless imaging setups require image acquisition times exceeding days for imaging a micrometer micrometer specimen at nm resolution!
To overcome these challenges, in this paper we aim to utilize apriori structural information available about the signal to reduce the required number of quadratic measurements of the form (1.1). Indeed, in the application domains discussed above there is a lot of apriori knowledge available that can be utilized to reduce acquisition time and enhance image reconstruction. For example, images of electronic chips are extremely structured e.g. piecewise constant and often projections of 3D rectilinear models. While, historically various apriori information such as nonnegativty has been used to enhance image reconstruction, such simple forms of structural information are often not sufficient. Complicating the matter further our mathematical understanding of how well even simple forms of apriori information can enhance reconstruction is far from complete. To be concrete, assume we know the signal of interest is sparse e.g. it has at most nonzero entries. In this case for known tractable algorithms to yield accurate solutions the number of generic measurements must exceed with a constant [60]
. This is surprising, as the degrees of freedom of an
sparse vector is of the order and based on the compressive sensing literature one expects to be able to recover the signal from the order of generic quadratic measurements. In fact, it is known that on the order of generic quadratic measurements uniquely specify the unknown signal up to a global phase factor. However, it is not known whether a tractable algorithm can recover the signal from such minimal number of generic quadratic measurements.The above example demonstrates a significant gap in our ability to utilize prior structural assumptions in phase retrieval problems so as to reduce the required number of measurements or sample complexity. This is not an isolated example, and such gaps hold more generally for a variety of problems and structures (see [60, 2, 22, 23, 21] for more details on related gaps). The emergence of such sample complexity “barriers” is quite surprising as in many cases there is no tractable algorithm known to close this gap. In fact, for some problems such as sparse PCA it is known that closing this gap via a computationally tractable approach will yield tractable algorithms for notoriously difficult problems such as planted clique [9].
2 Minimizing (non)convex objectives with (non)convex constraints
We wish to discern an unknown but “structured” signal from quadratic measurements of the form , for . However, in the applications of interest typically the number of equations is significantly smaller than the number of variables so that there are infinitely many solutions obeying the quadratic constraints. However, it may still be possible to recover the signal by exploiting knowledge of its structure. To this aim, let be a cost function that reflects some notion of “complexity” of the “structured” solution. It is then natural to use the following optimization problem to recover the signal.
(2.1) 
Here,
is a loss function measuring the misfit between the measurements
and the data model and is a regularization function that reflects known prior knowledge about the signal. A natural approach to solve this problem is via projected gradient type updates of the form(2.2) 
Here, is the Wirtinger derivative of (see [16, Section 6] for details) and denotes the projection of onto the set
(2.3) 
Following [16], we shall refer to this iterative procedure as the Projected Wirtinger Flow (PWF) algorithm.
Apriori it is completely unclear why the iterative updates (2.2) should converge as not only the loss function may be nonconvex but also the regularization function! Efficient signal reconstruction from nonlinear measurements in this highdimensional setting poses new challenges:

When are the iterates able to escape local optima and saddle points and converge to global optima?

How many measurements do we need? Can we break through the barriers faced by convex relaxations?

How does the number of measurements depend on the apriori prior knowledge available about the signal? What regularizer is best suited to utilizing a particular form of prior knowledge?

How many passes (or iterations) of the algorithm is required to get to an accurate solution?
At the heart of answering these questions is the ability to predict convergence behavior/rate of (non)convex constrained optimization algorithms.
3 Precise measures for statistical resources
Throughout the rest of the paper we assume that the signal and the measurement vectors are all realvalued. For sake of brevity we have focused our attention to this realvalued case. However, we note that all of our definitions/results trivially extend to the complex case. We wish to characterize the rates of convergence for the projected gradient updates (2.2) as a function of the number of samples, the available prior knowledge and the choice of the regularizer. To make these connections precise and quantitative we need a few definitions. Naturally the required number of samples for reliable signal reconstruction depends on how well the regularization function can capture the properties of the unknown signal . For example, if we know our unknown parameter is approximately sparse naturally using an norm for the regularizer is superior to using an regularizer. To quantify this capability we first need a couple of standard definitions which we adapt from [61, 62].
Definition 3.1 (Descent set and cone)
The set of descent of a function at a point is defined as
The cone of descent is defined as a closed cone that contains the descent set, i.e. . The tangent cone is the conic hull of the descent set. That is, the smallest closed cone obeying .
We note that the capability of the regularizer in capturing the properties of the unknown signal depends on the size of the descent cone . The smaller this cone is the more suited the function is at capturing the properties of . To quantify the size of this set we shall use the notion of mean width.
Definition 3.2 (Gaussian width)
The Gaussian width of a set is defined as:
where the expectation is taken over . Throughout we use to denote the the unit ball/sphere of .
We now have all the definitions in place to quantify the capability of the function in capturing the properties of the unknown parameter . This naturally leads us to the definition of the minimum required number of samples.
Definition 3.3 (minimal number of samples)
Let be a cone of descent of at . We define the minimal sample function as
We shall often use the short hand with the dependence on implied.
We note that is exactly the minimum number of samples required for structured signal recovery from linear measurements when using convex regularizers [18, 1]. Specifically, the optimization problem
(3.1) 
succeeds at recovering the unknown signal
with high probability from
measurements of the form if and only if .^{1}^{1}1We would like to note that only approximately characterizes the minimum number of samples required. A more precise characterization is where . However, since our results have unspecified constants we avoid this more accurate characterization. While this result is only known to be true for convex regularization functions we believe that also characterizes the minimal number of samples even for nonconvex regularizers in (3.1). See [61] for some results in the nonconvex case as well as the role this quantity plays in the computational complexity of projected gradient schemes for linear inverse problems. Given that in phaseless imaging we have less information (we loose the phase of the linear measurements) we can not hope to recover structured signals from when using (2.1). Therefore, we can use as a lowerbound on the minimum number of measurements required for projected gradient descent iterations (2.2) to succeed in recovering the signal of interest.4 Nonconvex regularization examples
Next we provide two examples of (non)convex regularizers which are of interest in phaseless imaging applications. These two simple examples are meant to highlight the importance of nonconvex regularizers in imaging. However, our theoretical framework is by no means limited to these simple examples and can deal with significantly more complicated nonconvex regularizers capturing much more nuanced forms of prior structure.

piecewise constant structure. One apriori structural information available in many phaseless imaging applications is that images tend to be piecewise constant. For example, contiguous parts of biological specimens are made up of the same tissue and exhibit the same behavior under electromagnetic radiations. Similarly, images of electronic chips are often projections of piecewise constant, 3D rectilinear models. Let denote a image consisting of an array of pixels. A popular approache for exploiting piecewiseconstant structures is to use total variation regularization functions. Two common choices are the isotropic and anisotropic totoal variation regularizations defined as
When , these regularization functions are convex. However, for many image reconstruction tasks total variation regularization with , despite being nonconvex, is significantly more effective at capturing piecewise constant structure.

discrete values. Another form of apriori knowledge that is sometimes available in imaging applications is possible discrete values the image pixels can take. For example, in imaging electronic chips the metallic composition of the different parts predetermines the possible discrete values and are known in advance. Let denote a image consisting of an array of pixels and assume the possible discrete values are . A natural regularization in this case is
This regularization is convenient as projection onto its sublevel sets is easy and amounts to replacing each entry of the input vector/matrix with the closest discrete value from (a.k.a. hard thresholding). Of course this is not the only regularization function that can enforce discrete structures and in practice “soft thresholding” variants may be more effective. Our framework can be used to analyze many such variants. Indeed, an interesting aspect of our results is that it allows us to understand what regularizer is best suited at enforcing a particular form of prior structure.
5 Theoretical results for Projected Wirtinger Flows
In this section we shall explain our main theoretical results. To this aim we need to define the distance to the solution set.
Definition 5.1
Let be any solution to the quadratic system (the signal we wish to recover). For each , define
As we mentioned earlier we are interested in recovering structured signal recovery problems from quadratic measurements via the optimization problem (2.1). Naturally the convergence/lack of convergence as well as the rate of convergence of projected Wirtinger Flow iterates (2.2) depends on the loss function . We now discuss our theoretical results for two different loss functions.
5.1 Intensitybased Wirtinger Flows with convex regularizers
Our first result focuses on a quadratic loss function applied to intensity measurements, i.e. in (2.1). In this case, the optimization problem takes the form
(5.1) 
Our first result studies the effectiveness of projected Wirtinger Flows on this objective.
Theorem 5.2
Let be an arbitrary vector and be a proper convex function. Suppose is a Gaussian map and let be
quadratic measurements. To estimate
, start from a point obeying(5.2) 
and apply the Projected Wirtinger Flow (PWF) updates
(5.3) 
with . Also set the learning parameter sequence as and ^{2}^{2}2We note that can be trivially estimated from the measurements as and our proofs are robust to this misspecification. We avoid stating this variant for ease of reading. for all and assume for some fixed numerical constant . Furthermore, let , defined by 3.3, be our lower bound on the number of measurements. Also assume
(5.4) 
holds for a fixed numerical constant . Then there is an event of probability at least such that on this event starting from any initial point obeying (5.2) the update (5.3) satisfy
(5.5) 
As mentioned earlier is the minimal number of measurements required to recover a structured signal from linear measurements. also serves as a lower bound on structured signal recovery from quadratic measurements as they are even less informative (we loose sign information). Theorem 5.2 shows that PWF applied to quadratic loss using intensity measurements can (locally) reconstruct the signal with this minimal sample complexity (up to a constant and log factor). To be concrete consider the case where the unknown signal is known to be sparse and we use as the regularizer in (5.1). In this case it is known that and Theorem 5.2 predicts that an sparse signal can (locally) be recovered from the order of measurements. This breaks through wellknown barriers that have emerged for this problem in recent literature. Indeed, for known tractable convex relaxation schemes to yield accurate solutions the number of generic measurements must exceed with a constant [54, 60]. We also note that even recent nonconvex approaches such as [14, 78] have also not succeeded at breaking through this barrier even when an initialization obeying (5.2) is available.
The convergence guarantees provided above hold as long as PWF is initialized per (5.2) in a neighborhood of the unknown signal with relative error less than a constant. In this paper we are concerned only with local convergence properties of PWF and therefore do not provide an explicit construction for such an initialization. However, in a companion paper we demonstrate that the optimization problem (5.1) has certain favorable characteristics that may allow global convergence guarantees from any initialization using second order methods.
Another interesting aspect of the above result is that the rate of convergence is geometric. Specifically, to achieve a relative error of (), the required number of iterations is . Note that the cost of each iteration depends on applying the matrix and its transpose which has computational complexity on the order of . This is assuming that the projection has negligible cost compared to the cost of applying . This is the case for example for sparse signals when using the regularizer . Therefore, in these cases to achieve a relative error of the total computational complexity of PWF is on the order of .
Let us now discuss some ways in which this theorem is suboptimal. Even though this theorem breaks through known sample complexity barriers, a natural question is whether it is possible to remove the log factor so as to have a sample complexity that is only a constant factor away from the minimum sample complexity of structured signal recovery from linear measurements. Another way in which the algorithm is suboptimal is computational complexity. While the rate of convergence of PWF stated above is geometric, it is not linear. With a linear rate of convergence to achieve a relative error of the total computational complexity would be on the order of which is a factor of smaller than the guarantees provided by PWF. In the next section we will show how to close these gaps in sample complexity and computational complexity by using a different loss function in (5.1). Finally, a major draw back of Theorem 5.2 it that it only applies to convex regularizers. In the next section we will show how to also remove this assumption so as to allow arbitrary nonconvex regularizers.
5.2 Amplitudebased Wirtinger Flows with (non)convex regularizers
Our second result focuses on a quadratic loss function applied to amplitude measurements, i.e. in (2.1). In this case, the optimization problem takes the form
(5.6) 
One challenging aspect of the above loss function is that it is not differentiable and it is not clear how to run projected gradient descent. However, this does not pose a fundamental challenge as the loss function is differentiable except for isolated points and we can use the notion of generalized gradients to define the gradient at a nondifferentiable point as one of the limits points of the gradient in a local neighborhood of the nondifferentiable point. For the loss in (5.6) the generalized gradient takes the form
(5.7) 
Theorem 5.3
Let be an arbitrary vector and be a proper function (convex or nonconvex). Suppose is a Gaussian map and let be quadratic measurements. To estimate , start from a point obeying
(5.8) 
and apply the Projected Wirtinger Flow (PWF) updates
(5.9) 
with and defined via (5.7). Also set the learning parameter sequence and for all . Furthermore, let , defined by 3.3, be our lower bound on the number of measurements. Also assume
(5.10) 
holds for a fixed numerical constant . Then there is an event of probability at least such that on this event starting from any initial point obeying (5.8) the update (5.9) satisfy
(5.11) 
Here is a fixed numerical constant.
The first interesting and perhaps surprising aspect of this result is its generality: it applies not only to convex regularization functions but also nonconvex ones! As we mentioned earlier the optimization problem in (2.1) is not known to be tractable even for convex regularizers. Despite the nonconvexity of both the objective and regularizer, the theorem above shows that with a near minimal number of measurements, projected gradient descent provably converges to the original signal without getting trapped in any local optima.
The amplitudebased loss also has stronger sample complexity and computational complexity guarantees compared with the intensitybased version. Indeed, the required number of measurements improves upon the intensitybased loss by a logarithmic factor, achieving a near optimal sample complexity for this problem (up to a constant factor). Also, the convergence rate of the amplitudebased approach is now linear. Therefore, to achieve a relative error of the total number of iterations is on the order of . Thus the overall computational complexity is on the order of (in general the cost is the total number of iterations multiplied by the cost of applying the measurement matrix and its transpose). As a result, the computational complexity is also now optimal in terms of dependence on the matrix dimensions. Indeed, for a dense matrix even verifying that a good solution has been achieved requires one matrixvector multiplication which takes time.
We now pause to discuss the choice of the loss function. The theoretical results above suggests that the least squares loss on amplitude values is superior to one on intensity values in terms of both sample and computational complexity. Such improved performance has also been observed empirically for more realistic models in optics [79]. Indeed, [79] shows that not only the amplitudebased least squares has faster convergence rates but also is more robust to noise and model misspecification. However, we would like to point out that the least squares objective on intensity values does have certain advantages. For instance, it is possible to do exact line search (in closed form) on this objective. We have observed that this approach works rather well in some practical domains (e.g. ptychography for chip imaging) without the need for any tuning as the step size in each iteration is calculated in closed form via exact line search. Therefore, we would like to caution against rushed judgments declaring one variant of Wirtinger Flow superior to another due to minor (e.g. logarithmic) theoretical improvements in sample complexity and or computational complexity.^{3}^{3}3Unfortunately such premature declarations have become exceedingly common in recent literature. We would like to emphasize that there is no “best” or “correct” loss function that works better than others for all application domains. Ultimately, the choice of the loss function is dictated by the statistics of the noise or misspecification present in a particular domain.
6 Discussions and prior art
Phase retrieval is a century old problem and many heuristics have been developed for its solution. For a partial review of some of these heuristics as well as some recent theoretical advances in related problems we refer the reader to the overview articles/chapters [68, 56, 40] and [69, Part II] as well as [15, Section 1.6], and references therein such as [13, 59, 7, 76, 43, 31, 32, 50, 51]. There has also been a surge of activity surrounding nonconvex optimization problems in the last few years. While discussing all of these results is beyond the scope of this paper we shall briefly discuss some of the most relevant and recent literature in the coming paragraphs. We refer the reader to [71] and references therein [12, 47, 45, 46, 33] for a more comprehensive review of such results. We also refer the reader to[35, 29, 5]
for recent algorithmic approaches based on linear programs and
[52] for characterizing large systems limits of dynamics of phase retrieval algorithms.The Wirtinger Flow algorithms for solving quadratic systems of equations was introduce in [16]. [16] also provides a local convergence analysis when no prior structural assumption is available about the signal. The analysis of [16] was based on the so called regularity condition. This regularity condition and closely related notions have been utilized/generalized in a variety of interesting ways to provide rigorous convergence guarantees for related nonconvex problems arising in diverse applications ranging from matrix completion to dictionary learning and blind deconvolution [6, 72, 66, 3, 73, 19, 14, 83, 82, 20, 80, 70, 84, 81, 77, 64, 53]. The intensitybased results presented in Section 5.1 are based on a generalization of the regularity condition in [16] so as to allow arbitrary convex constraints.
The second set of results we presented in this paper were based on least squares fit of the amplitudes. This objective function has been historically used in phase retrieval applications [27] and has close connections with the classical Fienup algorithm [69, Chapter 13]. Focusing on more recent literature, [79] demonstrated the effectiveness of this approach in optical applications. More recently, a few interesting publications [81, 77] study variants of this loss function and develop guarantees for its convergence. The analysis presented in both of these papers are also based on variants of the regularity condition of [16] and do not utilize any structural assumptions. In this paper we have analyzed the performance of the amplitudebased PWF with any constraint (convex or nonconvex). These results are based on a new approach to analyzing nonconvex optimization problems that differs from the regularity approach used in [16] and all of the papers mentioned above. Rather, this new technique follows a more direct route, utilizing/developing powerful concentration inequalities to directly show the error between the iterates and the structured signal decreases at each iteration.
A more recent line of research aims to provide a more general understanding of the geometric landscape of nonconvex optimization problems by showing that in many problems there are no spurious local minmizers and saddles points have favorable properties [70, 28, 10, 11]. A major advantage of such results is that they do not required specialized initializations in the sense that trust regiontype algorithms or noisy stochastic methods are often guaranteed to converge from a random initialization and not just when an initial solution is available in a local neighborhood of the optimal solution. The disadvantage of such results is that the guaranteed rates of convergence of these approaches are either not linear/geometric or each iteration is very costly. These approaches also have slightly looser sample complexity bounds. Perhaps the most relevant result of this kind to this paper is the interesting work of Sun, Qu and Wright [70] which studies the geometric landscape of the objective (5.1) in the absence of any regularizer. The authors also show that a certain trust region algorithm achieves a relative error of after as long as the number of samples exceeds . As mentioned previously, using different proof techniques in a companion paper we demonstrate a result of a similar flavor to [70] for the constrained problem (5.1). This results shows that with measurements all local optima are global and a second order scheme recovers the global optima (the unknown signal) in a polynomial number of iterations.
We now pause to caution against erroneous missinterpretations of the theoretical results discussed in the previous paragraph:

There are no spurious local optima i.e. all local optima are global in phase retrieval applications

Initialization is irrelevant in phase retrieval applications
The reason these conclusions are inaccurate are twofold. First, while the results of the previous paragraph and Theorem 5.2 both require on the order of samples the multiplicative constants in these results tend to be drastically different in practice. Second, the measurement vectors occurring in practical domains are substantially more illconditioned than the Gaussian measurements studied in this paper. This further amplifies the gap between the sample complexity of local versus global results. Indeed, in many practical domains where phase retrieval is applied local optima are abound and a major source of algorithmic stagnation. Therefore, carefully crafted initialization schemes or regularization methods are crucial for the convergence of local search heuristics in many phaseless imaging domains.
We would also like to mention prior work on sparse phase retrieval. For generic measurements such as the Gaussian distribution studied in this paper,
[54] provides guarantees for the convex relaxationbased PhaseLift algorithm as long as the number of samples exceed where is the number of nonzeros in the sparse signal. The papers [54, 60] showed that these results are essentially unimprovable when using simple SDP relaxations. More recently, interesting work by Cai, Li and Ma [14] studies the performance of Wirtinger Flow based schemes for sparse phase retrieval problems. This result also requires measurements even when an initialization obeying (5.2) is available. Therefore, this results also does not breakthrough the local barrier. More recently, there are a few publications aimed at going below measurements. These results differ from ours in that they are either applicable to specific designs which tailor the algorithm to the measurement process [4, 65, 48, 36, 37] or require additional constraints on the coefficient of the sparse signal [78, 34, 49]. In contrast to the above publications in this paper we have demonstrated that locally only samples suffice to recover any sparse signal from generic quadratic measurements formally breaking through the barrier. Furthermore, our results applies to any regularizer (convex or nonconvex), allowing us to enforce various forms of prior knowledge in our reconstuction.Finally, we would like to mention that there has also been some recent publications aimed at developing theoretical guarantees for more practical models. For instance, the papers [15, 32, 67, 42, 41] develop theoretical guarantees for convex relaxation techniques for more realistic Fourier based models such as coded diffraction patterns and Ptychography. More recently, the papers [39, 38, 44, 26, 8] also develop some theoretical guarantees for faster but sometimes designspecific algorithms. Despite all of this interesting progress the known results for more realistic measurement models are far inferior to their Gaussian counterparts in terms of sample complexity, computational complexity or stability to noise. Closing these gaps is an interesting and important future direction.
7 Proofs
In the Gaussian model the measurement vectors also obey for all with probability at least . Thoughout the proofs, we assume we are on this event without explicitly mentioning it each time. Without loss of generality we will assume throughout the proofs that . We remind the reader that throughout is a solution to our quadratic equations, i.e. obeys and that the sampling vectors are independent from . We also remind the reader that for a set , is the mean width of per Definition 3.2. Throughout, we use to denote the unit sphere/unit ball of . We first discuss some common background and results used for proving both theorems. Since the proof of the two theorems follow substantially different paths we dedicate a subsection to each: Section 7.4 for proof of Theorem 5.2 and Section 7.5 for proof of Theorem 5.3.
7.1 Formulas for gradients, generalized gradients and their expected values
As a reminder the intensitybased loss function is equal to
and the gradient equal to
As a reminder the amplitudebased loss function is equal to
and the generalized gradient is equal to
7.2 Concentration and bounds for stochastic processes
In this section we gather some useful results on concentration of stochastic processes which will be crucial in our proofs. We begin with a lemma which is a direct consequence of Gordon’s escape from the mesh lemma [30] whose proof is deferred to Appendix A.1.
Lemma 7.1
Assume is a cone and is the unit sphere of . Also assume that
for a fixed numerical constant . Then for all
holds with probability at least .
We also need a generalization of the above lemma stated below and proved in Appendix A.2.
Lemma 7.2
Assume is a cone (not necessarily convex) and is the unit sphere of . Also assume that
for a fixed numerical constant . Then for all
holds with probability at least .
We next state a generalization of Gordon’s escape through the mesh lemma, whose proof appears in Appendix A.3.
Lemma 7.3
Let be fixed vector with nonzero entries and construct the diagonal matrix . Also, let have i.i.d. entries. Furthermore, assume and define
where is distributed as . Define
then for all
holds with probability at least
The previous lemma leads to the following Corollary. We skip the proof as it is identical to how Lemma 7.1 is derived from Gordon’s lemma (See Section A.1 for details).
Corollary 7.4
Let be fixed vector with nonzero entries and assume . Furthermore, assume
Then for all ,
holds with probability at least .
The above generalization of Gordon’s lemma together with its corollary will be very useful in our proofs in particular it allows us to prove the following key result whose proof is also deferred to Appendix A.4.
Lemma 7.5
Assume is a cone and is the unit sphere of . Furthermore, let be a fixed vector. Also assume that
Then for all
holds with probability at least with and fixed numerical constants.
We also need the following important lemma. The proof of this lemma is based on the paper [57]. Please also see [25] for related calculations. We defer the proof to Appendix A.5.
Lemma 7.6
Assume are cones and is the unit sphere of . Also assume
for a fixed numerical constant . Then for any and
holds with probability at least where is a fixed numerical constant. Here, is distributed as .
We also state a simple generalization of Lemma 7.6 above. This lemma has a near identical proof. We skip details for brevity.
Lemma 7.7
Assume are sets with diameters bounded by fixed numerical constants. Also assume
for a fixed numerical constant . Then for any and
holds with probability at least where is a fixed numerical constant. Here, is distributed as .
Finally, we also need the following lemma with the proof appearing in Appendix A.6.
Lemma 7.8
For any define . Then we have
7.3 Cone and projection identities
In this section we will gather a few results regarding higher dimensional cones and projections that are used throughout the proofs. These results are directly adapted from [61, 6.1]. We begin with a result about projections onto sets. The first part concerning projections onto convex sets is the well known contractivity result regarding convex projections.
Lemma 7.9
Assume is a closed set and . Then if is convex for every we have
(7.1) 
Furthermore, for any closed set (not necessarily convex) and for every we have
(7.2) 
Proof Equation (7.1) is well known. We shall prove the second result. To this aim note that by definition of projection onto a set we have
(7.3) 
Also note that
Combining the latter inequality with (7.3) and using the CauchySchwarz inequality we have
Dividing both sides of the above inequality by concludes the proof.
We now state a result concerning projection onto cones.
Lemma 7.10
Let be a closed cone and . The followings two identities hold
(7.4)  
(7.5) 
The following lemma is straightforward and follows from the fact that translation preserves distances.
Lemma 7.11
Suppose is a closed set. The projection onto obeys
The next lemma compares the length of a projection onto a set to the length of projection onto the conic approximation of the set.
Lemma 7.12 (Comparison of projections)
Let be a closed and nonempty set that contains . Let be a nonempty and closed cone containing (). Then for all ,
(7.6) 
Furthermore, assume is a convex set. Then for all ,
(7.7) 
7.4 Convergence analysis for intensitybased Wirtinger Flows
In this section we shall prove Theorem 5.2. The proof of this result is based on an extension of the framework developed in [16]. Therefore, the outline of our exposition closely follows that of [16]. Section 7.4.1 discusses our general convergence analysis and shows that it follows from a certain Regularity Condition (RC). In this section we also show that the regularity condition can be proven by showing two sufficient Local Curvature and Local Smoothness conditions denoted by LCC and LSC. We then prove the Local Curvature condition in Section 7.4.2 and the Local Smoothness condition in Section 7.4.3.
7.4.1 General convergence analysis
Note that (5.2) guarantees that either or is small. Throughout the proof without loss of generality we assume is the smaller one. To introduce our general convergence analysis we begin by defining
Note that when condition (5.2) holds the next iterate obeys with . The reason is that when the regularizer is convex so is the set and by contractivity of projection onto convex sets (Lemma 7.9) we also have
We will assume that the function satisfies a regularity condition on , which essentially states that the gradient of the function is wellbehaved.
Condition 7.13 (Regularity Condition)
We say that the function satisfies the regularity condition or if for all vectors we have
(7.8) 
In the lemma below we show that as long as the regularity condition holds on then Projected Wirtinger Flow starting from an initial solution in converges to a global optimizer at a geometric rate. Subsequent sections shall establish that this property holds.
Lemma 7.14
Assume that obeys RC for all . Furthermore, suppose , and assume . Consider the following update
Then for all we have and
Proof The proof is similar to a related proof in the Wirtinger Flow paper [16]. We prove that if then for all
obeys
(7.9) 
The latter implies that if , then . Combining the latter with the fact that projection onto convex sets are contractive (Lemma 7.9) we conclude that
(7.10) 
Also by the definition of we have . Therefore, if then we also have . The lemma follows by inductively applying (7.9) and (7.10). Now let us demonstrate how (7.9) follows from simple algebraic manipulations together with the regularity condition (7.8). To this aim note that
where the last line follows from . This concludes the proof.
For any , we need to show that
(7.11) 
We prove that (7.11) holds with by establishing that our gradient satisfies the local smoothness and local curvature conditions defined below. Combining both these two properties gives (7.11).
Condition 7.15 (Local Curvature Condition)
We say that the function satisfies the local curvature condition or if for all vectors ,
(7.12) 
This condition essentially states that the function curves sufficiently upwards (along most directions) near the curve of global optimizers.
Condition 7.16 (Local Smoothness Condition)
We say that the function satisfies the local smoothness condition or if for all vectors we have
(7.13) 
This condition essentially states that the gradient of the function is well behaved (the function does not vary too much) near the curve of global optimizers.
Comments
There are no comments yet.