denotes the conjugate of a complex number . denotes the phase of
. We use bold lower-case and upper case letters for vectors and matrices respectively. For a matrix, and denote the transpose of a matrix and its Hermitian respectively. Throughout the paper, we also use the following two notations: and . and said to be circularly-symmetric Gaussian, denoted as , if and and
are two independent real Gaussian random variables with mean zero and variance. Finally, we define for .
1.2 Informal statement of our results
Phase retrieval refers to the task of recovering a signal from its phaseless linear measurements:
where is the th component of and a Gaussian noise. The recent surge of interest [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 12, 16, 17, 18, 19, 20, 21, 22, 23] has led to a better understanding of the theoretical aspects of this problem. Thanks to such research we now have access to several algorithms, inspired by different ideas, that are theoretically guaranteed to recover exactly in the noiseless setting. Despite all this progress, there is still a gap between the theoretical understanding of the recovery algorithms and what practitioners would like to know. For instance, for many algorithms, including Wirtinger flow [4, 5] and amplitude flow [6, 7], the exact recovery is guaranteed with either or measurements, where is often a fixed but large constant that does not depend on . In both cases, it is often claimed that the large value of or the existence of is an artifact of the proving technique and the algorithm is expected to work with for a reasonably small value of . Such claims have left many users wondering
Which algorithm should we use? Since the theoretical analyses are not sharp, they do not shed any light on the relative performance of different algorithms. Answering this question through simulations is very challenging too, since many factors including the distribution of the noise, the true signal , and the number of measurements may have impact on the answer.
When can we trust the performance of these algorithms in the presence of noise? Suppose for a moment that we know the minimum number of measurements that is required for the exact recovery through simulations. Should we collect the same number of measurements in the noisy settings too?
What is the impact of initialization schemes, such as spectral initialization? Can we trust these initialization schemes in the presence of noise? How should we compare different initialization schemes?
Researchers have developed certain intuition based on a combination of theoretical and empirical results, to give heuristic answers to these questions. However, as demonstrated in a series of papers in the context of compressed sensing, such folklores are sometimes inaccurate. To address Question Q.1, several researchers have adopted the asymptotic framework , , and provided sharp analyses for the performance of several algorithms [20, 21, 22]. This line of work studies recovery algorithms that are based on convex optimization. In this paper, we adopt the same asymptotic framework and study the following popular non-convex problem, known as amplitude-based optimization [7, 6, 25]:
-regularizer. Regularization is known to reduce the variance of an estimator and hence is expected to be useful when. However, as we will try to clarify later in this section, since the loss function is non-convex, regularization can help the iterative algorithm that aims to solve (1.2) even in the noiseless settings.
present sharp characterization of the mean square error (even the constants are sharp) in both noiseless and noisy settings.
present a quantitative characterization of the gain initialization and regularization can offer to our algorithms.
Furthermore, the sharpness of our results enables us to present a quantitative and accurate comparison with convex optimization based recovery algorithms [20, 21, 22] and give partial answers to Question Q.1 mentioned above. Below we introduce our message passing algorithm and informally state some of our main results. The careful and accurate statements of our results are postponed to Section 2.
Following the steps proposed in , we obtain the following algorithm called, Approximate Message Passing for Amplitude-based optimization (AMP.A). Starting from an initial estimate , AMP.A proceeds as follows for :
|In these iterations|
In the above, at can be any fixed number and does not affect the performance of AMP.A. Further, the “divergence” term is defined as
where and denote the real and imaginary parts of respectively (i.e., ). For readers’ convenience, we include the derivations of AMP.A in Appendix A.
The first point that we would like to discuss here is the effect of the regularizer on AMP.A. For the moment suppose that the noise is zero. Does including the regularizer in (1.2) benefit AMP.A? Clearly, any regularization may introduce unnecessary bias to the solution. Hence, if the final goal is to obtain exactly we should set . However, the optimization problem in (1.2) is non-convex and iterative algorithms intended to solve it can get stuck at bad local minima. In this regard, regularization can still help AMP.A to escape bad local minima through continuation. Continuation is popular in convex optimization for improving the convergence rate of iterative algorithms , and has been applied to the phase retrieval problem in . In continuation we start with a value of for which AMP.A is capable of finding the global minimizer of (1.2). Then, once converges we will either decrease or increase a little bit (depending on the final value of for which we want to solve the problem) and use the previous fixed point of AMP.A as the initialization for the new AMP.A. We continue this process until we reach the value of we are interested in. For instance, if we would like to solve the noiseless phase retrieval problem then should eventually go to zero so that we do not introduce unnecessary bias. The rationale behind continuation is the following. Let and be two different values of the regularization parameter, and they are close to each other. Suppose that the global minimizer of (1.2) with regularization parameter is and is given to the user. Suppose further that the user would like to find the global minimizer of (1.2) with . Then, it is conceivable that the global minimizer of the new problem is close to .111Given the sometimes complex geometry of non-convex problems, this might not always be the case. Hence, the user can initialize AMP.A with and hope that the algorithm may converge to the global minimizer of (1.2) for .
A more general version of the continuation idea we discussed above is to let change at every iteration (denoted as ), and set according to :
This way we can not only automate the continuation process, but also let AMP.A decide which choice of is appropriate at a given stage of the algorithm. Our discussion so far has been heuristic. It is not clear whether and how much the generalized continuation can benefit the algorithm. To give a partial answer to this question we focus on the following particular continuation strategy: and obtain the following version of AMP.A:
Below we informally discuss some of the results we will prove in this paper.
Informal result 1. Consider the AMP.A algorithm for complex-valued signals with . Under the noiseless setting, if , then “converges to” as long as the initial estimate is not orthogonal to and . When , has a fixed point at . However, it has to be initialized very carefully to reach .
Before we discuss and explain the implications of this result, let us expand the scope of our results. This extension enables us to compare our results with existing work [20, 21, 22]. So far, we have discussed the case . However, in some applications, such as astronomical imaging, we are interested in real-valued signals . In Section 3, we will introduce a real-valued version of . The following informal result summarizes the performance of this algorithm.
Informal result 2. Consider the AMP.A algorithm for real-valued signals with . Under the noiseless setting, if , then “converges to” as long as the initialization is not orthogonal to . When , has a fixed point at . However, it has to be initialized very carefully to reach .
We would like to make the following remarks about these two results:
As is clear from our second informal result, when , cannot converge to . This value of is different from the information theoretic lower bound . This discrepancy is in fact due to the type of continuation we used in this paper. Note that this issue does not happen in the complex-valued . The search for a better continuation strategy for the real-valued is left as future research.
Simulation results presented in our forthcoming paper  show that for real-valued signals, AMP.A with can only recover when . As mentioned in our second informal result, continuation has improved the threshold of correct recovery to .
How much does spectral initialization improve the performance of AMP.A? To answer this question, let us focus on the real-valued signals. As discussed in our second Informal result, two values of are important for AMP.A: and . If , then AMP.A recovers exactly as long as the initialization is not orthogonal to . In this case spectral method helps, since it offers an initialization that is not orthogonal to . However, if the mean of is not zero, a simple initial estimate can work as well as the spectral initialization. Hence, in this case spectral initialization does not offer a major improvement. A more important question is whether spectral initialization can help AMP.A to perform exact recovery for . Our forthcoming paper  shows that the answer to this question is negative. Hence, as long as the final estimate of AMP.A is concerned, the impact of spectral initialization seems to be marginal.
Now let us discuss the performance of AMP.A under noisy settings. We assume that the measurement noise is Gaussian and small. Clearly, in this setting exact recovery is impossible, hence we study the asymptotic mean square error defined as the following almost sure limit ()
Informal result 3. Consider the AMP.A algorithm for complex-valued signals with . Let , then
Notice that the above result was derived based under the assumption . To interpret the above result correctly, we should discuss the signal to noise ratio of each measurement. Suppose that . Then the signal to noise ratio of each measurement is . In other words, as we increase the number of measurements or equivalently , then we reduce the signal to noise ratio of each measurement too. This causes some issues when we compare the for different values of . One easy fix is to assume that the variance of the noise is , where is a fixed number. Then we can define the noise sensitivity as
It is straightforward to use (1.8) to show that . Note that if we use with , then the noise sensitivity is approximately . If this level of noise sensitivity is not acceptable for an application, then the user should collect more measurements to reduce the noise sensitivity. Noise sensitivity can also be calculated for real-valued AMP.A:
Informal result 4. Consider the AMP.A algorithm for real-valued signals with . Let , then
1.3 Related work
1.3.1 Existing theoretical work
Early theoretical results on phase retrieval, such as PhaseLift  and PhaseCut , are based on semidefinite relaxations. For random Gaussian measurements, a variant of PhaseLift can recover the signal exactly (up to global phase) in the noiseless setting using measurements . However, PhaseLift (or PhaseCut) involves solving a semidefinite programming (SDP) and is computationally prohibitive for large-scale applications. A different convex optimization approach for phase retrieval, which has the same sample complexity, was independently proposed in  and . This method is formulated in the natural signal space and does not involve lifting, and is therefore computationally more attractive than SDP-based counterparts. However, both methods require an anchor vector that has non-zero correlation with the true signal, and the quality of the recovery highly depends on the quality of the anchor.
Apart from convex relaxation approaches, non-convex optimization approaches attract considerable recent interests. These algorithms typically consist of a carefully designed initialization step (usually accomplished via a spectral method ) followed by iterations that refine the estimate. An early work in this direction is the alternating minimization algorithm proposed in , which has sub-optimal sample complexity. Another line of work includes the Wirtinger flow algorithm [4, 32], truncated Wirtinger flow algorithm , and other variants[10, 7, 6, 25, 12]. Other approaches include Kaczmarz method [33, 34, 16, 17], trust region method , coordinate decent , prox-linear algorithm  and Polyak subgradient method .
All the above theoretical results guarantee successful recovery with measurements (or more) where is a fixed often large constant. However, such theories are not capable of providing fair comparison among different algorithms. To resolve this issue researchers have started studying the performance of different algorithms under the asymptotic setting and . An interesting iterative projection method was proposed in, whose dynamics can be characterized exactly under this asymptotic setting. However,  does not analyze the number of measurements required for this algorithm to work. The work in  provides sharp characterization of the spectral initialization step (which is a key ingredient to many of the above algorithms). The analysis in 
reveals a phase transition phenomenon: spectral method produces an estimate not orthogonal to the signal if and only ifis larger than a threshold (called “weak threshold” in ). Later,  derived the information-theoretically optimal weak threshold (which is for the real-valued model and for the complex-valued model) and proved that the optimal weak threshold can be achieved by an optimally-tuned spectral method. Using the non-rigorous replica method from statistical physics,  analyzes the exact threshold of (for the real-value setting) above which the PhaseMax method in  and  achieves perfect recovery. The analysis in  shows that the performance of PhaseMax highly depends on initialization (see Fig. 1 of ), and the required is lower bounded by for real-valued models. The analysis in  was later rigorously proved in  via the Gaussian min-max framework [36, 37], and a new algorithm called PhaseLamp was proposed. The PhaseLamp method has superior recovery performance over PhaseMax, but again it does not work when for real-valued models. A recent paper 
extends the asymptotic analysis of to the complex-valued setting, and it was shown that PhaseMax cannot work for . On the other hand, AMP.A proposed in this paper achieves perfect recovery when and , for the real and complex-valued models respectively. Further, [20, 21] focus on the noiseless scenario, while in this paper we also analyze the noise sensitivity of AMP.A. Finally, a recent paper  derived an upper bound of such that PhaseLift achieves perfect recovery. The exact value of this upper bound can be derived by solving a three-variable convex optimization problem and empirically  shows that for real-valued models.
1.3.2 Existing work based on AMP
Our work in this paper is based on the approximate message passing (AMP) framework [39, 40], in particular the generalized approximate message passing (GAMP) algorithm developed and analyzed in [26, 41]. A key property of AMP (including GAMP) is that its asymptotic behavior can be characterized exactly via the state evolution platform [39, 40, 26, 41].
For phase retrieval, a Bayesian GAMP algorithm has been proposed in . However,  did not provide rigorous performance analysis, partly due to the heuristic treatments used in the algorithm (such as damping and restart). Another work related to ours is the recent paper  (appeared on Arxiv while we are preparing this paper), which analyzed the phase transitions of the Bayesian GAMP algorithms for a class of nonlinear acquisition models. For the phase retrieval problem, a phase transition diagram was shown in [43, Fig. 1] under a Bernoulli-Gaussian signal prior. The numerical results in  indeed achieve state-of-the-art reconstruction results for real-valued models. However,  did not provide the analysis of their results and in particular did not mention how they handle a difficulty related to initialization. Further, the algorithm in  is based on the Bayesian framework which assumes that the signal and the measurements are generated according to some known distributions. Contrary to  and , this paper considers a version of GAMP derived from solving the popular optimization problem (1.2). We provide rigorous performance analysis of our algorithm for both real and complex-valued models. Note that the advantages and disadvantages of Bayesian and optimization-based techniques have been a long debate in the field of Statistics. Hence, we do not repeat those debates here. Given our experience in the fields of compressed sensing and phase retrieval, it seems that the performance of Bayesian algorithms are more sensitive to their assumptions than the optimization-based schemes. Furthermore, performance analyses of Bayesian algorithms are often very challenging under “non-ideal” situations which the algorithms are not designed for.
Here, we emphasize another advantage of our approach. Given the fact that the most popular schemes in practice are iterative algorithms derived for solving non-convex optimization problems, the detailed analyses of presented in our paper may also shed light on the performance of these algorithms and suggest new ideas to improve their performances.
1.3.3 Fundamental limits
It the literature of phase retrieval, it is well known that to make the signal-to-observation mapping injective one needs at least measurements  (or  in the case of real-valued models). On the other hand, the measurement thresholds obtained in this paper are and respectively. In fact, our algorithm can in principal recover the signal when and (or if continuation is not applied) for complex and real-valued models, provided that the algorithm is initialized close enough to the signal (though no known initialization strategy can accomplish this goal). Hence, our threshold are even smaller than the injectivity bounds. We emphasize that this is possible since the injectivity bounds derived in [45, 44] are defined for all (which can depend on in the worst case scenario). This is different from our assumption that is independent of , which is more relevant in applications where one has some freedom to randomize the sampling mechanism. In fact, several papers have observed that their algorithm can operate at the injectivity thresholds for real-valued models [6, 13]. These two different notions of thresholds were discussed in . In the context of phase retrieval, the reader is referred to the recent paper , which showed that by solving a compression-based optimization problem, the required number of observations for recovery is essentially the information dimension of the signal (see  for the precise definition). For instance, if the signal is -sparse and complex-valued, then measurements suffice.
1.4 Organization of the paper
2 Asymptotic analysis of
In this section, we present the asymptotic platform under which is studied, and we derive a set of equations, known as state evolution (SE), that capture the performance of under the asymptotic analysis.
2.1 Asymptotic framework and state evolution
Our analysis of is carried out based on a standard asymptotic framework developed in [40, 48]. In this framework, we let , while . Within this section, we will write , , and as , , and to make explicit their dependency on the signal dimension . In this section we focus on the complex-valued AMP. We postpone the discussion of the real-valued AMP until Section 3. Following , we introduce the following definition of converging sequences.
The sequence of instances is said to be a converging sequence if the following hold:
, as .
has i.i.d. Gaussian entries where .
The empirical distribution of converges weakly to a probability measure with bounded second moment. Further, where is the second moment of . For convenience and without loss of generality, we assume .222Otherwise, we can introduce the following normalized variables: , , , and . One can verify that the AMP.A algorithm defined in (1.6) for these normalized variables remains unchanged. Therefore, we can view that our analyses are carried out for these normalized variables; we don’t need to actually change the algorithm though.
The empirical distribution of converges weakly to .
Under the asymptotic framework introduced above, the behavior of can be characterized exactly. Roughly speaking, the estimate produced by in each iteration is approximately distributed as the (scaled) true signal additive Gaussian noise; in other words, can be modeled as , where behaves like an iid standard complex normal noise. We will clarify this claim in Theorem 1 below. The scaling constant
and the noise standard deviationevolve according to a known deterministic rule, called the state evolution (SE), defined below.
Starting from fixed , the sequences and are generated via the following recursion:
where and are respectively given by
In the above equations, the expectations are over all random variables involved: , where is independent of , and where is independent of both and . Further, the partial Wirtinger derivative is defined as:
where and are the real and imaginary parts of (i.e., ).
The functions and are well defined except when both and are zero.
Most of the analysis in this paper is concerned with the noiseless case. For brevity, we will often write (where ) as . Further, when our focus is on and rather than , we will simply write as .
In Appendix B.2, we simplify the functions and into the following expressions (with being the phase of ):
The above expressions for and are more convenient for our analysis.
The state evolution framework for generalized AMP (GAMP) algorithms  was first introduced and analyzed in  and later formally proved in . As we will show later in Theorem 1, SE characterizes the macroscopic behavior of . To apply the results in [26, 41] to AMP.A, however, we need two generalizations. First, we need to extend the results in [26, 41] to complex-valued models. This is straightforward by applying a complex-valued version of the conditioning lemma introduced in [26, 41]. Second, existing results in [26, 41] require the function to be smooth. Our simulation results in case of complex-valued show that SE predicts the performance of despite the fact that is not smooth. Since our paper is long, we postpone the proof of this claim to another paper. Instead we use the smoothing idea discussed in  to connect the SE equations presented in (2.1) with the iterations of in (1.6). Let be a small fixed number. Consider the following smoothed version of :
where refers to a vector produced by applying below component-wise:
where for , is defined as
Note that as , and hence we expect the iterations of smoothed- converge to the iterations of .
Theorem 1 (asymptotic characterization).
Let be a converging sequence of instances. For each instance, let be an initial estimate independent of . Assume that the following hold almost surely
Let be the estimate produced by the smoothed initialized by (which is independent of ) and . Let denote a sequence of smoothing parameters for which as Then, for any iteration , the following holds almost surely
where , and is independent of . Further, and are determined by (2.1) with initialization and .
2.2 Convergence of the SE for noiseless model
We now analyze the dynamical behavior of the SE. Before we proceed, we point out that in phase retrieval, one can only hope to recover the signal up to global phase ambiguity [2, 1, 4], for generic signals without any structure. In light of (2.5), AMP.A is successful if and as .
Let us start with the following interesting feature of the state evolution, which can be seen from (2.2).
For any , and satisfy the following properties:
, with being the phase of ;
Hence, if denotes the phase of , then .
In light of this lemma, we can focus on real and nonnegative values of . In particular, we assume that and we are interested in whether and under what conditions can the SE converge to the fixed point . The following two values of will play critical roles in the analysis of SE:
Our next theorem reveals the importance of . The proof of this theorem detailed in Section 4.3.
Theorem 2 (convergence of SE).
Consider the noiseless model where . If , then for any and , the sequences and defined in (2.1) converge to
Notice that is essential for the success of AMP.A. This can be seen from the fact that is always a fixed point of for any . From our definition of in Theorem 1, is equivalent to . This means that the initial estimate cannot be orthogonal to the true signal vector , otherwise there is no hope to recover the signal no matter how large is.
The following theorem describes the importance of and its proof can be found in Section 4.4.
Theorem 3 (local convergence of SE).
When , then is a fixed point of the SE in (2.2). Furthermore, if , then there exist two constants and such that the SE converges to this fixed point for any and . On the other hand if , then the SE cannot converge to except when initialized there.
According to Theorem 3, with proper initialization, SE can potentially converge to even if . However, there are two points we should emphasize here: (i) we find that when , standard initialization techniques, such as the spectral method, do not help converge to . Hence, the question of finding initialization in the basin of attraction of (when ) remains open for future research. (In Appendix F, we briefly discuss how we combine spectral initialization with AMP.A. More details will be reported in our forthcoming paper .) (ii) As decreases from to the basin of attraction of shrinks. Check the numerical results in Figure 1.
2.3 Noise sensitivity
So far we have only discussed the performance of in the ideal setting where the noise is not present in the measurements. In general, one can use (2.1) to calculate the asymptotic MSE (AMSE) of as a function of the variance of the noise and . However, as our next theorem demonstrates it is possible to obtain an explicit and informative expression for AMSE of in the high signal-to-noise ratio (SNR) regime.
Theorem 4 (noise sensitivity).
Suppose that and and . Then, in the high SNR regime the asymptotic MSE defined in (1.7) behaves as
The proof of this theorem can be found in Appendix E.
3 Extension to real-valued signals
Until now our focus is on complex-valued signals. In this section, our goal is to extend our results to real-valued signals. Since most of the results are similar to the complex-valued case, we will skip the details and only emphasize on the main differences.
In the real-valued case, uses the following iterations:
|where is given by|
where denotes the sign of . We emphasize that the divergence term contains a Dirac delta at due to the discontinuity of the sign function. This makes the calculation of the divergence in the algorithm tricky. One can use the smoothing idea we discussed in Section 2.1. Alternatively, there are several possible approaches to estimate the divergence term. These practical issues will be discussed in details in our follow-up paper .
3.2 Asymptotic Analysis
Our analysis is based on the same asymptotic framework detailed in Section 2.2. The only difference is that the measurement matrix is now real Gaussian with and . In the real-valued setting, the state evolution (SE) recursion of in (3.1) becomes the following.
Starting from fixed the sequences and are generated via the following iterations:
where, with some abuse of notations, and are now defined as
The expectations are over the following random variables: , where is independent of , and where independent of both and .
In Appendix B.3, we derived the following closed-form expressions of and :
As in the complex-valued case, we would like to study the dynamics of these two equations. The following lemma simplifies the analysis.
Again the following two values of play a critical role in the performance of AMP:
Theorem 5 (convergence of SE).
Suppose that and . For any and , the sequences and defined in (3.2) converge:
Theorem 6 (local convergence of SE).
For the noiseless setting where , is a fixed point of the SE in (2.2). Furthermore, if , then there exist two constants and such that the SE converges to this fixed point for any and . On the other hand if , then the SE cannot converge to except when initialized there.
Note that here is different from the information theoretic limit . We should emphasize that if we had not used the continuation discussed in (1.5), then the basin of attraction of would be non-empty as long as .
Finally, we discuss the performance of in the high SNR regime. See Section E for its proof.
Theorem 7 (noise sensitivity).
Suppose that and and . Then, in the high SNR regime we have