1 Introduction
Many problems in machine learning and signal processing can be reduced to solve a quadratic system of equations. For instance, in phase retrieval applications, i.e., Xray crystallography and coherent diffraction imaging
[3, 4, 5], the structure of an object is to be recovered from the measured far field diffracted intensity when an object is illuminated by a source light. Mathematically, such a problem amounts to recover the signal from only measurements of magnitudes. Specifically, the problem is formulated below.Problem 1.
Recover from measurements given by
(1) 
where are random design vectors (known).
Various algorithms have been proposed to solve this problem since 1970s. The errorreduction methods proposed in [6, 7] work well empirically but lack theoretical guarantees. More recently, convex relaxation of the problem has been formulated, for example, via PhaseLift [8, 9, 10] and PhaseCut [11], and the correspondingly developed algorithms typically come with performance guarantee. The reader can refer to the review paper [12] to learn more about applications and algorithms of the phase retrieval problem.
While with good theoretical guarantee, these convex methods often suffer from computational complexity particularly when the signal dimension is large. On the other hand, more efficient nonconvex approaches have been proposed and shown to recover the true signal as long as initialization is good enough. [13] proposed AltMinPhase algorithm, which alternatively updates the phase and the signal with each signal update solving a leastsquares problem, and showed that AltMinPhase converges linearly and recovers the true signal with Gaussian measurements. More recently, [1] introduces Wirtinger flow (WF) algorithm, which guarantees signal recovery via a simple gradient algorithm with only Gaussian measurements and attains accuracy within flops. More specifically, WF obtains good initialization by the spectral method, and then minimizes the following nonconvex loss function
(2) 
via the gradient descent scheme.
WF is further improved by truncated Wirtinger flow (TWF) algorithm proposed in [2], which adopts a Poisson loss function of , and keeps only wellbehaved measurements based on carefully designed truncation thresholds for calculating the initial seed and every step of gradient. Such truncation assists to yield linear convergence with certain fixed step size and reduces both the sample complexity to and the convergence time to .
It can be observed that WF uses the quadratic loss of so that the optimization objective is a smooth function of and the gradient step becomes simple. But this comes with a cost of increasing the order of to be four in the loss function. In this paper, we adopt the quadratic loss of . Although the loss function is not smooth everywhere, it reduces the order of to be two, and the general curvature can be more amenable to convergence of the gradient method. The goal of this paper is to explore potential advantages of such a nonsmooth lowerorder loss function.
Furthermore, incremental/stochastic methods have been proposed to solve Problem 1. Specifically, Kaczmarz method for phase retrieval (KaczmarzPR) [14, 15] is shown to have excellent empirical performance, but no global convergence guarantee was established. Incremental truncated Wirtinger flow (ITWF) [16] is a stochastic algorithm developed based on TWF and exhibits linear convergence to the true signal once initialized properly. In this paper, we consider the incremental/stochastic version of RWF (IRWF) and compare its performance with KaczmarzPR and ITWF, in order to further demonstrate the advantage of the lowerorder loss function.
1.1 Our Contribution
This paper adopts the following loss function^{1}^{1}1The loss function (3) was also used in [7] to derive a gradientlike update for the phase retrieval problem with Fourier magnitude measurements. However, the focus of this paper is to characterize global convergence guarantee for such an algorithm with appropriate initialization, which was not studied in [7].
(3) 
Compared to the loss function (2) in WF that adopts , the above loss function adopts the absolute value/magnitude and hence has lowerorder variables. For such a nonconvex and nonsmooth loss function, we develop a gradient descentlike algorithm, which sets zero for the “gradient” component corresponding to nonsmooth samples. We refer to such an algorithm together with an initialization using a new spectral method (different from that employed in TWF or WF) as reshaped Wirtinger flow (RWF). We show that the lowerorder loss function has great advantage in both statistical and computational efficiency, although scarifying smoothness. In fact, the curvature of such a loss function behaves similarly to that of a leastsquares loss function in the neighborhood of global optimums (see Section 2.2
), and hence RWF converges fast. The nonsmoothness does not significantly affect the convergence of the algorithm because only with negligible probability the algorithm encounters nonsmooth points for some samples, which furthermore are set not to contribute to the gradient direction by the algorithm.
We further exploit the loss function (3) to design the incremental/stochastic reshaped Wirtinger flow (IRWF), and we show that IRWF also enjoys the advantageous local curvature of such a loss function and achieves excellent statistical and computation performance. In particular, IRWF performs better than other competitive incremental methods (ITWF and KaczmarzPR) as well as batch algorithms (RWF, TWF and WF) numerically.
We summarize our main results as follows.

Statistically, we show that RWF recovers the true signal with samples, when the design vectors consist of independently and identically distributed (i.i.d.) Gaussian entries, which is optimal in the order sense. Thus, even without truncation in gradient steps (truncation only in initialization stage), RWF improves the sample complexity of WF, and achieves the same sample complexity as TWF with truncation in gradient step. It is thus more robust to random measurements.

Computationally, RWF converges geometrically to the true signal, requiring flops to reach accuracy. Again, without truncation in gradient steps, RWF improves computational cost of WF and achieves the same computational cost as TWF.

Numerically, RWF adopts fixed step size and does not require truncation, which avoids the trouble to set truncation thresholds in practice. It is generally two times faster than TWF and four to six times faster than WF in terms of the number of iterations and time cost.

We also show that RWF is robust to bounded additive noise. The estimation error is shown to diminish geometrically to the power of bounded noise up to a certain coefficient. Experiments on Poisson noise further corroborate the stability guarantee.

We further propose incremental/stochastic algorithm based on RWF (referred to as IRWF) and show that IRWF converges to the true signal geometrically under an appropriate initialization. More interestingly, we show that randomized KaczmarzPR (i.e., Kaczmarz method adapted for phase retrieval [14]) can be viewed as IRWF under a specific way of choosing step size, via which we further established geometric convergence guarantee for randomized KaczmarzPR.
Compared to WF and TWF, the new form of the gradient step due to nonsmoothness of the loss function, in terms of technical analysis, requires new developments of bounding techniques. On the other hand, our technical proof of performance guarantee is much simpler, because the lowerorder loss function allows to bypass higherorder moments of variables and truncation in gradient steps. We also anticipate that such analysis is more easily extendable.
1.2 Connection to Related Work
Along the line of developing nonconvex algorithms with global performance guarantee for the phase retrieval problem, [13] developed alternating minimization algorithm, [1, 2, 17, 18] developed/studied firstorder gradientlike algorithms, and a recent study [19] characterized geometric structure of the nonconvex objective and designed a secondorder trustregion algorithm. This paper is most closely related to [1, 2, 20, 17], but develops a new gradientlike algorithm based on a lowerorder nonsmooth (as well as nonconvex) loss function that yields advantageous statistical/computational efficiency.
Stochastic algorithms are also developed for the phase retrieval problem. [16] studied the incremental truncated Wirtinger flow (ITWF) and showed that ITWF needs much fewer passes of data than TWF to reach the same accuracy. [14] adapted the Kaczmarz method to solve the phase retrieval problem and demonstrated its fast empirical convergence. We propose a stochastic algorithm based on RWF (IRWF) and show that it has close connection with KaczmarzPR. We also show that IRWF runs faster than ITWF due to the benefit of loworder loss function.
After our work was posted on arXiv, an independent work [21] was subsequently posted, which also adopts the same loss function but develops a slightly different algorithm TAF (i.e., truncated amplitude flow). One major difference of our algorithm RWF from TAF is that RWF does not require truncation in gradient loops while TAF employs truncation. Hence, RWF has fewer parameters to tune, and is easier to implement than TAF in practice. Furthermore, RWF demonstrates the performance advantage of adopting a lowerorder loss function even without truncation, which cannot be observed from TAF. Moreover, we analyze stochastic algorithm based on new loss function while [21] does not.
More generally, various problems have been studied by minimizing nonconvex loss functions. For example, a partial list of these studies include matrix completion [22, 23, 24, 25, 26, 27, 28, 29], lowrank matrix recovery [30, 31, 32, 33, 34, 35], robust PCA [36]
, robust tensor decomposition
[37], dictionary learning [38, 39], community detection [40], phase synchronization[41], blind deconvolution [42, 43], etc.For minimizing a general nonconvex nonsmooth objective, various algorithms have been proposed, such as gradient sampling algorithm [44, 45] and majorizationminimization method [46]. These algorithms were often shown to convergence to critical points which may be local minimizers or saddle points, without explicit characterization of convergence rate. In contrast, our algorithm is specifically designed for the phase retrieval problem, and can be shown to converge linearly to global optimum under appropriate initialization.
The advantage of nonsmooth loss function exhibiting in our study is analogous in spirit to that of the rectifier activation function (of the form
) in neural networks. It has been shown that rectified linear unit (
ReLU) enjoys superb advantage in reducing the training time [47] and promoting sparsity [48] over its counterparts of sigmoid and hyperbolic tangent functions, in spite of nonlinearity and nondifferentiability at zero. Our result in fact also demonstrates that a nonsmooth but simpler loss function yields improved performance.1.3 Paper Organization and Notations
The rest of this paper is organized as follows. Section 2 describes RWF algorithm in detail and establishes its performance guarantee. Section 3 introduces the IRWF algorithm and establishes its performance guarantee and compares it with existing stochastic algorithms. Section 4 compares RWF and IRWF with other competitive algorithms numerically. Finally, Section 5 concludes the paper with comments on future directions.
Throughout the paper, boldface lowercase letters such as denote vectors, and boldface capital letters such as denote matrices. For two matrices, means that is positive definite. For a complex matrix or vector, and denote conjugate transposes of and respectively. For a real matrix or vector, and denote transposes of and respectively. The indicator function if the event is true, and otherwise.
2 Reshaped Wirtinger Flow
Consider the problem of recovering a signal based on measurements given by
(4) 
where for
are known measurement vectors, independently generated by Gaussian distribution
. It can be observed that if is a solution, i.e., satisfying (1), then is also the solution of the problem. So the recovery is up to a phase difference. Thus, we define the Euclidean distance between two vectors up to a global phase difference [1] as, for complex signals,(5) 
where it is simply for real case. We focus on the realvalued case in analysis, but the algorithm designed below is applicable to the complexvalued case and the case with coded diffraction pattern (CDP) as we demonstrate via numerical results in Section 4.
We design RWF (see Algorithm 1) for solving the above problem, which contains two stages: spectral initialization and gradient loop. Suggested values for parameters are and ^{2}^{2}2For complex Gaussian case, we suggest .. The scaling parameter in and the conjugate transpose allow the algorithm readily applicable to complex and CDP cases. We next describe the two stages of the algorithm in detail in Sections 2.1 and 2.2, respectively, and establish the convergence of the algorithm in Section 2.3. Finally, we provide the stability result of RWF in Section 2.4.
Input: , ;
Parameters: Lower and upper thresholds for truncation in initialization, stepsize ;
Initialization: Let , where and
is the leading eigenvector of
(6) 
Gradient loop: for do
(7) 
Output .
2.1 Initialization via Spectral Method
Differently from the spectral initialization methods for WF in [1] and for TWF in [2], both of which are based on , we propose an alternative initialization in Algorithm 1 that uses magnitude instead, and truncates samples with both lower and upper thresholds as in (6). We show that such initialization achieves smaller sample complexity than WF and the same orderlevel sample complexity as TWF, and furthermore, performs better than both WF and TWF numerically.
Our initialization consists of estimation of both the norm and direction of . The norm estimation of is given by in Algorithm 1 with mathematical justification in Appendix A. Intuitively, with real Gaussian measurements, the scaling coefficient . Moreover,
are independent subGaussian random variables for
with mean , and thus . Combining these two facts yields the desired argument.The direction of is approximated by the leading eigenvector of , because approaches by concentration of measure and the leading eigenvector of takes the form for some scalar . We note that (6) involves truncation of samples from both sides, in contrast to truncation only by an upper threshold in [2]. This difference is due to the following reason. We note that in high dimension setting, two random vectors are almost perpendicular to each other [49], which indicates there are considerable amount of with small values, i.e., less than 1. These samples with small values deviate the direction of leading eigenvector of from , whose effect cannot be offset and neglected if there are only moderate number of samples. Specifically, [2] uses to weight the contribution of in and the square power helps to reduce the contribution of bad directions (that samples with small values). In contrast, we use to weight the contribution of and apply truncation from bellow to filter out bad directions directly.
We next provide the formal statement of the performance guarantee for the initialization step that we propose.
Proposition 1.
Fix . The initialization step in Algorithm 1 yields satisfying with probability at least , if , where is some positive constant and is a positive number only affected by and .
Proof.
See Appendix A. ∎
Finally, Figure 1 demonstrates that RWF achieves better initialization accuracy in terms of the relative error than WF and TWF. Furthermore, we also include the orthonormal promoting initialization method proposed for truncated amplitude flow (TAF) in the independent work [21], in the comparison. It can be seen that our initialization is slightly better.
2.2 Gradient Loop and Why RWF is Fast
We let the update direction be as follows:
(9) 
where is the sign function for nonzero arguments. We further set and . In fact, equals the gradient of the loss function (8) if for all . For samples with nonsmooth point, i.e., , we adopt Fréchet superdifferential [50] for nonconvex function to set the corresponding gradient component to be zero (as zero is an element in Fréchet superdifferential). With abuse of terminology, we still refer to in (9) as “gradient” for simplicity, which rather represents the update direction in the gradient loop of Algorithm 1.
We next provide the intuition about why reshaped WF is fast. Suppose that the spectral method sets an initial point in the neighborhood of the ground truth . We compare RWF with the following problem of solving from linear equations with and for given. In particular, we note that this problem has both magnitude and sign of the measurements observed. In this case, it is natural to use the leastsquares loss. Assume that the gradient descent is applied to solve this problem. Then the gradient is given by
(10) 
We now argue intuitively that the gradient (9) of RWF behaves similarly to the leastsquares gradient (10). For each , , and hence the two gradient components are close if is viewed as an estimate of . The following lemma shows that if is small (guaranteed by initialization), then has the same sign as for large .
Lemma 1.
Let . For any given and , independent from , satisfying , we have
(11) 
where .
Proof.
See Appendix B.2. ∎
It is easy to observe in (11) that large is likely to have the same sign as so that the corresponding gradient components in (9) and (10) are likely equal, whereas small may have different sign as but contributes less to the gradient. Hence, overall the two gradients (9) and (10) should be close to each other with a large probability.
This fact can be further verified numerically. Figure 2(a) illustrates that RWF takes almost the same number of iterations for recovering a signal (with only magnitude information) as the leastsquares gradient descent method for recovering a signal (with both magnitude and sign information).



Furthermore, Figure 3 illustrates a quadratic function (Figure 3(a)), the expected loss surface of RWF (Figure 3(b), and see Appendix B for expression), and the expected loss surface for WF (Figure 3(c), and see Appendix B for expression). It can be seen that the loss of RWF rather than the loss of WF has a similar curvature to the quadratic function around the global optimums, which justifies its better performance than WF.
2.3 Geometric Convergence of RWF
We characterize the convergence of RWF in the following theorem.
Theorem 1.
Outline of the Proof.
We outline the proof here with details relegated to Appendix C. Compared to WF and TWF, our proof requires new development of bounding techniques to deal with nonsmoothness, but is much simpler due to the lowerorder loss function that RWF relies on.
We first introduce a global phase notation for real case as follows:
(13) 
For the sake of simplicity, we let be , which indicates that is always in the neighborhood of .
Here, the central idea is to show that within the neighborhood of global optimums, RWF satisfies the Regularity Condition [2], i.e.,
(14) 
for all obeying , where is some constant. Then, as shown in [2], once the initialization lands into this neighborhood, geometric convergence can be guaranteed, i.e.,
(15) 
for any with .
We note that (17) implies an upper bound , by taking and to be sufficiently small. This suggests a range to set the step size in Algorithm 1. However, in practice, can be set much larger than such a bound, say , while still keeping the algorithm convergent. This is because the coefficients in the proof are set for convenience of proof rather than being tightly chosen.
Theorem 1 indicates that RWF recovers the true signal with samples, which is orderlevel optimal. Such an algorithm improves the sample complexity of WF. Furthermore, RWF does not require truncation of weak samples in the gradient step to achieve the same sample complexity as TWF. This is mainly because RWF benefits from the lowerorder loss function given in (8), the curvature of which behaves similarly to the leastsquares loss function locally as we explain in Section 2.2.
Theorem 1 also suggests that RWF converges geometrically at a constant step size. To reach accuracy, it requires computational cost of flops, which is better than WF (). Furthermore, it does not require truncation in gradient steps to reach the same computational cost as TWF. Numerically, as we demonstrate in Section 4, RWF is two times faster than TWF and four to six times faster than WF in terms of both iteration count and time cost in various examples.
2.4 Stability to Bounded Noise
We have established that RWF guarantees exact recovery with geometric convergence for noisefree case. We now study RWF in the presence of noise. Suppose the measurements are corrupted by bounded noise, and are given by
(18) 
where denote the additive noise. Then the following theorem shows that RWF is robust under such noise corruption.
Theorem 2.
Proof.
See Appendix D. ∎
The numerical result under the Poisson noise model in Section 4 further corroborates the stability of RWF.
3 Incremental Reshaped Wirtinger Flow
In large sample size and online scenarios, stochastic algorithm is preferred due to its potential advantage of fast convergence and low computational complexity. Thus, in this section, we develop the stochastic algorithm based on RWF, named incremental reshaped Wirtinger flow (IRWF). We show that IRWF guarantees exact recovery with linear convergence rate. We further draw the connection between IRWF and the KaczmarzPR algorithm recently developed for phase retrieval [14, 15, 51].
3.1 IRWF: Algorithm and Convergence
We describe IRWF in Algorithm 2. More specifically, IRWF applies the same initialization step as in RWF, but calculates each gradient update using only one sample selected randomly.
Input: , ;
Initialization: Same as in RWF (Algorithm 1);
Gradient loop: for do
Choose uniformly at random from , and let
(20) 
Output .
We characterize the convergence of IRWF in the following theorem.
Theorem 3.
Consider the problem of solving any given from a system of equations (4) with Gaussian measurement vectors. There exist some universal constants and such that if and , then with probability at least , Algorithm 2 yields
(21) 
where denotes the expectation with respect to algorithm randomness conditioned on the high probability event of random measurements .
We suggest the step size in practice.
Proof.
Theorem 3
establishes that IRWF achieves linear convergence to the global optimum. For general objectives, it is not anticipated that incremental/stochastic firstorder method achieves linear convergence due to the variance of stochastic gradient. However, for our specific problem, the variance of stochastic gradient reduces as the estimate approaches the true solution, and hence a fixed step size can be employed and linear convergence can be established. Such a result is also established in
[16] for the stochastic algorithm based on TWF (ITWF). We comment further on comparison between our algorithm and ITWF in Section 3.4. Another explanation may be due to the fact [52, 53] that stochastic gradient method yields linear convergence to the minimizer when the objective is a smooth and strongly convex function and minimizes all components . This may also hold for our objective (3) whose summands share a same minimizer, although it is neither convex nor smooth.3.2 Minibatch IRWF: Algorithm and Convergence
In oder to fully exploit the processing throughput of CPU/GPU, we develop a minibatch version of IRWF, described in Algorithm 3. The minibatch IRWF applies the same initialization step as in RWF, but uses a minibatch of data for each gradient update in contrast to IRWF that uses only a single sample.
Input: , ;
Initialization: Same as in RWF (Algorithm 1);
Gradient loop: for do
Choose uniformly at random from the subsets of with cardinality , and let
(22) 
where is a matrix stacking for as its rows, is a vector stacking for as its elements, denotes elementwise product, and denotes a phase vector of .
Output .
We characterize the convergence of minibatch IRWF in the following theorem.
Theorem 4.
Consider the problem of solving any given from a system of equations (4) with Gaussian measurement vectors. There exist some universal constants and such that if and , then with probability at least , Algorithm 3 yields
(23) 
where denotes the expectation with respect to algorithm randomness conditioned on the high probability event of random measurements .
Proof.
See Appendix E.2. ∎
We suggest that in practice.
3.3 Connection to Kaczmarz Method for Phase Retrieval
Kaczmarz method was originally developed for solving the linear equation systems [54]. In [14], it was adapted to solve phase retrieval problem, which we refer to as KaczmarzPR. It has been demonstrated in [14] that KaczmarzPR exhibits better empirical performance than error reduction (ER) [6, 7] and Wirtinger flow (WF) [1]. However, theoretical guarantee of KaczmarzPR has not been well established yet although Kaczmarz method for leastsquares problem achieves linear convergence guarantee [55, 56]. For instance, [14] obtained a bound on the estimation error which can be as large as the signal energy no matter how many iterations are taken. [15] requires infinite number of samples to establish the asymptotic convergence.
In this section, we draw the connection between IRWF and KaczmarzPR, and thus the theoretical guarantee of KaczmarzPR can be established by adapting that of IRWF. This is analogous to the connection established in [53] between Kaczmarz method and stochastic gradient method when solving the leastsquares problem. Here, the connection is interesting because RWF rather than WF and TWF connects to KaczmarzPR due to the lowerorder loss function that RWF adopts.
To be more specific, the KaczmarzPR (Algorithm 3 in [14]) employs the following update rule
(24) 
where is selected either in a deterministic manner or randomly. We focus on the randomized KaczmarzPR where is selected uniformly at random.
Comparing (24) and (20), the update rule of randomized Kaczmarz is a special case of IRWF with step size replaced by . Moreover, these two update rules are close if is set as suggested, i.e., , because concentrates around
by law of large numbers. As we demonstrate in empirical results (see Table
1), these two methods have similar performance as anticipated. Thus, following the convergence result Theorem 3 for IRWF, we have the convergence guarantee for the randomized KaczmarzPR as follows.Theorem 5.
Assume the measurement vectors are independent and each . There exist some universal constants and such that if , then with probability at least , the randomized Kaczmarz update rule (24) yields
(25) 
holds for all satisfying .
Proof.
See Appendix E.3. ∎
The above theorem implies that once the estimate enters the neighborhood of true solutions (often referred as to basin of attraction), the error diminishes geometrically by each update in expectation.
Furthermore, [14] also provided a block KaczmarzPR (similar to the minibatch version), whose update rule is given by
(26) 
where is elementwise product, denotes a phase vector of (stacking phase of each element of together), is a matrix with each row being measurement vector , and is a selected block at iterate containing row indices. Moreover, represents MoorePenrose pseudoinverse, which can be computed as follows:
(27) 
Comparing (26) and the minibatch IRWF update in (22), these two update rules are similar to each other if approaches . For the case with Gaussian measurements, has linearly independent rows with high probability if and hence is not far from . Our empirical results (see Table 1) further suggest similar convergence rate for these two algorithms with the same block size.
Next, we argue that for the CDP setting, block KaczmarzPR is the same as the minibatch IRWF with . The CDP measurements are collected in the following form
(28) 
where
represents the discrete Fourier transform (DFT) matrix, and
is a diagonal matrix (mask). We choose the block size to be , the dimension of the signal, for the convenience of Fourier transform. Then becomes a Fourier transform composed with (mask effect) and becomes multiplied by inverse Fourier transform. Therefore, if the diagonal elements of have unit magnitude. Taking the step size , the two algorithms are identical.On the other hand, since the block KaczmarzPR needs to calculate the matrix inverse or to solve an inverse problem, the block size cannot be too large. However, minibatch IRWF works well for a wide range of batch sizes which can even vary with signal dimension as long as a batch of data is loadable into memory.
3.4 Comparison with Incremental Truncated Wirtinger Flow (ITWF)
Recently, [16] designed and analyzed an incremental algorithm based on TWF, which is referred to as ITWF. More specifically, ITWF employs the same initialization procedure as TWF and randomly chooses one sample for gradient update as follows.
(29) 
where is the truncation rule determined by two events and . Compared to ITWF developed based on TWF, our IRWF uses lowerorder variable rather than used in ITWF. Moreover, IRWF does not employ any truncation in gradient loops and hence has fewer parameters to tune, which is easier to implement in practice.
[16] proved that ITWF converges linearly to the true signal as long as (sample size/ signal dimension) is large enough. Compared to ITWF, IRWF also achieves the same linear convergence, but runs faster than ITWF numerically as demonstrated in Section 4. The proof of IRWF requires different bounding techniques, but is simpler than ITWF due to the lowerorder loss function that RWF adopts and the avoidance of truncation in the gradient update.
4 Numerical Results
In this section, we demonstrate the numerical efficiency of RWF and IRWF by comparing their performances with other competitive algorithms. Our experiments are conducted not only for real Gaussian case but also for complex Gaussian and the CDP cases. All the experiments are implemented in Matlab 2015b and carried out on a computer equipped with Intel Core i7 3.4GHz CPU and 12GB RAM.
We first compare the sample complexity of RWF and IRWF with those of TWF and WF via the empirical successful recovery rate versus the number of measurements. For RWF, we follow Algorithm 1 with suggested parameters. For IRWF, we adopt a block size 64 for efficiency and set the step size . For WF, TWF, we use the codes provided in the original papers with the suggested parameters. For ITWF, we also adopt a block size 64 and set the step size (optimal step size). We conduct the experiment for real Gaussian, complex Gaussian and CDP cases respectively. For real and complex cases, we set the signal dimension to be 1000, and set the ratio to take values from to by a step size . For each , we run trials and count the number of successful trials. For each trial, we run a fixed number of iterations/passes for all algorithms. A trial is declared to be successful if , the output of the algorithm, satisfies . For the real Gaussian case, we generate signal , and the measurement vectors i.i.d. for . For the complex Gaussian case, we generate signal and measurements i.i.d. for . For the CDP case (28), we set for convenience of FFT and . All other settings are the same as those for the real case.



Figure 4 plots the fraction of successful trials out of 100 trials for all algorithms, with respect to . It can be further seen that IRWF exhibits the best sample complexity for all three cases, which is close to the theoretical limits [57]. It can be seen that the two incremental methods (IRWF and ITWF) outperform batch methods (RWF, TWF and WF). This can be due to the inherent noise in incremental methods helps to escape bad local minimums, which is extremely helpful in the regime of small number of samples where local minimums do exist near the global ones. Comparing among the three batch methods (RWF, TWF and WF), it can be seen that although RWF outperforms only WF (not TWF) for the real Gaussian case, it outperforms both WF and TWF for complex Gaussian and CDP cases. An intuitive explanation for the real case is that a substantial number of samples with small can deviate gradient so that truncation indeed helps to stabilize the algorithm if the number of measurements is not large. Furthermore, RWF exhibits sharper transition than TWF and WF.
Real Gaussian  Complex Gaussian  
#passses  time(s)  # passes  time(s)  
RWF  72  0.52  177  4.81  
Batch  TWF  182  1.30  484  13.5 
methods  WF  217  2.22  922  24.9 
AltMinPhase  6  0.94  157  91.8  
IRWF  9  3.15  21  11.8  
minibatch IRWF (64)  9  0.28  21  1.53  
Incremental  minibatch ITWF (64)  15  0.72  28.6  3.28 
methods  KaczmarzPR  9  3.71  21  13.2 
block KaczmarzPR (64)  8  0.45  21  3.22 
We next compare the convergence rate of RWF, IRWF with those of TWF, ITWF, WF and AltMinPhase. We run all of the algorithms with suggested parameter settings in the original codes. We generate signal and measurements in the same way as those in the first experiment with . All algorithms are seeded with RWF initialization. In Table 1, we list the number of passes and time cost for those algorithms to achieve the relative error of averaged over 10 trials. For incremental methods, one update passes samples and one pass amounts to updates. Clearly, IRWF with minibatch size 64 runs fastest for both real and complex cases. Moreover, among batch (deterministic) algorithms, RWF takes much less number of passes as well as running much faster than TWF and WF. Although RWF takes more iterations than AltMinPhase, it runs much faster than AltMinPhase due to the fact that each iteration of AltMinPhase needs to solve a leastsquares problem that takes much longer time than a simple gradient update in RWF.
We also compare the performance of the above algorithms on the recovery of a real image from the Fourier intensity measurements (two dimensional CDP case). The image (see Figure 5) is the Milky Way Galaxy with resolution . Table 2 lists the number of passes and the time cost of the above six algorithms to achieve the relative error of for one R/G/B channel. All algorithms are seeded with RWF initialization. To explore the advantage of FFT, we run the incremental/stochastic methods with minibatch size of the one R/G/B channel. We note that with such a minibatch size IRWF is equivalent to block KaczmarzPR from the discussion in Section 3.3. It can be seen that in general, the incremental/stochastic methods (IRWF/KaczmarzPR and ITWF) run faster than the batch methods (RWF, TWF, WF and AltMinPhase). Moreover, among batch methods, RWF outperforms other three algorithms in both number of passes and the computational time cost. In particular, RWF runs two times faster than TWF and six times faster than WF in terms of both the number of iterations and computational time cost.
Algorithms  RWF  IRWF/KaczmarzPR  TWF  ITWF  WF  AltMinPhase  

#passes  140  24  410  41  fail  230  
time cost(s)  110  21.2  406  43  fail  167  
#passes  70  8  190  12  315  120  
time cost(s)  107  13.7  363.6  25.9  426  171 
We next demonstrate the robustness of RWF to noise corruption and compare it with TWF. We consider the phase retrieval problem in imaging applications, where random Poisson noises are often used to model the sensor and electronic noise [58]. Specifically, the noisy measurements of intensity can be expressed as y where denotes the level of input noise, and
denotes a random sample generated by the Poisson distribution with mean
. It can be observed from Figure 6 that RWF performs better than TWF in terms of recovery accuracy under different noise levels.5 Conclusion
In this paper, we proposed RWF and its stochastic version IRWF to recover a signal from a quadratic systems of equations, based on a nonconvex and nonsmooth quadratic loss function of absolute values of measurements. This loss function sacrifices the smoothness but enjoys advantages in statistical and computational efficiency. It has potential to be extended in various scenarios. One interesting direction is to extend such an algorithm to exploit signal structures (e.g., nonnegativity, sparsity, etc) to assist the recovery. The lowerorder loss function may offer great simplicity to prove performance guarantee in such cases.
Another interesting direction is to study the convergence of algorithms from random initialization. In the regime of large sample size (), the empirical loss surface approaches the asymptotic loss (Figure 3(b)) and hence has no spurious local minimums. Due to [59], it is conceivable that gradient descent converges from random starting point. Similar phenomenons have been observed in [19, 29]. However, under moderate number of measurements (), authentic local minimums do exist and often locate not far from the global ones. In this regime, the batch gradient method often fails with random initialization. As always believed, stochastic algorithms are efficient in escaping bad local minimums or saddle points in nonconvex optimization because of the inherent noise [60, 26]. We observe numerically that IRWF and block IRWF from random starting point still converge to global minimum even with very small sample size which is close to the theoretical limits [57]. It is of interest to analyze theoretically that stochastic methods escape these local minimums (not just saddle points) efficiently.
Appendix
We first introduce some notations here. We let be a linear map
We let and denote the norm and norm of a vector, respectively. Moreover, let and denote the Frobenius norm and the spectral norm of a matrix, respectively. We note that the constants may be different from line to line, for the sake of notational simplicity.
Appendix A Proof of Proposition 1: Performance Guarantee for Initialization
Compared to the proof for TWF [2], this proof has new technical developments to address the magnitude measurements and truncation from both sides.
We first estimate the norm of as
(30) 
Since , by Hoeffdingtype inequality, it can be shown that
(31) 
holds with probability at least for some constant .
Moreover, given , ’s are independent subGaussian random variables. Thus, by Hoeffdingtype inequality, it can be shown that
(32) 
holds with probability at least for some constant .
On the event , it can be argued that
(33) 
Without loss of generality, we let . Then on the event , the truncation function satisfies the following bounds
Thus, by defining
we have . We further compute the expectations of and and obtain
(34) 
where
where . For given and , small value of yields arbitrarily close and , as well as arbitrarily close and . For example, taking and , we have .
Now applying standard results on random matrices with nonisotropic subGaussian rows [61, equation (5.26)] and noticing that can be rewritten as for subGaussian vector , one can derive
(35) 
with probability for some positive which is only affected by , provided that exceeds a certain constant. Furthermore, when is sufficiently small, one further has . Combining the above facts together, one can show that
(36) 
Let be the normalized leading eigenvector of . Following the arguments in [1, Section 7.8] and taking and to be sufficiently small, one has
(37) 
for a given , as long as exceeds a certain constant.
Appendix B Supporting Arguments for Section 2.2
b.1 Expectation of loss functions
We next show that the expectation of the loss function (3) of RWF has the following form:
Comments
There are no comments yet.