Reshaped Wirtinger Flow and Incremental Algorithm for Solving Quadratic System of Equations

05/25/2016 ∙ by Huishuai Zhang, et al. ∙ Syracuse University The Ohio State University 0

We study the phase retrieval problem, which solves quadratic system of equations, i.e., recovers a vector x∈R^n from its magnitude measurements y_i=|〈a_i, x〉|, i=1,..., m. We develop a gradient-like algorithm (referred to as RWF representing reshaped Wirtinger flow) by minimizing a nonconvex nonsmooth loss function. In comparison with existing nonconvex Wirtinger flow (WF) algorithm candes2015phase, although the loss function becomes nonsmooth, it involves only the second power of variable and hence reduces the complexity. We show that for random Gaussian measurements, RWF enjoys geometric convergence to a global optimal point as long as the number m of measurements is on the order of n, the dimension of the unknown x. This improves the sample complexity of WF, and achieves the same sample complexity as truncated Wirtinger flow (TWF) chen2015solving, but without truncation in gradient loop. Furthermore, RWF costs less computationally than WF, and runs faster numerically than both WF and TWF. We further develop the incremental (stochastic) reshaped Wirtinger flow (IRWF) and show that IRWF converges linearly to the true signal. We further establish performance guarantee of an existing Kaczmarz method for the phase retrieval problem based on its connection to IRWF. We also empirically demonstrate that IRWF outperforms existing ITWF algorithm (stochastic version of TWF) as well as other batch algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many problems in machine learning and signal processing can be reduced to solve a quadratic system of equations. For instance, in phase retrieval applications, i.e., X-ray crystallography and coherent diffraction imaging

[3, 4, 5], the structure of an object is to be recovered from the measured far field diffracted intensity when an object is illuminated by a source light. Mathematically, such a problem amounts to recover the signal from only measurements of magnitudes. Specifically, the problem is formulated below.

Problem 1.

Recover from measurements given by

(1)

where are random design vectors (known).

Various algorithms have been proposed to solve this problem since 1970s. The error-reduction methods proposed in [6, 7] work well empirically but lack theoretical guarantees. More recently, convex relaxation of the problem has been formulated, for example, via PhaseLift [8, 9, 10] and PhaseCut [11], and the correspondingly developed algorithms typically come with performance guarantee. The reader can refer to the review paper [12] to learn more about applications and algorithms of the phase retrieval problem.

While with good theoretical guarantee, these convex methods often suffer from computational complexity particularly when the signal dimension is large. On the other hand, more efficient nonconvex approaches have been proposed and shown to recover the true signal as long as initialization is good enough. [13] proposed AltMinPhase algorithm, which alternatively updates the phase and the signal with each signal update solving a least-squares problem, and showed that AltMinPhase converges linearly and recovers the true signal with Gaussian measurements. More recently, [1] introduces Wirtinger flow (WF) algorithm, which guarantees signal recovery via a simple gradient algorithm with only Gaussian measurements and attains accuracy within flops. More specifically, WF obtains good initialization by the spectral method, and then minimizes the following nonconvex loss function

(2)

via the gradient descent scheme.

WF is further improved by truncated Wirtinger flow (TWF) algorithm proposed in [2], which adopts a Poisson loss function of , and keeps only well-behaved measurements based on carefully designed truncation thresholds for calculating the initial seed and every step of gradient. Such truncation assists to yield linear convergence with certain fixed step size and reduces both the sample complexity to and the convergence time to .

It can be observed that WF uses the quadratic loss of so that the optimization objective is a smooth function of and the gradient step becomes simple. But this comes with a cost of increasing the order of to be four in the loss function. In this paper, we adopt the quadratic loss of . Although the loss function is not smooth everywhere, it reduces the order of to be two, and the general curvature can be more amenable to convergence of the gradient method. The goal of this paper is to explore potential advantages of such a nonsmooth lower-order loss function.

Furthermore, incremental/stochastic methods have been proposed to solve Problem 1. Specifically, Kaczmarz method for phase retrieval (Kaczmarz-PR) [14, 15] is shown to have excellent empirical performance, but no global convergence guarantee was established. Incremental truncated Wirtinger flow (ITWF) [16] is a stochastic algorithm developed based on TWF and exhibits linear convergence to the true signal once initialized properly. In this paper, we consider the incremental/stochastic version of RWF (IRWF) and compare its performance with Kaczmarz-PR and ITWF, in order to further demonstrate the advantage of the lower-order loss function.

1.1 Our Contribution

This paper adopts the following loss function111The loss function (3) was also used in [7] to derive a gradient-like update for the phase retrieval problem with Fourier magnitude measurements. However, the focus of this paper is to characterize global convergence guarantee for such an algorithm with appropriate initialization, which was not studied in [7].

(3)

Compared to the loss function (2) in WF that adopts , the above loss function adopts the absolute value/magnitude and hence has lower-order variables. For such a nonconvex and nonsmooth loss function, we develop a gradient descent-like algorithm, which sets zero for the “gradient” component corresponding to nonsmooth samples. We refer to such an algorithm together with an initialization using a new spectral method (different from that employed in TWF or WF) as reshaped Wirtinger flow (RWF). We show that the lower-order loss function has great advantage in both statistical and computational efficiency, although scarifying smoothness. In fact, the curvature of such a loss function behaves similarly to that of a least-squares loss function in the neighborhood of global optimums (see Section 2.2

), and hence RWF converges fast. The nonsmoothness does not significantly affect the convergence of the algorithm because only with negligible probability the algorithm encounters nonsmooth points for some samples, which furthermore are set not to contribute to the gradient direction by the algorithm.

We further exploit the loss function (3) to design the incremental/stochastic reshaped Wirtinger flow (IRWF), and we show that IRWF also enjoys the advantageous local curvature of such a loss function and achieves excellent statistical and computation performance. In particular, IRWF performs better than other competitive incremental methods (ITWF and Kaczmarz-PR) as well as batch algorithms (RWF, TWF and WF) numerically.

We summarize our main results as follows.

  • Statistically, we show that RWF recovers the true signal with samples, when the design vectors consist of independently and identically distributed (i.i.d.) Gaussian entries, which is optimal in the order sense. Thus, even without truncation in gradient steps (truncation only in initialization stage), RWF improves the sample complexity of WF, and achieves the same sample complexity as TWF with truncation in gradient step. It is thus more robust to random measurements.

  • Computationally, RWF converges geometrically to the true signal, requiring flops to reach accuracy. Again, without truncation in gradient steps, RWF improves computational cost of WF and achieves the same computational cost as TWF.

  • Numerically, RWF adopts fixed step size and does not require truncation, which avoids the trouble to set truncation thresholds in practice. It is generally two times faster than TWF and four to six times faster than WF in terms of the number of iterations and time cost.

  • We also show that RWF is robust to bounded additive noise. The estimation error is shown to diminish geometrically to the power of bounded noise up to a certain coefficient. Experiments on Poisson noise further corroborate the stability guarantee.

  • We further propose incremental/stochastic algorithm based on RWF (referred to as IRWF) and show that IRWF converges to the true signal geometrically under an appropriate initialization. More interestingly, we show that randomized Kaczmarz-PR (i.e., Kaczmarz method adapted for phase retrieval [14]) can be viewed as IRWF under a specific way of choosing step size, via which we further established geometric convergence guarantee for randomized Kaczmarz-PR.

Compared to WF and TWF, the new form of the gradient step due to nonsmoothness of the loss function, in terms of technical analysis, requires new developments of bounding techniques. On the other hand, our technical proof of performance guarantee is much simpler, because the lower-order loss function allows to bypass higher-order moments of variables and truncation in gradient steps. We also anticipate that such analysis is more easily extendable.

1.2 Connection to Related Work

Along the line of developing nonconvex algorithms with global performance guarantee for the phase retrieval problem, [13] developed alternating minimization algorithm, [1, 2, 17, 18] developed/studied first-order gradient-like algorithms, and a recent study [19] characterized geometric structure of the nonconvex objective and designed a second-order trust-region algorithm. This paper is most closely related to [1, 2, 20, 17], but develops a new gradient-like algorithm based on a lower-order nonsmooth (as well as nonconvex) loss function that yields advantageous statistical/computational efficiency.

Stochastic algorithms are also developed for the phase retrieval problem. [16] studied the incremental truncated Wirtinger flow (ITWF) and showed that ITWF needs much fewer passes of data than TWF to reach the same accuracy. [14] adapted the Kaczmarz method to solve the phase retrieval problem and demonstrated its fast empirical convergence. We propose a stochastic algorithm based on RWF (IRWF) and show that it has close connection with Kaczmarz-PR. We also show that IRWF runs faster than ITWF due to the benefit of low-order loss function.

After our work was posted on arXiv, an independent work [21] was subsequently posted, which also adopts the same loss function but develops a slightly different algorithm TAF (i.e., truncated amplitude flow). One major difference of our algorithm RWF from TAF is that RWF does not require truncation in gradient loops while TAF employs truncation. Hence, RWF has fewer parameters to tune, and is easier to implement than TAF in practice. Furthermore, RWF demonstrates the performance advantage of adopting a lower-order loss function even without truncation, which cannot be observed from TAF. Moreover, we analyze stochastic algorithm based on new loss function while [21] does not.

More generally, various problems have been studied by minimizing nonconvex loss functions. For example, a partial list of these studies include matrix completion [22, 23, 24, 25, 26, 27, 28, 29], low-rank matrix recovery [30, 31, 32, 33, 34, 35], robust PCA [36]

, robust tensor decomposition

[37], dictionary learning [38, 39], community detection [40], phase synchronization[41], blind deconvolution [42, 43], etc.

For minimizing a general nonconvex nonsmooth objective, various algorithms have been proposed, such as gradient sampling algorithm [44, 45] and majorization-minimization method [46]. These algorithms were often shown to convergence to critical points which may be local minimizers or saddle points, without explicit characterization of convergence rate. In contrast, our algorithm is specifically designed for the phase retrieval problem, and can be shown to converge linearly to global optimum under appropriate initialization.

The advantage of nonsmooth loss function exhibiting in our study is analogous in spirit to that of the rectifier activation function (of the form

) in neural networks. It has been shown that rectified linear unit (

ReLU) enjoys superb advantage in reducing the training time [47] and promoting sparsity [48] over its counterparts of sigmoid and hyperbolic tangent functions, in spite of non-linearity and non-differentiability at zero. Our result in fact also demonstrates that a nonsmooth but simpler loss function yields improved performance.

1.3 Paper Organization and Notations

The rest of this paper is organized as follows. Section 2 describes RWF algorithm in detail and establishes its performance guarantee. Section 3 introduces the IRWF algorithm and establishes its performance guarantee and compares it with existing stochastic algorithms. Section 4 compares RWF and IRWF with other competitive algorithms numerically. Finally, Section 5 concludes the paper with comments on future directions.

Throughout the paper, boldface lowercase letters such as denote vectors, and boldface capital letters such as denote matrices. For two matrices, means that is positive definite. For a complex matrix or vector, and denote conjugate transposes of and respectively. For a real matrix or vector, and denote transposes of and respectively. The indicator function if the event is true, and otherwise.

2 Reshaped Wirtinger Flow

Consider the problem of recovering a signal based on measurements given by

(4)

where for

are known measurement vectors, independently generated by Gaussian distribution

. It can be observed that if is a solution, i.e., satisfying (1), then is also the solution of the problem. So the recovery is up to a phase difference. Thus, we define the Euclidean distance between two vectors up to a global phase difference [1] as, for complex signals,

(5)

where it is simply for real case. We focus on the real-valued case in analysis, but the algorithm designed below is applicable to the complex-valued case and the case with coded diffraction pattern (CDP) as we demonstrate via numerical results in Section 4.

We design RWF (see Algorithm 1) for solving the above problem, which contains two stages: spectral initialization and gradient loop. Suggested values for parameters are and 222For complex Gaussian case, we suggest .. The scaling parameter in and the conjugate transpose allow the algorithm readily applicable to complex and CDP cases. We next describe the two stages of the algorithm in detail in Sections 2.1 and 2.2, respectively, and establish the convergence of the algorithm in Section 2.3. Finally, we provide the stability result of RWF in Section 2.4.

Input: , ;
Parameters: Lower and upper thresholds for truncation in initialization, stepsize ;
Initialization: Let , where and

is the leading eigenvector of

(6)

Gradient loop: for do

(7)

Output .

Algorithm 1 Reshaped Wirtinger Flow

2.1 Initialization via Spectral Method

Differently from the spectral initialization methods for WF in [1] and for TWF in [2], both of which are based on , we propose an alternative initialization in Algorithm 1 that uses magnitude instead, and truncates samples with both lower and upper thresholds as in (6). We show that such initialization achieves smaller sample complexity than WF and the same order-level sample complexity as TWF, and furthermore, performs better than both WF and TWF numerically.

Our initialization consists of estimation of both the norm and direction of . The norm estimation of is given by in Algorithm 1 with mathematical justification in Appendix A. Intuitively, with real Gaussian measurements, the scaling coefficient . Moreover,

are independent sub-Gaussian random variables for

with mean , and thus . Combining these two facts yields the desired argument.

Figure 1: Comparison of different initialization methods with and 50 iterations.

The direction of is approximated by the leading eigenvector of , because approaches by concentration of measure and the leading eigenvector of takes the form for some scalar . We note that (6) involves truncation of samples from both sides, in contrast to truncation only by an upper threshold in [2]. This difference is due to the following reason. We note that in high dimension setting, two random vectors are almost perpendicular to each other [49], which indicates there are considerable amount of with small values, i.e., less than 1. These samples with small values deviate the direction of leading eigenvector of from , whose effect cannot be offset and neglected if there are only moderate number of samples. Specifically, [2] uses to weight the contribution of in and the square power helps to reduce the contribution of bad directions (that samples with small values). In contrast, we use to weight the contribution of and apply truncation from bellow to filter out bad directions directly.

We next provide the formal statement of the performance guarantee for the initialization step that we propose.

Proposition 1.

Fix . The initialization step in Algorithm 1 yields satisfying with probability at least , if , where is some positive constant and is a positive number only affected by and .

Proof.

See Appendix A. ∎

Finally, Figure 1 demonstrates that RWF achieves better initialization accuracy in terms of the relative error than WF and TWF. Furthermore, we also include the orthonormal promoting initialization method proposed for truncated amplitude flow (TAF) in the independent work [21], in the comparison. It can be seen that our initialization is slightly better.

2.2 Gradient Loop and Why RWF is Fast

The gradient loop of Algorithm 1 is based on the loss function (3), which is rewritten below

(8)

We let the update direction be as follows:

(9)

where is the sign function for nonzero arguments. We further set and . In fact, equals the gradient of the loss function (8) if for all . For samples with nonsmooth point, i.e., , we adopt Fréchet superdifferential [50] for nonconvex function to set the corresponding gradient component to be zero (as zero is an element in Fréchet superdifferential). With abuse of terminology, we still refer to in (9) as “gradient” for simplicity, which rather represents the update direction in the gradient loop of Algorithm 1.

We next provide the intuition about why reshaped WF is fast. Suppose that the spectral method sets an initial point in the neighborhood of the ground truth . We compare RWF with the following problem of solving from linear equations with and for given. In particular, we note that this problem has both magnitude and sign of the measurements observed. In this case, it is natural to use the least-squares loss. Assume that the gradient descent is applied to solve this problem. Then the gradient is given by

(10)

We now argue intuitively that the gradient (9) of RWF behaves similarly to the least-squares gradient (10). For each , , and hence the two gradient components are close if is viewed as an estimate of . The following lemma shows that if is small (guaranteed by initialization), then has the same sign as for large .

Lemma 1.

Let . For any given and , independent from , satisfying , we have

(11)

where .

Proof.

See Appendix B.2. ∎

It is easy to observe in (11) that large is likely to have the same sign as so that the corresponding gradient components in (9) and (10) are likely equal, whereas small may have different sign as but contributes less to the gradient. Hence, overall the two gradients (9) and (10) should be close to each other with a large probability.

This fact can be further verified numerically. Figure 2(a) illustrates that RWF takes almost the same number of iterations for recovering a signal (with only magnitude information) as the least-squares gradient descent method for recovering a signal (with both magnitude and sign information).

(a) Convergence behavior
Figure 2: Comparison of convergence behavior between RWF and least-squares gradient descent with the same initialization. Parameters , , step size .
(a) Quadratic surface
(b) Expected loss of RWF
(c) Expected loss of WF
Figure 3: (a) Surface of quadratic function with . (b) Expected loss function of RWF for . (c) Expected loss function of WF for .

Furthermore, Figure 3 illustrates a quadratic function (Figure 3(a)), the expected loss surface of RWF (Figure 3(b), and see Appendix B for expression), and the expected loss surface for WF (Figure 3(c), and see Appendix B for expression). It can be seen that the loss of RWF rather than the loss of WF has a similar curvature to the quadratic function around the global optimums, which justifies its better performance than WF.

2.3 Geometric Convergence of RWF

We characterize the convergence of RWF in the following theorem.

Theorem 1.

Consider the problem of solving any given from a system of equations (4) with Gaussian measurement vectors. There exist some universal constants ( can be set as in practice), and such that if and , then with probability at least , Algorithm 1 yields

(12)
Outline of the Proof.

We outline the proof here with details relegated to Appendix C. Compared to WF and TWF, our proof requires new development of bounding techniques to deal with nonsmoothness, but is much simpler due to the lower-order loss function that RWF relies on.

We first introduce a global phase notation for real case as follows:

(13)

For the sake of simplicity, we let be , which indicates that is always in the neighborhood of .

Here, the central idea is to show that within the neighborhood of global optimums, RWF satisfies the Regularity Condition [2], i.e.,

(14)

for all obeying , where is some constant. Then, as shown in [2], once the initialization lands into this neighborhood, geometric convergence can be guaranteed, i.e.,

(15)

for any with .

Lemmas 2 and 3 in Appendix C yield that

And Lemma 4 in Appendix C further yields that

(16)

Therefore, the above two bounds imply that Regularity Condition (14) holds for and satisfying

(17)

We note that (17) implies an upper bound , by taking and to be sufficiently small. This suggests a range to set the step size in Algorithm 1. However, in practice, can be set much larger than such a bound, say , while still keeping the algorithm convergent. This is because the coefficients in the proof are set for convenience of proof rather than being tightly chosen.

Theorem 1 indicates that RWF recovers the true signal with samples, which is order-level optimal. Such an algorithm improves the sample complexity of WF. Furthermore, RWF does not require truncation of weak samples in the gradient step to achieve the same sample complexity as TWF. This is mainly because RWF benefits from the lower-order loss function given in (8), the curvature of which behaves similarly to the least-squares loss function locally as we explain in Section 2.2.

Theorem 1 also suggests that RWF converges geometrically at a constant step size. To reach accuracy, it requires computational cost of flops, which is better than WF (). Furthermore, it does not require truncation in gradient steps to reach the same computational cost as TWF. Numerically, as we demonstrate in Section 4, RWF is two times faster than TWF and four to six times faster than WF in terms of both iteration count and time cost in various examples.

2.4 Stability to Bounded Noise

We have established that RWF guarantees exact recovery with geometric convergence for noise-free case. We now study RWF in the presence of noise. Suppose the measurements are corrupted by bounded noise, and are given by

(18)

where denote the additive noise. Then the following theorem shows that RWF is robust under such noise corruption.

Theorem 2.

Consider the model (18). Suppose that the measurement vectors are independently Gaussian, i.e., for , and the noise is bounded, i.e., . Then there exist some universal constants ( can be set as in practice), and such that if and , then with probability at least , Algorithm 1 yields

(19)

for some .

Proof.

See Appendix D. ∎

The numerical result under the Poisson noise model in Section 4 further corroborates the stability of RWF.

3 Incremental Reshaped Wirtinger Flow

In large sample size and online scenarios, stochastic algorithm is preferred due to its potential advantage of fast convergence and low computational complexity. Thus, in this section, we develop the stochastic algorithm based on RWF, named incremental reshaped Wirtinger flow (IRWF). We show that IRWF guarantees exact recovery with linear convergence rate. We further draw the connection between IRWF and the Kaczmarz-PR algorithm recently developed for phase retrieval [14, 15, 51].

3.1 IRWF: Algorithm and Convergence

We describe IRWF in Algorithm 2. More specifically, IRWF applies the same initialization step as in RWF, but calculates each gradient update using only one sample selected randomly.

Input: , ;
Initialization: Same as in RWF (Algorithm 1);

Gradient loop: for do
Choose uniformly at random from , and let

(20)

Output .

Algorithm 2 Incremental Reshaped Wirtinger Flow (IRWF)

We characterize the convergence of IRWF in the following theorem.

Theorem 3.

Consider the problem of solving any given from a system of equations (4) with Gaussian measurement vectors. There exist some universal constants and such that if and , then with probability at least , Algorithm 2 yields

(21)

where denotes the expectation with respect to algorithm randomness conditioned on the high probability event of random measurements .

We suggest the step size in practice.

Proof.

The proof is relegated to Appendix E.1, which uses technical lemmas established for proving Theorem 1. ∎

Theorem 3

establishes that IRWF achieves linear convergence to the global optimum. For general objectives, it is not anticipated that incremental/stochastic first-order method achieves linear convergence due to the variance of stochastic gradient. However, for our specific problem, the variance of stochastic gradient reduces as the estimate approaches the true solution, and hence a fixed step size can be employed and linear convergence can be established. Such a result is also established in

[16] for the stochastic algorithm based on TWF (ITWF). We comment further on comparison between our algorithm and ITWF in Section 3.4. Another explanation may be due to the fact [52, 53] that stochastic gradient method yields linear convergence to the minimizer when the objective is a smooth and strongly convex function and minimizes all components . This may also hold for our objective (3) whose summands share a same minimizer, although it is neither convex nor smooth.

3.2 Minibatch IRWF: Algorithm and Convergence

In oder to fully exploit the processing throughput of CPU/GPU, we develop a minibatch version of IRWF, described in Algorithm 3. The minibatch IRWF applies the same initialization step as in RWF, but uses a minibatch of data for each gradient update in contrast to IRWF that uses only a single sample.

Input: , ;
Initialization: Same as in RWF (Algorithm 1);

Gradient loop: for do
Choose uniformly at random from the subsets of with cardinality , and let

(22)

where is a matrix stacking for as its rows, is a vector stacking for as its elements, denotes element-wise product, and denotes a phase vector of .

Output .

Algorithm 3 Minibatch Incremetnal Reshaped Wirtinger Flow (minibatch IRWF)

We characterize the convergence of minibatch IRWF in the following theorem.

Theorem 4.

Consider the problem of solving any given from a system of equations (4) with Gaussian measurement vectors. There exist some universal constants and such that if and , then with probability at least , Algorithm 3 yields

(23)

where denotes the expectation with respect to algorithm randomness conditioned on the high probability event of random measurements .

Proof.

See Appendix E.2. ∎

We suggest that in practice.

3.3 Connection to Kaczmarz Method for Phase Retrieval

Kaczmarz method was originally developed for solving the linear equation systems [54]. In [14], it was adapted to solve phase retrieval problem, which we refer to as Kaczmarz-PR. It has been demonstrated in [14] that Kaczmarz-PR exhibits better empirical performance than error reduction (ER) [6, 7] and Wirtinger flow (WF) [1]. However, theoretical guarantee of Kaczmarz-PR has not been well established yet although Kaczmarz method for least-squares problem achieves linear convergence guarantee [55, 56]. For instance, [14] obtained a bound on the estimation error which can be as large as the signal energy no matter how many iterations are taken. [15] requires infinite number of samples to establish the asymptotic convergence.

In this section, we draw the connection between IRWF and Kaczmarz-PR, and thus the theoretical guarantee of Kaczmarz-PR can be established by adapting that of IRWF. This is analogous to the connection established in [53] between Kaczmarz method and stochastic gradient method when solving the least-squares problem. Here, the connection is interesting because RWF rather than WF and TWF connects to Kaczmarz-PR due to the lower-order loss function that RWF adopts.

To be more specific, the Kaczmarz-PR (Algorithm 3 in [14]) employs the following update rule

(24)

where is selected either in a deterministic manner or randomly. We focus on the randomized Kaczmarz-PR where is selected uniformly at random.

Comparing (24) and (20), the update rule of randomized Kaczmarz is a special case of IRWF with step size replaced by . Moreover, these two update rules are close if is set as suggested, i.e., , because concentrates around

by law of large numbers. As we demonstrate in empirical results (see Table

1), these two methods have similar performance as anticipated. Thus, following the convergence result Theorem 3 for IRWF, we have the convergence guarantee for the randomized Kaczmarz-PR as follows.

Theorem 5.

Assume the measurement vectors are independent and each . There exist some universal constants and such that if , then with probability at least , the randomized Kaczmarz update rule (24) yields

(25)

holds for all satisfying .

Proof.

See Appendix E.3. ∎

The above theorem implies that once the estimate enters the neighborhood of true solutions (often referred as to basin of attraction), the error diminishes geometrically by each update in expectation.

Furthermore, [14] also provided a block Kaczmarz-PR (similar to the minibatch version), whose update rule is given by

(26)

where is element-wise product, denotes a phase vector of (stacking phase of each element of together), is a matrix with each row being measurement vector , and is a selected block at iterate containing row indices. Moreover, represents Moore-Penrose pseudoinverse, which can be computed as follows:

(27)

Comparing (26) and the minibatch IRWF update in (22), these two update rules are similar to each other if approaches . For the case with Gaussian measurements, has linearly independent rows with high probability if and hence is not far from . Our empirical results (see Table 1) further suggest similar convergence rate for these two algorithms with the same block size.

Next, we argue that for the CDP setting, block Kaczmarz-PR is the same as the minibatch IRWF with . The CDP measurements are collected in the following form

(28)

where

represents the discrete Fourier transform (DFT) matrix, and

is a diagonal matrix (mask). We choose the block size to be , the dimension of the signal, for the convenience of Fourier transform. Then becomes a Fourier transform composed with (mask effect) and becomes multiplied by inverse Fourier transform. Therefore, if the diagonal elements of have unit magnitude. Taking the step size , the two algorithms are identical.

On the other hand, since the block Kaczmarz-PR needs to calculate the matrix inverse or to solve an inverse problem, the block size cannot be too large. However, minibatch IRWF works well for a wide range of batch sizes which can even vary with signal dimension as long as a batch of data is loadable into memory.

3.4 Comparison with Incremental Truncated Wirtinger Flow (ITWF)

Recently, [16] designed and analyzed an incremental algorithm based on TWF, which is referred to as ITWF. More specifically, ITWF employs the same initialization procedure as TWF and randomly chooses one sample for gradient update as follows.

(29)

where is the truncation rule determined by two events and . Compared to ITWF developed based on TWF, our IRWF uses lower-order variable rather than used in ITWF. Moreover, IRWF does not employ any truncation in gradient loops and hence has fewer parameters to tune, which is easier to implement in practice.

[16] proved that ITWF converges linearly to the true signal as long as (sample size/ signal dimension) is large enough. Compared to ITWF, IRWF also achieves the same linear convergence, but runs faster than ITWF numerically as demonstrated in Section 4. The proof of IRWF requires different bounding techniques, but is simpler than ITWF due to the lower-order loss function that RWF adopts and the avoidance of truncation in the gradient update.

4 Numerical Results

In this section, we demonstrate the numerical efficiency of RWF and IRWF by comparing their performances with other competitive algorithms. Our experiments are conducted not only for real Gaussian case but also for complex Gaussian and the CDP cases. All the experiments are implemented in Matlab 2015b and carried out on a computer equipped with Intel Core i7 3.4GHz CPU and 12GB RAM.

We first compare the sample complexity of RWF and IRWF with those of TWF and WF via the empirical successful recovery rate versus the number of measurements. For RWF, we follow Algorithm 1 with suggested parameters. For IRWF, we adopt a block size 64 for efficiency and set the step size . For WF, TWF, we use the codes provided in the original papers with the suggested parameters. For ITWF, we also adopt a block size 64 and set the step size (optimal step size). We conduct the experiment for real Gaussian, complex Gaussian and CDP cases respectively. For real and complex cases, we set the signal dimension to be 1000, and set the ratio to take values from to by a step size . For each , we run trials and count the number of successful trials. For each trial, we run a fixed number of iterations/passes for all algorithms. A trial is declared to be successful if , the output of the algorithm, satisfies . For the real Gaussian case, we generate signal , and the measurement vectors i.i.d. for . For the complex Gaussian case, we generate signal and measurements i.i.d. for . For the CDP case (28), we set for convenience of FFT and . All other settings are the same as those for the real case.

(a) Real Gaussian case
(b) Complex Gaussian case
(c) CDP case
Figure 4: Comparison of sample complexity among RWF, IRWF, TWF, ITWF and WF.

Figure 4 plots the fraction of successful trials out of 100 trials for all algorithms, with respect to . It can be further seen that IRWF exhibits the best sample complexity for all three cases, which is close to the theoretical limits [57]. It can be seen that the two incremental methods (IRWF and ITWF) outperform batch methods (RWF, TWF and WF). This can be due to the inherent noise in incremental methods helps to escape bad local minimums, which is extremely helpful in the regime of small number of samples where local minimums do exist near the global ones. Comparing among the three batch methods (RWF, TWF and WF), it can be seen that although RWF outperforms only WF (not TWF) for the real Gaussian case, it outperforms both WF and TWF for complex Gaussian and CDP cases. An intuitive explanation for the real case is that a substantial number of samples with small can deviate gradient so that truncation indeed helps to stabilize the algorithm if the number of measurements is not large. Furthermore, RWF exhibits sharper transition than TWF and WF.

Real Gaussian Complex Gaussian
#passses time(s) # passes time(s)
RWF 72 0.52 177 4.81
Batch TWF 182 1.30 484 13.5
methods WF 217 2.22 922 24.9
AltMinPhase 6 0.94 157 91.8
IRWF 9 3.15 21 11.8
minibatch IRWF (64) 9 0.28 21 1.53
Incremental minibatch ITWF (64) 15 0.72 28.6 3.28
methods Kaczmarz-PR 9 3.71 21 13.2
block Kaczmarz-PR (64) 8 0.45 21 3.22
Table 1: Comparison of iteration count and time cost among algorithms

We next compare the convergence rate of RWF, IRWF with those of TWF, ITWF, WF and AltMinPhase. We run all of the algorithms with suggested parameter settings in the original codes. We generate signal and measurements in the same way as those in the first experiment with . All algorithms are seeded with RWF initialization. In Table 1, we list the number of passes and time cost for those algorithms to achieve the relative error of averaged over 10 trials. For incremental methods, one update passes samples and one pass amounts to updates. Clearly, IRWF with minibatch size 64 runs fastest for both real and complex cases. Moreover, among batch (deterministic) algorithms, RWF takes much less number of passes as well as running much faster than TWF and WF. Although RWF takes more iterations than AltMinPhase, it runs much faster than AltMinPhase due to the fact that each iteration of AltMinPhase needs to solve a least-squares problem that takes much longer time than a simple gradient update in RWF.

We also compare the performance of the above algorithms on the recovery of a real image from the Fourier intensity measurements (two dimensional CDP case). The image (see Figure 5) is the Milky Way Galaxy with resolution . Table 2 lists the number of passes and the time cost of the above six algorithms to achieve the relative error of for one R/G/B channel. All algorithms are seeded with RWF initialization. To explore the advantage of FFT, we run the incremental/stochastic methods with minibatch size of the one R/G/B channel. We note that with such a minibatch size IRWF is equivalent to block Kaczmarz-PR from the discussion in Section 3.3. It can be seen that in general, the incremental/stochastic methods (IRWF/Kaczmarz-PR and ITWF) run faster than the batch methods (RWF, TWF, WF and AltMinPhase). Moreover, among batch methods, RWF outperforms other three algorithms in both number of passes and the computational time cost. In particular, RWF runs two times faster than TWF and six times faster than WF in terms of both the number of iterations and computational time cost.

Figure 5: Milky way Galaxy.
Algorithms RWF IRWF/Kaczmarz-PR TWF ITWF WF AltMinPhase
#passes 140 24 410 41 fail 230
time cost(s) 110 21.2 406 43 fail 167
#passes 70 8 190 12 315 120
time cost(s) 107 13.7 363.6 25.9 426 171
Table 2: Comparison of iterations and time cost among algorithms on Galaxy image (shown in Figure 5)

We next demonstrate the robustness of RWF to noise corruption and compare it with TWF. We consider the phase retrieval problem in imaging applications, where random Poisson noises are often used to model the sensor and electronic noise [58]. Specifically, the noisy measurements of intensity can be expressed as y where denotes the level of input noise, and

denotes a random sample generated by the Poisson distribution with mean

. It can be observed from Figure 6 that RWF performs better than TWF in terms of recovery accuracy under different noise levels.

Figure 6: Comparison of relative error under Poisson noises between RWF and TWF.

5 Conclusion

In this paper, we proposed RWF and its stochastic version IRWF to recover a signal from a quadratic systems of equations, based on a nonconvex and nonsmooth quadratic loss function of absolute values of measurements. This loss function sacrifices the smoothness but enjoys advantages in statistical and computational efficiency. It has potential to be extended in various scenarios. One interesting direction is to extend such an algorithm to exploit signal structures (e.g., non-negativity, sparsity, etc) to assist the recovery. The lower-order loss function may offer great simplicity to prove performance guarantee in such cases.

Another interesting direction is to study the convergence of algorithms from random initialization. In the regime of large sample size (), the empirical loss surface approaches the asymptotic loss (Figure 3(b)) and hence has no spurious local minimums. Due to [59], it is conceivable that gradient descent converges from random starting point. Similar phenomenons have been observed in [19, 29]. However, under moderate number of measurements (), authentic local minimums do exist and often locate not far from the global ones. In this regime, the batch gradient method often fails with random initialization. As always believed, stochastic algorithms are efficient in escaping bad local minimums or saddle points in nonconvex optimization because of the inherent noise [60, 26]. We observe numerically that IRWF and block IRWF from random starting point still converge to global minimum even with very small sample size which is close to the theoretical limits [57]. It is of interest to analyze theoretically that stochastic methods escape these local minimums (not just saddle points) efficiently.

Appendix

We first introduce some notations here. We let be a linear map

We let and denote the norm and norm of a vector, respectively. Moreover, let and denote the Frobenius norm and the spectral norm of a matrix, respectively. We note that the constants may be different from line to line, for the sake of notational simplicity.

Appendix A Proof of Proposition 1: Performance Guarantee for Initialization

Compared to the proof for TWF [2], this proof has new technical developments to address the magnitude measurements and truncation from both sides.

We first estimate the norm of as

(30)

Since , by Hoeffding-type inequality, it can be shown that

(31)

holds with probability at least for some constant .

Moreover, given , ’s are independent sub-Gaussian random variables. Thus, by Hoeffding-type inequality, it can be shown that

(32)

holds with probability at least for some constant .

On the event , it can be argued that

(33)

Without loss of generality, we let . Then on the event , the truncation function satisfies the following bounds

Thus, by defining

we have . We further compute the expectations of and and obtain

(34)

where

where . For given and , small value of yields arbitrarily close and , as well as arbitrarily close and . For example, taking and , we have .

Now applying standard results on random matrices with non-isotropic sub-Gaussian rows [61, equation (5.26)] and noticing that can be rewritten as for sub-Gaussian vector , one can derive

(35)

with probability for some positive which is only affected by , provided that exceeds a certain constant. Furthermore, when is sufficiently small, one further has . Combining the above facts together, one can show that

(36)

Let be the normalized leading eigenvector of . Following the arguments in [1, Section 7.8] and taking and to be sufficiently small, one has

(37)

for a given , as long as exceeds a certain constant.

Appendix B Supporting Arguments for Section 2.2

b.1 Expectation of loss functions

The expectation of the loss function (2) of WF is given by [19] as

(38)

We next show that the expectation of the loss function (3) of RWF has the following form: