Reconstructing signal from intensity measurements only, which is also known as phase retrieval, is an important problem with applications in various fields, including the X-ray crystallography, astronomy, and diffraction imaging [Fienup(1982), Gerchberg(1972), Hauptman(1991)]
. There is a recent resurgence of interest in solving phase retrieval problem in machine learning community. A typical setting of those works is that, the observed datais of the form,
where is a random gaussian measurement vector and is unknown [Candes et al.(2013)Candes, Strohmer, and Voroninski]. Numerous approaches have been proposed to solve this problem. They can be mainly categorized into two classes. The first one is converting the problem into a convex program which can be easily solved. PhaseLift [Candes et al.(2013)Candes, Strohmer, and Voroninski] and PhaseCut [Waldspurger et al.(2015)Waldspurger, d’Aspremont, and Mallat] all fall into this category. It has been established that this kind of algorithm can recover any vector exactly from only samples [Candès and Li(2014)]. However, the computational complexities of those convex surrogates scale as
, which limits applicability to high dimensional data. Another class of algorithm is to optimize the nonconvex problem directly. One of the most successful algorithm in this category is the Wirtinger Flow (WF), which was proposed by[Candes et al.(2015)Candes, Li, and Soltanolkotabi]
. Chen et al. improved WF by wisely discarding certain outlier gradients, and named this procedure as the Truncated Wirtinger Flow (TWF)[Chen and Candes(2015)]. The TWF method achieves linear sample complexity and linear-time computational cost, which is the optimal statistical complexity one can obtain for solving a well-posed problem. There are numerous follow up works focus on improving the truncation rules [Kolte and Özgür(2016), Wang et al.(2016)Wang, Giannakis, and Eldar, Zhang et al.(2016)Zhang, Chi, and Liang, Wang et al.(2017)Wang, Giannakis, Saad, and Chen]. These works lower the critical sampling ratio required to exactly solve the random systems of quadratic equations. In realistic scenarios, the systems to be solved often admit low data to parameter ratio, e.g for real-valued system. For instance, the crystals of important large protein complexes diffract to low resolutions, which limits the number of measurements collected in the experiment and compromises the model accuracy [Poon et al.(2007)Poon, Chen, Lu, Vyas, Quiocho, Wang, and Ma, Schröder et al.(2010)Schröder, Levitt, and Brunger]. Therefore, it’s of great importance to develop new methods which are capable of solving systems with low data to parameter ratio, and can be employed in real-world problems. In this paper, we follow the second route to solve the nonconvex phase retrieval problem directly and design new methods which can not only improve the success rate of solving the random systems of quadratic equations with low data to parameter ratio but also admits optimal statistical complexity.
Our contributions can be summarized as follows; 1) We designed a novel data dependent nonlinear weight function to improve the regularity of wirtinger flow in the area with large model errors. This approach is different from the traditional truncation rules and has shown to increase the empirical success rates in numerical simulation. 2) We developed a new spectral initialization method which can obtain initial solution with high correlation w.r.t the true signal. 3) By combining our weight function with a weight function similar to the reweighted amplitude flow (RAF) [Wang et al.(2017)Wang, Giannakis, Saad, and Chen], we obtained a more powerful phasing method, RTanhWF, and achieved the highest empirical success rates on solving the random systems of quadratic equations relative to state-of-the-art approaches. 4) We explored a different approach to establish the regularity condition which is the mainstream method to demonstrate the convergence of phasing method. Instead of using the net argument, we leveraged the results about the supreme of empirical process and obtained tighter bounds.
The remainder of this paper is organized as follows. We first present how our weight functions are originated and developed, and how they can be combined with RAF in section 2. The numerical test results of our new methods and the TWF are shown in section 3. Detailed theoretical analysis for our methods can be found in section 4. The final section is attributed to concluding remarks.
2 Tanh wirtinger Flow
Throughout this paper, we will use the following notation. We denote the true real signal as , and the design matrix as where We use to represent a quantity which is of the order , where is a constant grater than 1. sgn refers to the sign function, .
We first assume that the solution and our current estimation are statistically independent. For simplicity, we restrict our attention to the real-valued system. Given an estimated signal, for the gaussian random measurement vector , the corresponding observation and the estimation form a bivariate guassian, which is of the form
where , and the covariance matrix can be written as,
Denote as and as , the joint probability can be expanded as
and the probability of observing is . Thus the probability of conditioned on can be otained as follows,
where , which is the correlation between the solution and real signal . Given the conditional probability , we can derive a likelihood function for the estimation of , which can be obtained by maximizing the total log likelihood . By observing that and are of the unit vector forms in the likelihood function, we can further simplify the likelihood to obtain the following target function,
It’s worth noting that similar likelihood function has been derived in crystallography fields for a long time [Pannu and Read(1996), Murshudov et al.(1997)Murshudov, Vagin, and Dodson, Bricogne and Irwin(1996)]. General overviews for the likelihood function can be found in [Murshudov et al.(1997)Murshudov, Vagin, and Dodson, Lunin et al.(2002)Lunin, Afonine, and Urzhumtsev]. The gradient of the target function 6 for any is of the form,
We then have the corresponding gradient descent update rule,
The vanilla gradient descent is not the only method to update parameters. In fact, we can incorporate our new gradient with any first order optimization algorithm.
To gain some empirical understandings about the new gradient, we reexamine it in the noiseless setting, namely, . Suppose , we rewrite the gradient in equation 7 as
where serves as an estimator for the true signal . It’s approximated as linear combinations of measurement vectors where each vector is weighted by . The factor is at the core of our new form of gradient since it serves to modulate the contribution of each observation with the estimated phase. We will investigate how this factor affects estimating . It should be noted that the correlation between true signal and estimation, , in the gradient 8 remains unknown unless the true solution is revealed. It in turn requires us to design effective methods to estimate the correlation. In the traditional crystallographic refinement field, this factor is often estimated by the maximum likelihood method [Lunin and Skovoroda(1995)], which introduces additional complexity.
We encountered the problem about designing effective weight function without any knowledge of the correlation. The main principle of such weight function is to downweight the data point where the estimated phase is wrong, thus improving the correlation between the estimated signal and the true signal . In this paper, we used a geometric observation about phase retrieval problem to achieve this goal. According to the Grothendieck’s identity from the Lemma 3.6.6 in [Vershynin(2016)], the relationship between the inner products of a random gaussian vector with any fixed vectors can be understood by reducing the problem to . Applying the transformation, by equation 24, we have , where is the length of projected design vector. A few observations lead us to consider as a good candidate for the weight function since its average value is smaller in the region where is negative, namely, where the estimatied phase is incorrect. This prompts us to estimate the length of projected design vector. As the square norm of projected design vector is given by , suppose and are of equal length, we can approximate it with and denote it as . The final weight function is of the form and the corresponding wirtinger flow is named as TanhWFQ. We further consider alternative form for the TanhWFQ. We first reorganize as,
It is easy to identify that the weight function consists of two layers of transformations: the first layer is a quadratic transformation about the variable
, and the second layer applies a tanh activation function to the output of the first layer. We may replace the quadratic transformation with an absolute function, which can be written asand changes more conservatively in certain region. The wirtinger flow weighted by the new alternative weight function is named as TanhWFL, while this class of wirtinger flow is called TanhWF.
We proceed to show how the TanhWF can be further improved by combining with the RAF [Wang et al.(2017)Wang, Giannakis, Saad, and Chen] and further explain our choice of weight function using probability argument. The weighting scheme proposed in RAF applies to each , which serves as an estimator for , while our weight function acts upon the estimator for only. We postulate that our weighting scheme is complementary to the RAF weighting scheme. We then propose a new type of wirtinger flow by knitting together these two weighting schemes as below,
where and are the weight functions in TanhWF and RAF, respectively. Since our weight functions depends solely on the value of , we should explore the connection between its value and the credibility of the estimated phase. In other words, we will compare the probability of obtaining correct estimated phase and the probability of obtaining wrong estimated phase for a given value of statistics. We can then construct certain weight functions to downweight the data points with ambiguous estimated phases. Denote as , using the results obtained in appendix A, we have or when the phases of and are the same, and when the phases of and are different. Given that the estimated phase agrees with as the true phase, we have . By equation 14, the probability density of is of the form
where , which is the relative error, and is the correlation between error and true signal. When the phases of and differs, the statistics can be expressed as . Let with . By change of variables, we have
We then compare the probability densities of and , whose values are in the interval , on the event that they have the same value. Suppose and , we have the inequality,
which holds for as long as . Consequently, the most ambiguous regions are near the points where or 0. Hence, an ideal weight function should place small weights on the gradients around these points. Our weight function in ThanWF has exactly the desired property. In the combined wirtinger flow, we make the minimum of weight function in RAF be at , that is,
where is a time varying coefficient, and slightly adjust the weight function from TanhWFL. Since has already suppressed the magnitude of gradient around the point , we increase the value of around this point while reducing the value of around by setting it as , where is also a time varying coefficient and is a bias constant. Both and
control the magnitude or the step size of gradient. By the heuristic in[Candes et al.(2015)Candes, Li, and Soltanolkotabi], they can be updated using an increasing function w.r.t step, which is of the form . Finally, we name this method as RTanhWFL and present it with Nesterov accelerated gradient descent [Nesterov(1983), Bengio et al.(2013)Bengio, Boulanger-Lewandowski, and Pascanu] in algorithm 2.
Except the update rule, the form of the tanh weighted wirtinger flow also inspires us to propose a new initialization algorithm. Suppose , we expect namely,
is the leading eigenvector of the matrix. In the initialization step, we can replace the term with a crude estimation. To further exclude those outliers which is weakly correlated with , we can discard the observations whose magnitudes are not exceeding certain threshold as it’s done in [Wang et al.(2016)Wang, Giannakis, and Eldar, Chen and Candes(2015)]. We summarize the workflow of our new phasing algorithm where the initial solution is generated by the new spectral initialization algorithm, the wirtinger flow is given by TanhWFL or TanhWFQ and the update rule is Nesterov accelerated gradient descent [Nesterov(1983), Bengio et al.(2013)Bengio, Boulanger-Lewandowski, and Pascanu] in algorithm 2.
Measurements and sampling vectors ; Initialization scale factor , trimming threshold , gradient type , momentum and step size .
Drawn from , and normalize it as . Set .
InputInput RefinementRefinement OutputOutput Measurements and sampling vectors ; Initial solution , exponential decay parameter , weights , bias , momentum and step size .
3 Numerical Experiment
In this section, we report the numerical simulation results to demonstrate the effectiveness of our initialization method and update rules. In all the simulations performed for TanhWF and TWF methods in this paper, we used the following parameter settings: for the initialization stage, we used 100 power iterations; for truncated spectral initialization, we set the trimming threshold ; for tanh weighted spectral initialization, the scale factor was set to be and the trimming threshold was set to be ; for TWF, we adopted the default parameters used in [Chen and Candes(2015)] when calculating the gradient; we set the number of iterations to be , the learning rate to be and the momentum to be . When testing the TWF method, we replaced the gradient calculation formula in algorithm 2 with the formula defined in TWF. We first compared these methods by measuring the empirical success rates on random systems of quadratic equations with different data to parameter ratios. The metric used to evaluate the solutions is the relative error which is defined in [Chen and Candes(2015)]. In these tests, we fixed the number of unknowns to be while varying the number of measurements from to with a step size of , and we considered a system as solved if the minimum relative error in the optimization process was smaller than . For each method and number of measurements, we conducted trials in which the system to be solved was randomly generated every time and the average values were reported. The final results for the empirical success rate test are presented in figure 1. As it can be seen from figure 1, the tanh weighted spectral initialization significantly improved the success rate of solving quadratic systems. Besides, the TanhWF methods also have higher success rates comparing with TWF method when using the same initialization method. Among all these methods, the TanhWFL with tanh weighted spectral initialization achieves the highest empirical success rate at every number of measurements. It can almost solve any random system of quadratic equations (with probability ) when the sampling ratio exceeds 2.
Another part of numerical experiment is to compare the empirical success rate of RTanhWFL method and the reweighted tanh wirtinger flow method without weighting , which is called RTanhWF and implemented by letting . We fixed the initialization method in this test to be the tanh weighted spectral method, and didn’t change the parameter setting for this method. We still worked with the random systems with the same number of unknowns. The parameter settings for RTanhWFL and RTanhWF were as follows; the exponential decay parameter was set to be ; the weights and were and , respectively; the bias was ; the learning rate was set to be and the momentum was ; the number of iterations was 1500. All the empirical success rates reported were the average values of 400 trials. The empirical success rates w.r.t different numbers of measurements are shown in figure 2. It can be seen that the RTanhWFL method can solve the random quadratic systems with probability higher than when the measurement to unknown ratio is not less than 1.7. It is also self-evident that the RTanhWFL method outperforms the RTanhWF method at every measurement to unknown ratio. To the best of our knowledge, this is currently the best possible result one can obtain for solving random systems of quadratic equations.
We also presented the relative errors and their correlations w.r.t true signal of the initial solutions returned by different initialization methods. The correlation between error and true signal is defined as
where and is the corresponding vector. These two quantities for the Tanh and Truncated spectral initialization methods, which are obtained at the sampling ratio and the number of unknowns , are shown in figure 3
. The initialization errors in different realizations are shown in the upper part of figure. It is self-evident that the initial solution returned by the Tanh method fluctuates around 0.95, and has smaller relative error in each of 100 realizations and smaller variance. The correlation between initial error and true signal for Tanh method fluctuates around 0.5.
4 Convergence Analysis of the TanhWFQ method
In this section, we will perform the convergence analysis for TanhWFQ by verifying that it satisfies the regularity condition proposed in [Candes et al.(2015)Candes, Li, and Soltanolkotabi]. Without loss of generality, we assume throughout this section. To establish the regularity condition, we need to bound two quantities, and , which are the curvature and smoothness of target function , respectively. We have the following two propositions for them. In the noiseless setting, for a fixed vector , with probability at least ,
holds for all , where and are universal positive constants.
Specifically, for all satisfying and outside the region , we have . Therefore, given sufficiently large , it is guaranteed that holds for a constant with high probability. Moreover, the norm of gradient satisfies the smoothness condition. In the noiseless setting, let
be an isotropic, sub-gaussian random matrix. Given, there exist some universal constants such that
holds with probability at least for all .
The detailed proofs of proposition 4 and proposition 4 can be found in appendix A and appendix B. These proofs consist of two steps. In the first step, the expectations of these quantities in certain regions are calculated by assuming is statistically independent from the measurement vectors . However, due to the fact that the expectation has no analytical solution, we used dyadic decomposition to obtain numerically integrated bounds on those regions. The second step is to demonstrate the concentration property of these quantities for all in those regions. Instead of using the net argument in previous works, we resort to a result about the supreme of empirical process, which is more general and provides tighter bound.
Given and hold with high probability, we have the following main theorem. In the noiseless setting, for a fixed vector , there exist some universal constants and such that with probability exceeding ,
holds for all satisfying and not in the set as long as is sufficiently small.
Proof. Since , we have
Therefore, as long as , namely, , the gradient update is contractive and our algorithm enjoys geometric convergence rate. The convergence properties of TanhWFL method can also be established using similar approaches.
4.1 Initialization via tanh weighted spectral method
To demonstrate that it is possible to solve the phasing problem with our algorithm, it remains to prove that the spectral initialization method can return a solution which is close to the true signal . Without loss of generality, we assume the minimum distance between estimated solution and true signal is . The distance between and is bounded in the following theorem. Consider the model where and , with probability exceeding , the solution returned by the tanh weighted spectral method obeys
where is a small constant, provided that for some sufficiently large constant .
Our proof starts by introducing a new vector with the norm and is parallel to the vector . Then the distance between and can be decomposed as,
Without loss of generality, we can assume . Since the length of is determined by an average of , using the inequality 154 in [Chen and Candes(2015)], the first term in the inequality 10 can be bounded by with probability for sufficiently large . It takes more efforts to show that the second term is also bounded by a small constant with high probability. The detailed proof about the bound of second term is included in appendix.
5 Concluding Remarks
In this paper, we presented a new phase retrieval algorithm which employs the tanh activation function to weight the current estimation about the phase for each measurement. We have shown that the TanhWF method has higher success rate in solving random systems of quadratic equations than the TWF method when using the same initialization method and parameter update rule. In addition, we also proposed a new tanh weighted spectral initialization method which significantly improved the success rate comparing with the truncated initialization method. We have proved that the TanhWF method satisfies the regularity condition for gaussian design matrix [Candes et al.(2015)Candes, Li, and Soltanolkotabi]. Finally, we designed the RTanhWFL method which achieved the best possible performance for solving random quadratic systems. It is worth pointing out that there remain some problems to be addressed, such as completing the convergence analysis for the RTanhWFL method and in the noisy setting, and investigating the effect of acceleration in our method. Future possible research extensions include extending our theoretical analysis to complex-valued signals, and developing criteria to compare different wirtinger flow methods.
We thank a bunch of people.
- [Bengio et al.(2013)Bengio, Boulanger-Lewandowski, and Pascanu] Yoshua Bengio, Nicolas Boulanger-Lewandowski, and Razvan Pascanu. Advances in optimizing recurrent networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 8624–8628. IEEE, 2013.
- [Bricogne and Irwin(1996)] G Bricogne and JJ Irwin. Maximum-likelihood refinement of incomplete models with buster+ tnt. Macromolecular refinement. Proceedings of the CCP4 Study Weekend at Chester College, pages 85–92, 1996.
- [Bubeck(2013)] Sebastien Bubeck. https://blogs.princeton.edu/imabandit/2013/03/28/smoothfunctions/. 2013.
- [Candès and Li(2014)] Emmanuel J Candès and Xiaodong Li. Solving quadratic equations via phaselift when there are about as many equations as unknowns. Foundations of Computational Mathematics, 14(5):1017–1026, 2014.
- [Candes et al.(2013)Candes, Strohmer, and Voroninski] Emmanuel J Candes, Thomas Strohmer, and Vladislav Voroninski. Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming. Communications on Pure and Applied Mathematics, 66(8):1241–1274, 2013.
- [Candes et al.(2015)Candes, Li, and Soltanolkotabi] Emmanuel J Candes, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via Wirtinger flow: Theory and algorithms. IEEE Transactions on Information Theory, 61(4):1985–2007, 2015.
- [Chen and Candes(2015)] Yuxin Chen and Emmanuel Candes. Solving random quadratic systems of equations is nearly as easy as solving linear systems. In Advances in Neural Information Processing Systems, pages 739–747, 2015.
- [Dirksen et al.(2015)] Sjoerd Dirksen et al. Tail bounds via generic chaining. Electronic Journal of Probability, 20, 2015.
- [Fienup(1982)] James R Fienup. Phase retrieval algorithms: a comparison. Applied optics, 21(15):2758–2769, 1982.
- [Gerchberg(1972)] Ralph W Gerchberg. A practical algorithm for the determination of the phase from image and diffraction plane pictures. Optik, 35:237–246, 1972.
- [Hauptman(1991)] Herbert A Hauptman. The phase problem of x-ray crystallography. Reports on Progress in Physics, 54(11):1427, 1991.
- [Jensen(2000)] Jerry L Jensen. Statistics for petroleum engineers and geoscientists, volume 2. Gulf Professional Publishing, 2000.
- [Kolte and Özgür(2016)] Ritesh Kolte and Ayfer Özgür. Phase retrieval via incremental truncated Wirtinger flow. arXiv preprint arXiv:1606.03196, 2016.
- [Liaw et al.(2017)Liaw, Mehrabian, Plan, and Vershynin] Christopher Liaw, Abbas Mehrabian, Yaniv Plan, and Roman Vershynin. A simple tool for bounding the deviation of random matrices on geometric sets. In Geometric Aspects of Functional Analysis, pages 277–299. Springer, 2017.
- [Lunin and Skovoroda(1995)] V Yu Lunin and TP Skovoroda. R-free likelihood-based estimates of errors for phases calculated from atomic models. Acta Crystallographica Section A: Foundations of Crystallography, 51(6):880–887, 1995.
- [Lunin et al.(2002)Lunin, Afonine, and Urzhumtsev] VY Lunin, PV Afonine, and AG Urzhumtsev. Likelihood-based refinement. I. Irremovable model errors. Acta Crystallographica Section A: Foundations of Crystallography, 58(3):270–282, 2002.
- [Murshudov et al.(1997)Murshudov, Vagin, and Dodson] Garib N Murshudov, Alexei A Vagin, and Eleanor J Dodson. Refinement of macromolecular structures by the maximum-likelihood method. Acta Crystallographica Section D: Biological Crystallography, 53(3):240–255, 1997.
- [Nesterov(1983)] Yurii Nesterov. A method of solving a convex programming problem with convergence rate O (1/k2). In Soviet Mathematics Doklady, volume 27, pages 372–376, 1983.
- [Pannu and Read(1996)] Navraj S Pannu and Randy J Read. Improved structure refinement through maximum likelihood. Acta Crystallographica Section A: Foundations of Crystallography, 52(5):659–668, 1996.
- [Poon et al.(2007)Poon, Chen, Lu, Vyas, Quiocho, Wang, and Ma] Billy K Poon, Xiaorui Chen, Mingyang Lu, Nand K Vyas, Florante A Quiocho, Qinghua Wang, and Jianpeng Ma. Normal mode refinement of anisotropic thermal parameters for a supramolecular complex at 3.42-å crystallographic resolution. Proceedings of the National Academy of Sciences, 104(19):7869–7874, 2007.
- [Schröder et al.(2010)Schröder, Levitt, and Brunger] Gunnar F Schröder, Michael Levitt, and Axel T Brunger. Super-resolution biomolecular crystallography with low-resolution data. Nature, 464(7292):1218, 2010.
Introduction to the non-asymptotic analysis of random matrices.Eprint Arxiv, 2010.
- [Vershynin(2016)] Roman Vershynin. High Dimensional Probability. 2016.
- [Waldspurger et al.(2015)Waldspurger, d’Aspremont, and Mallat] Irène Waldspurger, Alexandre d’Aspremont, and Stéphane Mallat. Phase recovery, maxcut and complex semidefinite programming. Mathematical Programming, 149(1-2):47–81, 2015.
- [Wang et al.(2016)Wang, Giannakis, and Eldar] Gang Wang, Georgios B Giannakis, and Yonina C Eldar. Solving systems of random quadratic equations via truncated amplitude flow. arXiv preprint arXiv:1605.08285, 2016.
- [Wang et al.(2017)Wang, Giannakis, Saad, and Chen] Gang Wang, Georgios B Giannakis, Yousef Saad, and Jie Chen. Solving almost all systems of random quadratic equations. arXiv preprint arXiv:1705.10407, 2017.
- [Yu et al.(2014)Yu, Wang, and Samworth] Yi Yu, Tengyao Wang, and Richard J Samworth. A useful variant of the davis–kahan theorem for statisticians. Biometrika, 102(2):315–323, 2014.
- [Zhang et al.(2016)Zhang, Chi, and Liang] Huishuai Zhang, Yuejie Chi, and Yingbin Liang. Provable Non-convex Phase Retrieval with Outliers: Median TruncatedWirtinger Flow. In International conference on machine learning, pages 1022–1031, 2016.
Appendix A Proof of proposition 4: the local curvature condition
We first verify that the curvature satisfies the lower bound,
where is a constant smaller than 1 in certain regions. We rewrite this quantity to strengthen its connection with ,
The first term in equation 11 can be bounded using the standard result since
is a simple gaussian random variable. It then boils down to showing thatholds with high probability. Our proof mainly consists of two parts: we will calculate the expectation of this random variable and shows it is smaller than , where in certain regions; next, we demonstrate that the sample average is concentrated around its expectation with high probability for all in a set. We begin with writing the random variable as a function of and only. Suppose , when and have the same sign, we have , which is equivalent to and ; when and have different phases, always holds, thus resulting in . In the case of , are also in the same ranges under these conditions. Hence, denote as , as and as , we can express by
namely, the random variable is the product of a function and
. We then turn to derive the joint distribution ofand which can be used to calculate the expectation of . We have the following proposition about the joint distribution of and . Suppose is a random Gaussian vector where each element has zero mean and unit variance, given two vectors , the joint distribution of and is
a.1 Proof of the Proposition 12
Since the joint probability of and
is a bivariate gaussian distribution with covariance matrix,
the conditional distribution of given [Jensen(2000)] can be expressed as,
while the marginal distribution of is a gaussian with variance . Denote as , as and as , we then have
The Jacobian matrix of this transformation is
Using change of variables, the joint probability of and is