Optimal Rates of Convergence for Noisy Sparse Phase Retrieval via Thresholded Wirtinger Flow

06/10/2015 ∙ by T. Tony Cai, et al. ∙ 0

This paper considers the noisy sparse phase retrieval problem: recovering a sparse signal x ∈R^p from noisy quadratic measurements y_j = (a_j' x )^2 + ϵ_j, j=1, ..., m, with independent sub-exponential noise ϵ_j. The goals are to understand the effect of the sparsity of x on the estimation precision and to construct a computationally feasible estimator to achieve the optimal rates. Inspired by the Wirtinger Flow [12] proposed for noiseless and non-sparse phase retrieval, a novel thresholded gradient descent algorithm is proposed and it is shown to adaptively achieve the minimax optimal rates of convergence over a wide range of sparsity levels when the a_j's are independent standard Gaussian random vectors, provided that the sample size is sufficiently large compared to the sparsity of x.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

model-copram

Code for phase retrieval of signals with underlying structured sparsity.


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In a range of fields in science and engineering, researchers face the problem of recovering a -dimensional signal of interest by probing the signal via a set of -dimensional sensing vectors , , and hence the observations are the

’s contaminated with noise. This gives rise to the linear regression model in statistical terminology where

is the regression coefficient vector and is the design matrix. There is an extensive literature on the theory and methods for the estimation/recovery of under such a linear model. However, in many important applications, including X-ray crystallography, microscopy, astronomy, diffraction and array imaging, interferometry, and quantum information, it is sometimes impossible to observe directly and the measurements that one is able to obtain are the magnitude/energy of contaminated with noise. In other words, the observations are generated by the following phase retrieval model:

(1.1)

where is a vector of stochastic noise with . Note that , so in the real case, (1.1) can be treated as a generalized linear model with the multi-value link function . We refer interested readers to [41] and the reference therein for more detailed discussions on scientific and engineering background for this model.

In many applications, especially those related to imaging, the signal

admits a sparse representation under some known and deterministic linear transformation. Without loss of generality, we assume in the rest of the paper that such a linear transform has already taken place and hence the signal

is sparse itself. In this case, model (1.1) is referred to as the sparse phase retrieval model. In addition, we consider the case where are independent centered sub-exponential random errors. This is motivated by the observation that in the application settings where model (1.1) is appropriate, especially in optics, heavy-tailed noise may arise due to photon counting.

Efficient computational methods for phase retrieval have been proposed in the community of optics, and they are mostly based on the seminal work by Gerchberg, Saxton, and Fienup [21, 19]. The effectiveness of these methods relies on careful exploration of prior information of the signal in the spatial domain. Moreover, these methods were revealed later as non-convex successive projection algorithms [30, 4]. This provides insight for occasional observation of stagnation of iterates and failure of convergence.

Recently, inspired by multiple illumination, novel computational methods were proposed for phase retrieval without exploring and employing a priori information of the signal. These methods include semidefinite programming [14, 10, 11, 44, 13], polarization [2], alternating minimization [37], gradient methods [12], alternating projection [35], etc. More importantly, profound and remarkable theoretical guarantees for these methods have also been established. As for noiseless sparse phase retrieval, semidefinite programming has been proven to be effective with theoretical guarantees [31, 38, 22]. Other empirical methods for sparse phase retrieval include belief propagation [39] and greedy methods [40].

Regarding noisy phase retrieval, some stability results have been established in the literature; See [9, 42, 15]. In particular, stability results have been established in [16] for noisy sparse phase retrieval by semidefinite programming, though the authors did not study the optimal dependence of the convergence rates on the sparsity of the signal and the sample size. Nearly minimax convergence rates for sparse phase retrieval with Gaussian noise have been established in [28] under sub-gaussian design matrices. However, the optimal rates are achieved by empirical risk minimization under sparsity constraints, in which both the objective function and the constraint are non-convex, implying that the procedure is not computationally feasible.

In the present paper, we establish the minimax optimal rates of convergence for noisy sparse phase retrieval under sub-exponential noise, and propose a novel thresholded gradient descent method in order to estimate the signal under the model (1.1). For conciseness, we focus on the case where the signal and the sensing vectors are all real-valued, and the key ideas extend naturally to the complex case. The theoretical analysis sheds light on the effects of the sparsity of the signal and the presence of sub-exponential noise on the minimax rates for the estimation of under the loss, as long as the sensing vectors ’s are independent standard Gaussian vectors. Combining the minimax upper and lower bounds given in Section 3, the optimal rate of convergence for estimating the signal under the loss is , where is the sparsity of , is the usual Euclidean norm, and characterizes the noise level. Moreover, it is shown that the thresholded gradient descent procedure is both rate-optimal and computationally efficient, and the sample size requirement matches the state-of-the-art result in computational sparse phase retrieval under structureless Gaussian design matrices.

We explain some notation used throughout the paper. For any -dimensional vector and a subset , we denote by the -dimensional vector by keeping the coordinates of with indices in unchanged, while changing all other components to zero. We also denote for , and . Also denote as the number of nonzero components of . For any matrix , and any subsets and , is defined by keeping the submatrix of with row index set and column index set , while changing all other entries to zero. For any and , we denote the induced norm from the Banach space to . For simplicity, denote . We also denote by the identity matrix.

The rest of the paper is organized as follows: In Section 2, we introduce in detail the thresholded gradient descent procedure, which consists of two steps. The first is an initialization step by applying a diagonal thresholding method to a matrix constructed with available data. The second step applies iterative thresholding procedure for the recovery of the sparse vector . Section 3 establishes the minimax optimal rates of convergence for noisy sparse phase retrieval under the loss. The results show that the proposed thresholded gradient descent method is rate-optimal. In Section 4, numerical simulations illustrate the effectiveness of thresholding in denoising, and demonstrate how the relative estimation error depends on the thresholding parameter , sample size , sparsity , and the noise-to-signal ratio . In Section 5, we discuss the connections between our thresholded gradient method for noisy sparse phase retrieval and related methods proposed in the literature for high-dimensional regression. The proofs are given in Section 6 with some technical details deferred to the appendix.

2 Methodology

The major component of the our method is a thresholded gradient descent algorithm to obtain a sparse solution to a given non-convex empirical risk minimization problem. Due to the non-convex nature of the problem, in order to avoid any local optimum that is far away from the truth, the initialization step is crucial. Thus, we also provide a candidate method which can be justified theoretically for yielding a good initializer. The methodology is proposed assuming that has standard Gaussian entries, though it could potentially also be used when such an assumption does not necessarily hold.

2.1 Thresholded Wirtinger flow

Given the sensing vectors and the noisy magnitude measurements as in (1.1) for , one can consider estimating by minimizing the following empirical risk function

(2.1)

Statistically speaking, in the low-dimensional setup with fixed and

, if the additive noises are heavy-tailed, least-absolute-deviations (LAD) methods might be more robust than least-squares methods. However, recent progress in modern linear regression analysis shows that least-squares could be preferable to LAD when

and are proportional, even the noises are sub-exponential [18]. Due to this surprising phenomenon, we simply take the least-squares empirical risk in (2.1), although phase retrieval is a nonlinear regression problem, which could be very different from linear regression. More importantly, close-form gradient methods can be induced from the empirical risk function in (2.1), which is computationally convenient. To be specific, at any current value of , one updates the estimator by taking a step along the gradient direction

(2.2)

until a stationary point is reached. Indeed, Candès et al. [12] showed that under appropriate conditions, initialized by an appropriate spectral method, a gradient method, referred to as Wirtinger flow, leads to accurate recovery of up to a global phase in the complex domain and noiseless setting.

However, the direct application of gradient descent is not ideal for noisy sparse phase retrieval since it does not utilize the knowledge that the true signal is sparse in order to mitigate the contamination of the noise. To incorporate this a priori knowledge, it makes sense to seek a “sparse minimizer” of (2.1). To this end, suppose we have a sparse initial guess for . To update to another sparse vector, we may take a step along , and then sparsify the result by thresholding.

Input: Data ; initial estimator ; thresholding function ; gradient tuning parameter ; thresholding tuning parameter ; number of iterations .
Output: Final estimator .
1 Initialize and . repeat
      32 Compute threshold level
(2.3)
Update
(2.4)
until ;
4where is defined in (2.2); Return .
Algorithm 1 Thresholded Wirtinger flow for noisy sparse phase retrieval

Indeed, if we were given the oracle knowledge of the support of , then we can reduce the problem to recovering based on the . By avoiding estimating any coordinate of in

, we could greatly reduce variance of the resulting estimator of

. In reality, we do not have such oracle knowledge and the additional thresholding step added on top of gradient descent is intended to mimic the oracle behavior by hopefully restricting all the updated coordinates on .

Let be any thresholding function satisfying

(2.5)

For any vector , let . With the foregoing definition, the proposed thresholded gradient descent method can be summarized as algo:twf. In view of the Wirtinger flow method for noiseless phase retrieval [12], we name our approach the “Thresholded Wirtinger Flow” method. The data-driven choice of the threshold level in (2.3) is motivated by the following intuition. Recall that we assume the sensing vectors are independent standard Gaussian vectors. For a fixed , if we act as if each is a fixed constant, then the gradient in (2.2) is a linear combination of Gaussian vectors and hence has i.i.d. Gaussian entries with mean zero and variance . Therefore, the threshold is simply

times the standard deviation of these Gaussian random variables, which is essentially the universal thresholding in the Gaussian sequence model literature

[24]. Although the above intuition is not exactly true, the resulting thresholds in (2.3) are indeed the right choices as justified later in sec:theory, and illustrated in sec:simulation. Notice that there are two tuning parameters and , which should be treated as absolute constants. We will validate some theoretical choices and also provide practical choices later.

2.2 Initialization

Input: Data ; tuning parameter .
Output: Initial estimator .
4321 Compute
(2.6)
and
(2.7)
Select a set of coordinates
(2.8)
Compute a matrix
(2.9)
Return
(2.10)
where

as the leading eigenvector of

.
Algorithm 2 Initialization for algo:twf

It is worth noting that the success of algo:twf depends crucially on the initial estimator for two reasons. First, the empirical risk (2.1) is a non-convex function of and hence it could have multiple local minimizers. Hence the success of a gradient descent based approach depends naturally on the starting point. Moreover, an accurate initializer can reduce the required number of iterations in the thresholded Wirtinger flow algorithm. In view of its crucial rule, we propose in algo:init an initialization method which can be proven to yield a decent starting point for algo:twf under our modeling assumption.

The motivation of the algorithm is similar to that of diagonal thresholding [25] for sparse PCA: we want to identify a small collection of coordinates with big marginal signals and then compute an estimator of by focusing only on these coordinates. In particular, the quantity in (2.7) captures the marginal signal strength of the -th coordinate and (2.8) selects all coordinates with big marginal signals. Last but not least, (2.9) and (2.10) computes the initial estimator by focusing only on the coordinates in . There is a tuning parameter needed as input of the algorithm, which can be treated as an absolute constant. We will provide some justified theoretical choice later.

3 Theory

We first establish the statistical convergence rate for the thresholded Wirtinger flow method under the case of “Gaussian design”, i.e., for in (1.1). Moreover, we assume the signal is -sparse, i.e., , and the noises are independent centered sub-exponential random variables with maximum sub-exponential norm , i.e., . Here for any random variable , its sub-exponential norm is defined as . This definition, as well as some fundamental properties of sub-exponential variables (such as Bernstein inequality), can be found in Section 5.2.4 of [43].

Theorem 3.1

Suppose in (2.3), and in (2.8) for some absolute constant . Suppose in (2.4) and . For all , there holds

where , , and are some absolute constants.

The proof is given in Section 6. Lemma 6.3 guarantees the efficacy of the initialization step algo:init, and Lemmas 6.4 and 6.5 explain why the thresholded Wirtinger flow method leads to accurate estimation. Here and are chosen for analytical convenience. The discussion of empirical choices of , , and are deferred to Section 4.

Let us interpret thm:upper by considering the following cases. In the noiseless case, with high probability, we obtain

. This implies that thresholded gradient descent method leads to linear convergence to the original signal up to a global sign.

In the noisy case, if is an absolute constant, by letting where , we obtain with high probability. If the knowledge of is not available, by choosing , we can obtain for any predetermined . The convergence rate is better than the upper bound result established in [28], which is achieved by the intractable sparsity constrained empirical risk minimization. Our contribution is to show that this rate can be obtained tractably by a fast algorithm.

Ignoring any polylog factor, the above convenient properties of thresholded Wirtinger flow are guaranteed by the sample size condition . When

, this condition is crucial for the effectiveness of initialization algo:init. An immediate question is whether such a minimum sample size condition is in some sense necessary for any computationally efficient algorithm, if the sensing matrix is random and structureless? A similar phenomenon has been previously observed in the related but different problem of sparse principal component analysis. Assuming the hardness of the planted clique problem

[3], a series of papers [6, 45, 20] have shown that a comparable minimum sample size condition is necessary for any estimator computable in polynomial time complexity to achieve consistency and optimal convergence rates uniformly over a parameter space of interest. In particular, it was shown in [20] that this is the case even for the most restrictive parameter space in sparse principal component analysis – (discretized) Gaussian single spiked model with a sparse leading eigenvector. Establishing comparable computational lower bounds for sparse phase retrieval, especially under the Gaussian design, is an interesting project for future research.

In the case when ignoring any log factor, it is well-known that a consistent initializer can be obtained by spectral methods [37, 12], no matter whether is sparse or not. In other words, the diagonal thresholding idea in algo:init is not as crucial as in the case . It is interesting to investigate whether can be relaxed such that the optimal converge rates can still be achieved by thresholded Wirtinger flow.

The convergence rate is essentially optimal. The following lower bound result has been essentially proven in [28]:

Theorem 3.2

([28]) Let . Suppose the ’s are i.i.d. , the ’s are i.i.d. , and they are mutually independent. There holds under model (1.1),

provided , where both and are some absolute constants.

Notice that for a standard Gaussian variable with variance , its sub-exponential norm is a constant multiple of . For brevity, we do not scale the Gaussian noises such that their sub-exponential norms are strictly less than or equal to .

4 Numerical Simulation

In this section, we report numerical simulation results to demonstrate how the relative estimation error depends on the thresholding parameter , the noise-to-signal ratio (NSR) , the sample size , and the sparsity . To guarantee fair comparison, we always fix the length of the signal and the initialization parameter (except for the first case on thresholding effect). Moreover, in each numerical experiment, we conservatively choose gradient parameter , and the number of iterations for thresholded Wirtinger flow. The resulting estimator is denoted as . With each fixed , the support of

is uniformly distributed at random. The nonzero entries of

are i.i.d. . The noise , where is determined by and the choice of NSR . As discussed before, the design matrix consists of independent standard Gaussian random variables.

  1. Thresholding effect: Fix , , , and . For each , we implement the algorithm for times with independently generated , , and . and then take the average of the independent relative errors . The relation between the average relative error and the choice of is plotted as the red curve in Figure 1. The result shows that the average relative error essentially decreases from to as the thresholding parameter increases from to , and then increases slowly up to as continues to increase to .

    We implement the above experiments again with the only difference . The relation curve between the relative estimation error and is plotted as the blue curve in fig:Thresholding effect. It is clear that the performance of the algorithm is very close to the case .

    Figure 1: The relation between the average relative error and the thresholding parameter . Setup of parameters: , , , , , and . Red curve with , while blue curve with .
  2. Noise effect: Fix , , and . In each choice of NSR , with instances of generated independently, we take the average of the relative error . In Figure 2, it shows how the average relative error depends on NSR. The average relative error strictly increases from to , as the NSR increases from to .

    Figure 2: The relation between the average relative error and the noise-to-signal-ratio . Setup of parameters: , , , , , , and .
  3. Sample size effect: Fix , , and . In each choice of , with instances of generated independently, we take the average of the relative error . In Figure 3, it shows how the average relative error depends on the sample size. When the sample sizes are and , i.e., twice and three times as large as , the average relative errors are and respectively. In these cases, the thresholded gradient descent method leads to poor recovery of the original signal. When the sample size increases from to , the average relative error decreases steadily from to .

    Figure 3: The relation between the average relative error and the sample size . Setup of parameters: , , , , , , and .
  4. Sparsity effect: Fix , , and . In each choice of sparsity , with instances of generated independently, we take the average of the relative error . Figure 4 demonstrates the relation between the average relative error and the sparsity. The average relative error essentially increases from to , as the sparsity increases from to .

    Figure 4: The relation between the average relative error and the sparsity . Setup of parameters: , , , , , , and .

5 Discussion

In this paper, we established the optimal rates of convergence for noisy sparse phase retrieval under the Gaussian design in the presence of sub-exponential noise, provided that the sample size is sufficiently large. Furthermore, a thresholded gradient descent method called “Thresholded Wirtinger Flow” was introduced and shown to achieve the optimal rates.

Iterative thresholding has been employed in a variety of problems in high-dimensional statistics, machine learning, and signal processing, under the assumption that the signal or parameter vector/matrix satisfies a sparse or low-rank constraint. Examples include compressed sensing/sparse approximation

[17, 36, 34, 7], sparse principal component analysis [33, 48], high-dimensional regression [1, 47, 23], and low-rank recovery [8, 26, 29].

Regarding the application of iterative thresholding and projected gradient methods in high-dimensional -estimation, their statistical optimality has been established when the empirical risk function satisfies certain properties, such as restrictive strong convexity and smoothness (RSC and RSS) [1, 47, 23]. Although our thresholded gradient method aims to solve (2.1) for a sparse solution, the existing analytical framework for high-dimensional -estimation does not apply to the sparse phase retrieval problem, since the empirical risk function in (2.1) does not satisfy RSC in general, no matter how large the sample size is. Instead, we have shown that thresholded gradient methods can achieve optimal statistical precision for signal recovery, even when the empirical risk function does not satisfy the common assumption of RSC.

Besides thresholded gradient methods, convexly and non-convexly regularized methods are also widely-used for high-dimensional -estimation. In fact, some iterative thresholding methods are induced by regularizations; See, e.g., [17]. Therefore, an alternative candidate method for solving the noisy sparse phase retrieval problem is to penalize the empirical risk function in (2.1) before taking the minimum, in order to promote a sparse solution. The major difficulty is apparently the non-convexity of the empirical risk function. An interesting result in [32] guarantees the statistical precision of all local optima, as long as the non-convex penalty satisfies certain regularity conditions, and the empirical risk function, possibly non-convex, satisfies the restricted strong convexity. A similar result appeared in [46]

, in which the empirical risk function is required to satisfy a sparse eigenvalue (SE) condition. However, back to noisy sparse phase retrieval, the empirical risk function in (

2.1) satisfies neither RSC nor SE in general, so there is no guarantee that all local optima are consistent. A natural question is whether some penalized version of (2.1) is strongly convex in a sufficiently large neighborhood of its global minimum, such that a tractable initializer lies in this neighborhood provided the sample size is sufficiently large. Another interesting question is whether the global minimizer of such penalized version of (2.1) is a rate-optimal estimator of the original sparse signal. We leave these questions for future research.

6 Proof of thm:upper

In model (1.1), denote , which implies . Without loss of generality, we assume . As to the Gaussian design matrix , denote

(6.1)

both of which are in .

For any two two random variables/vectors/matrices/sets and , we denote by if and are independent.

Lemma 6.1

From the model (1.1), we have . Moreover, we have and , where and are defined in (2.6) and (2.7), respectively.

Proof The fact implies straightforwardly that . By (2.7), we know for all , are defined by and , which implies that for all . Finally, by (2.6), we know is determined uniquely by , which implies that .  

Lemma 6.2

On an event with probability at least ,

for some numerical constant . As a consequence, as long as , there holds

Proof By the definition of and , we have

As shown in Lemma A.7, with probability at least ,

for some numerical constant . Moreover, since is fixed, there holds

By Lemma 4.1 of [27], with probability at least , we have

The proof is done.  

Lemma 6.3

Let for some large enough absolute constant , and be defined in algo:init. There exists a random vector satisfying and , such that on an event with probability at least , we have

provided . Here is an absolute constant.

Proof Recall that and for . Define

(6.2)

Since , we have . Define as the leading eigenvector of

with -norm . This easily implies . Since , we also have .

To simplify notation, let us write for any , , which implies . Notice that

(6.3)

in which we will first control the second term. For a given , we know are i.i.d. centered sub-exponential random variables with sub-exponential norms being an absolute constant. Then, by Bernstein inequality (see, e.g., Proposition 16 in [43]), we have with probability at least ,

for some absolute constant . Then by Lemma A.7, with probability at least , we have

(6.4)

provided for some absolute constant .

Next, we prove that with high probability . It suffices to prove , i.e., . For any , and are independent, and so conditional on , is a weighted sum of variables. By Lemma 4.1 of [27],

Moreover, Chebyshev’s inequality, the Gaussian tail bound and the union bound lead to

Thus, with probability at least , for all ,

(6.5)

Here the last inequality holds when for some absolute constant .

Since with large enough , by (6.3), (6.5), (6.4) and Lemma 6.2, we obtain that with probability at least , for all ,

which implies that .

Next, we prove that with high probability. For any fixed , straightforward calculation yields . On the other hand,

So for , we have , and . By Lemma A.1,

Next, Lemma 4.1 of [27] leads to with probability at least ,

The last two inequalities, together with (6.4) and (6.3), imply that with probability at least , for all ,

Define . Then, for all we have

Since with sufficiently large absolute constant , by lemma 6.2, we have or all ,

with probability at least . This implies .

Therefore, we have , provided that . Notice that , which implies that