# DOLPHIn - Dictionary Learning for Phase Retrieval

We propose a new algorithm to learn a dictionary for reconstructing and sparsely encoding signals from measurements without phase. Specifically, we consider the task of estimating a two-dimensional image from squared-magnitude measurements of a complex-valued linear transformation of the original image. Several recent phase retrieval algorithms exploit underlying sparsity of the unknown signal in order to improve recovery performance. In this work, we consider such a sparse signal prior in the context of phase retrieval, when the sparsifying dictionary is not known in advance. Our algorithm jointly reconstructs the unknown signal - possibly corrupted by noise - and learns a dictionary such that each patch of the estimated image can be sparsely represented. Numerical experiments demonstrate that our approach can obtain significantly better reconstructions for phase retrieval problems with noise than methods that cannot exploit such "hidden" sparsity. Moreover, on the theoretical side, we provide a convergence result for our method.

## Authors

• 3 publications
• 66 publications
• 45 publications
• ### Robust Wirtinger Flow for Phase Retrieval with Arbitrary Corruption

We consider the phase retrieval problem of recovering the unknown signal...
04/20/2017 ∙ by Jinghui Chen, et al. ∙ 0

• ### Sample-Efficient Algorithms for Recovering Structured Signals from Magnitude-Only Measurements

We consider the problem of recovering a signal x^* ∈R^n, from magnitude-...
05/18/2017 ∙ by Gauri Jagatap, et al. ∙ 0

• ### Sparsity assisted solution to the twin image problem in phase retrieval

The iterative phase retrieval problem for complex-valued objects from Fo...
07/13/2015 ∙ by Charu Gaur, et al. ∙ 0

• ### PhasePack: A Phase Retrieval Library

Phase retrieval deals with the estimation of complex-valued signals sole...
11/28/2017 ∙ by Rohan Chandra, et al. ∙ 0

• ### Holographic Phase Retrieval and Optimal Reference Design

A general mathematical framework and recovery algorithm is presented for...
01/19/2019 ∙ by David A. Barmherzig, et al. ∙ 0

• ### Linear Convergence of An Iterative Phase Retrieval Algorithm with Data Reuse

Phase retrieval has been an attractive but difficult problem rising from...
12/05/2017 ∙ by Gen Li, et al. ∙ 0

• ### The Algebraic Approach to Phase Retrieval and Explicit Inversion at the Identifiability Threshold

We study phase retrieval from magnitude measurements of an unknown signa...
02/17/2014 ∙ by Franz J Király, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Phase retrieval has been an active research topic for decades [1, 2]. The underlying goal is to estimate an unknown signal from the modulus of a complex-valued linear transformation of the signal. With such nonlinear measurements, the phase information is lost (hence the name “phase retrieval”), rendering the recovery task ill-posed and, perhaps not surprisingly, NP-hard [3]. Traditional approaches consider cases where the solution is unique up to a global phase shift, which can never be uniquely resolved, and devise signal reconstruction algorithms for such settings. Uniqueness properties and the empirical success of recovery algorithms usually hinge on oversampling the signal, i.e., taking more measurements than the number of signal components.

The most popular techniques for phase retrieval are based on alternating projections, see [4, 5, 6] for overviews. These methods usually require precise prior information about the signal (such as knowledge of the support set) and often converge to erroneous results. More recent approaches include semidefinite programming relaxations [7, 8, 9, 10, 11] and gradient-based methods such as Wirtinger Flow [12, 13].

In recent years, new phase retrieval techniques were developed for recovering sparse signals, which are linear combinations of only a few atoms from a known dictionary [8, 14, 13, 15]. With a sparsity assumption, these algorithms obtained better recovery performance than traditional non-sparse approaches. The main idea is akin to compressed sensing, where one works with fewer (linear) measurements than signal components [16, 17, 18]. An important motivation for developing sparse recovery techniques was that many classes of signals admit a sparse approximation in some basis or overcomplete dictionary [19, 20, 21]. While sometimes such dictionaries are known explicitly, better results have been achieved by adapting the dictionary to the data, e.g., for image denoising [20]. Numerous algorithms have been developed for this task, see, e.g., [19, 22, 23]. In this traditional setting, the signal measurements are linear and a large database of training signals is used to train the dictionary.

In this work, we propose a dictionary learning formulation for simultaneously solving the signal reconstruction and sparse representation problems given nonlinear, phaseless and noisy measurements. To optimize the resulting (nonconvex) objective function, our algorithm—referred to as DOLPHIn (DictiOnary Learning for PHase retrIeval)—alternates between several minimization steps, thus monotonically reducing the value of the objective until a stationary point is found (if step sizes are chosen appropriately). Specifially, we iterate between best fitting the data and sparsely representing the recovered signal. DOLPHIn combines projected gradient descent steps to update the signal, iterative shrinkage to obtain a sparse approximation [24], and block-coordinate descent for the dictionary update [23].

In various experiments on image reconstruction problems, we demonstrate the ability of DOLPHIn to achieve significantly improved results when the oversampling ratio is low and the noise level high, compared to the recent state-of-the-art Wirtinger Flow (WF) method [12], which cannot exploit sparsity if the dictionary is unknown. In this two-dimensional setting, we break an image down into small patches and train a dictionary such that each patch can be sparsely represented using this dictionary. The patch size as well as the amount of overlap between patches can be freely chosen, which allows us to control the trade-off between the amount of computation required to reconstruct the signal and the quality of the result.

The paper is organized as follows: In Sections II and III, we introduce the DOLPHIn framework and algorithm. Then, in Section IV, we present numerical experiments and implementation details, along with discussions about (hyper-)parameter selection and variants of DOLPHIn. We conclude the paper in Section V. The appendices provide further details on the mathematical derivation of the DOLPHIn algorithm and its properties. A short preliminary version of this work appeared in the conference paper [25].

## Ii Phase Retrieval Meets Dictionary Learning

In mathematical terms, the phase retrieval problem can be formulated as solving a nonlinear system of equations:

 Find x∈X⊆CNs.t.|fi(x)|2=yi   ∀i=1,…,M, (1)

where the functions are linear operators and the scalars are nonlinear measurements of the unknown original signal in , obtained by removing the phase information. The set  represents constraints corresponding to additional prior information about . For instance, when dealing with real-valued bounded signals, this may typically be a box constraint . Other common constraints include information about the support set—that is, the set of nonzero coefficients of . Classical phase retrieval concerns the recovery of

given the (squared) modulus of the signal’s Fourier transform. Other commonly considered cases pertain to randomized measurements (

are random linear functions) or coded diffraction patterns, i.e., concatenations of random signal masks and Fourier transforms (see, e.g., [2, 12]).

### Ii-a Prior and Related Work

The most popular methods for classical phase retrieval—Fienup’s algorithm [5] and many related approaches [2, 4, 6, 26, 27]—are based on alternating projections onto the sets (or ) and onto the set . However, the nonconvexity of makes the projection not uniquely defined and possibly hard to compute. The success of such projection-based methods hinges critically on precise prior knowledge (which, in general, will not be available in practice) and on the choice of a projection operator onto . Ultimately, convergence to (up to global phase) is in general not guaranteed and these methods often fail in practice.

Further algorithmic techniques to tackle (1) include two different semidefinite relaxation approaches, PhaseLift [7] and PhaseCut [11]. PhaseLift “lifts” (1) into the space of (complex) positive semidefinite rank- matrices via the variable transformation . Then, the nonlinear constraints are equivalent to linear constraints with respect to the matrix variable . By suitably relaxing the immediate but intractable rank-minimization objective, one obtains a convex semidefinite program (SDP). Similarly, PhaseCut introduces a separate variable for the phase, allowing to eliminate , and then lifts to obtain an equivalent problem with a rank--constraint, which can be dropped to obtain a different SDP relaxation of (1). Despite some guarantees on when these relaxations are tight, i.e., allow for correctly recovering the solution to (1) (again up to a global phase factor), their practical applicability is limited due to the dimension of the SDP that grows quadratically with the problem dimension.

A recent method that works in the original variable space is the so-called Wirtinger Flow algorithm [12]. Here, (1) is recast as the optimization problem

 minx∈CN14MM∑i=1(|fi(x)|2−yi)2, (2)

which is approximately solved by a gradient descent algorithm. Note that in the case of complex variables, the concept of a gradient is not well-defined, but as shown in [12]

, a strongly related expression termed the “Wirtinger derivative” can be used instead and indeed reduces to the actual gradient in the real case. For the case of i.i.d. Gaussian random measurements, local convergence with high probability can be proven for the method, and a certain spectral initialization provides sufficiently accurate signal estimates for these results to be applicable. Further variants of the Wirtinger Flow (WF) method that have been investigated are the Truncated WF

[28], which involves improving search directions by a statistically motivated technique to filter out components that bear “too much” influence, and Thresholded WF [13], which allows for improved reconstruction of sparse signals (i.e., ones with only a few significant components or nonzero elements), in particular when the measurements are corrupted by noise.

The concept of sparsity has been successfully employed in the context of signal reconstruction from linear measurements, perhaps most prominently in the field of compressed sensing [16, 17, 18, 29] during the past decade. There, the task is to recover an unkown signal from linear measurements—that is, finding the desired solution among the infinitely many solutions of an underdetermined system of linear equations. For signals that are (exactly or approximately) sparse with respect to some basis or dictionary, i.e., when for a matrix

and a vector

that has few nonzero entries, such recovery problems have been shown to be solvable in a very broad variety of settings and applications, and with a host of different algorithms. Dictionaries enabling sparse signal representations are sometimes, but not generally, known in advance. The goal of dictionary learning is to improve upon the sparsity achievable with a given (analytical) dictionary, or to find a suitable dictionary in the first place. Given a set of training signals, the task consists of finding a dictionary such that every training signal can be well-approximated by linear combinations of just a few atoms. Again, many methods have been developed for this purpose (see, e.g., [19, 20, 21, 22, 23]) and demonstrated to work well in different practical applications.

Signal sparsity (or compressability) can also be beneficially exploited in phase retrieval methods, cf. [8, 9, 13, 14, 15]. However, to the best of our knowledge, existing methods assume that the signal is sparse itself or sparse with respect to a fixed pre-defined dictionary. This motivates the development of new algorithms and formulations to jointly learn suitable dictionaries and reconstruct input signals from nonlinear measurements.

### Ii-B Dictionary Learning for Phase Retrieval

In this paper, we consider the problem of phase retrieval by focusing on image reconstruction applications. Therefore, we will work in a two-dimensional setting directly. However, it should be noted that all expressions and algorithms can also easily be formulated for one-dimensional signals like (1), as detailed in Appendix A. We will also consider the case of noisy measurements, and will show that our approach based on dictionary learning is particularly robust to noise, which is an important feature in practice.

Concretely, we wish to recover an image in from noise-corrupted phaseless nonlinear measurements

 Y\coloneqq|F(^X)|2+N, (3)

where is a linear operator, is a real matrix whose entries represent noise, and the complex modulus and squares are taken component-wise. As mentioned earlier, signal sparsity is known to improve the performance of phase retrieval algorithms, but a sparsifying transform is not always known in advance, or a better choice than a predefined selection can sometimes be obtained by adapting the dictionary to the data. In the context of image reconstruction, this motivates learning a dictionary in such that each patch of , represented as a vector of size , can be approximated by with a sparse vector in . Here, is chosen a priori and the number of patches depends on whether the patches are overlapping or not. In general, is chosen such that . With linear measurements, the paradigm would be similar to the successful image denoising technique of [20], but the problem (3) is significantly more difficult to solve due to the modulus operator.

Before detailing our algorithm for solving (3), we introduce the following notation. Because our approach is patch-based (as most dictionary learning formulations), we consider the linear operator that extracts the  patches (which may overlap or not) from an image and forms the matrix . Similarly, we define the linear operator that reverses this process, i.e., builds an image from a matrix containing vectorized patches as its columns. Thus, in particular, we have . When the patches do not overlap, the operator  simply places every small patch at its appropriate location in a larger image. When they overlap, the operator averages the pixel values from the patches that fall into the same location. Further, let in be the matrix containing the patch representation coefficient vectors as columns. Then, our desired sparse-approximation relation “ for all ” can be expressed as .

With this notation in hand, we may now introduce our method, called DOLPHIn (DictiOnary Learning for PHase retrIeval). We consider an optimization problem which can be interpreted as a combination of an optimization-based approach to phase retrieval—minimizing the residual norm with respect to the set of nonlinear equations induced by the phaseless measurements, cf. (2)—and a (patch-based) dictionary learning model similar to that used for image denoising in [20]. The model contains three variables: The image, or phase retrieval solution , the dictionary and the matrix containing as columns the coefficient vectors of the representation . The phase retrieval task consists of estimating and the dictionary learning or sparse coding task consists of estimating and ; a common objective function provides feedback between the two objectives, with the goal of improving the phase retrieval reconstruction procedure by encouraging the patches of  to admit a sparse approximation.

Formally, the DOLPHIn formulation consists of minimizing

 minX,D,A 14∥∥Y−|F(X)|2∥∥2F+μ2∥∥E(X)−DA∥∥2F+λp∑i=1∥∥ai∥∥1 s.t. (4)

Here, denotes the Frobenius matrix-norm, which generalizes the Euclidean norm to matrices. The parameters in the objective (4) provide a way to control the trade-off between the data fidelity term from the phase retrieval problem and the approximation sparsity of the image patches111We discuss suitable choices and sensitivity of the model to these parameters in detail in Section IV-D.. To that effect, we use the -norm, which is well-known to have a sparsity-inducing effect [30]. In order to avoid scaling ambiguities, we also restrict  to be in the subset of matrices with column -norms at most , and assume (otherwise, each patch is trivially representable by a -sparse vector by including as a column of ).

The model (4) could also easily be modified to include further side constraints, a different type of nonlinear measurements, or multiple images or measurements, respectively; we omit these extensions for simplicity.

## Iii Algorithmic Framework

Similar to classical dictionary learning [19, 22, 21, 31] and phase retrieval, problem (4) is nonconvex and difficult to solve. Therefore, we adopt an algorithm that provides monotonic decrease of the objective while converging to a stationary point (see Section III-D below).

The algorithmic framework we employ is that of alternating minimization: For each variable , and in turn, we take one step towards solving (4) with respect to this variable alone, keeping the other ones fixed. Each of these subproblems is convex in the remaining unfixed optimization variable, and well-known efficient algorithms can be employed accordingly. We summarize our method in Algorithm 1, where the superscript  denotes the adjoint operator (for a matrix , is thus the conjugate transpose), extracts the real part of a complex-valued argument, and denotes the Hadamard (element-wise) product of two matrices. The algorithm also involves the classical soft-thresholding operator and the Euclidean projection onto ; here, all operations are meant component-wise.

To avoid training the dictionary on potentially useless early estimates, the algorithm performs two phases—while the iteration counter  is smaller than , the dictionary is not updated. Below, we explain the algorithmic steps in more detail.

Note that DOLPHIn actually produces two distinct reconstructions of the desired image, namely (the per se “image variable”) and (the image assembled from the sparsely coded patches)222Technically, might contain entries not in , so one should project once more. Throughout, we often omit this step for simplicity; differences (if any) between and were insignificant in all our tests.. Our numerical experiments in Section IV show that in many cases, is in fact slightly or even significantly better than with respect to at least one quantitative quality measure and is therefore also considered a possible reconstruction output of Algorithm 1 (at least in the noisy setups we consider in this paper). Nevertheless, is sometimes more visually appealing and can be used, for instance, to refine parameter settings (if it differs strongly from ) or to assess the influence of the patch-based “regularization” on the pure non-sparse Wirtinger Flow method corresponding to the formulation where  and  are set to zero.

### Iii-a Updating the Patch Representation Vectors

Updating (i.e., considering (4) with and fixed at their current values) consists of decreasing the objective

 p∑i=1(12∥∥D(ℓ)ai−xi(ℓ)∥∥22+λμ∥∥ai∥∥1), (5)

which is separable in the patches . Therefore, we can update all vectors  independently and/or in parallel. To do so, we choose to perform one step of the well-known algorithm ISTA (see, e.g., [24]), which is a gradient-based method that is able to take into account a non-smooth regularizer such as the -norm. Concretely, the following update is performed for each :

 (6)

This update involves a gradient descent step (the gradient with respect to of the smooth term in each summand of (5) is , respectively) followed by soft-thresholding. Constructing from the as specified above is equivalent to Step 2 of Algorithm 1.

The step size parameter can be chosen in , where is an upper bound on the Lipschitz constant of the gradient; here, would be appropriate, but a less computationally demanding strategy is to use a backtracking scheme to automatically update  [24].

A technical subtlety is noteworthy in this regard: We can either find one that works for the whole matrix-variable update problem—this is what is stated implicitly in Step 2—or we could find different values, say , for each column , , of separately. Our implementation does the latter, since it employs a backtracking strategy for each column update independently.

### Iii-B Updating the Image Estimate

With and fixed, updating consists of decreasing the objective

 14∥∥Y−|F(X)|2∥∥2F+μ2∥∥E(X)−DA∥∥2F (7) with X∈X=[0,1]N1×N2.

This problem can be seen as a regularized version of the phase retrieval problem (with regularization parameter ) that encourages the patches of  to be close to the sparse approximation  obtained during the previous (inner) iterations.

Our approach to decrease the value of the objective (7) is by a projected gradient descent step. In fact, for , this step reduces to the Wirtinger flow method [12], but with necessary modifications to take into account the constraints on (real-valuedness and variable bounds ).

The gradient of with respect to can be computed as

 ∇φ(X)=R(F∗(F(X)⊙(|F(X)|2−Y))),

by using the chain rule. For

, the gradient is given by

 ∇ψ(X)=μE∗(E(X)−DA)=μR⊙R(E(X)−DA),

where is an matrix whose entries equal the number of patches the respective pixel is contained in. Note that if the whole image is divided into a complete set of nonoverlapping patches, will just be the all-ones matrix; otherwise, the element-wise multiplication with undoes the averaging of pixel values performed by when assembling an image from overlapping patches.

Finally, the gradient w.r.t. of the objective in (7) is , and the update in Step 3 of Algorithm 1 is indeed shown to be a projected gradient descent step. Typically, a backtracking (line search) strategy is used for choosing the step size ; see Theorem 2 in Section III-D for a selection rule that gives theoretical convergence, and also Section IV-B

for a heuristic alternative.

### Iii-C Updating the Dictionary

To update the dictionary, i.e., to approximately solve (4) w.r.t. alone, keeping and fixed at their current values, we employ one pass of a block-coordinate descent (BCD) algorithm on the columns of the dictionary [23]. The objective to decrease may be written as

 12p∑i=1∥∥Dai(ℓ+1)−xi(ℓ+1)∥∥22s.t.D∈D, (8)

and the update rule given by Steps 4 –13 corresponds333In [21, Algo. 11], and in our implementation, we simply normalize the columns of ; it is easily seen that any solution with for some is suboptimal (w.r.t. (4)) since raising it to allows to reduce coefficients in and thus to improve the -term of the DOLPHIn objective (4). However, using the projection is more convenient for proving the convergence results without adding more technical subtleties w.r.t. this aspect. to one iteration of [21, Algorithm 11] applied to (8).

To see this, note that each column update problem has a closed-form solution:

 (dj)(ℓ+1)=P∥⋅∥2≤1⎛⎜ ⎜⎝1∑pi=1(aij)2p∑i=1aij(xi−n∑k=1k≠jaikdk)⎞⎟ ⎟⎠=1max{1,∥1wjqj∥2}(1wjqj)

with and ; here, we abbreviated , . If , and thus for all , then column is not used in any of the current patch representations; in that case, the column’s update problem has a constant objective and is therefore solved by any with , e.g., a normalized random vector as in Step 12 of Algorithm 1. The computations performed in Steps 813 of Algorithm 1 are equivalent to these solutions, expressed differently using the matrices and defined there. Note that the operations could be parallelized to speed up computation.

### Iii-D Convergence of the Algorithm

As mentioned at the beginning of this section, with appropriate step size choices, DOLPHIn (Algorithm 1) exhibits the property of monotonically decreasing the objective function value (4) at each iteration. In particular, many line-search type step size selection mechanisms aim precisely at reducing the objective; for simplicity, we will simply refer to such subroutines as “suitable backtracking schemes” below. Concrete examples are the ISTA backtracking from [24, Section 3] we can employ in the update of , or the rule given in Theorem 2 for the projected gradient descent update of (a different choice is described in Section IV-B); further variants are discussed, e.g., in [32].

###### Proposition 1

Let be the current iterates (after the -th iteration) of Algorithm 1 with step sizes and determined by suitable backtracking schemes (or arbitrary , resp.) and let denote the objective function value of the DOLPHIn model (4) at . Then, DOLPHIn either terminates in the -th iteration, or it holds that .

###### Proof:

Since we use ISTA to update , it follows from [24] that . Similarly, a suitable backtracking strategy is known to enforce descent in the projected gradient method when the gradient is locally Lipschitz-continuous, whence . Finally, follows from standard results for BCD methods applied to convex problems, see, e.g., [33]. Combining these inequalities proves the claim.

The case of termination in Proposition 1 can occur when the backtracking scheme is combined with a maximal number of trial steps, which are often used as a safeguard against numerical stability problems or as a heuristic stopping condition to terminate the algorithm if no (sufficient) improvement can be reached even with tiny step sizes. Note also that the assertion of Proposition 1 trivially holds true if all step sizes are ; naturally, a true descent of the objective requires a strictly positive step size in at least one update step. In our algorithm, step size positivity can always be guaranteed since all these updates involve objective functions whose gradient is Lipschitz continuous, and backtracking essentially finds step sizes inversely proportional to (local) Lipschitz constants. (Due to non-expansiveness, the projection in the -update poses no problem either).

Adopting a specific Armijo-type step size selection rule for the -update allows us to infer a convergence result, stated in Theorem 2 below. To simplify the presentation, let and (cf. in Section III-B).

###### Theorem 2

Let and . Consider the DOLPHIn variant consisting of Algorithm 1 with and the following Armijo rule to be used in Step 3:

Determine as the largest number in such that and set .

If and, for some , it holds that (or ), for all and (and ), then every accumulation point of the sequence of DOLPHIn iterates is a stationary point of problem (4).

###### Proof:

The proof works by expressing Algorithm 1 as a specific instantiation of the coordinate gradient descent (CGD) method from [34] and analyzing the objective descent achievable in each update of , and , respectively. The technical details make the rigorous formal proof somewhat lengthy; therefore, we only sketch it here and defer the full proof to Appendix B.

The CGD method works by solving subproblems to obtain directions of improvement for blocks of variables at a time—in our case, (the columns of) , the matrix , and the columns of correspond to such blocks—and then taking steps along these directions. More specifically, the directions are generated using a (suitably parameterized) strictly convex quadratic approximation of the objective (built using the gradient). Essentially due to the strict convexity, it is then always possible to make a positive step along such a direction that decreases the (original) objective, unless stationarity already holds. Using a certain Armijo line-search rule designed to find such positive step sizes which achieve a sufficient objective reduction, [34, Theorem 1] ensures (under mild further assumptions, which in our case essentially translate to the stated boundedness requirement of the step sizes) that every accumulation point of the iterate sequence is indeed a stationary point of the addressed (block-separable) problem.

To embed DOLPHIn into the CGD framework, we can interpret the difference between one iterate and the next (w.r.t. the variable “block” under consideration) as the improvement direction, and proceed to show that we can always choose a step size equal to 1 in the Armijo-criterion from [34] (cf. (9) and (46) therein). For this to work out, we need to impose slightly stricter conditions on other parameters used to define that rule than what is needed in [34]; these conditions are derived directly from known descent properties of the - and -updates of our method (essentially, ISTA descent properties as in [24]). That way, the - and -updates automatically satisfy the specific CGD Armijo rule, and the actual backtracking scheme for the -update given in the present theorem can be shown to assert that our -update does so as well. (The step sizes used in DOLPHIn could also be reinterpreted in the CGD framework as scaling factors of diagonal Hessian approximations of the combined objective to be used in the direction-finding subproblems. With such simple Hessian approximations, the obtained directions are then indeed equivalent to the iterate-differences resulting from the DOLPHIn update schemes.) The claim then follows directly from [34, Theorem 1(e) (and its extensions discussed in Section 8)].

A more formal explanation for why the step sizes can be chosen positive in each step can be found on page 392 of [34]; the boundedness of approximate Hessians is stated in [34, Assumption 1]. Arguably, assuming step sizes are bounded away from zero by a constant may become problematic in theory (imagine an Armijo-generated step size sequence converging to zero), but will not pose a problem in practice where one always faces the limitations of numerical machine precision. (Note also that, in practice, the number of line-search trials can be effectively reduced by choosing based on the previous step size [34].)

Our implementation uses a different backtracking scheme for the -update (see Section IV-B) that can be viewed as a cheaper heuristic alternative to the stated Armijo-rule which still ensures monotonic objective descent (and hence is “suitable” in the context of Proposition 1), also enables strictly positive steps, and empirically performs equally well. Finally, we remark that the condition in Theorem 2 can be dropped if the relevant objective parts of problem (4) are not rescaled for the - and -updates, respectively.

To conclude the discussion of convergence, we point out that one can obtain a linear rate of convergence for DOLPHIn with the Armijo rule from Theorem 2, by extending the results of [34, Theorem 2 (cf. Section 8)].

## Iv Numerical Results

In this section, we discuss various numerical experiments to study the effectiveness of the DOLPHIn algorithm. To that end, we consider several types of linear operators within our model (4) (namely, different types of Gaussian random operators and coded diffraction models). Details on the experimental setup and our implementation are given in the first two subsections, before presenting the main numerical results in Subsection IV-C. Our experiments demonstrate that with noisy measurements, DOLPHIn gives significantly better image reconstructions than the Wirtinger Flow method [12], one recent state-of-the-art phase retrieval algorithm, thereby showing that introducing sparsity via a (simultaneously) learned dictionary is indeed a promising new approach for signal reconstruction from noisy, phaseless, nonlinear measurements. Furthermore, we discuss sensitivity of DOLPHIn with regard to various (hyper-)parameter choices (Subsections. IV-DIV-E and IV-F) and a variant in which the -regularization term in the objective is replaced by explicit constraints on the sparsity of the patch-representation coefficient vectors (Subsection. IV-G).

### Iv-a Experimental Setup

We consider several linear operators corresponding to different types of measurements that are classical in the phase retrieval literature. We denote by the (normalized) 2D-Fourier operator (implemented using fast Fourier transforms), and introduce two complex Gaussian matrices , , whose entries are i.i.d. samples from the distribution . Then, we experiment with the operators , , , and the coded diffraction pattern model

 F(X)=⎛⎜ ⎜ ⎜⎝F(¯¯¯¯¯¯M1⊙X)⋮F(¯¯¯¯¯¯Mm⊙X)⎞⎟ ⎟ ⎟⎠,  F∗(Z)=m∑j=1(Mj⊙F∗(Zj)), (9)

where (i.e., ) and the ’s are admissible coded diffraction patterns (CDPs), see for instance [12, Section 4.1]; in our experiments we used ternary CDPs, such that each is in . (Later, we will also consider octanary CDPs with .)

To reconstruct , we choose an oversampling setting where , and/or , respectively. Moreover, we corrupt our measurements with additive white Gaussian noise  such that  dB for the Gaussian-type, and  dB for CDP measurements, respectively. Note that these settings yield, in particular, a relatively heavy noise level for the Gaussian cases and a relatively low oversampling ratio for the CDPs.

### Iv-B Implementation Details

We choose to initialize our algorithm with a simple random image in to demonstrate the robustness of our approach with respect to its initialization. Nevertheless, other choices are possible. For instance, one may also initialize with a power-method scheme similar to that proposed in [12], modified to account for the real-valuedness and box-constraints. The dictionary is initialized as in , where corresponds to the two-dimensional discrete cosine transform (see, e.g., [20]).

To update , we use the ISTA implementation from the SPAMS package [23] with its integrated backtracking line search (for ). Regarding the step sizes for the update of  (Step 3 of Algorithm 1), we adopt the following simple strategy, which is similar to that from [24] and may be viewed as a heuristic to the Armijo rule from Theorem 2: Whenever the gradient step leads to a reduction in the objective function value, we accept it. Otherwise, we recompute the step with halved until a reduction is achieved; here, as a safeguard against numerical issues, we implemented a limit of 100 trials (forcing termination in case all were unsuccessful), but this was never reached in any of our computational experiments. Regardless of whether was reduced or not, we reset its value to for the next round; the initial step size is , where is the objective function of the DOLPHIn model (4), evaluated at , and least-squares patch representations . (Note that, while this rule deviates from the theoretical convergence Theorem 2, Propositon 1 and the remarks following it remain applicable.)

Finally, we consider nonoverlapping patches and run DOLPHIn (Algorithm 1) with and ; the regularization/penalty parameter values can be read from Table I (there, is the number of elements of ). We remark that these parameter values were empirically benchmarked to work well for the measurement setups and instances considered here; a discussion about the stability of our approach with respect to these parameter choices is presented below in Section IV-D. Further experiments with a sparsity-constrained DOLPHIn variant and using overlapping patches are discussed in Section IV-G.

Our DOLPHIn code is available online on the first author’s webpage.

### Iv-C Computational Experiments

We test our method on a collection of typical (grayscale) test images used in the literature, see Figure 1. All experiments were carried out on a Linux -bit quad-core machine ( GHz,  GB RAM) running Matlab R2016a (single-thread).

We evaluate our approach with the following question in mind: Can we improve upon the quality of reconstruction compared to standard phase retrieval algorithms? Standard methods cannot exploit sparsity if the underlying basis or dictionary is unknown; as we will see, the introduced (patch-) sparsity indeed allows for better recovery results (at least in the oversampling and noise regimes considered here).

To evaluate the achievable sparsity, we look at the average number of nonzeros in the columns of after running our algorithm. Generally, smaller values indicate an improved suitability of the learned dictionary for sparse patch coding (high values often occur if the regularization parameter is too small and the dictionary is learning the noise, which is something we would like to avoid). To assess the quality of the image reconstructions, we consider two standard measures, namely the peak signal-to-noise ratio (PSNR) of a reconstruction as well as its structural similarity index (SSIM) [35]. For PSNR, larger values are better, and SSIM-values closer to  (always ranging between and ) indicate better visual quality.

Table I displays the CPU times, PSNR- and SSIM-values and mean patch representation vector sparsity levels obtained for the various measurement types, averaged over the instance groups of the same size. The concrete examples in Figures 2 and 3 show the results from DOLPHIn and plain Wirtinger Flow (WF; the real-valued, -box constrained variant, which corresponds to running Algorithm 1 with and omitting the updates of and ). In all tests, we let the Wirtinger Flow method run for the same number of iterations (75) and use the same starting points as for the DOLPHIn runs. Note that instead of random , we could also use a spectral initialization similar to the one proposed for the (unconstrained) Wirtinger Flow algorithm, see [12]. Such initialization can improve WF reconstruction (at least in the noiseless case), and may also provide better initial estimates for DOLPHIn. We have experimented with such a spectral approach and found the results comparable to what is achievable with random , both for WF and DOLPHIn. Therefore, we do not report these experiments in the paper.

The DOLPHIn method consistently provides better image reconstructions than WF, which clearly shows that our approach successfully introduces sparsity into the phase retrieval problem and exploits it for estimating the solution. As can be seen from Table I, the obtained dictionaries allow for rather sparse representation vectors, with the effect of making better use of the information provided by the measurements, and also denoising the image along the way. The latter fact can be seen in the examples (Fig. 2 and 3, see also Fig. 4) and also inferred from the significantly higher PSNR and SSIM values for the estimates and (or , resp.) obtained from DOLPHIn compared to the reconstruction of the WF algorithm (which cannot make use of hidden sparsity). The gain in reconstruction quality is more visible in the example of Fig. 3 (cf. Fig. 4) than for that in Fig. 2, though both cases assert higher quantitative measures. Furthermore, note that DOLPHIn naturally has higher running times than WF, since it performs more work per iteration (also, different types of measurement operators require different amounts of time to evaluate). Note also that storing and instead of an actual image (such as the WF reconstruction) requires saving only about half as many numbers (including integer index pairs for the nonzero entries in ).

As indicated earlier, the reconstruction is quite often better than w.r.t. at least one of either PSNR or SSIM value. Nonetheless, may be visually more appealing than even if the latter exhibits a higher quantitative quality measure (as is the case, for instance, in the example of Figures 3 and 4); Furthermore, occasionally achieves notably better (quantitative) measures than ; an intuitive explanation may be that if, while the sparse coding of patches served well to eliminate the noise and—by means of the patch-fit objective term—to successfully “push” the -update steps toward a solution of good quality, that solution eventually becomes “so good”, then the fact that is designed to be only an approximation (of ) predominates.

On the other hand, is sometimes very close to , which indicates a suboptimal setting of the parameters and that control how much “feedback” the patch-fitting objective term introduces into the Wirtinger-Flow-like -update in the DOLPHIn algorithm. We discuss parameter choices in more detail in the following subsection.

### Iv-D Hyperparameter Choices and Sensitivity

The DOLPHIn algorithm requires several parameters to be specified a priori. Most can be referred to as design parameters; the most prominent ones are the size of image patches (), whether patches should overlap or not (not given a name here), and the number of dictionary atoms to learn. Furthermore, there are certain algorithmic parameters (in a broad sense) that need to be fixed, e.g., the iteration limits and or the initial dictionary and image estimate . The arguably most important parameters, however, are the model or regularization parameters and . For any fixed combination of design and algorithmic parameters in a certain measurement setup (fixed measurement type/model and (assumed) noise level), it is conceivable that one can find some values for and that work well for most instances, while the converse—choosing, say, iteration limits for fixed , and other parameters—is clearly not a very practical approach.

As is common for regularization parameters, a good “general-purpose” way to choose and a priori is unfortunately not known. To obtain the specific choices used in our experiments, we fixed all the other parameters (including noise SNR and oversampling ratios), then (for each measurement model) ran preliminary tests to identify values for which good results could be produced with some , and finally fixed at such values and ran extensive benchmark tests to find values that give the best results.

For DOLPHIn, offers some control over how much “feedback” from the current sparse approximation of the current image estimate is taken into account in the update step to produce the next image iterate—overly large values hinder the progress made by the Wirtinger-Flow-like part of the -update, while too small values marginalize the influence of the approximation , with one consequence being that the automatic denoising feature is essentially lost. Nevertheless, in our experiments we found that DOLPHIn is not strongly sensitive to the specific choice of once a certain regime has been identified in which one is able to produce meaningful results (for some choice of ). Hence, may be considered the most important parameter; note that this intuitively makes sense, as controls how strongly sparsity of the patch representations is actually promoted, the exploitation of which to obtain improved reconstructions being the driving purpose behind our development of DOLPHIn.

Figure 5 illustrates the sensitivity of DOLPHIn with respect to , in terms of reconstruction quality and achievable patch-sparsity, for different noise levels, and examplary measurement types and problem sizes. (In this figure, image quality is measured by SSIM values alone; the plots using PSNR instead look very similar and were thus omitted. Note, however, that the parameters yielding the best SSIM and PSNR values, respectively, need not be the same.) As shown by (a) and (d), there is a clear correlation between the best reconstruction quality that can be achieved (in noisy settings) and the average sparsity of the patch representation vectors . For larger noise, clearly a larger is needed to achieve good results—see (b) and (e)—which shows that a stronger promotion of patch-sparsity is an adequate response to increased noise, as is known for linear sparsity-based denoising as well. Similarly, increasing the number of measurements allows to pick a smaller whatever the noise level actually is, as can be seen in (c) and (f), respectively. The dependence of the best on the noise level appears to follow an exponential curve (w.r.t. the reciprocal SNR) which is “dampened” by the sampling ratio, i.e., becoming less steep and pronounced the more measurements are available, cf. (b) and (e). Indeed, again referring to the subplots (c) and (f), at a fixed noise level the best values seem to decrease exponentially with growing number of measurements. It remains subject of future research to investigate these dependencies in more detail, e.g., to come up with more or less general (functional) rules for choosing .

### Iv-E Impact of Increased Inner Iteration Counts

It is worth considering whether more inner iterations—i.e., consecutive update steps for the different variable blocks—lead to further improvements of the results and / or faster convergence. In general, this is an open question for block-coordinate descent algorithms, so the choices are typically made empirically. Our default choices of ISTA iterations for the -update (Step 2 in Algorithm 1), projected gradient descent steps for the -update (Step 3) and iterations of the BCD scheme for the -update (Steps 413) primarily reflect the desire to keep the overall iteration cost low. To assess whether another choice might yield significant improvements, we evaluated the DOLPHIn performance for all combinations of and , keeping all other parameters equal to the settings from the experiments reported on above. (We also tried these combinations together with an increased iteration count for the -update, but already for or the results were far worse than with just projected gradient descent step; the reason can likely be found in the fact that without adapting and to a modified -iterate, the patch-fitting term of the objective tries to keep close to a then-unsuitable estimate based on older -iterates, which apparently has a quite notable negative influence on the achievable progress in the -update loop.)

The results are summarized in condensed format in Table II, from which we can read off the spread of the best and worst results (among the best ones achievable with either or ) for each measurement-instance combination among all combinations . (Full tables for each test run can be found alongside our DOLPHIn code on the first author’s webpage.) As the table shows, the results are all quite close; while some settings lead to sparser patch representations, the overall quality of the best reconstructions for the various combinations usually differ only slightly, and no particular combination stands out clearly as being better than all others. In particular, comparing the results with those in Table I, we find that our default settings provide consistently good results; they may be improved upon with some other combination of and , but at least with the same total iteration horizon (), the improvements are often only marginal. Since the overall runtime reductions (if any) obtained with other choices for are also very small, there does not seem to be a clear advantage to using more than a single iteration for either update problem.

### Iv-F Influence of the First DOLPHIn Phase

Our algorithm keeps the dictionary fixed at its initialization for the first iterations in order to prevent the dictionary from training on relatively useless first reconstruction iterates. Indeed, if all variables including are updated right from the beginning (i.e., , ), then we end up with inferior results (keeping all other parameters unchanged): The obtained patch representations are much less sparse, the quality of the image estimate decreases drastically, and also the quality of the reconstruction becomes notably worse. This demonstrates that the dictionary apparently “learns too much noise” when updated from the beginning, and the positive effect of filtering out quite a lot of noise in the first phase by regularizing with sparsely coded patches using a fixed initial dictionary is almost completely lost. To give an example, for the CDP testset on images, the average values obtained by DOLPHIn when updating also the dictionary from the first iteration onward are: seconds runtime (versus for default DOLPHIn, cf. Table I), mean patch sparsity (vs. ), PSNR  dB and  dB (vs. and ) and SSIM-values and (vs. and ) for the reconstructions and , respectively.

On the other hand, one might argue that if the first iterates are relatively worthless, the effort of updating in the first DOLPHIn phase (i.e., the first iterations) could be saved as well. However, the influence of the objective terms involving and should then be completely removed from the algorithm for the first iterations; otherwise, the patch-fitting term will certainly hinder progress made by the -update because it then amounts to trying to keep close to the initial estimate , which obviously needs not bear any resemblance to the sought solution at all. Thus, if both and are to be unused in the first phase, then one should temporarily set , with the consequence that the first phase reduces to pure projected gradient descent for with respect to the phase retrieval objective—i.e., essentially, Wirtinger Flow. Therefore, proceeding like this simply amounts to a different initialization of . Experiments with this DOLPHIn variant ( initial WF iterations followed by full DOLPHIn iterations including and ; all other parameters again left unchanged) showed that the achievable patch-sparsities remain about the same for the Gaussian measurement types but become much worse for the CDP setup, and that the average PSNR and SSIM values become (often clearly) worse in virtually all cases. In the example of measurements of the test images, the above-described DOLPHIn variant runs seconds on average (as less effort is spent in the first phase, this is naturally lower than the seconds default DOLPHIn takes), produces slightly lower average patch-sparsity ( vs. for default DOLPHIn), but for both and , the PSNR and SSIM values are notably worse ( dB and  dB vs.  dB and  dB, and and vs. and , resp.). The reason for the observed behavior can be found in the inferior performance of WF without exploiting patch-sparsity (cf. also Table I); note that the results also further demonstrate DOLPHIn’s robustness w.r.t. the initial point—apparently, the initial point obtained from some WF iterations is not more helpful for DOLPHIn than a random first guess.

Finally, it is also worth considering what happens if the dictionary updates are turned off completely, i.e., . Then, DOLPHIn reduces to patch-sparsity regularized Wirtinger Flow, a WF variant that apparently was not considered previously. Additional experiments with , and (other parameters left unchanged) showed that this variant consistently produces higher sparsity (i.e., smaller average number of nonzero entries) of the patch representation coefficient vectors, but that the best reconstruction ( or ) is always significantly inferior to the best one produced by our default version of DOLPHIn. The first observation is explained by the fact that with fixed throughout, the patch coding (-update) only needs to adapt w.r.t. new -iterates but not a new ; at least if the new is not too different from the previous one, the former representation coefficient vectors still yield an acceptable approximation, which no longer holds true if the dictionary was also modified. While it should also me mentioned that the patch-sparsity regularized version performs much better than plain WF already, the second observation clearly indicates the additional benefit of working with trained dictionaries, i.e., superiority of DOLPHIn also over the regularized WF variant.

Full tables containing results for all testruns reported on in this subsection are again available online along with our DOLPHIn code on the first author’s website.