## 1 Introduction

The task of recovering a signal from its observations that are obtained by some acquisition process is common in many fields of science and engineering, and referred to as an inverse problem. In the field of image processing, the inverse problems are often linear, in the sense that the observations can be formulated by a linear model

(1) |

where represents the unknown original image, represents the observations, is an measurement matrix and

is a noise vector. For example, this model corresponds to tasks like denoising

(Rudin et al., 1992), deblurring (Danielyan et al., 2012)(Yang et al., 2010), and compressed sensing (Donoho, 2006; Candès and Wakin, 2008).A common strategy for recovering is to solve an optimization problem, which is composed of a fidelity term that enforces agreement with the observations , and a prior term ( is the optimization variable). Note that using a prior model is inevitable since the inverse problems represented by (1) are typically ill-posed. The optimization problem is usually stated in a penalized form

(2) |

or in a constrained form

(3) |

where and are positive scalars that control the level of regularization. While a vast amount of research has focused on designing good prior models, most of the works use a typical least squares (LS) fidelity term

(4) |

where stands for the Euclidean norm.

Recently, a new framework (Tirer and Giryes, 2018) has implicitly considered a different fidelity term, and has demonstrated excellent reconstruction results for popular tasks such as image deblurring (Tirer and Giryes, 2018) and super-resolution (Tirer and Giryes, 2019). Assuming that and (which is the common case, e.g., in super-resolution and compressed sensing tasks), this fidelity term, which has been coined as “back-projection” (BP) term in (Tirer and Giryes, 2020), can be explicitly written as

(5) |

where is the pseudoinverse of ,
or in an equivalent^{1}^{1}1The equivalence can be seen by expanding the two quadratic forms. weighted least-squares fashion

(6) |

The work in (Tirer and Giryes, 2020) has focused on examining and comparing the LS and BP

terms from an estimation accuracy (MSE) point of view,

and identified cases with advantages of the BP term over the LS term. Yet, the empirical results there also show that using the BP term, rather than the LS term, requires fewer iterations of plain and accelerated proximal gradient algorithms (e.g., ISTA and FISTA (Beck and Teboulle, 2009)) applied on the penalized formulation (2). This advantage implies reduced overall run-time if the operator has fast implementation (e.g., in image deblurring and super-resolution) or if the proximal operation dominates the computational cost of each iteration. We emphasize that this behavior has not been mathematically analyzed in (Tirer and Giryes, 2020).Contribution. In this paper, we provide mathematical reasoning for the faster convergence of BP compared to LS, for both projected gradient descent (PGD) applied on the constrained form (3), and the more general proximal gradient method applied on the penalized form (2). Our analysis for PGD (Section 3), which is inspired by the analysis in (Oymak et al., 2017), requires very mild assumptions and allows us to identify sources for the different convergence rates. Our analysis for proximal methods (Section 4, and Appendix B) requires a relaxed contraction condition on the proximal mapping of the prior, and further highlights the advantage of BP when is badly conditioned. Numerical experiments (Section 5, and Appendix C) corroborate our theoretical results for PGD with both convex (-norm) and non-convex (pre-trained DCGAN (Radford et al., 2015)) priors. For the -norm prior, we also present experiments for proximal methods, and connect them with our analysis.

## 2 Preliminaries

Let us present notations and definitions that are used in the paper. We write for the Euclidean norm of a vector, for the spectral norm of a matrix, and and

for the largest and smallest eigenvalue of a matrix, respectively. We denote the unit Euclidean ball and sphere in

by and , respectively. We denote by the orthogonal projection onto a set . We denote bythe identity matrix in

, and by and the orthogonal projections onto the row space and the null space (respectively) of the full row-rank matrix . Let us also define the descent set and its tangent cone as follows.###### Definition 2.1.

The descent set of the function at a point is defined as

(7) |

The tangent cone at a point is the smallest closed cone satisfying .

In this paper, we focus on minimizing (3) using PGD, i.e., by applying iterations of the form

(8) |

where is the gradient of at , is a step-size, and

(9) |

Note that

(10) |

Therefore, we can examine a unified formulation of PGD for both objectives

(11) |

where equals or for the LS and BP terms, respectively.

## 3 Comparing PGD Convergence Rates

The goal of this section is to provide a mathematical reasoning for the observation (shown in Section 5) that using the BP term, rather than the LS term, requires fewer PGD iterations. We start in Section 3.1 with a warm-up example with a very restrictive prior that fixes the value of on the null space of , which provides us with some intuition as to the advantage of BP. Then, in Sections 3.2 - 3.4 we build on the analysis technique in (Oymak et al., 2017) to show that the advantage of BP carries on to practical priors.

### 3.1 Warm-Up: Restrictive “Oracle” Prior

Let us define the following
“oracle”^{2}^{2}2In fact, the results in this warm-up require that the prior fixes to a constant value on the null space of , but the value itself does not affect the convergence rates. prior that fixes the value of on the null space of to that of the latent

(12) |

Applying the PGD update rule from (11) using this prior, we have

(13) |

In the following, we specialize (13) for LS and BP with step-size of 1 over the Lipschitz constant of . This step-size is perhaps the most common choice of practitioners, as it ensures (sublinear) convergence of the sequence for general convex priors (Beck and Teboulle, 2009) (i.e., for larger constant step-size PGD and general proximal methods may “swing” and not converge). Here, due to the constant Hessian matrix for LS and BP, this step-size can be computed as .

LS case. For the LS objective, we have and . Therefore,

(14) |

Let be the stationary point of the sequence , i.e., . The convergence rate can be obtained as follows

(15) |

BP case. For the BP objective, we have and , where the last equality follows from the fact that is a non-trivial orthogonal projection. Substituting these terms in (13), we get

(16) |

Note that while the use of LS objective leads to linear convergence rate of , using BP objective requires only a single iteration. This result hints that an advantage of BP may exist even for practical priors , which only implicitly impose some restrictions on .

### 3.2 General Analysis

The following theorem provides a term that characterizes the convergence rate of PGD for both LS and BP objectives
for general priors.
It is closely related to Theorem 2 in (Oymak et al., 2017). The difference is twofold. First, the theorem in (Oymak et al., 2017) considers only the LS objective and its derivation is not valid for the BP objective. Second, (Oymak et al., 2017) considers the estimation error, rather than only the convergence rate. Therefore, it examines , where is the unknown “ground truth” signal, and assumes that is known, which allows to set .
In contrast, we generalize the theory for both LS and BP objectives, and for an arbitrary value of .
Among others,
our theorem covers any stationary point of the PGD scheme (11) (i.e., an optimal point for convex ) for which .^{3}^{3}3Essentially, we require that is small enough such that the prior is not meaningless.
The proof of the theorem is deferred to Appendix A.

###### Theorem 3.1.

Let be a lower semi-continuous function,^{4}^{4}4A lower semi-continuous function is a function whose sub-level sets are closed. and
let be a point on the boundary of , i.e., .
Let be a constant that is equal to 1 for convex and equal to 2 otherwise.
Then, the sequence obtained by
(11) obeys

(17) |

where

(18) |

Theorem 3.1 is meaningful, i.e., implies convergence and provides characterization of its rate, if . We elaborate on this issue for LS and BP, below Propositions 3.2 and 3.3, respectively.

Assuming that , the term belongs to the component of the bound (17) which cannot be compensated for by using more iterations. Note that if , then Theorem 3.1 can be applied with . In this case, characterizes the estimation error (up to a factor due to the recursion in (17)). Moreover, , so the term vanishes if there is no noise.

In practice, one does not know the value of and often that is not equal to provides better results in the presence of noise or when is non-convex. Therefore, in this work we choose to compare the convergence rates for LS and BP objectives for arbitrary values of . As we consider arbitrary , we focus on , i.e., the stationary point obtained by PGD. In this case , and thus the component with in (17) presents slackness that is a consequence of the proof technique. To further see that is not expected to affect the conclusions of our analysis, note that we examine PGD with step-sizes that ensure convergence in convex settings (Beck and Teboulle, 2009) (namely, with the common step-size of ). Therefore, for convex misbehavior of like “swinging” is not possible. Empirically, monotonic convergence of is observed in Section 5 even for highly non-convex prior such as DCGAN.

In the rest of this section we focus on the term in (17). Whenever , this term characterizes the convergence rate of PGD: smaller implies faster convergence. We start with specializing and bounding it for and .

###### Proposition 3.2.

Consider the LS objective and step-size . We have

(19) |

where

(20) |

###### Proof.

For the LS objective, we have and . Therefore, is positive semi-definite, and using the generalized Cauchy-Schwarz inequality we get

(21) |

∎

Various works (Chandrasekaran et al., 2012; Plan and Vershynin, 2012; Amelunxen et al., 2014; Genzel et al., 2017)
have proved, via Gordon’s lemma (Corollary 1.2 in (Gordon, 1988)) and the notion of Gaussian width, that if: 1) the entries of are i.i.d Gaussians ;
2) belongs to a parsimonious signal model (e.g., a sparse signal); and 3) is an appropriate prior for the signal model (e.g., -quasi-norm or -norm for sparse signals), then there exist tight lower bounds^{5}^{5}5The tightness of these bounds has been shown empirically. on the restricted smallest eigenvalue of : , which are much greater than the naive lower bound (recall that , so ).
Together with the (small enough) step-size , this implies that and therefore Theorem 3.1 indeed provides meaningful guarantees for PGD applied on LS objective under the above conditions.

###### Proposition 3.3.

Consider the BP objective and step-size . We have

(22) |

where

(23) |

###### Proof.

For the BP objective, we have and . Since is positive semi-definite, using similar steps as those in (3.2) we get

(24) |

∎

As shown in Proposition 3.4 below, if then as well. Therefore, Theorem 3.1 provides meaningful guarantees also for PGD applied on BP objective. However, obtaining tight lower bounds directly on the restricted smallest eigenvalue , similar to those obtained (in some cases) for , appears to be an open problem. Its difficulty stems from the fact that tools like Slepian’s lemma and Sudakov-Fernique inequality, which are the core of Gordon’s lemma that is used to bound , cannot be used in this case.

Denote by and the recoveries obtained by LS and BP objectives, respectively. The terms and upper bound the convergence rate for each objective. Observing these expressions, we identify two factors that affect their relation, and are thus possible sources for different convergence rates. The two factors, labeled as “intrinsic” and “extrinsic”, are explained in Sections 3.3 and 3.4, respectively.

### 3.3 Intrinsic Source of Faster Convergence for BP

The following proposition addresses the case where the obtained minimizers are similar, i.e., . It guarantees that is lower than for any full row-rank .

###### Proof.

(26) |

∎

Notice that in (3.3) we use an inequality that does not take into account the fact that resides in a restricted set.
As discussed above, this is due to the lack of tighter lower bounds for .
Still, following the warm-up example^{6}^{6}6Note that the general analysis subsumes the warm-up result (strict inequality for the convergence rates).
For the prior in (12) we have that the descent set (and its tangent cone) are the subspace spanned by the rows of . Therefore, we have that and . and the discussions below Propositions 3.2 and 3.3, we conjecture
that the inequality in Proposition 3.4 is strict, i.e., that , in generic cases when the entries of
are i.i.d Gaussians , the recovered signals belong to parsimonious models and feasible sets are appropriately chosen.

Let us present an experiment that supports our conjecture. We consider a Gaussian , as mentioned above, and which is the set of -sparse signals, i.e., the number of non-zero elements in any is at most . In this case, can be approximated by: 1) drawing many supports, i.e., choices of out of the columns of ; 2) for each support creating an matrix and computing ; and 3) keeping the minimal value. Plugging the approximation of in (20), we obtain an approximation of .

Similarly, to approximate the same procedure can be done with . Plugging the approximation of in (23), we obtain an approximation of .

Fig. 0(a) shows the approximate ratio for and different values of . Fig. 0(b) shows this ratio for and different values of . In both figures is strictly smaller than , which agrees with our conjecture.

Discussion. As we have obtained (worst-case) upper bounds on PGD convergence rates for LS and BP, a natural question is: Should the bounds be tight in order to deduce conclusions on the relation of the real rates for LS and BP? Interestingly, when both objectives lead to a similar stationary point , note that it is enough to verify that is tight to conclude that the real rate for BP is better than for LS. This follows from the fact that the real rate of BP is smaller (i.e., better) than , and . That is, while tightness in is important (and is indeed obtained in certain cases, as discussed above and empirically demonstrated in (Oymak et al., 2017)), “miss-tightness” in only increases the gap between the real rates of LS and BP in favor of BP!

### 3.4 Extrinsic Source of Different Convergence Rates

Since using LS and BP objectives in (3) defines two different optimization problems, potentially, one may prefer to assign different values for the regularization parameter in each case. This is obviously translated to using feasible sets with different volume. Note that the obtained convergence rates depend on the feasible set through and , and are therefore affected by the value of . We refer to this effect on the convergence rate as “extrinsic” because it originates in a modified prior rather than directly from the different BP and LS objectives.

For the LS objective, under the assumption of Gaussian , the work by (Oymak et al., 2017) has used the notion of Gaussian width to theoretically link the complexity of the signal prior, which translates to the feasible set in (3), and the convergence rate of PGD. Their result implies that increasing the size of the feasible set (due to a relaxed prior) is expected to decrease the convergence rate, i.e., slow down PGD. Therefore, it is expected that using would increase the gap between the convergence rates in favor of the BP term, beyond the effect of its intrinsic advantage described in Section 3.3. On the other hand, using may counteract the intrinsic advantage of the BP term.

## 4 Convergence Analysis Beyond PGD

Many works on inverse problems use the penalized optimization problem (2) rather than the constrained one (3). Oftentimes (2) is minimized using the proximal gradient method, which is given by

(27) |

where

(28) |

is the proximal mapping at the point , which was introduced for convex functions in (Moreau, 1965). Note that PGD with convex feasible set is essentially the proximal gradient method for which is a convex indicator, and similarly to PGD, setting the step-size to 1 over the Lipschitz constant of ensures sublinear convergence of (27) in convex settings (Beck and Teboulle, 2009).

To obtain an expression that allows to compare the convergence rates of (27) with and , we make a relaxed contraction assumption. Namely, we require that the proximal mapping of is a contraction (only) in the null space of (rather than in all ).

Condition B.2. Given the convex function and the full row-rank matrix , there exists such that for all

(29) |

Note that the constant reflects the restriction that imposes on the null space of . Condition B.2 is satisfied by priors such as Tikhonov regularization (Tikhonov, 1963) or even a recent GMM-based prior (Teodoro et al., 2018). See Appendix B for more details on this condition.

Under Condition B.2, we prove (Theorem B.3 in Appendix B) that the proximal gradient method with step-size of exhibits linear convergence, with rate that shows an advantage of the BP term over the LS term, due to a better “restricted condition number” of the Hessian of in the subspace spanned by the rows of . Informally, we have

(30) |

where and . Note that if then the bound on the rate of BP is better, regardless of . Alternatively, this result hints that a worse condition number of is expected to correlate with a larger difference between the convergence rates of LS and BP in favor of BP. Since PGD with convex feasible set is a special case of proximal gradient method, this behavior is also expected for PGD. Indeed, this is demonstrated for compressed sensing in Appendix C.

In this paper we mainly focus on direct PGD results (rather than on those obtained for general proximal methods) for two reasons. Firstly, they do not require a contractive assumption. Secondly, identifying an “intrinsic factor” for different convergence rates is easier for PGD both in experiments (see the discussion on Fig. 3) and analysis (the dependence of on is not explicit, and the effect of the regularization parameters on the results cannot be bypassed by assuming , as we have done in Section 3.3 for and in PGD).

## 5 Experiments

In this section, we provide numerical experiments that corroborate our mathematical analysis for both convex (-norm) and non-convex (pre-trained DCGAN (Radford et al., 2015)) priors. For the -norm prior, we examine the performance of PGD with LS and BP objectives for compressed sensing (CS). It is demonstrated that both objectives prefer (i.e., provide better PSNR for) a similar value of — a case in which the faster convergence for BP is dictated by its “intrinsic” advantage, rather than by an “extrinsic” source. For this prior, we also examine an accelerated proximal gradient method (FISTA (Beck and Teboulle, 2009)) applied on (2) with LS and BP fidelity terms, and suggest an explanation for the observed behavior using the “extrinsic” and “intrinsic” sources. For the DCGAN prior we examine the performance of PGD for CS and super-resolution (SR) tasks, and show again the inherent advantage of the BP objective.

### 5.1 -Norm Prior

We consider a typical CS scenario^{7}^{7}7See Appendix C for more scenarios., where the measurement matrix is Gaussian (with i.i.d. entries drawn from ), the compression ratio is

, and the signal-to-noise ratio (SNR) is 20dB (with white Gaussian noise). We use four standard test images:

cameraman, house, peppers, and Lena, in their versions (so ). To apply sparsity-based recovery, we represent the images in the Haar wavelet basis, i.e., is the multiplication of the Gaussian measurement matrix with an Haar wavelet basis.For the reconstruction, we use the feasible set , where is the -norm, and project on it using the fast algorithm from (Duchi et al., 2008). Starting from , we apply 1000 iterations of PGD on the BP and LS objectives with the typical step-size of 1 over the spectral norm of the objective’s Hessian. We compute in advance, so the PGD iterations for both objectives have similar computational cost.

Fig. 1(a) shows the PSNR of the reconstructions, averaged over all images, for different values of the regularization parameter . Fig. 1(b) shows the average PSNR as a function of the iteration number, for and . Note that yields less accurate results despite being the average -norm of the four “ground truth” test images (in Haar basis representation). From Fig. 1(b) we see that when PGD is applied on BP and LS objectives with the same value of , indeed BP is faster, which demonstrates its “intrinsic” advantage. Also, when is increased, the convergence of PGD for both objectives becomes slower due to this “extrinsic” modification. Note, though, that Fig. 1(a) implies that both objectives prefer a similar value of . Therefore, when is (uniformly) tuned for best PSNR of each method, it is expected that the intrinsic advantage of BP over LS is the reason for its faster PGD convergence.

We turn to recover the images by minimization of (2) using 1000 iterations of FISTA (Beck and Teboulle, 2009) with LS and BP fidelity terms. Figs. 2(a) and 2(b) show the average PSNR vs. and vs. iteration number, respectively. Fig. 4 presents the average of the recoveries vs. . Note that the best PSNRs for BP and LS are received by values of for which is very similar for both terms, i.e., the equivalent constrained LS and BP formulations have very similar (as observed for PGD).

However, disentangling the factors for different convergence rates of LS and BP, where for each of them the regularization parameter is (uniformly) tuned for best PSNR, is more complicated for proximal methods than for PGD. To see this, note that in Figs. 2(a) and 4, for each fidelity term similar values of PSNR and can be obtained for different values of . Yet, as shown in Figs. 2(b), different values of significantly change the convergence rate of FISTA for the same fidelity term (a known behavior for proximal methods, see, e.g., (Hale et al., 2008)). Therefore, contrary to our conclusion for PGD, in this case, when is uniformly tuned for best PSNR of each fidelity term, an “extrinsic source” ( setting) can affect the convergence rate as well.

Finally, we emphasize that, in general, having in a penalized formulation does not have an immediate implication on the different convergence rates of proximal methods (e.g., does not necessarily imply that in the equivalent constrained formulations ). This is mainly because the scale of and may be very different (depends on in the problem).

### 5.2 DCGAN Prior

In the recent years there has been a significant improvement in learning generative models via deep neural networks. Methods like variational auto-encoders (VAEs)

(Kingma and Welling, 2013) and generative adversarial networks (GANs) (Goodfellow et al., 2014) have found success at modeling data distributions. This has naturally led to using pre-trained generative models as priors in imaging inverse problems (see, e.g., (Bora et al., 2017; Shah and Hegde, 2018; Abu Hussein et al., 2020)).Since in popular generative models (Kingma and Welling, 2013; Goodfellow et al., 2014) a generator learns a mapping from a low dimensional space to the signal space (), one can search for a reconstruction of only in the range of a pre-trained generator, i.e., in the set . Note that the proposed PGD theory, which assumes in (9) that , covers the above feasible set for and the following non-convex prior

(31) |

In the next experiments we use that we obtained by training a DCGAN (Radford et al., 2015) on the first 200,000 images (out of 202,599) of CelebA dataset. We use the version of the images (so ) and a training procedure similar to (Radford et al., 2015; Bora et al., 2017).

We start with the CS scenario from previous section, where , the entries of the measurement matrix are i.i.d. drawn from , and the SNR is 20dB. The last 10 images in CelebA are used as test images.

The recovery using each of the LS and BP objectives is based on 50 iterations of PGD with the typical step-size of 1 over the spectral norm of the objective’s Hessian, and initialization of . As the projection we use , where is obtained by minimizing with respect to . This inner (non-convex) minimization problem is carried out by 1000 iterations of ADAM (Kingma and Ba, 2014) with learning rate of 0.1 and multiple initializations. The value of that gives the lowest is chosen as . For the projection in the first PGD iteration we use the same 10 random initializations of for both LS and BP. For the other PGD iterations, warm starts are used (i.e., values of from the end of the inner minimization in previous iteration). For both LS and BP objectives the computational cost of a PGD iteration is dominated by the complexity of the projection. Thus, the overall complexity for each objective is dictated by the number of iterations that it requires.

Table 1 shows the PSNR of the reconstructions, averaged over the test images. Several visual results are shown in Fig. 6. Fig. 5(a) shows the average PSNR as a function of the iteration number. Again, it is clear that BP objective requires significantly fewer iterations. Since the DCGAN prior does not require a regularization parameter, the discussed “extrinsic” source of faster convergence is not relevant. However, recall that DCGAN prior is (highly) non-convex, contrary to the -norm prior. Therefore, and , the PGD stationary points for LS and BP objectives, may be extremely different, and similarly, their two associated cones and may have very different geometries. This fact is another source for different convergence rates.

As an attempt to (approximately) isolate the effect of the intrinsic source on the convergence rates, we present in Fig. 5(b) the PSNR vs. iteration number only for image 202592 in CelebA, where the recoveries using LS and BP objectives are relatively similar (see Fig. 6). The similarity between the convergence rates in Figs. 5(a) and 5(b) hints that the inherent advantage of BP plays an essential role in its faster PGD convergence also for the other images in the examined scenario, where the recoveries are not similar.

LS objective | BP objective | |
---|---|---|

CS | 23.14 | 23.57 |

SR x3 | 23.29 | 23.90 |

Our final experiment demonstrates faster PGD convergence of BP objective for a different observation model—the SR task, where is a composite operator of anti-aliasing filtering followed by down-sampling. We consider a commonly examined scenario with scale factor of 3 and Gaussian filter of size

and standard deviation 1.6. For the reconstruction, we use PGD with DCGAN prior, initialized with bicubic upsampling of

. The rest of the configuration is exactly as used for CS (obviously, now the gradients and the step-size are computed for the new relevant ).Table 1 shows the PSNR of the reconstructions, averaged over the test images. Several visual results are shown in Fig. 8. Fig. 8 shows the average PSNR versus the iteration number. Once again, the convergence of PGD for the BP objective is faster. However, this time the difference in the convergence rates is modest. Since in this SR experiment we have obtained significantly different recoveries for the LS and BP objectives (BP consistently yields higher PSNR), we cannot try to isolate the effect of the intrinsic source, as done above. Yet, presumably in this SR scenario the prior imposes a weaker restriction on the null space of than in the CS scenario above, which is translated in the analysis to a smaller gap between and .

## 6 Conclusion

In this paper we compared the convergence rate of PGD applied on LS and BP objectives, and identified an intrinsic source of a faster convergence for BP. Numerical experiments supported our theoretical findings for both convex (-norm) and non-convex (pre-trained DCGAN) priors. For the -norm prior, we also provided numerical experiments that connected the PGD analysis with the behavior observed for proximal gradient method in (Tirer and Giryes, 2020). A study of the latter has further highlighted the advantage of BP over LS when is badly conditioned.

A possible direction for future research is obtaining a lower bound for , which is tighter than