# Projected Nesterov's Proximal-Gradient Algorithm for Sparse Signal Reconstruction with a Convex Constraint

We develop a projected Nesterov's proximal-gradient (PNPG) approach for sparse signal reconstruction that combines adaptive step size with Nesterov's momentum acceleration. The objective function that we wish to minimize is the sum of a convex differentiable data-fidelity (negative log-likelihood (NLL)) term and a convex regularization term. We apply sparse signal regularization where the signal belongs to a closed convex set within the closure of the domain of the NLL; the convex-set constraint facilitates flexible NLL domains and accurate signal recovery. Signal sparsity is imposed using the ℓ_1-norm penalty on the signal's linear transform coefficients or gradient map, respectively. The PNPG approach employs projected Nesterov's acceleration step with restart and an inner iteration to compute the proximal mapping. We propose an adaptive step-size selection scheme to obtain a good local majorizing function of the NLL and reduce the time spent backtracking. Thanks to step-size adaptation, PNPG does not require Lipschitz continuity of the gradient of the NLL. We present an integrated derivation of the momentum acceleration and its O(k^-2) convergence-rate and iterate convergence proofs, which account for adaptive step-size selection, inexactness of the iterative proximal mapping, and the convex-set constraint. The tuning of PNPG is largely application-independent. Tomographic and compressed-sensing reconstruction experiments with Poisson generalized linear and Gaussian linear measurement models demonstrate the performance of the proposed approach.

Comments

There are no comments yet.

## Authors

• 2 publications
• 3 publications
11/11/2021

### Convergence and Stability of the Stochastic Proximal Point Algorithm with Momentum

Stochastic gradient descent with momentum (SGDM) is the dominant algorit...
04/06/2018

### Adaptive Three Operator Splitting

We propose and analyze a novel adaptive step size variant of the Davis-Y...
03/14/2016

### On the Influence of Momentum Acceleration on Online Learning

The article examines in some detail the convergence rate and mean-square...
04/06/2019

### Convex-Concave Backtracking for Inertial Bregman Proximal Gradient Algorithms in Non-Convex Optimization

Backtracking line-search is an old yet powerful strategy for finding bet...
05/15/2019

### Iterative Alpha Expansion for estimating gradient-sparse signals from linear measurements

We consider estimating a piecewise-constant image, or a gradient-sparse ...
10/15/2019

### Adaptive Step Sizes in Variance Reduction via Regularization

The main goal of this work is equipping convex and nonconvex problems wi...
05/01/2021

### NuSPAN: A Proximal Average Network for Nonuniform Sparse Model – Application to Seismic Reflectivity Inversion

We solve the problem of sparse signal deconvolution in the context of se...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Most natural signals are well described by only a few significant coefficients in an appropriate transform domain, with the number of significant coefficients much smaller than the signal size. Therefore, for a vector

that represents the signal and an appropriate sparsifying transform , is a signal transform-coefficient vector with most elements having negligible magnitudes. The idea behind compressed sensing [Candes2006] is to sense the significant components of using a small number of measurements. Define the noiseless measurement vector , where and . Most effort in compressed sensing has focused on the linear sparsifying transform and noiseless measurement models with

 \bpsi(\bx)=ΨT\bx (1a) ϕ(\bx)=Φ\bx (1b)

where and are known sparsifying dictionary and sensing matrices. Here, we consider signals that belong to a closed convex set in addition to their sparsity in the transform domain. The nonnegative signal scenario with

 C=\mathamsbbRp+ (2)

is of significant practical interest and applicable to X-ray CT, SPECT, PET, and MRI, where the pixel values correspond to inherently nonnegative density or concentration maps [PrinceLinks2015]. Harmany2012 consider such a nonnegative sparse signal model and develop in [Harmany2012] and [Harmany2010Gradient] a convex-relaxation SPIRAL and a linearly constrained gradient projection method for Poisson and Gaussian linear measurements, respectively. In addition to signal nonnegativity, other convex-set constraints have been considered in the literature, such as prescribed value in the Fourier domain; box, geometric, and total-energy constraints; and intersections of these sets [YoulaWebb1982, SezanStark1983].

 We adopt the analysis regularization framework and minimize f(\bx)=\cL(\bx)+ur(\bx) (3a) with respect to the signal \bx, where \cL(\bx) is a convex differentiable data-fidelity (NLL) term, u>0 is a scalar tuning constant that quantifies the weight of the convex regularization term r(\bx) that imposes signal sparsity and the convex-set constraint: r(\bx)=\norm\bpsi(\bx)1+\mathamsbbIC(\bx) (3b)

and is the indicator function. Common choices for the signal sparsifying transform are the linear map in (1a), isotropic gradient map

 \SBR\bpsi(\bx)p′i=1\df√∑j∈Ni(xi−xj)2 (4)

and their combinations; here, is the index set of neighbors of in an appropriate (e.g., 2D) arrangement. Summing (4) over leads to the isotropic TV penalty [gdasil15, Harmany2012, Beck2009TV]; in the 2D case, anisotropic TV penalty is slightly different and easy to accommodate as well. Assume

 C⊆\closure\PARENSs\dom\cL(\bx) (5)

which ensures that is computable for all and closure ensures that points in but close to its open boundary, if there is any, will not be excluded upon projecting onto the closed set . If is not empty, then is not computable in it, which needs special attention; see Section III.

Define the proximal operator for a function scaled by :

 \proxopλr\ba=argmin\bx12\norm\bx−\ba22+λr(\bx). (6)

References [DupeFadiliStarck2012, Raguet2013GFB, Condat2013Primal] view (3a) as a sum of three terms, , , and , and minimize it by splitting schemes, such as forward-backward, Douglas-Rachford, and primal-dual. A potential benefit of splitting schemes is that they apply proximal operations on individual summands rather than on their combination, which is useful if all individual proximal operators are easy to compute. However, [DupeFadiliStarck2012] requires the proximal operator of , which is difficult in general and needs an inner iteration. Both [DupeFadiliStarck2012] and GFB splitting [Raguet2013GFB] require inner iterations for solving (see (1a) and (6)) in the general case where the sparsifying matrix is not orthogonal. The elegant PDS method in [Condat2013Primal, Vu2013] does not require inner iterations. The convergence rate of both GFB and PDS methods can be upper-bounded by where is the number of iterations and the constant is determined by values of the tuning proximal and relaxation constants [Liang2014Conv, Davis2015ConvPDS].

In this paper, we develop a PNPG method whose momentum acceleration accommodates (increasing) adaptive step size selection (see also [gdasil15, gdasil14, NesterovTechReport]) and convex-set constraint on the signal . PNPG needs an inner iteration to compute the proximal operator with respect to , which implies inexact proximal operation. We account for this inexactness and establish convergence rate of the PNPG method as well as convergence of its iterates; the obtained convergence conditions motivate our selection of convergence criteria for proximal-mapping iterations. We modify the original Nesterov’s acceleration [Nesterov1983, Beck2009FISTA] so that we can establish these convergence results when the step size is adaptive and adjusts to the local curvature of the NLL. Thanks to the step-size adaptation, PNPG does not require Lipschitz continuity of the gradient of the NLL and applies to the Poisson compressed-sensing scenario described in Section II-A. Our integration of the adaptive step size and convex-set constraint extends the application of the Nesterov-type acceleration to more general measurement models than those used previously. Furthermore, a convex-set constraint can bring significant improvement to signal reconstructions compared with imposing signal sparsity only, as illustrated in Section LABEL:sec:linear1dex. See Section LABEL:sec:Okminustwoaccelerationapproaches for discussion of other acceleration approaches: Auslender-Teboulle (AT) [Auslender2006AT, Becker2011TFOCS] and BonettiniPortaRuggiero2015 [BonettiniPortaRuggiero2015]. Proximal Quasi-Newton type methods with problem-specific diagonal Hessian approximations have been considered in [BonettiniPortaRuggiero2015, BonettiniLorisPortaPrato2015]; [BonettiniLorisPortaPrato2015] applies step-size adaptation and accounts for inaccurate proximal operator, but does not employ acceleration or provide fast convergence-rate guarantees.

PNPG code is easy to maintain: for example, the proximal-mapping computation can be easily replaced as a module by the latest state-of-the-art solver. Furthermore, PNPG requires minimal application-independent tuning; indeed, we use the same set of tuning parameters in two different application examples. This is in contrast with the existing splitting methods, which require problem-dependent (NLL-dependent) design and tuning.

We introduce the notation: , ,

, denote the vectors of zeros and ones and identity matrix, respectively; “

” is the elementwise version of “”. For a vector , define the projection and soft-thresholding operators:

 \projC\ba = argmin\bx∈C\norm\bx−\ba22 (7a) \SBR\softthrλ\bai = \sgn(ai)\maxp\absai−λ,0 (7b) and the elementwise logarithm and exponential functions \SBRsln∘\bai=lnai and \SBRsexp∘\bai=expai. The projection onto \mathamsbbRN+ and the proximal operator (6) for the ℓ1-norm ∥\bx∥1 can be computed in closed form: \SBRbig\proj\mathamsbbRN+\bai=max(ai,0), \proxopλ\norm⋅1\ba=\softthrλ\ba. (7c)

Define the -subgradient [Rockafellar1970, Sec. 23]:

 ∂εr(\bx) \df \CBRbigg∈\mathamsbbRp∣r(z)≥r(\bx)+(\bz−\bx)Tg−ε,∀\bz∈\mathamsbbRp

and an inexact proximal operator:

###### Definition 1

We say that is an approximation of with -precision [Villa2013Inexact], denoted

 \bx≊ε\proxsur\ba (9a) if \ba−\bxu∈∂ε22ur(\bx). (9b)

Note that (9a) implies

 \norm\bx−\proxsur\ba22≤ε2. (10)

We introduce representative NLL functions in Section II, describe the proposed PNPG reconstruction algorithm in Section III, establish its convergence properties (Section LABEL:sec:convergence_analysis), present numerical examples (Section LABEL:sec:NumEx), and make concluding remarks (Section LABEL:sec:conclusion).

## Ii Probabilistic Measurement Models

For numerical stability, we normalize the likelihood function so that the corresponding NLL

is lower-bounded by zero. For NLL that correspond to discrete GLM, this normalization corresponds to the generalized Kullback-Leibler divergence form of the NLL and is also closely related to the Bregman divergence

[BanerjeeDhillon2005].

### Ii-a Poisson Generalized Linear Model

GLM with Poisson observations are often adopted in astronomic, optical, hyperspectral, and tomographic imaging [Willett2014, PrinceLinks2015, StarckMurtagh2006, Snyder1993] and used to model event counts, e.g., numbers of particles hitting a detector. Assume that the measurements

are independent Poisson-distributed

111 Here, we use the extended Poisson pmf for all by defining to accommodate the identity-link model. with means .

Upon ignoring constant terms and normalization, we obtain the generalized Kullback-Leibler divergence form [ZanniBertero2014] of the NLL

 \cL(\bx)=1T\SBRsϕ(\bx)−\by+∑n,yn≠0ynlnyn\SBRs\bphi(\bx)n. (11a) The NLL \cL(\bx):\mathamsbbRp↦\mathamsbbR+ is a convex function of the signal \bx. Here, the relationship between the linear predictor Φ\bx and the expected value \bphi(\bx) of the measurements \by is summarized by the link function \bg(⋅):\mathamsbbRN↦\mathamsbbRN [McCullagh1989]: \Exp(\by)=\bphi(\bx)=\bg−1(Φ\bx). (11b)

Note that .

Two typical link functions in the Poisson GLM are log and identity, described in the following:

#### Ii-A1 Identity link

The identity link function with

 \bg(μ)=μ−\bb, \bphi(\bx)=Φ\bx+\bb (12)

is used for modeling the photon count in optical imaging [Snyder1993] and radiation activity in emission tomography [PrinceLinks2015, Ch. 9.2], as well as for astronomical image deconvolution [StarckMurtagh2006, Sec. 3.5.4]. Here, and are the known sensing matrix and intercept term, respectively; the intercept models background radiation and scattering determined, e.g., by calibration before the measurements have been collected. The nonnegative set in (2) satisfies (5), where we have used the fact that the elements of are nonnegative. If has zero components, is not empty and the NLL does not have a Lipschitz-continuous gradient.

Setting leads to the identity link without intercept used, e.g., in [Snyder1993, StarckMurtagh2006, Harmany2012].

#### Ii-A2 Log link

The log-link function

 \bg(μ)=−ln∘\PARENSsμ/\cI0, \bphi(\bx)=\cI0exp∘(−Φ\bx) (13)

has been used to account for the exponential attenuation of particles (e.g., in tomographic imaging), where is the incident energy before attenuation. The intercept term is often assumed known [Lange2013, Sec. 8.10]. The Poisson GLM with log link function is referred to as the log-linear model in [McCullagh1989, Ch. 6], which treats known and unknown as the same model.

Log link with unknown intercept. For unknown , (11a) does not hold because the underlying NLL is a function of both and . Substituting (13) into the NLL function, concentrating it with respect to , and ignoring constant terms yields the following convex concentrated (profile) NLL:

 \cL\tc(\bx)=1T\byln[1Texp∘(−Φ\bx)]+\byTΦ\bx (14)

see [NesterovTechReport, App. LABEL:report-app:derconcentratednllpoisson], where we also derive the Hessian of (14). Note that ; hence, any closed convex satisfies (5).

### Ii-B Linear Model with Gaussian Noise

Linear measurement model (1b) with zero-mean AWGN leads to the following scaled NLL:

 \cL(\bx)=12∥\by−Φ\bx∥22 (15)

where is the measurement vector and constant terms (not functions of ) have been ignored. This NLL belongs to the Gaussian GLM with identity link without intercept: . Here, , any closed convex satisfies (5), and the set is empty.

Minimization of the objective function (3a) with penalty (3b) and Gaussian NLL (15) can be thought of as an analysis BPDN problem with a convex signal constraint; see also [NesterovTechReport, gdasil14] which use the nonnegative in (2). A synthesis BPDN problem with a convex signal constraint was considered in [YamagishiYamada2009].

## Iii Reconstruction Algorithm

We propose a PNPG approach for minimizing (3a) that combines convex-set projection with Nesterov acceleration [Nesterov1983, Beck2009FISTA] and applies adaptive step size to adapt to the local curvature of the NLL and restart to ensure monotonicity of the resulting iteration. The pseudo code in Algorithm 1 summarizes our PNPG method.

Define the quadratic approximation of the NLL :

 Qβ\PARENSs\bx∣\wbx=L(\wbx)+(\bx−\wbx)T∇L(\wbx)+12β\norm\bx−\wbx22 (16)

with chosen so that (16) majorizes in the neighborhood of . Iteration  of the PNPG method proceeds as follows:

 (17a) (17d) (17e) (17aw)

We first prove Lemma LABEL:thm:lll and then derive the acceleration (17a)–(17e) and prove Theorem LABEL:thm:conv.

###### Proof:

According to Definition 1 and (LABEL:eq:proxgradstepE),

 ur(\bx) ≥ ur\PARENSs\bx(i)+(\bx−\bx(i))T[\wbx(i)−\bx(i)β(i)−∇\cL(\wbx(i))] (A1a) −\PARENSsε(i)22β(i) for any \bx∈\mathamsbbRp. Moreover, due to the convexity of \cL(\bx), we have \cL(\bx) ≥ \cL\PARENSs\wbx(i)+\PARENSs\bx−\wbx(i)T∇\cL\PARENSs\wbx(i). (A1b)

Summing (A1a), (A1b), and (LABEL:eq:majorCond) completes the proof.

The following result from [BertsekasOzdaglarNedic2003, Proposition 2.2.1] states that the distance between and can be reduced by projecting them onto a closed convex set .

###### Lemma 2 (Projection theorem)

The projection mapping onto a nonempty closed convex set is nonexpansive

 \norm\projC\bx−\projC\by22≤\norm\bx−\by22 (A2)

for all .

We now derive the Nesterov’s acceleration step (17)–(17e) with goal to select in (17aw) that achieves the convergence rate of .

Define sequences and , multiply them with (LABEL:eq:stari) and (LABEL:eq:im1i), respectively, add the resulting expressions, and multiply by to obtain

 \IEEEeqnarraymulticol3l−2β(i)\stepciΔ(i)+2β(i)\stepbiΔ(i−1) (A3) ≥ 1\stepci\normbig\stepci\step\bxi−\stepbi\step\bxi−1−\stepai\bx⋆22 −1\stepci\normbig\stepci\wbx(i)−\stepbi\step\bxi−1−\stepai\bx⋆22−c(i)\PARENSsε(i)2 = \stepci\SBRbig\stepti−\step¯ti−\PARENSsε(i)2

where

 \stepci \df \stepai+\stepbi (A4a) t(i) \df \norm\step\bxi−\bz(i)22,\step¯ti\df\norm\step\wbxi−\bz(i)22 (A4b) \bz(i) \df \stepbi\stepci\step\bxi−1+\stepai\stepci\bx⋆. (A4c)

We arranged (A3) using completion of squares so that the first two summands are similar (but with opposite signs), with goal to facilitate cancellations as we sum over . Since we have control over the sequences and , we impose the following conditions for :

 \stepci−1t(i−1) ≥ \stepci¯t(i) (A5a) π(i) ≥ 0 (A5b)

where

 π(i) \df β(i)c(i)−β(i+1)b(i+1). (A6)

Now, apply the inequality (A5a) to the right-hand sides of (A3):

 −2β(i)\stepciΔ(i)+2β(i)\stepbiΔ(i−1) ≥ \stepcit(i)−\stepci−1t(i−1) (A7a) −c(i)\PARENSsε(i)2 and sum (A7a) over i=1,2,…,k, which leads to summand cancellations and \IEEEeqnarraymulticol3l−2β(k)\stepckΔ(k)+2β(1)\stepb1Δ(0)−2k−1∑i=1π(i)Δ(i) (A7c) ≥\stepckt(k)−\stepc0t(0)−k∑i=1c(i)\PARENSsε(i)2 ≥−\stepc0t(0)−k∑i=1c(i)\PARENSsε(i)2 and (A7c) follows from (A7c) by discarding the nonnegative term \stepckt(k).

Now, due to , the inequality (A7c) leads to

 Δ(k)≤2β(1)\stepb1Δ(0)+\stepc0t(0)+∑ki=1c(i)\PARENSsε(i)22β(k)\stepck (A8)

with simple upper bound on the right-hand side, thanks to summand cancellations facilitated by the assumptions (A5).

As long as grows at a rate of and the inaccuracy of the proximal mappings leads to bounded , the centered objective function can achieve the desired bound decrease rate of . Now, we discuss how to satisfy (A5) and the growth rate of by an appropriate selection of .

### A-I Satisfying Conditions (A5)

#### A-Ia Imposing equality in (A5a)

(A5a) holds with equality for all and any when we choose that satisfy

 √\stepci−1\PARENSs\step\bxi−1−\step\bzi−1=√\stepci\PARENSs\what\bx(i)−\step\bzi. (A9)

Now, (A9) requires equal coefficients multiplying on both sides, thus for all , where is a constant (not a function of ), which implies and , see also (A4a). Upon defining

 \stepθi\dfw2\stepai (A10a) we have w2\stepci=\PARENSsθ(i)2, w2\stepbi=\PARENSsθ(i)2−θ(i). (A10b)

Plug (A10) into (A9) and reorganize to obtain the following form of momentum acceleration:

 \step\what\bxi=\step\bxi−1+\stepΘi\PARENSs\step\bxi−1−\step\bxi−2. (A11)

Although satisfies (A5a), it is not guaranteed to be within ; consequently, the proximal-mapping step for this selection may not be computable.

#### A-Ib Selecting \wbx(i)∈C that satisfies (A5a)

We now seek within that satisfies the inequality (A5a). Since and are in , by the convexity of ; see (A4c). According to Lemma 2, projecting (A11) onto preserves or reduces the distance between points. Therefore,

 \wbx(i)=PC\PARENSs\what\bx(i) (A12)

belongs to and satisfies the condition (A5a):

 \stepci−1t(i−1) = \stepci\norm\step\what\bxi−\bz(i)22 (A13a) ≥ \stepci\norm\step\wbxi−\bz(i)22=\stepci¯t(i) (A13b)

where (A13a) and (A13b) follow from (A9) and by using Lemma 2, respectively; see also (A4b).

Without loss of generality, set and rewrite and modify (A6), (A4b), and (A7c) using (A10) to obtain

 π(i) = β(i)\PARENSs\stepθi2 (A14a) −β(i+1)\stepθi+1\PARENSbigθ(i+1)−1,i≥1 \PARENSs\stepθi2t(i) = \normbigθ(i)\bx(i)−\PARENSbigθ(i)−1\bx(i−1)−\bx⋆22 (A14b) k−1∑i=1π(i)Δ(i) ≤ 12\SBRBig\PARENSs\stepθ02t(0)+k∑i=1\PARENSs\stepθiε(i)2 (A14c)

where (A14c) is obtained by discarding the negative term and the zero term (because ) on the left-hand side of (A7c). Now, (LABEL:eq:DueTo0Theta) follows from (A8) by using (see (17)), (A10), and (A14b) with .

#### A-Ic Satisfying (A5b)

By substituting (A14a) into (A5b), we obtain the conditions (LABEL:eq:thetaCond) and interpret as the sequence of gaps between the two sides of (LABEL:eq:thetaCond).

### A-Ii Connection to Convergence-Rate Analysis of Fista

If the step-size sequence is non-increasing (e.g., in the backtracking-only scenario with ), (17) with also satisfies the inequality (LABEL:eq:solvedTheta). In this case, (LABEL:eq:DueTo0Theta) still holds but (LABEL:eq:upperboundonDeltawithbetaonly) does not because (LABEL:eq:thetaGrow) no longer holds. However, because , we have and

 Δ(k) ≤ γ2\normbig\step\bx0−\bx⋆22+E(k)2β(k)(k+1)2 (A15)

which generalizes [Beck2009FISTA, Th. 4.4] to include the inexactness of the proximal operator and the convex-set projection.

## Appendix B Convergence of Iterates

To prove convergence of iterates, we need to show that the centered objective function decreases faster than the right-hand side of (LABEL:eq:upperboundonDeltawithbetaonly). We introduce Lemmas 3 and 4 and then use them to prove Theorem LABEL:thm:convItr. Throughout this Appendix, we assume that Assumption LABEL:th1cond of Theorem LABEL:thm:convItr holds, which justifies (LABEL:eq:twoineq) and (LABEL:eq:thetaCond0) as well as results from Appendix 17 that we use in the proofs.

###### Lemma 3

Under Assumptions LABEL:th1condLABEL:convitergammabcond of Theorem LABEL:thm:convItr,

 +∞∑i=1\PARENSbig2θ(i)−1δ(i)<+∞. (B1)
###### Proof:

By letting in (A14c) and using (LABEL:eq:Econverges), we obtain

 +∞∑i=1π(i)Δ(i)<+∞. (B2)

For , rewrite (A14a) using expressed in terms of (based on (17)):

 π(i) = β(i+1)γ\SBRBig(γ−2)θ(i+1)+1−bγ2γ (B3) ≥ γ−2γβ(i+1)θ(i+1)

where the inequality in (B3) is due to ; see Assumption LABEL:convitergammabcond. Apply nonexpansiveness of the projection operator to (LABEL:eq:im1i) and use (A11) to obtain

 2β(i)\PARENSs\stepΔi−1−\stepΔi ≥ δ(i)−\PARENSsΘ(i)2δ(i−1)−\PARENSsε(i)2 (B4)

then multiply both sides of (B4) by , sum over and reorganize:

 \IEEEeqnarraymulticol3lk−1∑i=1\PARENSs2θ(i)−1δ(i)≤\PARENSsθ(0)−12δ(0)−\PARENSsθ(k)2δ(k)+2β(1)Δ(0) (B5a) +E(k)+2k−1∑i=1ρ(i)Δ(i) ≤ 2β(1)Δ(0)+E(k)+4γ−2k−1∑i=1π(i)Δ(i) (B5b) where (see (A14a)) ρ(i) = β(i+1)\PARENSsθ(i+1)2−β(i)\PARENSsθ(i)2 (B5c) = β(i+1)θ(i+1)−π(i), (B5d)

and we drop the zero term and the negative term from (B5a) and use the fact that implied by (B3) to get (B5b). Finally, let and use (LABEL:eq:Econverges) and (B2) to conclude (B1).

###### Lemma 4

For ,

 Πj\df+∞∑k=jk∏ℓ=jΘ(ℓ) ≤ γθ(j−1)−1. (B6)
###### Proof:

For ,

 1√β(k−1)θ(k−1)θ(k) ≤ γ√β(k−1)θ(k−1)−γ√β(k)θ(k) (B7a) ≤ γ√β(k−2)θ(k−2)−γ√β(k)θ(k) (B7b)

where we obtain the inequality (B7a) by combining the terms on the right-hand size and using (LABEL:eq:grb1) and (B7b) holds because is an increasing sequence (see Section LABEL:sec:convergence_analysis). Now,

 Πj ≤ +∞∑k=jk∏ℓ=jβ(ℓ−2)\PARENSbigθ(ℓ−2)2β(ℓ−1)θ(ℓ−1)θ(ℓ)=+∞∑k=jβ(j−2)\PARENSbigθ(j−2)2θ(j−1)β(k−1)\PARENSbigθ(k−1)2θ(k) (B8a) \IEEEeqnarraymulticol3l≤γβ(j−2)\PARENSbigθ(j−2)2θ(j−1)√β(j−2)θ(j−2)√β(j−1)θ(j−1)=γ√B(j−1)θ(j−2) (B8b)

where (B8a) follows by using (17d), (LABEL:eq:thetaCond) with , and fraction-term cancellation; (B8b) is obtained by substituting (B7b) into (B8a) and canceling summation terms. (B8b) implies (B6) by using (LABEL:eq:grb1) with .

Define

 λ(i)\df\norm\bx(i)−\bx⋆22, Λ(i)\dfλ(i)−λ(i−1). (B9)

Since converges to as the iteration index grows and is a minimizer, it is sufficient to prove the convergence of , see [Chambolle2015Convergence, Th. 4.1].

###### Proof:

Use (LABEL:eq:stari) and the fact that to get

 0 ≥ λ(i)−\norm\wbx(i)−\bx⋆22−\PARENSsε(i)2. (B10)

Now,

 \IEEEeqnarraymulticol3l\norm\wbx(i)−\bx⋆22≤\norm\step\what\bxi−\bx⋆22=λ(i−1)+\PARENSsΘ(i)2δ(i−1) (B11a) +2Θ(i)\PARENSs\bx(i−1)−\bx⋆T\PARENSs\bx(i−1)−\bx(i−2) ≤ λ(i−1)+\PARENSsΘ(i)2δ(i−1)+Θ(i)\PARENSsΛ(i−1)+δ(i−1) (B11b)

where (B11a) and (B11b) follow by using the nonexpansiveness of the projection operator (see also (A11)) and the identity

 2(\ba−\bb)T(\ba−\bc)=\norm\ba−\bb22+\norm\ba−\bc22−\norm\bb−\bc22 (B12)

respectively. Combine the inequalities (B11b) and (B10) to get

 Λ(i) ≤ Θ(i)\SBRbigΛ(i−1)+\PARENSbigΘ(i)+1δ(i−1)+\PARENSsε(i)2 (B13a) ≤ Θ(i)\PARENSbigΛ(i−1)+2δ(i−1)/ξ+\PARENSsε(i)2 (B13b)

where (B13b) is due to (see (LABEL:eq:xi)) and the following

 Θ(i) < θ(i−1)θ(i)=√β(i−1)θ(i−1)√β(i)√β(i)θ(i)√β(i−1) (B14a) < √β(i)√β(i−1)≤1√ξ<1ξ (B14b)

where we have used (17d) and that is an increasing sequence, (see Section LABEL:sec:stepsize), and (LABEL:eq:xi).

According to (LABEL:eq:thetaGrow) and Assumption LABEL:stepsizeseqcond that the sequence is bounded, there exists an integer such that

 θ(j−1)≥2, Θ(j)≥1θ(j)>0 (B15)

for all , where the second inequality follows from the first and the definition of , see (17d). Then

 Ω(i) \df max\PARENSs0,Λ(i)≤Θ(i)\SBRbiggΩ(i−1)+2δ(i−1)ξ+\PARENSsε(i)2Θ(i) (B16a) \IEEEeqnarraymulticol3l≤i∑j=J\SBRbigg2δ(j−1)ξ+\PARENSsε(j)2Θ(j)i∏ℓ=jΘ(ℓ)+Ω(J−1)i∏ℓ=JΘ(ℓ) (B16b)

for , where the inequality in (B16a) follows by combining the inequalities (B13b) and and (B16b) follows by recursively applying inequality (B16a) with replace by . Now, sum the inequalities (B16b) over and exchange the order of summation over and on the right-hand side:

 +∞∑i=JΩ(i) ≤ +∞∑j=JΠj\SBRbigg2