1 Introduction
Problems involving estimation of an unknown vector from a set of noisy measurements
are important in many areas, including machine learning, image processing, and compressive sensing. Consider the scenario in Fig.
1, where a vector passes through the measurement channel to produce the measurement vector . When the estimation problem is illposed, it becomes essential to include the prior in the estimation process. However, in highdimensional settings, it is often difficult to directly obtain the true prior and one is hence restricted to various indirect sources of prior information on . This paper considers the cases where the prior information on is specified only via a denoising function, , designed for the removal of additive white Gaussian noise (AWGN).There has been considerable recent interest in leveraging denoisers as priors for the recovery of . One popular strategy, known as plugandplay priors (PnP) Venkatakrishnan.etal2013 , extends traditional proximal optimization Parikh.Boyd2014 by replacing the proximal operator with a general offtheshelf denoiser. It has been shown that the combination of proximal algorithms with advanced denoisers, such as BM3D Dabov.etal2007 or DnCNN Zhang.etal2017 , leads to the stateoftheart performance for various imaging problems Chan.etal2016 ; Sreehari.etal2016 ; Ono2017 ; Kamilov.etal2017 ; Meinhardt.etal2017 ; Zhang.etal2017a ; Buzzard.etal2017 ; Sun.etal2018a ; Teodoro.etal2019 . A similar strategy has also been adopted in the context of a related class of algorithms known as approximate message passing (AMP) Tan.etal2015 ; Metzler.etal2016 ; Metzler.etal2016a ; Fletcher.etal2018 . Regularizationbydenoising (RED) Romano.etal2017 , and the closely related deep meanshift priors Bigdeli.etal2017 , represent an alternative, in which the denoiser is used to specify an explicit regularizer that has a simple gradient. More recent work has clarified the existence of explicit RED regularizers Reehorst.Schniter2019 , demonstrated its excellent performance on phase retrieval Metzler.etal2018 , and further boosted its performance in combination with a deep image prior Mataev.Elad2019 . In short, the use of advanced denoisers has proven to be essential for achieving the stateoftheart results in many contexts. However, solving the corresponding estimation problem is still a significant computational challenge, especially in the context of highdimensional vectors , typical in modern applications.
In this work, we extend the current family of RED algorithms by introducing a new block coordinate RED (BCRED) algorithm. The algorithm relies on random partial updates on , which makes it scalable to vectors that would otherwise be prohibitively large for direct processing. Additionally, as we shall see, the overall computational complexity of BCRED can sometimes be lower than corresponding methods operating on the full vector. This behavior is consistent with the traditional coordinate descent methods that can outperform their full gradient counterparts by being able to better reuse local updates and take larger steps Nesterov2012 ; Beck.Tetruashvili2013 ; Wright2015 ; Fercoq.Gramfort2018 . We present two theoretical results related to BCRED. We first theoretically characterize the convergence of the algorithm under a set of transparent assumptions on the datafidelity and the denoiser. Our analysis complements the recent theoretical analysis of fullgradient RED algorithms in Reehorst.Schniter2019 by considering blockcoordinate updates and establishing the explicit worstcase convergence rate. Our second result establishes backward compatibility of BCRED with the traditional proximal optimization. We show that when the denoiser corresponds to a proximal operator, BCRED can be interpreted as an approximate MAP estimator, whose approximation error can be made arbitrarily small. To the best of our knowledge, this explicit link with proximal optimization is missing in the current literature on RED. BCRED thus provides a flexible, scalable, and theoretically sound algorithm applicable to a wide variety of largescale estimation problems. We demonstrate BCRED on image recovery from linear measurements using several denoising priors, including those based on convolutional neural network (CNN) denoisers.
All proofs and some technical details have been omitted for space and included into the supplement that also provides more background and additional simulations.
2 Background
It is common to formulate the estimation in Figure 1 as an optimization problem
(1) 
where is the datafidelity term and
is the regularizer. For example, the maximum a posteriori probability (MAP) estimator is obtained by setting
where is the likelihood that depends on and is the prior. One of the most popular datafidelity terms is leastsquares , which assumes a linear measurement model under AWGN. Similarly, one of the most popular regularizers is based on a sparsitypromoting penalty , where
is a linear transform and
is the regularization parameter Rudin.etal1992 ; Tibshirani1996 ; Candes.etal2006 ; Donoho2006 .Many widely used regularizers, including the ones based on the norm, are nondifferentiable. Proximal algorithms Parikh.Boyd2014 , such as the proximalgradient method (PGM) Figueiredo.Nowak2003 ; Daubechies.etal2004 ; Bect.etal2004 ; Beck.Teboulle2009 and alternating direction method of multipliers (ADMM) Eckstein.Bertsekas1992 ; Afonso.etal2010 ; Ng.etal2010 ; Boyd.etal2011 , are a class of optimization methods that can circumvent the need to differentiate nonsmooth regularizers by using the proximal operator
(2) 
The observation that the proximal operator can be interpreted as the MAP denoiser for AWGN has prompted the development of PnP Venkatakrishnan.etal2013 , where the proximal operator , within ADMM or PGM, is replaced with a more general denoising function .
Consider the following alternative to PnP that also relies on a denoising function Bigdeli.etal2017 ; Romano.etal2017
(3) 
Under some conditions on the denoiser, it is possible to relate in (3) to some explicit regularization function . For example, when the denoiser is locally homogeneous and has a symmetric Jacobian Romano.etal2017 ; Reehorst.Schniter2019 , the operator corresponds to the gradient of the following function
(4) 
On the other hand, when the denoiser corresponds to the minimum mean squared error (MMSE) estimator for the AWGN denoising problem Bigdeli.etal2017 ; Reehorst.Schniter2019 , , with and , the operator corresponds to the gradient of
(5) 
where
is the Gaussian probability density function of variance
and denotes convolution. In this paper, we will use the term RED to denote all methods seeking the fixed points of (3). The key benefits of the RED methods Romano.etal2017 ; Bigdeli.etal2017 ; Metzler.etal2018 ; Reehorst.Schniter2019 ; Mataev.Elad2019 are their explicit separation of the forward model from the prior, their ability to accommodate powerful denoisers (such as the ones based on CNNs) without differentiating them, and their stateoftheart performance on a number of imaging tasks. The next section further extends the scalability of RED by designing a new block coordinate RED algorithm.3 Block Coordinate RED
All the current RED algorithms operate on vectors in . We propose BCRED, shown in Algorithm 1, to allow for partial randomized updates on . Consider the decomposition of into subspaces
For each , we define the matrix that injects a vector in into and its transpose that extracts the th block from a vector in . Then, for any
(6) 
Note that (6) directly implies the norm preservation for any . We are interested in a blockcoordinate algorithm that uses only a subset of operator outputs corresponding to coordinates in some block . Hence, for an operator , we define the blockcoordinate operator as
(7) 
We introduce the following BCRED algorithm.
Note that when , we have and . Hence, the theoretical analysis in this paper is also applicable to the fullgradient RED algorithm in (3).
As with traditional coordinate descent methods (see Wright2015 for a review), BCRED can be implemented using different block selection strategies. The strategy adopted for our theoretical analysis selects block indices
as i.i.d. random variables distributed uniformly over
. An alternative is to proceed in epochs of
consecutive iterations, where at the start of each epoch the set is reshuffled, and is then selected consecutively from this ordered set. We numerically compare the convergence of both BCRED variants in Section 5.BCRED updates its iterates one randomly picked block at a time using the output of . When the algorithm converges, it converges to the vectors in the zero set of
Consider the following two sets
where is the set of all critical points of the datafidelity and is the set of all fixed points of the denoiser. Intuitively, the fixed points of correspond to all the vectors that are not denoised, and therefore can be interpreted as vectors that are noisefree according to the denoiser.
Note that if , then and is one of the solutions of BCRED. Hence, any vector that is consistent with the data for a convex and noiseless according to is in the solution set. On the other hand, when , then corresponds to a tradeoff between the two sets, explicitly controlled via (see Fig. 8 in the supplement for an illustration). This explicit control is one of the key differences between RED and PnP.
BCRED benefits from considerable flexibility compared to the fullgradient RED. Since each update is restricted to only one block of , the algorithm is suitable for parallel implementations and can deal with problems where the vector is distributed in space and in time. However, the maximal benefit of BCRED is achieved when is efficient to evaluate. Fortunately, it was systematically shown in Peng.etal2016 that many operators—common in machine learning, image processing, and compressive sensing—admit coordinate friendly updates.
For a specific example, consider the leastsquares datafidelity and a blockwise denoiser . Define the residual vector and consider a single iteration of BCRED that produces by updating the th block of . Then, the update direction and the residual update can be computed as
(8) 
where is a submatrix of consisting of the columns corresponding to the th block. In many problems of practical interest Peng.etal2016 , the complexity of working with is roughly times lower than with . Also, many advanced denoisers can be effectively applied on image blocks rather than on the full image Elad.Aharon2006 ; Buades.etal2010 ; Zoran.Weiss2011 . Therefore, the speed of iterations of BCRED is expected to be at least comparable to a single iteration of the fullgradient RED (see also Section E.1 in the supplement).
4 Convergence Analysis and Compatibility with Proximal Optimization
In this section, we present two theoretical results related to BCRED. We first establish its convergence to an element of and then discuss its compatibility with the theory of proximal optimization.
4.1 Fixed Point Convergence of BCRED
Our analysis requires three assumptions that together serve as sufficient conditions for convergence.
Assumption 1.
The operator is such that . There is a finite number such that the distance of the initial to the farthest element of is bounded, that is
This assumption is necessary to guarantee convergence and is related to the existence of the minimizers in the literature on traditional coordinate minimization Nesterov2012 ; Beck.Tetruashvili2013 ; Wright2015 .
The next two assumptions rely on Lipschitz constants along directions specified by specific blocks. We say that is block Lipschitz continuous with constant if
When , we say that is block nonexpansive. Note that if an operator is globally Lipschitz continuous, then it is straightforward to see that each is also block Lipschitz continuous.
Assumption 2.
The function is continuously differentiable and convex. Additionally, for each the block gradient is block Lipschitz continuous with constant . We define the largest block Lipschitz constant as
Let denote the global Lipschitz constant of . We always have and, for some , it may even happen that Wright2015 . As we shall see, the largest possible stepsize of BCRED depends on , while that of the fullgradient RED on . Hence, one natural advantage of BCRED is that it can often take more aggressive steps compared to the fullgradient RED.
Assumption 3.
The denoiser is such that each block denoiser is block nonexpansive.
Since the proximal operator is nonexpansive Parikh.Boyd2014 , it automatically satisfies this assumption. We revisit this scenario in a greater depth in Section 4.2. We can now establish the following result for BCRED.
Theorem 1.
A proof of the theorem is provided in the supplement. Theorem 1 establishes that the iterates of BCRED in expectation can get arbitrarily close to with rate. The proof relies on the monotone operator theory Bauschke.Combettes2017 ; Ryu.Boyd2016 , widely used in the context of convex optimization Parikh.Boyd2014 , including in the unified analysis of various traditional coordinate descent algorithms Peng.etal2016a ; Chow.etal2017 .
Since , one important implication of Theorem 1, is that the worstcase convergence rate (in expectation) of iterations of BCRED is better than that of a single iteration of the fullgradient RED (to see this, note that the fullgradient rate is obtained by setting , , and removing the expectation in (9)). This implies that in coordinate friendly settings (as discussed at the end of Section 3), the overall computational complexity of BCRED can be lower than that of the fullgradient RED. This gain is primarily due to two factors: (a) possibility to pick a larger stepsize ; (b) immediate reuse of each local blockupdate when computing the next iterate (the fullgradient RED updates the full vector before computing the next iterate).
In the special case of , for some convex function , BCRED reduces to the traditional randomized coordinate descent method applied to (1). Hence, under the assumptions of Theorem 1, one can rely on the analysis of traditional coordinate descent methods in Wright2015 to obtain
(10) 
where is the minimum value in (1). A proof of (10) is provided in the supplement for completeness. Therefore, such denoisers lead to explicit convex RED regularizers and convergence of BCRED in terms of the objective. However, as discussed in Section 4.2, when the denoiser is a proximal operator of some convex , BCRED is not directly solving (1), but rather its approximation.
Finally, note that the analysis in Theorem 1 only provides sufficient conditions for the convergence of BCRED. As corroborated by our numerical studies in Section 5, the actual convergence of BCRED is more general and often holds beyond nonexpansive denoisers. One plausible explanation for this is that such denoisers are locally nonexpansive over the set of input vectors used in testing. On the other hand, the recent techniques for spectralnormalization of CNNs Miyato.etal2018 ; Sedghi.etal2019 ; Gouk.etal2018 provide a convenient tool for building globally nonexpansive neural denoisers that result in provable convergence of BCRED.
4.2 Convergence for Proximal Operators
One of the limitations of the current RED theory is in its limited backward compatibility with the theory of proximal optimization. For example, as discussed in Romano.etal2017 (see section “Can we mimic any prior?”), the popular total variation (TV) denoiser Rudin.etal1992 cannot be justified with the original RED regularization function (4). In this section, we show that BCRED (and hence also the fullgradient RED) can be used to solve (1) for any convex, closed, and proper function . We do this by establishing a formal link between RED and the concept of Moreau smoothing, widely used in nonsmooth optimization Moreau1965 ; Rockafellar.Wets1998 ; Yu2013 . In particular, we consider to the following generic denoiser
(11) 
where is a closed, proper, and convex function Parikh.Boyd2014 . Since the proximal operator is nonexpansive, it is also block nonexpansive, which means that Assumption 3 is automatically satisfied. Our analysis, however, requires an additional assumption using the constant defined in Assumption 1.
Assumption 4.
There is a finite number that bounds the largest subgradient of , that is
where denotes a ball of radius , centered at .
This assumption on boundedness of the subgradients holds for a large number of regularizers used in practice, including both TV and the norm penalties. We can now establish the following result.
Theorem 2.
The theorem is proved in the supplement. It establishes that BCRED in expectation approximates the solution of (1) with an error bounded by . Hence, by setting and , one can establish the following worstcase convergence rate
(13) 
When , the proximal operator corresponds to the MAP denoiser, and the solution of BCRED corresponds to an approximate MAP estimator. This approximation can be made as precise as desired by considering larger values for the parameter . Note that this further justifies the RED framework by establishing that it can be used to compute a minimizer of any proper, closed, and convex (but not necessarily differentiable) . Therefore, our analysis strengthens RED by showing that it can accommodate a much larger class of explicit regularization functions, beyond those characterized in (4) and (5).
5 Numerical Validation
There is a considerable recent interest in using advanced priors in the context of image recovery from underdetermined () and noisy measurements. Recent work Romano.etal2017 ; Bigdeli.etal2017 ; Reehorst.Schniter2019 ; Metzler.etal2018 ; Mataev.Elad2019 suggests significant performance improvements due to advanced denoisers (such as BM3D Dabov.etal2007 or DnCNN Zhang.etal2017 ) over traditional sparsitydriven priors (such as TV Rudin.etal1992 ). Our goal is to complement these studies with several simulations validating our theoretical analysis and providing additional insights into BCRED.
We consider inverse problems of form where is an AWGN vector and
is a matrix corresponding to either a sparseview Radon transform, i.i.d. zeromean Gaussian random matrix of variance
, or radially subsampled twodimensional Fourier transform. Such matrices are commonly used in the context of computerized tomography (CT)
Kak.Slaney1988 , compressive sensing Candes.etal2006 ; Donoho2006 , and magnetic resonance imaging (MRI) Knoll.etal2011 , respectively. In all simulations, we set the measurement ratio to be approximately with AWGN corresponding to input signaltonoise ratio (SNR) of 30 dB and 40 dB. The images used correspond to 10 images randomly selected from the NYU fastMRI dataset Zbontar.etal2018 , resized to be pixels (see Fig. 5 in the supplement). BCRED is set to work with 16 blocks, each of size pixels. The reconstruction quality is quantified using SNR averaged over all ten test images.In addition to wellstudied denoisers, such as TV and BM3D, we design our own CNN denoiser denoted , which is a simplified version of the popular DnCNN denoiser (see Supplement E for details). This simplification reduces the computational complexity of denoising, which is important when running many iterations of BCRED. Additionally, it makes it easier to control the global Lipschitz constant of the CNN via spectralnormalization Sedghi.etal2019 . We train for the removal of AWGN at four noise levels corresponding to . For each experiment, we select the denoiser achieving the highest SNR value. Note that the parameter of BM3D is also finetuned for each experiment from the same set .
Methods  Radon  Random  Fourier  

30 dB  40 dB  30 dB  40 dB  30 dB  40 dB  
PGM (TV)  20.66  24.40  26.07  28.42  28.74  29.99 
UNet  21.90  21.72  16.37  16.40  22.11  22.11 
19  
RED (TV)  20.79  24.46  25.64  28.67  29.97  
BCRED (TV)  20.78  24.42  25.70  28.71  29.99  
19  
RED (BM3D)  26.46  27.82  28.89  29.79  
BCRED (BM3D)  26.52  27.89  28.85  29.80  
19  
RED ()  20.89  24.38  28.05  
BCRED ()  20.88  24.42  28.12 
Theorem 1 establishes that the sequence of iterates generated by BCRED converges in expectation to an element of . This is illustrated in Fig. 2 (left) for the Radon matrix with dB noise and a nonexpansive denoiser (see also Fig. 6 in the supplement). The average value of is plotted against the iteration number for the fullgradient RED and BCRED, with updates of BCRED (each modifying a single block) represented as one iteration. We numerically tested two block selection rules for BCRED (i.i.d. and epoch) and observed that processing in randomized epochs leads to a faster convergence. For reference, the figure also plots the normalized squared norm of the gradient mapping vectors produced by the traditional PGM with TV Beck.Teboulle2009a . The shaded areas indicate the range of values taken over runs corresponding to each test image. The results highlight the potential of BCRED to enjoy a better convergence rate compared to the fullgradient RED, with BCRED (epoch) achieving the accuracy of in 104 iterations, while the fullgradient RED achieves the same accuracy in 190 iterations.
Theorem 2 establishes that for proximaloperator denoisers, BCRED computes an approximate solution to (1) with an accuracy controlled by the parameter . This is illustrated in Fig. 2 (right) for the Fourier matrix with dB noise and the TV regularized leastsquares problem. The average value of is plotted against the iteration number for BCRED with . The optimal value is obtained by running the traditional PGM until convergence. As before, the figure groups updates of BCRED as a single iteration. The results are consistent with our theoretical analysis and show that as increases BCRED provides an increasingly accurate solution to TV. On the other hand, since the range of possible values for the stepsize depends on , the speed of convergence to is also influenced by .
The benefits of the fullgradient RED algorithms have been well discussed in prior work Romano.etal2017 ; Bigdeli.etal2017 ; Reehorst.Schniter2019 ; Metzler.etal2018 ; Mataev.Elad2019 . Table 1
summarizes the average SNR performance of BCRED in comparison to the fullgradient RED for all three matrix types and several priors. Unlike the fullgradient RED, BCRED is implemented using blockwise denoisers that work on image patches rather than the full images. We empirically found that 40 pixel padding on the denoiser input is sufficient for BCRED to match the performance of the fullgradient RED. The table also includes the results for the traditional PGM with TV
Beck.Teboulle2009a and the widelyused endtoend UNet approach Jin.etal2017a ; Han.etal2017 . The latter first backprojects the measurements into the image domain and then denoises the result using UNet Ronneberger.etal2015. The model was specifically trained endtoend for the Radon matrix with 30 dB noise and applied as such to other measurement settings. All the algorithms were run until convergence with hyperparameters optimized for SNR. The
denoiser in the table corresponds to the residual network with the Lipschitz constant of two (see Supplement E.2 for details). The overall best SNR in the table is highlighted in bolditalic, while the best RED prior is highlighted in lightgreen. First, note the excellent agreement between BCRED and the fullgradient RED. This close agreement between two methods is encouraging as BCRED relies on blockwise denoising and our analysis does not establish uniqueness of the solution, yet, in practice, both methods seem to yield solutions of nearly identical quality. Second, note that BCRED and RED provide excellent approximations to PGMTV solutions. Third, note how (unlike UNet) BCRED and RED with generalize to different measurement models. Finally. no prior seems to be universally good on all measurement settings, which indicates to the potential benefit of tailoring specific priors to specific measurement models.Coordinate descent methods are known to be highly beneficial in problems where both and are very large, but each measurement depends only on a small subset of the unknowns Niu.etal2011 . Fig. 3 demonstrates BCRED in such largescale setting by adopting the experimental setup from a recent work Farrens.etal2017 (see also Fig. 10 in the supplement). Specifically, we consider the recovery of a pixel galaxy image degraded by 597 known point spread functions (PSFs) corresponding to different spatial locations (see Supplement F for details). The natural sparsity of the problem makes it ideal for BCRED, which is implemented to update pixel blocks in a randomized fashion by only picking areas containing galaxies. The computational complexity of BCRED is further reduced by considering a simpler variant of that has only four convolutional layers (see Fig. 4 in the supplement). For comparison, we additionally show the result obtained by using the lowrank recovery method from Farrens.etal2017 with all the parameters kept at the values set by the authors. Note that our intent here is not to justify as a prior for image deblurring, but to demonstrate that BCRED can indeed be applied to a realistic, nontrivial image recovery task on a large image.
6 Conclusion and Future Work
Coordinate descent methods have become increasingly important in optimization for solving largescale problems arising in data analysis. We have introduced BCRED as a coordinate descent extension to the current family of RED algorithms and theoretically analyzed its convergence. Preliminary experiments suggest that BCRED can be an effective tool in largescale estimation problems arising in image recovery. More experiments are certainly needed to better asses the promise of this approach in various estimation tasks. For future work, we would like to explore accelerated and asynchronous variants of BCRED to further enhance its performance in parallel settings.
Acknowledgments
This material is based upon work supported by NSF award CCF1813910 and by NVIDIA Corporation with the donation of the Titan Xp GPU for research.
References
 (1) S. V. Venkatakrishnan, C. A. Bouman, and B. Wohlberg, “Plugandplay priors for model based reconstruction,” in Proc. IEEE Global Conf. Signal Process. and INf. Process. (GlobalSIP), 2013.
 (2) N. Parikh and S. Boyd, “Proximal algorithms,” Foundations and Trends in Optimization, vol. 1, no. 3, pp. 123–231, 2014.
 (3) K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising by sparse 3D transformdomain collaborative filtering,” IEEE Trans. Image Process., vol. 16, no. 16, pp. 2080–2095, August 2007.
 (4) K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising,” IEEE Trans. Image Process., vol. 26, no. 7, pp. 3142–3155, July 2017.
 (5) S. H. Chan, X. Wang, and O. A. Elgendy, “Plugandplay ADMM for image restoration: Fixedpoint convergence and applications,” IEEE Trans. Comp. Imag., vol. 3, no. 1, pp. 84–98, March 2017.

(6)
Sreehari et al.,
“Plugandplay priors for bright field electron tomography and sparse interpolation,”
IEEE Trans. Comp. Imag., vol. 2, no. 4, pp. 408–423, December 2016.  (7) S. Ono, “Primaldual plugandplay image restoration,” IEEE Signal Process. Lett., vol. 24, no. 8, pp. 1108–1112, 2017.
 (8) U. S. Kamilov, H. Mansour, and B. Wohlberg, “A plugandplay priors approach for solving nonlinear imaging inverse problems,” IEEE Signal Process. Lett., vol. 24, no. 12, pp. 1872–1876, December 2017.
 (9) T. Meinhardt, M. Moeller, C. Hazirbas, and D. Cremers, “Learning proximal operators: Using denoising networks for regularizing inverse imaging problems,” in Proc. IEEE Int. Conf. Comp. Vis. (ICCV), 2017.

(10)
K. Zhang, W. Zuo, S. Gu, and L. Zhang,
“Learning deep CNN denoiser prior for image restoration,”
in
Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR)
, 2017.  (11) G. T. Buzzard, S. H. Chan, S. Sreehari, and C. A. Bouman, “Plugandplay unplugged: Optimization free reconstruction using consensus equilibrium,” SIAM J. Imaging Sci., vol. 11, no. 3, pp. 2001–2020, 2018.
 (12) Y. Sun, B. Wohlberg, and U. S. Kamilov, “An online plugandplay algorithm for regularized image reconstruction,” IEEE Trans. Comput. Imaging, 2019.
 (13) A. M. Teodoro, J. M. BioucasDias, and M. A. T. Figueiredo, “A convergent image fusion algorithm using sceneadapted Gaussianmixturebased denoising,” IEEE Trans. Image Process., vol. 28, no. 1, pp. 451–463, Jan. 2019.
 (14) J. Tan, Y. Ma, and D. Baron, “Compressive imaging via approximate message passing with image denoising,” IEEE Trans. Signal Process., vol. 63, no. 8, pp. 2085–2092, Apr. 2015.
 (15) C. A. Metzler, A. Maleki, and R. G. Baraniuk, “From denoising to compressed sensing,” IEEE Trans. Inf. Theory, vol. 62, no. 9, pp. 5117–5144, September 2016.
 (16) C. A. Metzler, A. Maleki, and R. Baraniuk, “BM3DPRGAMP: Compressive phase retrieval based on BM3D denoising,” in Proc. IEEE Int. Conf. Image Proc., 2016.
 (17) A. Fletcher, S. Rangan, S. Sarkar, and P. Schniter, “Plugin estimation in highdimensional linear inverse problems: A rigorous analysis,” in Proc. Advances in Neural Information Processing Systems 32, 2018.
 (18) Y. Romano, M. Elad, and P. Milanfar, “The little engine that could: Regularization by denoising (RED),” SIAM J. Imaging Sci., vol. 10, no. 4, pp. 1804–1844, 2017.
 (19) S. A. Bigdeli, M. Jin, P. Favaro, and M. Zwicker, “Deep meanshift priors for image restoration,” in Proc. Advances in Neural Information Processing Systems 31, Long Beach, CA, USA, Dec. 2017.
 (20) E. T. Reehorst and P. Schniter, “Regularization by denoising: Clarifications and new interpretations,” IEEE Trans. Comput. Imag., vol. 5, no. 1, pp. 52–67, Mar. 2019.
 (21) C. A. Metzler, P. Schniter, A. Veeraraghavan, and R. G. Baraniuk, “prDeep: Robust phase retrieval with a flexible deep network,” in Proc. 35th Int. Conf. Machine Learning (ICML), 2018.
 (22) G. Mataev and P. Elad, M. Milanfar, “DeepRED: Deep image prior powered by RED,” 2019, arXiv:1903.10176 [cs.CV].
 (23) Y. Nesterov, “Efficiency of coordinate descent methods on hugescale optimization problems,” SIAM J. Optim., vol. 22, no. 2, pp. 341–362, 2012.
 (24) A. Beck and L. Tetruashvili, “On the convergence of block coordinate descent type methods,” SIAM J. Optim., vol. 23, no. 4, pp. 2037–2060, Oct. 2013.
 (25) S. J. Wright, “Coordinate descent algorithms,” Math. Program., vol. 151, no. 1, pp. 3–34, June 2015.

(26)
O. Fercoq and A. Gramfort,
“Coordinate descent methods,” Lecture notes
Optimization for Data Science
, École polytechnique, 2018.  (27) L. I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based noise removal algorithms,” Physica D, vol. 60, no. 1–4, pp. 259–268, November 1992.
 (28) R. Tibshirani, “Regression and selection via the lasso,” J. R. Stat. Soc. Series B (Methodological), vol. 58, no. 1, pp. 267–288, 1996.
 (29) E. J. Candès, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information,” IEEE Trans. Inf. Theory, vol. 52, no. 2, pp. 489–509, February 2006.
 (30) D. L. Donoho, “Compressed sensing,” IEEE Trans. Inf. Theory, vol. 52, no. 4, pp. 1289–1306, April 2006.
 (31) M. A. T. Figueiredo and R. D. Nowak, “An EM algorithm for waveletbased image restoration,” IEEE Trans. Image Process., vol. 12, no. 8, pp. 906–916, August 2003.
 (32) I. Daubechies, M. Defrise, and C. De Mol, “An iterative thresholding algorithm for linear inverse problems with a sparsity constraint,” Commun. Pure Appl. Math., vol. 57, no. 11, pp. 1413–1457, November 2004.
 (33) J. Bect, L. BlancFeraud, G. Aubert, and A. Chambolle, “A unified variational framework for image restoration,” in Proc. ECCV, Springer, Ed., New York, 2004, vol. 3024, pp. 1–13.
 (34) A. Beck and M. Teboulle, “A fast iterative shrinkagethresholding algorithm for linear inverse problems,” SIAM J. Imaging Sciences, vol. 2, no. 1, pp. 183–202, 2009.
 (35) J. Eckstein and D. P. Bertsekas, “On the DouglasRachford splitting method and the proximal point algorithm for maximal monotone operators,” Mathematical Programming, vol. 55, pp. 293–318, 1992.
 (36) M. V. Afonso, J. M.BioucasDias, and M. A. T. Figueiredo, “Fast image recovery using variable splitting and constrained optimization,” IEEE Trans. Image Process., vol. 19, no. 9, pp. 2345–2356, September 2010.
 (37) M. K. Ng, P. Weiss, and X. Yuan, “Solving constrained totalvariation image restoration and reconstruction problems via alternating direction methods,” SIAM J. Sci. Comput., vol. 32, no. 5, pp. 2710–2736, August 2010.
 (38) S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends in Machine Learning, vol. 3, no. 1, pp. 1–122, 2011.
 (39) Z. Peng, T. Wu, Y. Xu, M. Yan, and W. Yin, “Coordinatefriendly structures, algorithms and applications,” Adv. Math. Sci. Appl., vol. 1, no. 1, pp. 57–119, Apr. 2016.
 (40) M. Elad and M. Aharon, “Image denoising via sparse and redundant representations over learned dictionaries,” IEEE Trans. Image Process., vol. 15, no. 12, pp. 3736–3745, December 2006.
 (41) A. Buades, B. Coll, and J. M. Morel, “Image denoising methods. A new nonlocal principle,” SIAM Rev, vol. 52, no. 1, pp. 113–147, 2010.
 (42) D. Zoran and Y. Weiss, “From learning models of natural image patches to whole image restoration,” in Proc. IEEE Int. Conf. Comp. Vis. (ICCV), 2011.
 (43) H. H. Bauschke and P. L. Combettes, Convex Analysis and Monotone Operator Theory in Hilbert Spaces, Springer, 2 edition, 2017.
 (44) E. K. Ryu and S. Boyd, “A primer on monotone operator methods,” Appl. Comput. Math., vol. 15, no. 1, pp. 3–43, 2016.
 (45) Z. Peng, Y. Xu, M. Yan, and W. Yin, “ARock: An algorithmic framework for asynchronous parallel coordinate updates,” SIAM J. Sci. Comput., vol. 38, no. 5, pp. A2851–A2879, 2016.
 (46) Y. T. Chow, T. Wu, and W. Yin, “Cyclic coordinateupdate algorithms for fixedpoint problems: Analysis and applications,” SIAM J. Sci. Comput., vol. 39, no. 4, pp. A1280–A1300, 2017.
 (47) T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral normalization for generative adversarial networks,” in International Conference on Learning Representations (ICLR), 2018.

(48)
H. Sedghi, V. Gupta, and P. M. Long,
“The singular values of convolutional layers,”
in International Conference on Learning Representations (ICLR), New Orleans, LA, USA, May 2019.  (49) H. Gouk, E. Frank, B. Pfahringer, and M. Cree, “Regularisation of neural networks by enforcing Lipschitz continuity,” 2018, arXiv:1804.04368.
 (50) J. J. Moreau, “Proximité et dualité dans un espace hilbertien,” Bull. Soc. Math. France, vol. 93, pp. 273–299, 1965.
 (51) R. T. Rockafellar and R. JB Wets, Variational Analysis, Springer, 1998.
 (52) Y.L. Yu, “Better approximation and faster algorithm using the proximal average,” in Proc. Advances in Neural Information Processing Systems 26, Lake Tahoe, CA, USA, December 510, 2013, pp. 458–466.
 (53) A. C. Kak and M. Slaney, Principles of Computerized Tomographic Imaging, IEEE, 1988.
 (54) F. Knoll, K. Brendies, T. Pock, and R. Stollberger, “Second order total generalized variation (TGV) for MRI,” Magn. Reson. Med., vol. 65, no. 2, pp. 480–491, February 2011.
 (55) Zbontar et al., “fastMRI: An open dataset and benchmarks for accelerated MRI,” 2018, arXiv:1811.08839.
 (56) A. Beck and M. Teboulle, “Fast gradientbased algorithm for constrained total variation image denoising and deblurring problems,” IEEE Trans. Image Process., vol. 18, no. 11, pp. 2419–2434, November 2009.
 (57) K. H. Jin, M. T. McCann, E. Froustey, and M. Unser, “Deep convolutional neural network for inverse problems in imaging,” IEEE Trans. Image Process., vol. 26, no. 9, pp. 4509–4522, Sept. 2017.

(58)
Y. S. Han, J. Yoo, and J. C. Ye,
“Deep learning with domain adaptation for accelerated projection reconstruction MR,”
Magn. Reson. Med., vol. 80, no. 3, pp. 1189–1205, Sept. 2017.  (59) O. Ronneberger, P. Fischer, and T. Brox, “UNet: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and ComputerAssisted Intervention (MICCAI), 2015.

(60)
F. Niu, B. Recht, C. Ré, and S. J. Wright,
“Hogwild!: A lockfree approach to parallelizing stochastic gradient descent,”
in Proc. Advances in Neural Information Processing Systems 24, Granada, Spain, December 1215, 2011, pp. 693–701.  (61) S. Farrens, F. M. Ngolè Mboula, and J.L. Starck, “Space variant deconvolution of galaxy survey images,” A&A, vol. 601, pp. A66, 2017.
 (62) S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge Univ. Press, 2004.
 (63) Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Kluwer Academic Publishers, 2004.
 (64) Mandelbaum et al., “The third gravitational lensing accuracy testing (GREAT3) challenge handbook,” Astrophys. J. Suppl. S., vol. 212, no. 1, pp. 5, Aug. 2014.
 (65) Cropper et al., “VIS: The visible imager for Euclid,” in Proc. SPIE, 2012, vol. 8442.
 (66) T. Kuntzer, M. Tewes, and F. Courbin, “Stellar classification from singleband imaging using machine learning,” A&A, vol. 591, pp. A54, 2016.
Appendix A Proof of Theorem 1
A fixedpoint convergence of averaged operators is wellknown under the name of Krasnosel’skiiMann theorem (see Section 5.2 in Bauschke.Combettes2017 ) and was recently applied to the analysis of PnP Sun.etal2018a and several fullgradient RED algorithms in Reehorst.Schniter2019 . Our analysis here extends these results to the blockcoordinate setting and provides explicit worstcase convergence rates for BCRED.
We consider the following operators
and proceed in several steps.

[label=(), leftmargin=*]

From the definition of and the fact that is block nonexpansive, we know that is block nonexpansive.

Consider any , an index picked uniformly at random, and a single iteration of BCRED . Define a vector . We then have
(14) where in the third line we used , in the fourth line the block cocoercivity of , and in the last line the fact that .

By taking a conditional expectation on both sides and rearranging the terms, we obtain

Hence by averaging over iterations and taking the total expectation
The last inequality directly leads to the result.
Remark. Eq. (4) implies that, under Assumptions 13, the iterates of BCRED satisfy
(15) 
which means that the distance of the iterates of BCRED to is nonincreasing.
Remark. Suppose we are solving a coordinate friendly problem Peng.etal2016 , in which the cost of the full gradient update is times the cost of block update. Consider the stepsize where is the global Lipschitz constant of the gradient method. A similar analysis as above would yield the following convergence rate for the gradient method
Now, consider the stepsize and suppose that we run updates of BCRED with . Then, we have that
Since , where the upper bound can sometimes be tight, we conclude that the expected complexity of the blockcoordinate algorithm is lower compared to the full algorithm.
Appendix B Proof of Theorem 2
The concept of Moreau smoothing is wellknown and has been extensively used in other contexts (see for example Yu2013 ). Our contribution is to formally connect the concept to REDbased algorithms, which leads to its novel justification as an approximate MAP estimator. The basic review of relevant concepts from proximal optimization is given in Supplement D.4.
For , we consider the Moreau envelope of
From Proposition 9 in Supplement D.4 we know that
(16) 
and from Proposition 8 in Supplement D.4, we know that
(17) 
Hence, we can express the function as follows
where . From eq. (17), we conclude that a single iteration of BCRED
is performing a blockcoordinate descent on the function . From eq. (16) and the convexity of the Moreau envelope, we have
Hence, there exists a finite such that with . Consider the iteration of BCRED, then we have that
where we applied (10), which is further discussed in Supplement C.
The proof of eq. (13) is directly obtained by setting , , and noting that , for all .
Appendix C Convergence of the Traditional Coordinate Descent
The following analysis has been adopted from Wright2015 . We include it here for completeness.
Consider the following denoiser
and the following function
where and are both convex and continuously differentiable. For this denoiser, we have that
Therefore, in this case, BCRED is minimizing a convex and smooth function , which means that any is a global minimizer of . Additionally, due to Proposition 2 in Supplement D.1 and Proposition 7 in Supplement D.3, we have
Hence, for such denoisers, Assumption 3 is equivalent to the Lipschitz smoothness of block gradients .
To prove eq. 10, we consider the following iteration
which under our assumptions is a special case of the setting for Theorem 1.

[label=(), leftmargin=*]

From the block Lipscthiz continuity of , we conclude that
where the last inequality comes from the fact that .

For all , define
Then from (a), we can conclude that
where in the last inequality we used the Jensen’s inequality, and the fact that

From convexity, we know that
where in the last inequality, we used eq. (15). This combined with the result of (b) implies that

Note that from (c), we can obtain
By iterating this inequality, we get the final result
Appendix D Background Material
The results in this section are wellknown in the optimization literature and can be found in different forms in standard textbooks Rockafellar.Wets1998 ; Boyd.Vandenberghe2004 ; Nesterov2004 ; Bauschke.Combettes2017 . For completeness, we summarize the key results useful for our analysis by restating them in a blockcoordinate form.
d.1 Properties of BlockCoordinate Operators
Most of the concepts in this part come from the traditional monotone operator theory Ryu.Boyd2016 ; Bauschke.Combettes2017 adapted for blockcoordinate operators.
Definition 1.
We define the blockcoordinate operator
Comments
There are no comments yet.