Block Coordinate Regularization by Denoising

05/13/2019 ∙ by Yu Sun, et al. ∙ Washington University in St Louis 2

We consider the problem of estimating a vector from its noisy measurements using a prior specified only through a denoising function. Recent work on plug-and-play priors (PnP) and regularization-by-denoising (RED) has shown the state-of-the-art performance of estimators under such priors in a range of imaging tasks. In this work, we develop a new block coordinate RED algorithm that decomposes a large-scale estimation problem into a sequence of updates over a small subset of the unknown variables. We theoretically analyze the convergence of the algorithm and discuss its relationship to the traditional proximal optimization. Our analysis complements and extends recent theoretical results for RED-based estimation methods. We numerically validate our method using several denoiser priors, including those based on convolutional neural network (CNN) denoisers.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 21

page 22

page 24

page 25

page 26

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Problems involving estimation of an unknown vector from a set of noisy measurements

are important in many areas, including machine learning, image processing, and compressive sensing. Consider the scenario in Fig. 

1, where a vector passes through the measurement channel to produce the measurement vector . When the estimation problem is ill-posed, it becomes essential to include the prior in the estimation process. However, in high-dimensional settings, it is often difficult to directly obtain the true prior and one is hence restricted to various indirect sources of prior information on . This paper considers the cases where the prior information on is specified only via a denoising function, , designed for the removal of additive white Gaussian noise (AWGN).

There has been considerable recent interest in leveraging denoisers as priors for the recovery of . One popular strategy, known as plug-and-play priors (PnP) Venkatakrishnan.etal2013 , extends traditional proximal optimization Parikh.Boyd2014 by replacing the proximal operator with a general off-the-shelf denoiser. It has been shown that the combination of proximal algorithms with advanced denoisers, such as BM3D Dabov.etal2007 or DnCNN Zhang.etal2017 , leads to the state-of-the-art performance for various imaging problems Chan.etal2016 ; Sreehari.etal2016 ; Ono2017 ; Kamilov.etal2017 ; Meinhardt.etal2017 ; Zhang.etal2017a ; Buzzard.etal2017 ; Sun.etal2018a ; Teodoro.etal2019 . A similar strategy has also been adopted in the context of a related class of algorithms known as approximate message passing (AMP) Tan.etal2015 ; Metzler.etal2016 ; Metzler.etal2016a ; Fletcher.etal2018 . Regularization-by-denoising (RED) Romano.etal2017 , and the closely related deep mean-shift priors Bigdeli.etal2017 , represent an alternative, in which the denoiser is used to specify an explicit regularizer that has a simple gradient. More recent work has clarified the existence of explicit RED regularizers Reehorst.Schniter2019 , demonstrated its excellent performance on phase retrieval Metzler.etal2018 , and further boosted its performance in combination with a deep image prior Mataev.Elad2019 . In short, the use of advanced denoisers has proven to be essential for achieving the state-of-the-art results in many contexts. However, solving the corresponding estimation problem is still a significant computational challenge, especially in the context of high-dimensional vectors , typical in modern applications.

In this work, we extend the current family of RED algorithms by introducing a new block coordinate RED (BC-RED) algorithm. The algorithm relies on random partial updates on , which makes it scalable to vectors that would otherwise be prohibitively large for direct processing. Additionally, as we shall see, the overall computational complexity of BC-RED can sometimes be lower than corresponding methods operating on the full vector. This behavior is consistent with the traditional coordinate descent methods that can outperform their full gradient counterparts by being able to better reuse local updates and take larger steps Nesterov2012 ; Beck.Tetruashvili2013 ; Wright2015 ; Fercoq.Gramfort2018 . We present two theoretical results related to BC-RED. We first theoretically characterize the convergence of the algorithm under a set of transparent assumptions on the data-fidelity and the denoiser. Our analysis complements the recent theoretical analysis of full-gradient RED algorithms in Reehorst.Schniter2019 by considering block-coordinate updates and establishing the explicit worst-case convergence rate. Our second result establishes backward compatibility of BC-RED with the traditional proximal optimization. We show that when the denoiser corresponds to a proximal operator, BC-RED can be interpreted as an approximate MAP estimator, whose approximation error can be made arbitrarily small. To the best of our knowledge, this explicit link with proximal optimization is missing in the current literature on RED. BC-RED thus provides a flexible, scalable, and theoretically sound algorithm applicable to a wide variety of large-scale estimation problems. We demonstrate BC-RED on image recovery from linear measurements using several denoising priors, including those based on convolutional neural network (CNN) denoisers.

All proofs and some technical details have been omitted for space and included into the supplement that also provides more background and additional simulations.

Figure 1: The estimation problem considered in this work. The vector , with a prior , passes through the measurement channel to result in the measurements . The estimation algorithm does not have a direct access to the prior, but can rely on a denoising function , specifically designed for the removal of additive white Gaussian noise (AWGN). We propose block coordinate RED as a scalable algorithm for obtaining given and .

2 Background

It is common to formulate the estimation in Figure 1 as an optimization problem

(1)

where is the data-fidelity term and

is the regularizer. For example, the maximum a posteriori probability (MAP) estimator is obtained by setting

where is the likelihood that depends on and is the prior. One of the most popular data-fidelity terms is least-squares , which assumes a linear measurement model under AWGN. Similarly, one of the most popular regularizers is based on a sparsity-promoting penalty , where

is a linear transform and

is the regularization parameter Rudin.etal1992 ; Tibshirani1996 ; Candes.etal2006 ; Donoho2006 .

Many widely used regularizers, including the ones based on the -norm, are nondifferentiable. Proximal algorithms Parikh.Boyd2014 , such as the proximal-gradient method (PGM) Figueiredo.Nowak2003 ; Daubechies.etal2004 ; Bect.etal2004 ; Beck.Teboulle2009 and alternating direction method of multipliers (ADMM) Eckstein.Bertsekas1992 ; Afonso.etal2010 ; Ng.etal2010 ; Boyd.etal2011 , are a class of optimization methods that can circumvent the need to differentiate nonsmooth regularizers by using the proximal operator

(2)

The observation that the proximal operator can be interpreted as the MAP denoiser for AWGN has prompted the development of PnP Venkatakrishnan.etal2013 , where the proximal operator , within ADMM or PGM, is replaced with a more general denoising function .

Consider the following alternative to PnP that also relies on a denoising function Bigdeli.etal2017 ; Romano.etal2017

(3)

Under some conditions on the denoiser, it is possible to relate in (3) to some explicit regularization function . For example, when the denoiser is locally homogeneous and has a symmetric Jacobian Romano.etal2017 ; Reehorst.Schniter2019 , the operator corresponds to the gradient of the following function

(4)

On the other hand, when the denoiser corresponds to the minimum mean squared error (MMSE) estimator for the AWGN denoising problem Bigdeli.etal2017 ; Reehorst.Schniter2019 , , with and , the operator corresponds to the gradient of

(5)

where

is the Gaussian probability density function of variance

and denotes convolution. In this paper, we will use the term RED to denote all methods seeking the fixed points of (3). The key benefits of the RED methods Romano.etal2017 ; Bigdeli.etal2017 ; Metzler.etal2018 ; Reehorst.Schniter2019 ; Mataev.Elad2019 are their explicit separation of the forward model from the prior, their ability to accommodate powerful denoisers (such as the ones based on CNNs) without differentiating them, and their state-of-the-art performance on a number of imaging tasks. The next section further extends the scalability of RED by designing a new block coordinate RED algorithm.

3 Block Coordinate RED

All the current RED algorithms operate on vectors in . We propose BC-RED, shown in Algorithm 1, to allow for partial randomized updates on . Consider the decomposition of into subspaces

For each , we define the matrix that injects a vector in into and its transpose that extracts the th block from a vector in . Then, for any

(6)

Note that (6) directly implies the norm preservation for any . We are interested in a block-coordinate algorithm that uses only a subset of operator outputs corresponding to coordinates in some block . Hence, for an operator , we define the block-coordinate operator as

(7)

We introduce the following BC-RED algorithm.

1:  input: initial value , parameter , and step-size .
2:  for  do
3:     Choose an index
4:      where with .
5:  end for
Algorithm 1 Block Coordinate Regularization by Denoising (BC-RED)

Note that when , we have and . Hence, the theoretical analysis in this paper is also applicable to the full-gradient RED algorithm in (3).

As with traditional coordinate descent methods (see Wright2015 for a review), BC-RED can be implemented using different block selection strategies. The strategy adopted for our theoretical analysis selects block indices

as i.i.d. random variables distributed uniformly over

. An alternative is to proceed in epochs of

consecutive iterations, where at the start of each epoch the set is reshuffled, and is then selected consecutively from this ordered set. We numerically compare the convergence of both BC-RED variants in Section 5.

BC-RED updates its iterates one randomly picked block at a time using the output of . When the algorithm converges, it converges to the vectors in the zero set of

Consider the following two sets

where is the set of all critical points of the data-fidelity and is the set of all fixed points of the denoiser. Intuitively, the fixed points of correspond to all the vectors that are not denoised, and therefore can be interpreted as vectors that are noise-free according to the denoiser.

Note that if , then and is one of the solutions of BC-RED. Hence, any vector that is consistent with the data for a convex and noiseless according to is in the solution set. On the other hand, when , then corresponds to a tradeoff between the two sets, explicitly controlled via (see Fig. 8 in the supplement for an illustration). This explicit control is one of the key differences between RED and PnP.

BC-RED benefits from considerable flexibility compared to the full-gradient RED. Since each update is restricted to only one block of , the algorithm is suitable for parallel implementations and can deal with problems where the vector is distributed in space and in time. However, the maximal benefit of BC-RED is achieved when is efficient to evaluate. Fortunately, it was systematically shown in Peng.etal2016 that many operators—common in machine learning, image processing, and compressive sensing—admit coordinate friendly updates.

For a specific example, consider the least-squares data-fidelity and a block-wise denoiser . Define the residual vector and consider a single iteration of BC-RED that produces by updating the th block of . Then, the update direction and the residual update can be computed as

(8)

where is a submatrix of consisting of the columns corresponding to the th block. In many problems of practical interest Peng.etal2016 , the complexity of working with is roughly times lower than with . Also, many advanced denoisers can be effectively applied on image blocks rather than on the full image Elad.Aharon2006 ; Buades.etal2010 ; Zoran.Weiss2011 . Therefore, the speed of iterations of BC-RED is expected to be at least comparable to a single iteration of the full-gradient RED (see also Section E.1 in the supplement).

4 Convergence Analysis and Compatibility with Proximal Optimization

In this section, we present two theoretical results related to BC-RED. We first establish its convergence to an element of and then discuss its compatibility with the theory of proximal optimization.

4.1 Fixed Point Convergence of BC-RED

Our analysis requires three assumptions that together serve as sufficient conditions for convergence.

Assumption 1.

The operator is such that . There is a finite number such that the distance of the initial to the farthest element of is bounded, that is

This assumption is necessary to guarantee convergence and is related to the existence of the minimizers in the literature on traditional coordinate minimization Nesterov2012 ; Beck.Tetruashvili2013 ; Wright2015 .

The next two assumptions rely on Lipschitz constants along directions specified by specific blocks. We say that is block Lipschitz continuous with constant if

When , we say that is block nonexpansive. Note that if an operator is globally -Lipschitz continuous, then it is straightforward to see that each is also block -Lipschitz continuous.

Assumption 2.

The function is continuously differentiable and convex. Additionally, for each the block gradient is block Lipschitz continuous with constant . We define the largest block Lipschitz constant as

Let denote the global Lipschitz constant of . We always have and, for some , it may even happen that  Wright2015 . As we shall see, the largest possible step-size of BC-RED depends on , while that of the full-gradient RED on . Hence, one natural advantage of BC-RED is that it can often take more aggressive steps compared to the full-gradient RED.

Assumption 3.

The denoiser is such that each block denoiser is block nonexpansive.

Since the proximal operator is nonexpansive Parikh.Boyd2014 , it automatically satisfies this assumption. We revisit this scenario in a greater depth in Section 4.2. We can now establish the following result for BC-RED.

Theorem 1.

Run BC-RED for iterations with random i.i.d. block selection under Assumptions 1-3 using a fixed step-size . Then, we have

(9)

A proof of the theorem is provided in the supplement. Theorem 1 establishes that the iterates of BC-RED in expectation can get arbitrarily close to with rate. The proof relies on the monotone operator theory Bauschke.Combettes2017 ; Ryu.Boyd2016 , widely used in the context of convex optimization Parikh.Boyd2014 , including in the unified analysis of various traditional coordinate descent algorithms Peng.etal2016a ; Chow.etal2017 .

Since , one important implication of Theorem 1, is that the worst-case convergence rate (in expectation) of iterations of BC-RED is better than that of a single iteration of the full-gradient RED (to see this, note that the full-gradient rate is obtained by setting , , and removing the expectation in (9)). This implies that in coordinate friendly settings (as discussed at the end of Section 3), the overall computational complexity of BC-RED can be lower than that of the full-gradient RED. This gain is primarily due to two factors: (a) possibility to pick a larger step-size ; (b) immediate reuse of each local block-update when computing the next iterate (the full-gradient RED updates the full vector before computing the next iterate).

In the special case of , for some convex function , BC-RED reduces to the traditional randomized coordinate descent method applied to (1). Hence, under the assumptions of Theorem 1, one can rely on the analysis of traditional coordinate descent methods in Wright2015 to obtain

(10)

where is the minimum value in (1). A proof of (10) is provided in the supplement for completeness. Therefore, such denoisers lead to explicit convex RED regularizers and convergence of BC-RED in terms of the objective. However, as discussed in Section 4.2, when the denoiser is a proximal operator of some convex , BC-RED is not directly solving (1), but rather its approximation.

Finally, note that the analysis in Theorem 1 only provides sufficient conditions for the convergence of BC-RED. As corroborated by our numerical studies in Section 5, the actual convergence of BC-RED is more general and often holds beyond nonexpansive denoisers. One plausible explanation for this is that such denoisers are locally nonexpansive over the set of input vectors used in testing. On the other hand, the recent techniques for spectral-normalization of CNNs Miyato.etal2018 ; Sedghi.etal2019 ; Gouk.etal2018 provide a convenient tool for building globally nonexpansive neural denoisers that result in provable convergence of BC-RED.

4.2 Convergence for Proximal Operators

One of the limitations of the current RED theory is in its limited backward compatibility with the theory of proximal optimization. For example, as discussed in Romano.etal2017 (see section “Can we mimic any prior?”), the popular total variation (TV) denoiser Rudin.etal1992 cannot be justified with the original RED regularization function (4). In this section, we show that BC-RED (and hence also the full-gradient RED) can be used to solve (1) for any convex, closed, and proper function . We do this by establishing a formal link between RED and the concept of Moreau smoothing, widely used in nonsmooth optimization Moreau1965 ; Rockafellar.Wets1998 ; Yu2013 . In particular, we consider to the following generic denoiser

(11)

where is a closed, proper, and convex function Parikh.Boyd2014 . Since the proximal operator is nonexpansive, it is also block nonexpansive, which means that Assumption 3 is automatically satisfied. Our analysis, however, requires an additional assumption using the constant defined in Assumption 1.

Assumption 4.

There is a finite number that bounds the largest subgradient of , that is

where denotes a ball of radius , centered at .

This assumption on boundedness of the subgradients holds for a large number of regularizers used in practice, including both TV and the -norm penalties. We can now establish the following result.

Theorem 2.

Run BC-RED for iterations with random i.i.d. block selection and the denoiser (11) under Assumptions 1-4 using a fixed step-size . Then, we have

(12)

where the function is defined in (1) and is its minimum.

The theorem is proved in the supplement. It establishes that BC-RED in expectation approximates the solution of (1) with an error bounded by . Hence, by setting and , one can establish the following worst-case convergence rate

(13)

When , the proximal operator corresponds to the MAP denoiser, and the solution of BC-RED corresponds to an approximate MAP estimator. This approximation can be made as precise as desired by considering larger values for the parameter . Note that this further justifies the RED framework by establishing that it can be used to compute a minimizer of any proper, closed, and convex (but not necessarily differentiable) . Therefore, our analysis strengthens RED by showing that it can accommodate a much larger class of explicit regularization functions, beyond those characterized in (4) and (5).

5 Numerical Validation

There is a considerable recent interest in using advanced priors in the context of image recovery from underdetermined () and noisy measurements. Recent work Romano.etal2017 ; Bigdeli.etal2017 ; Reehorst.Schniter2019 ; Metzler.etal2018 ; Mataev.Elad2019 suggests significant performance improvements due to advanced denoisers (such as BM3D Dabov.etal2007 or DnCNN Zhang.etal2017 ) over traditional sparsity-driven priors (such as TV Rudin.etal1992 ). Our goal is to complement these studies with several simulations validating our theoretical analysis and providing additional insights into BC-RED.

We consider inverse problems of form where is an AWGN vector and

is a matrix corresponding to either a sparse-view Radon transform, i.i.d. zero-mean Gaussian random matrix of variance

, or radially subsampled two-dimensional Fourier transform. Such matrices are commonly used in the context of computerized tomography (CT) 

Kak.Slaney1988 , compressive sensing Candes.etal2006 ; Donoho2006 , and magnetic resonance imaging (MRI) Knoll.etal2011 , respectively. In all simulations, we set the measurement ratio to be approximately with AWGN corresponding to input signal-to-noise ratio (SNR) of 30 dB and 40 dB. The images used correspond to 10 images randomly selected from the NYU fastMRI dataset Zbontar.etal2018 , resized to be pixels (see Fig. 5 in the supplement). BC-RED is set to work with 16 blocks, each of size pixels. The reconstruction quality is quantified using SNR averaged over all ten test images.

In addition to well-studied denoisers, such as TV and BM3D, we design our own CNN denoiser denoted , which is a simplified version of the popular DnCNN denoiser (see Supplement E for details). This simplification reduces the computational complexity of denoising, which is important when running many iterations of BC-RED. Additionally, it makes it easier to control the global Lipschitz constant of the CNN via spectral-normalization Sedghi.etal2019 . We train for the removal of AWGN at four noise levels corresponding to . For each experiment, we select the denoiser achieving the highest SNR value. Note that the parameter of BM3D is also fine-tuned for each experiment from the same set .

Figure 2: Left: Illustration of the convergence of BC-RED under a nonexpansive prior. Average normalized distance to is plotted against the iteration number with the shaded areas representing the range of values attained over all test images. Right: Illustration of the influence of the parameter for solving TV regularized least-squares problem using BC-RED. As increases, BC-RED provides an increasingly accurate approximation to the TV optimization problem.
Methods Radon Random Fourier
30 dB 40 dB 30 dB 40 dB 30 dB 40 dB
PGM (TV) 20.66 24.40 26.07 28.42 28.74 29.99
U-Net 21.90 21.72 16.37 16.40 22.11 22.11
1-9

RED (TV) 20.79 24.46 25.64

28.30

28.67 29.97
BC-RED (TV) 20.78 24.42 25.70

28.40

28.71 29.99
1-9

RED (BM3D)

21.55

25.24

26.46 27.82 28.89 29.79
BC-RED (BM3D)

21.56

25.16

26.52 27.89 28.85 29.80
1-9

RED () 20.89 24.38

26.53

28.05

29.33

30.32

BC-RED () 20.88 24.42

26.60

28.12

29.40

30.39

Table 1: Average SNRs obtained for different measurement matrices and image priors.

Theorem 1 establishes that the sequence of iterates generated by BC-RED converges in expectation to an element of . This is illustrated in Fig. 2 (left) for the Radon matrix with dB noise and a nonexpansive denoiser (see also Fig. 6 in the supplement). The average value of is plotted against the iteration number for the full-gradient RED and BC-RED, with updates of BC-RED (each modifying a single block) represented as one iteration. We numerically tested two block selection rules for BC-RED (i.i.d. and epoch) and observed that processing in randomized epochs leads to a faster convergence. For reference, the figure also plots the normalized squared norm of the gradient mapping vectors produced by the traditional PGM with TV Beck.Teboulle2009a . The shaded areas indicate the range of values taken over runs corresponding to each test image. The results highlight the potential of BC-RED to enjoy a better convergence rate compared to the full-gradient RED, with BC-RED (epoch) achieving the accuracy of in 104 iterations, while the full-gradient RED achieves the same accuracy in 190 iterations.

Theorem 2 establishes that for proximal-operator denoisers, BC-RED computes an approximate solution to (1) with an accuracy controlled by the parameter . This is illustrated in Fig. 2 (right) for the Fourier matrix with dB noise and the TV regularized least-squares problem. The average value of is plotted against the iteration number for BC-RED with . The optimal value is obtained by running the traditional PGM until convergence. As before, the figure groups updates of BC-RED as a single iteration. The results are consistent with our theoretical analysis and show that as increases BC-RED provides an increasingly accurate solution to TV. On the other hand, since the range of possible values for the step-size depends on , the speed of convergence to is also influenced by .

The benefits of the full-gradient RED algorithms have been well discussed in prior work Romano.etal2017 ; Bigdeli.etal2017 ; Reehorst.Schniter2019 ; Metzler.etal2018 ; Mataev.Elad2019 . Table 1

summarizes the average SNR performance of BC-RED in comparison to the full-gradient RED for all three matrix types and several priors. Unlike the full-gradient RED, BC-RED is implemented using block-wise denoisers that work on image patches rather than the full images. We empirically found that 40 pixel padding on the denoiser input is sufficient for BC-RED to match the performance of the full-gradient RED. The table also includes the results for the traditional PGM with TV 

Beck.Teboulle2009a and the widely-used end-to-end U-Net approach Jin.etal2017a ; Han.etal2017 . The latter first backprojects the measurements into the image domain and then denoises the result using U-Net Ronneberger.etal2015

. The model was specifically trained end-to-end for the Radon matrix with 30 dB noise and applied as such to other measurement settings. All the algorithms were run until convergence with hyperparameters optimized for SNR. The

denoiser in the table corresponds to the residual network with the Lipschitz constant of two (see Supplement E.2 for details). The overall best SNR in the table is highlighted in bold-italic, while the best RED prior is highlighted in light-green. First, note the excellent agreement between BC-RED and the full-gradient RED. This close agreement between two methods is encouraging as BC-RED relies on block-wise denoising and our analysis does not establish uniqueness of the solution, yet, in practice, both methods seem to yield solutions of nearly identical quality. Second, note that BC-RED and RED provide excellent approximations to PGM-TV solutions. Third, note how (unlike U-Net) BC-RED and RED with generalize to different measurement models. Finally. no prior seems to be universally good on all measurement settings, which indicates to the potential benefit of tailoring specific priors to specific measurement models.

Figure 3: Recovery of a pixel galaxy image degraded by a spatially variant blur and a high-amount of AWGN. The efficacy of BC-RED on this problem is due to the natural sparsity in the latter, with all of the information contained in a small part of the full image.

Coordinate descent methods are known to be highly beneficial in problems where both and are very large, but each measurement depends only on a small subset of the unknowns Niu.etal2011 . Fig. 3 demonstrates BC-RED in such large-scale setting by adopting the experimental setup from a recent work Farrens.etal2017 (see also Fig. 10 in the supplement). Specifically, we consider the recovery of a pixel galaxy image degraded by 597 known point spread functions (PSFs) corresponding to different spatial locations (see Supplement F for details). The natural sparsity of the problem makes it ideal for BC-RED, which is implemented to update pixel blocks in a randomized fashion by only picking areas containing galaxies. The computational complexity of BC-RED is further reduced by considering a simpler variant of that has only four convolutional layers (see Fig. 4 in the supplement). For comparison, we additionally show the result obtained by using the low-rank recovery method from Farrens.etal2017 with all the parameters kept at the values set by the authors. Note that our intent here is not to justify as a prior for image deblurring, but to demonstrate that BC-RED can indeed be applied to a realistic, nontrivial image recovery task on a large image.

6 Conclusion and Future Work

Coordinate descent methods have become increasingly important in optimization for solving large-scale problems arising in data analysis. We have introduced BC-RED as a coordinate descent extension to the current family of RED algorithms and theoretically analyzed its convergence. Preliminary experiments suggest that BC-RED can be an effective tool in large-scale estimation problems arising in image recovery. More experiments are certainly needed to better asses the promise of this approach in various estimation tasks. For future work, we would like to explore accelerated and asynchronous variants of BC-RED to further enhance its performance in parallel settings.

Acknowledgments

This material is based upon work supported by NSF award CCF-1813910 and by NVIDIA Corporation with the donation of the Titan Xp GPU for research.

References

  • (1) S. V. Venkatakrishnan, C. A. Bouman, and B. Wohlberg, “Plug-and-play priors for model based reconstruction,” in Proc. IEEE Global Conf. Signal Process. and INf. Process. (GlobalSIP), 2013.
  • (2) N. Parikh and S. Boyd, “Proximal algorithms,” Foundations and Trends in Optimization, vol. 1, no. 3, pp. 123–231, 2014.
  • (3) K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising by sparse 3-D transform-domain collaborative filtering,” IEEE Trans. Image Process., vol. 16, no. 16, pp. 2080–2095, August 2007.
  • (4) K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising,” IEEE Trans. Image Process., vol. 26, no. 7, pp. 3142–3155, July 2017.
  • (5) S. H. Chan, X. Wang, and O. A. Elgendy, “Plug-and-play ADMM for image restoration: Fixed-point convergence and applications,” IEEE Trans. Comp. Imag., vol. 3, no. 1, pp. 84–98, March 2017.
  • (6) Sreehari et al.,

    “Plug-and-play priors for bright field electron tomography and sparse interpolation,”

    IEEE Trans. Comp. Imag., vol. 2, no. 4, pp. 408–423, December 2016.
  • (7) S. Ono, “Primal-dual plug-and-play image restoration,” IEEE Signal Process. Lett., vol. 24, no. 8, pp. 1108–1112, 2017.
  • (8) U. S. Kamilov, H. Mansour, and B. Wohlberg, “A plug-and-play priors approach for solving nonlinear imaging inverse problems,” IEEE Signal Process. Lett., vol. 24, no. 12, pp. 1872–1876, December 2017.
  • (9) T. Meinhardt, M. Moeller, C. Hazirbas, and D. Cremers, “Learning proximal operators: Using denoising networks for regularizing inverse imaging problems,” in Proc. IEEE Int. Conf. Comp. Vis. (ICCV), 2017.
  • (10) K. Zhang, W. Zuo, S. Gu, and L. Zhang, “Learning deep CNN denoiser prior for image restoration,” in

    Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR)

    , 2017.
  • (11) G. T. Buzzard, S. H. Chan, S. Sreehari, and C. A. Bouman, “Plug-and-play unplugged: Optimization free reconstruction using consensus equilibrium,” SIAM J. Imaging Sci., vol. 11, no. 3, pp. 2001–2020, 2018.
  • (12) Y. Sun, B. Wohlberg, and U. S. Kamilov, “An online plug-and-play algorithm for regularized image reconstruction,” IEEE Trans. Comput. Imaging, 2019.
  • (13) A. M. Teodoro, J. M. Bioucas-Dias, and M. A. T. Figueiredo, “A convergent image fusion algorithm using scene-adapted Gaussian-mixture-based denoising,” IEEE Trans. Image Process., vol. 28, no. 1, pp. 451–463, Jan. 2019.
  • (14) J. Tan, Y. Ma, and D. Baron, “Compressive imaging via approximate message passing with image denoising,” IEEE Trans. Signal Process., vol. 63, no. 8, pp. 2085–2092, Apr. 2015.
  • (15) C. A. Metzler, A. Maleki, and R. G. Baraniuk, “From denoising to compressed sensing,” IEEE Trans. Inf. Theory, vol. 62, no. 9, pp. 5117–5144, September 2016.
  • (16) C. A. Metzler, A. Maleki, and R. Baraniuk, “BM3D-PRGAMP: Compressive phase retrieval based on BM3D denoising,” in Proc. IEEE Int. Conf. Image Proc., 2016.
  • (17) A. Fletcher, S. Rangan, S. Sarkar, and P. Schniter, “Plug-in estimation in high-dimensional linear inverse problems: A rigorous analysis,” in Proc. Advances in Neural Information Processing Systems 32, 2018.
  • (18) Y. Romano, M. Elad, and P. Milanfar, “The little engine that could: Regularization by denoising (RED),” SIAM J. Imaging Sci., vol. 10, no. 4, pp. 1804–1844, 2017.
  • (19) S. A. Bigdeli, M. Jin, P. Favaro, and M. Zwicker, “Deep mean-shift priors for image restoration,” in Proc. Advances in Neural Information Processing Systems 31, Long Beach, CA, USA, Dec. 2017.
  • (20) E. T. Reehorst and P. Schniter, “Regularization by denoising: Clarifications and new interpretations,” IEEE Trans. Comput. Imag., vol. 5, no. 1, pp. 52–67, Mar. 2019.
  • (21) C. A. Metzler, P. Schniter, A. Veeraraghavan, and R. G. Baraniuk, “prDeep: Robust phase retrieval with a flexible deep network,” in Proc. 35th Int. Conf. Machine Learning (ICML), 2018.
  • (22) G. Mataev and P. Elad, M. Milanfar, “DeepRED: Deep image prior powered by RED,” 2019, arXiv:1903.10176 [cs.CV].
  • (23) Y. Nesterov, “Efficiency of coordinate descent methods on huge-scale optimization problems,” SIAM J. Optim., vol. 22, no. 2, pp. 341–362, 2012.
  • (24) A. Beck and L. Tetruashvili, “On the convergence of block coordinate descent type methods,” SIAM J. Optim., vol. 23, no. 4, pp. 2037–2060, Oct. 2013.
  • (25) S. J. Wright, “Coordinate descent algorithms,” Math. Program., vol. 151, no. 1, pp. 3–34, June 2015.
  • (26) O. Fercoq and A. Gramfort, “Coordinate descent methods,” Lecture notes

    Optimization for Data Science

    , École polytechnique, 2018.
  • (27) L. I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based noise removal algorithms,” Physica D, vol. 60, no. 1–4, pp. 259–268, November 1992.
  • (28) R. Tibshirani, “Regression and selection via the lasso,” J. R. Stat. Soc. Series B (Methodological), vol. 58, no. 1, pp. 267–288, 1996.
  • (29) E. J. Candès, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information,” IEEE Trans. Inf. Theory, vol. 52, no. 2, pp. 489–509, February 2006.
  • (30) D. L. Donoho, “Compressed sensing,” IEEE Trans. Inf. Theory, vol. 52, no. 4, pp. 1289–1306, April 2006.
  • (31) M. A. T. Figueiredo and R. D. Nowak, “An EM algorithm for wavelet-based image restoration,” IEEE Trans. Image Process., vol. 12, no. 8, pp. 906–916, August 2003.
  • (32) I. Daubechies, M. Defrise, and C. De Mol, “An iterative thresholding algorithm for linear inverse problems with a sparsity constraint,” Commun. Pure Appl. Math., vol. 57, no. 11, pp. 1413–1457, November 2004.
  • (33) J. Bect, L. Blanc-Feraud, G. Aubert, and A. Chambolle, “A -unified variational framework for image restoration,” in Proc. ECCV, Springer, Ed., New York, 2004, vol. 3024, pp. 1–13.
  • (34) A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM J. Imaging Sciences, vol. 2, no. 1, pp. 183–202, 2009.
  • (35) J. Eckstein and D. P. Bertsekas, “On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators,” Mathematical Programming, vol. 55, pp. 293–318, 1992.
  • (36) M. V. Afonso, J. M.Bioucas-Dias, and M. A. T. Figueiredo, “Fast image recovery using variable splitting and constrained optimization,” IEEE Trans. Image Process., vol. 19, no. 9, pp. 2345–2356, September 2010.
  • (37) M. K. Ng, P. Weiss, and X. Yuan, “Solving constrained total-variation image restoration and reconstruction problems via alternating direction methods,” SIAM J. Sci. Comput., vol. 32, no. 5, pp. 2710–2736, August 2010.
  • (38) S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends in Machine Learning, vol. 3, no. 1, pp. 1–122, 2011.
  • (39) Z. Peng, T. Wu, Y. Xu, M. Yan, and W. Yin, “Coordinate-friendly structures, algorithms and applications,” Adv. Math. Sci. Appl., vol. 1, no. 1, pp. 57–119, Apr. 2016.
  • (40) M. Elad and M. Aharon, “Image denoising via sparse and redundant representations over learned dictionaries,” IEEE Trans. Image Process., vol. 15, no. 12, pp. 3736–3745, December 2006.
  • (41) A. Buades, B. Coll, and J. M. Morel, “Image denoising methods. A new nonlocal principle,” SIAM Rev, vol. 52, no. 1, pp. 113–147, 2010.
  • (42) D. Zoran and Y. Weiss, “From learning models of natural image patches to whole image restoration,” in Proc. IEEE Int. Conf. Comp. Vis. (ICCV), 2011.
  • (43) H. H. Bauschke and P. L. Combettes, Convex Analysis and Monotone Operator Theory in Hilbert Spaces, Springer, 2 edition, 2017.
  • (44) E. K. Ryu and S. Boyd, “A primer on monotone operator methods,” Appl. Comput. Math., vol. 15, no. 1, pp. 3–43, 2016.
  • (45) Z. Peng, Y. Xu, M. Yan, and W. Yin, “ARock: An algorithmic framework for asynchronous parallel coordinate updates,” SIAM J. Sci. Comput., vol. 38, no. 5, pp. A2851–A2879, 2016.
  • (46) Y. T. Chow, T. Wu, and W. Yin, “Cyclic coordinate-update algorithms for fixed-point problems: Analysis and applications,” SIAM J. Sci. Comput., vol. 39, no. 4, pp. A1280–A1300, 2017.
  • (47) T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral normalization for generative adversarial networks,” in International Conference on Learning Representations (ICLR), 2018.
  • (48) H. Sedghi, V. Gupta, and P. M. Long,

    “The singular values of convolutional layers,”

    in International Conference on Learning Representations (ICLR), New Orleans, LA, USA, May 2019.
  • (49) H. Gouk, E. Frank, B. Pfahringer, and M. Cree, “Regularisation of neural networks by enforcing Lipschitz continuity,” 2018, arXiv:1804.04368.
  • (50) J. J. Moreau, “Proximité et dualité dans un espace hilbertien,” Bull. Soc. Math. France, vol. 93, pp. 273–299, 1965.
  • (51) R. T. Rockafellar and R. J-B Wets, Variational Analysis, Springer, 1998.
  • (52) Y.-L. Yu, “Better approximation and faster algorithm using the proximal average,” in Proc. Advances in Neural Information Processing Systems 26, Lake Tahoe, CA, USA, December 5-10, 2013, pp. 458–466.
  • (53) A. C. Kak and M. Slaney, Principles of Computerized Tomographic Imaging, IEEE, 1988.
  • (54) F. Knoll, K. Brendies, T. Pock, and R. Stollberger, “Second order total generalized variation (TGV) for MRI,” Magn. Reson. Med., vol. 65, no. 2, pp. 480–491, February 2011.
  • (55) Zbontar et al., “fastMRI: An open dataset and benchmarks for accelerated MRI,” 2018, arXiv:1811.08839.
  • (56) A. Beck and M. Teboulle, “Fast gradient-based algorithm for constrained total variation image denoising and deblurring problems,” IEEE Trans. Image Process., vol. 18, no. 11, pp. 2419–2434, November 2009.
  • (57) K. H. Jin, M. T. McCann, E. Froustey, and M. Unser, “Deep convolutional neural network for inverse problems in imaging,” IEEE Trans. Image Process., vol. 26, no. 9, pp. 4509–4522, Sept. 2017.
  • (58) Y. S. Han, J. Yoo, and J. C. Ye,

    Deep learning with domain adaptation for accelerated projection reconstruction MR,”

    Magn. Reson. Med., vol. 80, no. 3, pp. 1189–1205, Sept. 2017.
  • (59) O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015.
  • (60) F. Niu, B. Recht, C. Ré, and S. J. Wright,

    “Hogwild!: A lock-free approach to parallelizing stochastic gradient descent,”

    in Proc. Advances in Neural Information Processing Systems 24, Granada, Spain, December 12-15, 2011, pp. 693–701.
  • (61) S. Farrens, F. M. Ngolè Mboula, and J.-L. Starck, “Space variant deconvolution of galaxy survey images,” A&A, vol. 601, pp. A66, 2017.
  • (62) S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge Univ. Press, 2004.
  • (63) Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Kluwer Academic Publishers, 2004.
  • (64) Mandelbaum et al., “The third gravitational lensing accuracy testing (GREAT3) challenge handbook,” Astrophys. J. Suppl. S., vol. 212, no. 1, pp. 5, Aug. 2014.
  • (65) Cropper et al., “VIS: The visible imager for Euclid,” in Proc. SPIE, 2012, vol. 8442.
  • (66) T. Kuntzer, M. Tewes, and F. Courbin, “Stellar classification from single-band imaging using machine learning,” A&A, vol. 591, pp. A54, 2016.

Appendix A Proof of Theorem 1

A fixed-point convergence of averaged operators is well-known under the name of Krasnosel’skii-Mann theorem (see Section 5.2 in Bauschke.Combettes2017 ) and was recently applied to the analysis of PnP Sun.etal2018a and several full-gradient RED algorithms in Reehorst.Schniter2019 . Our analysis here extends these results to the block-coordinate setting and provides explicit worst-case convergence rates for BC-RED.

We consider the following operators

and proceed in several steps.

  1. [label=(), leftmargin=*]

  2. Since is block -Lipschitz continuous, it is also block -Lipschitz continuous. Hence, we know from Proposition 7 in Supplement D.3 that it is block -cocoercive. Then from Proposition 4 in Supplement D.2, we know that the operator is block nonexpansive.

  3. From the definition of and the fact that is block nonexpansive, we know that is block nonexpansive.

  4. From Proposition 1 in Supplement D.1, we know that a convex combination of block nonexpansive operators is also block nonexpansive, hence we conclude that

    is block nonexpansive. Then from Proposition 4 in Supplement D.2, we know that is block -cocoercive.

  5. Consider any , an index picked uniformly at random, and a single iteration of BC-RED . Define a vector . We then have

    (14)

    where in the third line we used , in the fourth line the block cocoercivity of , and in the last line the fact that .

  6. By taking a conditional expectation on both sides and rearranging the terms, we obtain

  7. Hence by averaging over iterations and taking the total expectation

The last inequality directly leads to the result.

Remark. Eq. (4) implies that, under Assumptions 1-3, the iterates of BC-RED satisfy

(15)

which means that the distance of the iterates of BC-RED to is nonincreasing.

Remark. Suppose we are solving a coordinate friendly problem Peng.etal2016 , in which the cost of the full gradient update is times the cost of block update. Consider the step-size where is the global Lipschitz constant of the gradient method. A similar analysis as above would yield the following convergence rate for the gradient method

Now, consider the step-size and suppose that we run updates of BC-RED with . Then, we have that

Since , where the upper bound can sometimes be tight, we conclude that the expected complexity of the block-coordinate algorithm is lower compared to the full algorithm.

Appendix B Proof of Theorem 2

The concept of Moreau smoothing is well-known and has been extensively used in other contexts (see for example Yu2013 ). Our contribution is to formally connect the concept to RED-based algorithms, which leads to its novel justification as an approximate MAP estimator. The basic review of relevant concepts from proximal optimization is given in Supplement D.4.

For , we consider the Moreau envelope of

From Proposition 9 in Supplement D.4 we know that

(16)

and from Proposition 8 in Supplement D.4, we know that

(17)

Hence, we can express the function as follows

where . From eq. (17), we conclude that a single iteration of BC-RED

is performing a block-coordinate descent on the function . From eq. (16) and the convexity of the Moreau envelope, we have

Hence, there exists a finite such that with . Consider the iteration of BC-RED, then we have that

where we applied (10), which is further discussed in Supplement C.

The proof of eq. (13) is directly obtained by setting , , and noting that , for all .

Appendix C Convergence of the Traditional Coordinate Descent

The following analysis has been adopted from Wright2015 . We include it here for completeness.

Consider the following denoiser

and the following function

where and are both convex and continuously differentiable. For this denoiser, we have that

Therefore, in this case, BC-RED is minimizing a convex and smooth function , which means that any is a global minimizer of . Additionally, due to Proposition 2 in Supplement D.1 and Proposition 7 in Supplement D.3, we have

Hence, for such denoisers, Assumption 3 is equivalent to the -Lipschitz smoothness of block gradients .

To prove eq. 10, we consider the following iteration

which under our assumptions is a special case of the setting for Theorem 1.

  1. [label=(), leftmargin=*]

  2. From the block Lipscthiz continuity of , we conclude that

    where the last inequality comes from the fact that .

  3. For all , define

    Then from (a), we can conclude that

    where in the last inequality we used the Jensen’s inequality, and the fact that

  4. From convexity, we know that

    where in the last inequality, we used eq. (15). This combined with the result of (b) implies that

  5. Note that from (c), we can obtain

    By iterating this inequality, we get the final result

Appendix D Background Material

The results in this section are well-known in the optimization literature and can be found in different forms in standard textbooks Rockafellar.Wets1998 ; Boyd.Vandenberghe2004 ; Nesterov2004 ; Bauschke.Combettes2017 . For completeness, we summarize the key results useful for our analysis by restating them in a block-coordinate form.

d.1 Properties of Block-Coordinate Operators

Most of the concepts in this part come from the traditional monotone operator theory Ryu.Boyd2016 ; Bauschke.Combettes2017 adapted for block-coordinate operators.

Definition 1.

We define the block-coordinate operator