DeepAI
Log In Sign Up

Scalable Plug-and-Play ADMM with Convergence Guarantees

Plug-and-play priors (PnP) is a broadly applicable methodology for solving inverse problems by exploiting statistical priors specified as denoisers. Recent work has reported the state-of-the-art performance of PnP algorithms using pre-trained deep neural nets as denoisers in a number of imaging applications. However, current PnP algorithms are impractical in large-scale settings due to their heavy computational and memory requirements. This work addresses this issue by proposing an incremental variant of the widely used PnP-ADMM algorithm, making it scalable to large-scale datasets. We theoretically analyze the convergence of the algorithm under a set of explicit assumptions, extending recent theoretical results in the area. Additionally, we show the effectiveness of our algorithm with nonsmooth data-fidelity terms and deep neural net priors, its fast convergence compared to existing PnP algorithms, and its scalability in terms of speed and memory.

READ FULL TEXT VIEW PDF
10/03/2020

Async-RED: A Provably Convergent Asynchronous Block Parallel Stochastic Method using Deep Denoising Priors

Regularization by denoising (RED) is a recently developed framework for ...
06/07/2021

Recovery Analysis for Plug-and-Play Priors using the Restricted Eigenvalue Condition

The plug-and-play priors (PnP) and regularization by denoising (RED) met...
06/20/2020

A Fast Stochastic Plug-and-Play ADMM for Imaging Inverse Problems

In this work we propose an efficient stochastic plug-and-play (PnP) algo...
11/08/2018

Plug-In Stochastic Gradient Method

Plug-and-play priors (PnP) is a popular framework for regularized signal...
05/13/2019

Block Coordinate Regularization by Denoising

We consider the problem of estimating a vector from its noisy measuremen...
05/14/2019

Plug-and-Play Methods Provably Converge with Properly Trained Denoisers

Plug-and-play (PnP) is a non-convex framework that integrates modern den...
09/12/2018

An Online Plug-and-Play Algorithm for Regularized Image Reconstruction

Plug-and-play priors (PnP) is a powerful framework for regularizing imag...

1 Introduction

Plug-and-play priors (PnP) is a simple yet flexible methodology for imposing statistical priors without explicitly forming an objective function Venkatakrishnan.etal2013 ; Sreehari.etal2016 . PnP algorithms alternate between imposing data consistency by minimizing a data-fidelity term and imposing a statistical prior by applying an additive white Gaussian noise (AWGN) denoiser. PnP draws its inspiration from the proximal algorithms extensively used in nonsmooth composite optimization Parikh.Boyd2014 , such as the proximal-gradient method (PGM) Figueiredo.Nowak2003 ; Daubechies.etal2004 ; Bect.etal2004 ; Beck.Teboulle2009 and alternating direction method of multipliers (ADMM) Eckstein.Bertsekas1992 ; Afonso.etal2010 ; Ng.etal2010 ; Boyd.etal2011

. The popularity of deep learning has led to a wide adoption of PnP for exploiting

learned priors specified through pre-trained deep neural nets, leading to its state-of-the-art performance in a variety of applications Zhang.etal2017a ; Dong.etal2019 ; Zhang.etal2019 ; Ahmad.etal2020 ; Wei.etal2020 . Its empirical success has spurred a follow-up work that provided theoretical justifications to PnP in various settings Chan.etal2016 ; Meinhardt.etal2017 ; Buzzard.etal2017 ; Sun.etal2018a ; Tirer.Giryes2019 ; Teodoro.etal2019 ; Ryu.etal2019 . Despite this progress, current PnP algorithms are not practical for addressing large-scale problems due to their computation time and memory requirements. To the best of our knowledge, the only prior work on developing PnP algorithms that are suitable for large-scale problems is the stochastic gradient descent variant of PnP (PnP-SGD), whose fixed-point convergence was recently analyzed for smooth data-fidelity terms Sun.etal2018a .

In this work, we present a new incremental PnP-ADMM (IPA) algorithm for solving large-scale inverse problems. As an extensions of the widely used PnP-ADMM Venkatakrishnan.etal2013 ; Sreehari.etal2016 , IPA can integrate statistical information from a data-fidelity term and a pre-trained deep neural net. However, unlike PnP-ADMM, IPA can effectively scale to datasets that are too large for traditional batch processing by using a single element or a small subset of the dataset at a time. The memory and per-iteration complexity of IPA is independent of the number of measurements, thus allowing it to deal with very large datasets. Additionally, unlike PnP-SGD Sun.etal2018a , IPA can effectively address problems with nonsmooth data-fidelity terms, and generally has faster convergence. We present a detailed convergence analysis of IPA under a set of explicit assumptions on the data-fidelity term and the denoiser. Our analysis extends the recent fixed-point analysis of PnP-ADMM in Ryu.etal2019 to partial randomized processing of data. To the best of our knowledge, the proposed scalable PnP algorithm and corresponding convergence analysis are absent from the current literature in this area. Our numerical validation demonstrates the practical effectiveness of IPA for integrating nonsmooth data-fidelity terms and deep neural net priors, its fast convergence compared to PnP-SGD, and its scalability in terms of both speed and memory. In summary, we establish IPA as a flexible, scalable, and theoretically sound PnP algorithm applicable to a wide variety of large-scale problems.

All proofs and some technical details have been omitted due to space restrictions, and are included in the supplement, which also provides additional background and additional simulations.

2 Background

Consider the problem of estimating an unknown vector

from a set of noisy measurements . It is common to formulate the solution to this estimation as an optimization problem

(1)

where is a data-fidelity term that quantifies consistency with the observed data and is a regularizer that encodes prior knowledge on . As an example, consider the nonsmooth -norm data-fidelity term , which assumes a linear observation model , and the TV regularizer , where is the gradient and is the regularization parameter. Common applications of (1) include sparse vector recovery in compressive sensing Candes.etal2006 ; Donoho2006 , image restoration using total variation (TV) Beck.Teboulle2009a , and low-rank matrix completion Recht.etal2010 .

Proximal algorithms are often used for solving problems of form (1) when or are nonsmooth Parikh.Boyd2014 . For example, ADMM is one such standard algorithm that can be summarized as

(2a)
(2b)
(2c)

where is the penalty parameter Boyd.etal2011 . ADMM relies on the proximal operator

(3)

which is well-defined for any proper, closed, and convex function  Parikh.Boyd2014 . The proximal operator can be interpreted as a

maximum a posteriori probability (MAP)

estimator for the AWGN denoising problem

(4)

by setting . This perspective inspired the development of PnP Venkatakrishnan.etal2013 ; Sreehari.etal2016 , where the proximal operator is simply replaced by a more general denoiser such as BM3D Dabov.etal2007 or DnCNN Zhang.etal2017 . For example, the widely used PnP-ADMM can be summarized as

(5a)
(5b)
(5c)

where, in analogy to in (3), we introduce the parameter

controlling the relative strength of the denoiser. Remarkably, this heuristic of using denoisers not associated with any

within an iterative algorithm exhibited great empirical success Zhang.etal2017a ; Dong.etal2019 ; Zhang.etal2019 ; Ahmad.etal2020 and spurred a great deal of theoretical work on PnP algorithms Chan.etal2016 ; Meinhardt.etal2017 ; Buzzard.etal2017 ; Sun.etal2018a ; Tirer.Giryes2019 ; Teodoro.etal2019 ; Ryu.etal2019 .

An elegant fixed-point convergence analysis of PnP-ADMM was recently presented in Ryu.etal2019 . By substituting into PnP-ADMM, the algorithm is expressed in terms of an operator

(6)

where denotes the identity operator. The convergence of PnP-ADMM is then established through its equivalence to the fixed-point convergence of the sequence . The equivalence of PnP-ADMM to the iterations of the operator (6) originates from the well-known relationship between ADMM and the Douglas-Rachford splitting Parikh.Boyd2014 ; Eckstein.Bertsekas1992 .

Scalable optimization algorithms have become increasingly important in the context of large-scale problems arising in machine learning and data science 

Bottou.etal2018 . Stochastic and online optimization techniques have been investigated for traditional ADMM Wang.Banerjee2012 ; Ouyang.etal2013 ; Suzuki2013 ; Zhong.etal2014 ; Huang.etal2019 , where is approximated using a subset of observations (with or without subsequent linearization). Our work contributes to this area by investigating the scalability of PnP-ADMM that is not minimizing any explicit objective function. Since PnP-ADMM can integrate powerful deep neural net denoisers, there is a pressing need to understand its theoretical properties and ability to address large-scale imaging applications.

Before introducing our algorithm, it is worth briefly mentioning an emerging paradigm of using deep neural nets for solving ill-posed imaging inverse problems (see, reviews McCann.etal2017 ; Lucas.etal2018 ; Knoll.etal2020 ; Ongie.etal2020 ). This work is most related to techniques that explicitly decouple the measurement model from the learned prior. For example, learned denoisers have been adopted for a class of algorithms in compressive sensing known as approximate message passing (AMP) Tan.etal2015 ; Metzler.etal2016a ; Metzler.etal2016 ; Fletcher.etal2018 . The key difference of PnP from AMP is that it does not assume random measurement operators. Regularization by denoising (RED), and the closely related deep mean-shift priors, rely on the denoiser to specify an explicit regularizer that has a simple gradient Romano.etal2017 ; Bigdeli.etal2017 ; Sun.etal2019b ; Mataev.etal2019 . PnP does not assume existence of such an objective, instead interpreting solutions as equilibrum points balancing the data-fit and the prior Buzzard.etal2017 . Finally, a recent line of work has investigated the recovery and convergence guarantees for priors specified by generative adversarial networks (GANs) Bora.etal2017 ; Shah.Hegde2018 ; Hyder.etal2019 ; Raj.etal2019 ; Latorre.etal2019 . PnP does not seek to project its iterates to the range of a GAN, instead it directly uses the output of a simple AWGN denoiser to improve the estimation quality. This greatly simplifies the training and application of learned priors within the PnP methodology. Our work contributes to this broad area by providing new conceptual, theoretical, and empirical insights into incremental ADMM optimization under statistical priors specified as deep neural net denoisers.

3 Incremental PnP-ADMM

1:input: initial values , parameters .
2:for  do
3:     Choose an index
4:      where impose data-consistency
5:      impose prior knowledge
6:     
7:end for
Algorithm 1 Incremental Plug-and-Play ADMM (IPA)

Batch PnP algorithms operate on the whole observation vector . We are interested in partial randomized processing of observations by considering the decomposition of into blocks

We thus consider data-fidelity terms of the form

(7)

where each is evaluated only on the subset of the full data .

PnP-ADMM is often impractical when is very large due to the complexity of computing . IPA extends stochastic variants of traditional ADMM Wang.Banerjee2012 ; Ouyang.etal2013 ; Suzuki2013 ; Zhong.etal2014 ; Huang.etal2019 by integrating denoisers that are not associated with any regularizer . Its per-iteration complexity is independent of the number of data blocks , since it processes only a single component function at every iteration. IPA can also be implemented as a minibatch algorithm, processing several blocks in parallel at every iteration, thus improving its efficiency on multi-processor hardware architectures (see details in Supplement A).

In principle, IPA can be implemented using different block selection rules. The strategy adopted for our theoretical analysis focuses on the usual strategy of selecting indices

as independent and identically distributed (i.i.d.) random variables distributed uniformly over

. An alternative would be to proceed in epochs of

consecutive iterations, where at the start of each epoch the set is reshuffled, and is selected from this ordered set Bertsekas2011 . In some applications, it might also be beneficial to select indices in an online data-adaptive fashion by taking into account the statistical relationships among observations Tian.etal2015 ; Kellman.etal2020 .

Unlike PnP-SGD, IPA does not require smoothness of the functions . Instead of computing the partial gradient , as is done in PnP-SGD, IPA evaluates the partial proximal operator . Thus, the maximal benefit of IPA is expected for problems in which is efficient to evaluate. This is a case for a large number of functions commonly used in computational imaging, compressive sensing, and machine learning (see the extensive discussion on proximal operators in Beck2017a ).

Let us discuss two widely used scenarios. The proximal operator of the -norm data-fidelity term has a closed-form solution

(8)

Prior work has extensively discussed efficient strategies for evaluating (8

) for a variety of linear operators, including convolutions, partial Fourier transforms, and subsampling masks 

Afonso.etal2010 ; Wohlberg2016 ; Ramani.Fessler2012a ; Almeida.Figueiredo2013 . As a second example, consider the -data fidelity term , which is nonsmooth. The corresponding proximal operator has a closed form solution for any orthogonal operator and can also be efficiently computed in many other settings Beck2017a . We numerically evaluate the effectiveness of IPA on both - and -norm data-fidelity terms and deep neural net priors in Section 5.

4 Theoretical Analysis

We now present a theoretical analysis of IPA. We fist present an intuitive interpretation of its solutions, and then present our convergence analysis under a set of explicit assumptions.

4.1 Fixed Point Interpretation

IPA cannot be interpreted using the standard tools from convex optimization, since its solution is generally not a minimizer of an objective function. Nonetheless, we develop an intuitive operator based interpretation. The full technical exposition of the discussion here can be found in Supplement D.

Consider the following set-valued operator

(9)

where is the subdifferential of the data-fidelity term and is the inverse operator of the denoiser . Note that this inverse operator exists even when is not one-to-one Eckstein.Bertsekas1992 ; Ryu.Boyd2016 . By characterizing the fixed points of PnP algorithms, it can be shown that their solutions can be interpreted as vectors in the zero set of

Consider the following two sets

where is the set of all critical points of the data-fidelity term and is the set of all fixed points of the denoiser. Intuitively, the fixed points of correspond to all vectors that are not denoised, and therefore can be interpreted as vectors that are noise-free according to the denoiser.

If , then , which implies that is one of the solutions. Hence, any vector that minimizes a convex data-fidelity term and noiseless according to is in the solution set. On the other hand, when , then corresponds to an equilibrium point between two sets.

This interpretation of PnP highlights one important aspect that is often overlooked in the literature, namely that, unlike in the traditional formulation (1), the regularization in PnP depends on both the denoiser parameter and the penalty parameter , with both influencing the solution through different mechanisms. Hence, the best performance is obtained by jointly tuning both parameters for a given experimental setting. In the special case of with , we have

which corresponds to the optimization formulation (1) whose solutions are independent of .

4.2 Convergence Analysis

Our analysis requires three assumptions that jointly serve as sufficient conditions.

Assumption 1.

Each is proper, closed, convex, and Lipschitz continuous with constant . We define the largest Lipschitz constant as .

This assumption is commonly adopted in nonsmooth optimization and is equivalent to existence of a global upper bound on subgradients Boyd.Vandenberghe2008 ; Yu2013 ; Ouyang.etal2013 . It is satisfied by a large number of functions, such as the -norm. The -norm also satisfies Assumption 1 when it is evaluated over a bounded subset of . We next state our assumption on the denoiser .

Assumption 2.

The residual of the denoiser is firmly nonexpansive.

We review firm nonexpansiveness and other related concepts in the supplement. Firmly nonexpansive operators are a subset of nonexpansive operators (those that are Lipschitz continuous with constant one). A simple strategy to obtain a firmly nonexpansive operator is to create a -averaged operator from a nonexpansive operator Parikh.Boyd2014 . The residual is firmly nonexpansive if and only if is firmly nonexpansive, which implies that the proximal operator automatically satisfies Assumption 2 Parikh.Boyd2014 .

The rationale for stating Assumption 2 for is based on our interest in residual deep neural nets. The success of residual learning in the context of image restoration is well known Zhang.etal2017 . Prior work has also shown that Lipschitz constrained residual networks yield excellent performance without sacrificing stable convergence Ryu.etal2019 ; Sun.etal2019b . Additionally, there has recently been an explosion of techniques for training Lipschitz constrained and firmly nonexpansive deep neural nets Ryu.etal2019 ; Terris.etal2020 ; Miyato.etal2018 ; Fazlyab.etal2019 .

Assumption 3.

The operator in (9) is such that . There also exists such that

The first part of the assumptions simply ensures the existence of a solution. The existence of the bound often holds in practice, as many denoisers have bounded range spaces. In particular, this is true for a number of image denoisers whose outputs live within the bounded subset .

We will state our convergence results in terms of the operator defined as

(10)

Both IPA and traditional PnP-ADMM can be interpreted as algorithms for computing an element in , which is equivalent to finding an element of (see details in Supplement D).

We are now ready to state our main result on IPA.

Theorem 1.

Run IPA for iterations with random i.i.d. block selection under Assumptions 1-3 using a fixed penalty parameter . Then, the sequence satisfies

(11)

where is a positive constant.

In order to contextualize this result, we also review the convergence of the traditional PnP-ADMM.

Theorem 2.

Run PnP-ADMM for iterations under Assumptions 1-3 using a fixed penalty parameter . Then, the sequence satisfies

(12)

Both proofs are provided in the supplement. The proof of Theorem 2 is a minor modification of the analysis in Ryu.etal2019 , obtained by relaxing the strong convexity assumption in Ryu.etal2019 by Assumption 1 and replacing the assumption that is a contraction in Ryu.etal2019 by Assumption 2. Theorem 2 establishes that the iterates of PnP-ADMM satisfy as . Since is firmly nonexpansive and is nonexpansive, the Krasnosel’skii-Mann theorem (see Section 5.2 in Bauschke.Combettes2017 ) directly implies that and .

Theorem 1 establishes that IPA approximates the solution obtained by the full PnP-ADMM up to an error term that depends on the penalty parameter . One can precisely control the accuracy of IPA by setting to a desired level. In practice,

can be treated as a hyperparameter and tuned to maximize performance for a suitable image quality metric, such as SNR or SSIM. Our numerical results in Section 

5 corroborate that excellent SNR performance of IPA can be achieved without taking to zero, which simplifies practical applicability of IPA.

Finally, note that our analysis can be also performed under assumptions adopted in Ryu.etal2019 , namely that are strongly convex and is a contraction. Such an analysis leads to the following statement

(13)

which establishes a linear convergence to up to an error term. A proof of (13) is provided in the supplement. As corroborated by our simulations in Section 5, the actual convergence of IPA holds even more broadly than suggested by both sets of sufficient conditions. This motivates further analysis of IPA under more relaxed assumptions that we leave to future work.

5 Numerical Validation

Recent work has shown the excellent performance of PnP algorithms for smooth data-fidelity terms using advanced denoising priors. Our goal in this section is to extend these studies with simulations validating the effectiveness of IPA for nonsmooth data-fidelity terms and deep neural net priors, as well as demonstrating its scalability to large-scale inverse problems. We consider two applications of the form , where denotes the noise and denotes either a random Gaussian matrix in compressive sensing (CS) or the transfer function in intensity diffraction tomography Ling.etal18 .

Our deep neural net prior is based on the DnCNN architecture Zhang.etal2017

, with its batch normalization layers removed for controlling the Lipschitz constant of the network via spectral normalization 

Sedghi.etal2019 (see details in Supplement G.1). We train a nonexpansive residual network by predicting the noise residual from its noisy input. This means that satisfies the necessary condition for firm nonexpansiveness of . The training data is generated by adding AWGN to the images from the BSD400 dataset Martin.etal2001

. The reconstruction quality is quantified using the signal-to-noise ratio (SNR) in dB. We pre-train several deep neural net models as denoisers for

, using intervals of , and use the denoiser achieving the best SNR in each experiment.

Figure 1: Illustration of the influence of the penalty parameter on the convergence of IPA for a DnCNN prior. The average normalized distance to and SNR (dB) are plotted against the iteration number with the shaded areas representing the range of values attained over test images. The accuracy of IPA improves for smaller values of . However, the SNR performance is nearly identical, indicating that in practice IPA can achieve excellent results for a range of fixed values.

5.1 Integration of Non-smooth Data-Fidelity Terms and Pre-Trained Deep Priors

We first validate the effectiveness of Theorem 1 for non-smooth data-fidelity terms. The matrix

is generated with i.i.d. zero-mean Gaussian random elements of variance

, and as a sparse Bernoulli-Gaussian vector with the sparsity ratio of 0.1. This means that, in expectation, ten percent of the elements of are contaminated by AWGN. The sparse nature of noise motivates the usage of the -norm

, since it can effectively mitigate outliers. The nonsmoothness of

-norm prevents the usage of gradient-based algorithms such as PnP-SGD. On the other hand, the application IPA is facilitated by efficient strategies for computing the proximal operator Chambolle.2004 ; Beck.Teboulle2009a .

We set the measurement ratio to be approximately

with AWGN of standard deviation

. Twelve standard images from Set 12 are used in testing, each resized to pixels for rapid parameter tuning and testing. We quantify the convergence accuracy using the normalized distance , which is expected to approach zero as IPA converges to a fixed point.

Theorem 1 characterizes the convergence of IPA in terms of up to a constant error term that depends on . This is illustrated in Fig. 1 for three values of the penalty parameter with . The average normalized distance and SNR are plotted against the iteration number and labeled with their respective final values. The shaded areas represent the range of values attained across all test images. IPA is implemented to use a random half of the elements in in every iteration to impose the data-consistency. Fig. 1 shows the improved convergence of IPA to for smaller values of , which is consistent with our theoretical analysis. Specifically, the final accuracy improves approximately (from to ) when is reduced from to . On the other hand, the SNR values are nearly identical for all three experiments, indicating that in practice different values lead to fixed points of similar quality. This indicates that IPA can achieve high-quality result without taking to zero.

Figure 2: Illustration of scalability of IPA and several widely used PnP algorithms on problems of different sizes. The parameters and denote the image size and the number of acquired intensity images, respectively. The average SNR is plotted against time in seconds. Both IPA and PnP-SGD use random minibatches of 60 measurements at every iteration, while PnP-ADMM and PnP-FISTA use all the measurements. The figure highlights the fast empirical convergence of IPA compared to PnP-SGD as well as its ability to address larger problems compared to PnP-ADMM and PnP-FISTA.
Simulations Parameters
( = 300) ( = 600) ( = 600)
Algorithms SNR in dB (Runtime)
PnP-FISTA 1 5 22.60 (19.4 min) 22.79 (42.6 min) 23.56 (8.1 hr)
1-6
PnP-SGD () 1 5 22.31 (7.1 min) 22.74 (5.2 min) 23.42 (44.3 min)
1-6
PnP-ADMM 2.5 1 24.23 (7.4 min) 24.40 (14.7 min) 25.50 (1.4 hr)
1-6
IPA () 2.5 1 23.65 (1.7 min) 23.88 (2 min) 24.95 (11 min)
Table 1: Final average SNR (dB) and Runtime obtained by several PnP algorithms on all test images.

5.2 Scalability in Large-scale Optical Tomography

We now discuss the scalability of IPA on intensity diffraction tomography, which is a data intensive computational imaging modality Ling.etal18 . The goal is to recover the spatial distribution of the complex permittivity contrast of an object given a set of its intensity-only measurements. In this problem, consists of a set of complex matrices , where each is a convolution corresponding to the th measurement . We adopt the -norm loss as the data-fidelity term to empirically compare the performance of IPA and PnP-SGD on the same problem.

In the simulation, we follow the experimental setup in Ling.etal18 under AWGN corresponding to an input SNR of 20 dB. We select six images from the CAT2000 dataset Borji.etal2015 as our test examples, each cropped to pixels. We assume real permittivity functions, but still consider complex valued measurement operator that accounts for both absorption and phase Ling.etal18 . Due to the large size of data, we process the measurements in epochs using minibatches of size 60 (see also Supplement G.2).

Fig. 2 illustrates the evolution of average SNR against runtime for several PnP algorithms, namely PnP-ADMM, PnP-FISTA, PnP-SGD, and IPA, for images of size and the total number of intensity measurements . The final values of SNR as well as the total runtimes are summarized in Table 1. The table highlights the overall best SNR performance in bold and the shortest runtime in light-green. In every iteration, PnP-ADMM and PnP-FISTA use all the measurements, while IPA and PnP-SGD use only a small subset of 60 measurements. IPA thus retains its effectiveness for large values of , while batch algorithms become significantly slower. Moreover, the scalability of IPA over PnP-ADMM becomes more notable when the image size increases. For example, Table 1 highlights the convergence of IPA to 24.95 dB within 11 minutes, while PnP-ADMM takes 1.4 hours to reach a similar SNR value. Note the rapid progress of PnP-ADMM in the first few iterations, followed by a slow but steady progress until its convergence to the values reported in Table 1. This behavior of ADMM is well known and has been widely reported in the literature (see Section 3.2.2 “Convergence in Practice” in Boyd.etal2011 ). We also observe faster convergence of IPA compared to both PnP-SGD and PnP-FISTA, further highlighting the potential of IPA to address large-scale problems where partial proximal operators are easy to evaluate.

Another key feature of IPA is its memory efficiency due to incremental processing of data. The memory considerations in optical tomography include the size of all the variables related to the desired image , the measured data , and the variables related to the forward model . Table 2 records the total memory (GB) used by IPA and PnP-ADMM for reconstructing a pixel permittivity image, with the smallest value highlighted in light-green. PnP-ADMM requires 37.63 GB of memory due to its batch processing of the whole dataset, while IPA uses only 3.88 GB—nearly one-tenth of the former—by adopting incremental processing of data (see also Supplement G.2 for additional examples). In short, our numerical evaluations highlight both fast and stable convergence and flexible memory usage of IPA in the context of large-scale optical tomographic imaging.

Algorithms PnP-ADMM IPA (Ours)
Variables size memory size memory
real 9.38 GB 0.94 GB
3-6
imaginary 9.38 GB 0.94 GB
3-6
18.75 GB 1.88 GB
3-6
others combined 0.13 GB 0.13 GB
Total 37.63 GB 3.88 GB
Table 2: Per-iteration memory usage specification for reconstructing 10241024 images

6 Conclusion

This work provides several new insights into the widely used PnP methodology in the context of large-scale imaging problems. First, we have proposed IPA as a new incremental PnP algorithm. IPA extends PnP-ADMM to randomized partial processing of measurements and extends traditional optimization-based ADMM by integrating pre-trained deep neural nets. Second, we have theoretically analyzed IPA under a set of realistic assumptions, showing that IPA can approximate PnP-ADMM to a desired precision by controlling the penalty parameter. Third, our simulations highlight the effectiveness of IPA for nonsmooth data-fidelity terms and deep neural net priors, as well as its scalability to large-scale imaging. We observed faster convergence of IPA compared to several baseline PnP methods, including PnP-ADMM and PnP-SGD, when partial proximal operators can be efficiently evaluated. IPA can thus be an effective alternative to existing algorithms for addressing large-scale imaging problems. For future work, we would like to explore strategies to further relax our assumptions and explore distributed variants of IPA to enhance its performance in parallel settings.

Broader Impact

This work is expected to impact the area of computational imaging with potential applications to computational microscopy, computerized tomography, medical imaging, and image restoration. There is a growing need in computational imaging to deal with noisy and incomplete measurements by integrating multiple information sources, including physical information describing the imaging instrument and learned information characterizing the statistical distribution of the desired image. The ability to solve large-scale computational imaging problems has the potential to enable new technological advances in 3D (space), 4D (space + time), or 5D (space, time, and spectrum) applications. These advances might lead to new imaging tools for diagnosing health conditions, understanding biological processes, or inferring properties of complex materials.

Traditionally, imaging relied on linear models and fixed transforms (filtered back projection, wavelet transform) that are relatively straightforward to understand. Learning based methods, including our algorithm, have the potential to enable new technological capabilities; yet, they also come with a downside of being much more complex. Their usage might thus lead to unexpected outcomes and surprising results when used by non-experts. While we aim to use our method to enable positive contributions to humanity, one can also imagine nonethical usage of large-scale imaging technology. This work focuses on imaging using large-scale algorithms with learned priors, but it might be adopted within broader data science, which might lead to broader impacts that we have not anticipated.

Research presented in this article was supported by NSF award CCF-1813910 and by the Laboratory Directed Research and Development program of Los Alamos National Laboratory under project number 20200061DR.

References

  • (1) S. V. Venkatakrishnan, C. A. Bouman, and B. Wohlberg, “Plug-and-play priors for model based reconstruction,” in Proc. IEEE Global Conf. Signal Process. and INf. Process. (GlobalSIP), 2013.
  • (2) S. Sreehari, S. V. Venkatakrishnan, B. Wohlberg, G. T. Buzzard, L. F. Drummy, J. P. Simmons, and C. A. Bouman,

    “Plug-and-play priors for bright field electron tomography and sparse interpolation,”

    IEEE Trans. Comp. Imag., vol. 2, no. 4, pp. 408–423, December 2016.
  • (3) N. Parikh and S. Boyd, “Proximal algorithms,” Foundations and Trends in Optimization, vol. 1, no. 3, pp. 123–231, 2014.
  • (4) M. A. T. Figueiredo and R. D. Nowak, “An EM algorithm for wavelet-based image restoration,” IEEE Trans. Image Process., vol. 12, no. 8, pp. 906–916, August 2003.
  • (5) I. Daubechies, M. Defrise, and C. De Mol, “An iterative thresholding algorithm for linear inverse problems with a sparsity constraint,” Commun. Pure Appl. Math., vol. 57, no. 11, pp. 1413–1457, November 2004.
  • (6) J. Bect, L. Blanc-Feraud, G. Aubert, and A. Chambolle, “A -unified variational framework for image restoration,” in Proc. ECCV, Springer, Ed., New York, 2004, vol. 3024, pp. 1–13.
  • (7) A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM J. Imaging Sciences, vol. 2, no. 1, pp. 183–202, 2009.
  • (8) J. Eckstein and D. P. Bertsekas, “On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators,” Mathematical Programming, vol. 55, pp. 293–318, 1992.
  • (9) M. V. Afonso, J. M.Bioucas-Dias, and M. A. T. Figueiredo, “Fast image recovery using variable splitting and constrained optimization,” IEEE Trans. Image Process., vol. 19, no. 9, pp. 2345–2356, September 2010.
  • (10) M. K. Ng, P. Weiss, and X. Yuan, “Solving constrained total-variation image restoration and reconstruction problems via alternating direction methods,” SIAM J. Sci. Comput., vol. 32, no. 5, pp. 2710–2736, August 2010.
  • (11) S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends in Machine Learning, vol. 3, no. 1, pp. 1–122, 2011.
  • (12) K. Zhang, W. Zuo, S. Gu, and L. Zhang, “Learning deep CNN denoiser prior for image restoration,” in

    Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR)

    , 2017.
  • (13) W. Dong, P. Wang, W. Yin, G. Shi, F. Wu, and X. Lu,

    “Denoising prior driven deep neural network for image restoration,”

    IEEE Trans. Patt. Anal. and Machine Intell., vol. 41, no. 10, pp. 2305–2318, Oct. 2019.
  • (14) K. Zhang, W. Zuo, and L. Zhang,

    “Deep plug-and-play super-resolution for arbitrary blur kernels,”

    in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, June 2019, pp. 1671–1681.
  • (15) R. Ahmad, C. A. Bouman, G. T. Buzzard, S. Chan, S. Liu, E. T. Reehorst, and P. Schniter, “Plug-and-play methods for magnetic resonance imaging: Using denoisers for image recovery,” IEEE Signal Processing Magazine, vol. 37, no. 1, pp. 105–116, 2020.
  • (16) K. Wei, A. Aviles-Rivero, J. Liang, Y. Fu, C.-B. Schnlieb, and H. Huang, “Tuning-free plug-and-play proximal algorithm for inverse imaging problems,” in Proc. 37th Int. Conf. Machine Learning (ICML), 2020, arXiv:2002.09611.
  • (17) S. H. Chan, X. Wang, and O. A. Elgendy, “Plug-and-play ADMM for image restoration: Fixed-point convergence and applications,” IEEE Trans. Comp. Imag., vol. 3, no. 1, pp. 84–98, March 2017.
  • (18) T. Meinhardt, M. Moeller, C. Hazirbas, and D. Cremers, “Learning proximal operators: Using denoising networks for regularizing inverse imaging problems,” in Proc. IEEE Int. Conf. Comp. Vis. (ICCV), Venice, Italy, Oct. 2017, pp. 1799–1808.
  • (19) G. T. Buzzard, S. H. Chan, S. Sreehari, and C. A. Bouman, “Plug-and-play unplugged: Optimization free reconstruction using consensus equilibrium,” SIAM J. Imaging Sci., vol. 11, no. 3, pp. 2001–2020, 2018.
  • (20) Y. Sun, B. Wohlberg, and U. S. Kamilov, “An online plug-and-play algorithm for regularized image reconstruction,” IEEE Trans. Comput. Imaging, vol. 5, no. 3, pp. 395–408, Sept. 2019.
  • (21) T. Tirer and R. Giryes, “Image restoration by iterative denoising and backward projections,” IEEE Trans. Image Process., vol. 28, no. 3, pp. 1220–1234, 2019.
  • (22) A. M. Teodoro, J. M. Bioucas-Dias, and M. A. T. Figueiredo, “A convergent image fusion algorithm using scene-adapted Gaussian-mixture-based denoising,” IEEE Trans. Image Process., vol. 28, no. 1, pp. 451–463, Jan. 2019.
  • (23) E. K. Ryu, J. Liu, S. Wang, X. Chen, Z. Wang, and W. Yin, “Plug-and-play methods provably converge with properly trained denoisers,” in Proc. 36th Int. Conf. Machine Learning (ICML), Long Beach, CA, USA, June 2019, pp. 5546–5557.
  • (24) E. J. Candès, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information,” IEEE Trans. Inf. Theory, vol. 52, no. 2, pp. 489–509, February 2006.
  • (25) D. L. Donoho, “Compressed sensing,” IEEE Trans. Inf. Theory, vol. 52, no. 4, pp. 1289–1306, April 2006.
  • (26) A. Beck and M. Teboulle, “Fast gradient-based algorithm for constrained total variation image denoising and deblurring problems,” IEEE Trans. Image Process., vol. 18, no. 11, pp. 2419–2434, November 2009.
  • (27) B. Recht, M. Fazel, and P. A. Parrilo, “Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization,” SIAM Rev., vol. 52, no. 3, pp. 471–501, 2010.
  • (28) K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising by sparse 3-D transform-domain collaborative filtering,” IEEE Trans. Image Process., vol. 16, no. 16, pp. 2080–2095, August 2007.
  • (29) K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising,” IEEE Trans. Image Process., vol. 26, no. 7, pp. 3142–3155, July 2017.
  • (30) L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,” SIAM Rev., vol. 60, no. 2, pp. 223–311, 2018.
  • (31) H. Wang and A. Banerjee, “Online alternating direction method,” in Proc. 29th Int. Conf. Machine Learning (ICML), Edinburgh, Scotland, UK, June 26-July 1, 2012, pp. 1699–1706.
  • (32) H. Ouyang, N. He, L. Q. Tran, and A. Gray, “Stochastic alternating direction method of multipliers,” in Proc. 30th Int. Conf. Machine Learning (ICML), Atlanta, GA, USA, 16-21 June, 2013, pp. 80–88.
  • (33) T. Suzuki, “Dual averaging and proximal gradient descent for online alternating direction multiplier method,” in Proc. 30th Int. Conf. Machine Learning (ICML), Atlanta, GA, USA, June 2013, pp. 392–400.
  • (34) W. Zhong and J. Kwok, “Fast stochastic alternating direction method of multipliers,” in Proc. 31th Int. Conf. Machine Learning (ICML), Bejing, China, Jun 22-24, 2014, pp. 46–54.
  • (35) F. Huang, S. Chen, and H. Huang, “Faster stochastic alternating direction method of multipliers for nonconvex optimization,” in Proc. 36th Int. Conf. Machine Learning (ICML), Long Beach, CA, USA, June 10-15, 2019, pp. 2839–2848.
  • (36) M. T. McCann, K. H. Jin, and M. Unser,

    Convolutional neural networks for inverse problems in imaging: A review,”

    IEEE Signal Process. Mag., vol. 34, no. 6, pp. 85–95, 2017.
  • (37) A. Lucas, M. Iliadis, R. Molina, and A. K. Katsaggelos, “Using deep neural networks for inverse problems in imaging: Beyond analytical methods,” IEEE Signal Process. Mag., vol. 35, no. 1, pp. 20–36, Jan. 2018.
  • (38) F. Knoll, K. Hammernik, C. Zhang, S. Moeller, T. Pock, D. K. Sodickson, and M. Akcakaya, “Deep-learning methods for parallel magnetic resonance imaging reconstruction: A survey of the current approaches, trends, and issues,” IEEE Signal Process. Mag., vol. 37, no. 1, pp. 128–140, Jan. 2020.
  • (39) G. Ongie, A. Jalal, C. A. Metzler, R. G. Baraniuk, A. G. Dimakis, and R. Willett, “Deep learning techniques for inverse problems in imaging,” 2020, arXiv:2005.06001.
  • (40) J. Tan, Y. Ma, and D. Baron, “Compressive imaging via approximate message passing with image denoising,” IEEE Trans. Signal Process., vol. 63, no. 8, pp. 2085–2092, Apr. 2015.
  • (41) C. A. Metzler, A. Maleki, and R. Baraniuk, “BM3D-PRGAMP: Compressive phase retrieval based on BM3D denoising,” in Proc. IEEE Int. Conf. Image Proc., 2016.
  • (42) C. A. Metzler, A. Maleki, and R. G. Baraniuk, “From denoising to compressed sensing,” IEEE Trans. Inf. Theory, vol. 62, no. 9, pp. 5117–5144, September 2016.
  • (43) A. Fletcher, S. Rangan, S. Sarkar, and P. Schniter, “Plug-in estimation in high-dimensional linear inverse problems: A rigorous analysis,” in Proc. Advances in Neural Information Processing Systems 32, Montréal, Canada, Dec 3-8, 2018, pp. 7451–7460.
  • (44) Y. Romano, M. Elad, and P. Milanfar, “The little engine that could: Regularization by denoising (RED),” SIAM J. Imaging Sci., vol. 10, no. 4, pp. 1804–1844, 2017.
  • (45) S. A. Bigdeli, M. Jin, P. Favaro, and M. Zwicker, “Deep mean-shift priors for image restoration,” in Proc. Advances in Neural Information Processing Systems 31, Long Beach, CA, USA, Dec 4-9, 2017, pp. 763–772.
  • (46) Y. Sun, J. Liu, and U. S. Kamilov, “Block coordinate regularization by denoising,” in Proc. Advances in Neural Information Processing Systems 33, Vancouver, BC, Canada, December 8-14, 2019, pp. 382–392.
  • (47) G. Mataev, M. Elad, and P. Milanfar, “DeepRED: Deep image prior powered by RED,” in Proc. IEEE Int. Conf. Comp. Vis. Workshops (ICCVW), Seoul, South Korea, Oct 27-Nov 2, 2019, pp. 1–10.
  • (48) A. Bora, A. Jalal, E. Price, and A. G. Dimakis, “Compressed sensing using generative priors,” in Proc. 34th Int. Conf. Machine Learning (ICML), Sydney, Australia, Aug. 2017, pp. 537–546.
  • (49) V. Shah and C. Hegde, “Solving linear inverse problems using GAN priors: An algorithm with provable guarantees,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process., Calgary, AB, Canada, Apr. 2018, pp. 4609–4613.
  • (50) R. Hyder, V. Shah, C. Hegde, and M. S. Asif, “Alternating phase projected gradient descent with generative priors for solving compressive phase retrieval,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process., Brighton, UK, May 2019, pp. 7705–7709.
  • (51) A. Raj, Y. Li, and Y. Bresler, “GAN-based projector for faster recovery in compressed sensing with convergence guarantees,” in Proc. IEEE Int. Conf. Comp. Vis. (ICCV), Seoul, South Korea, Oct 27-Nov 2, 2019, pp. 5601–5610.
  • (52) F. Latorre, A. Eftekhari, and V. Cevher, “Fast and provable ADMM for learning with generative priors,” in Advances in Neural Information Processing Systems 33, Vancouver, BC, USA, December 8-14, 2019, pp. 12027–12039.
  • (53) D. P. Bertsekas, “Incremental proximal methods for large scale convex optimization,” Math. Program. Ser. B, vol. 129, pp. 163–195, 2011.
  • (54) L. Tian, Z. Liu, L. Yeh, M. Chen, J. Zhong, and L. Waller, “Computational illumination for high-speed in vitro fourier ptychographic microscopy,” Optica, vol. 2, no. 10, pp. 904–911, 2015.
  • (55) M. R. Kellman, E. Bostan, N. A. Repina, and L. Waller, “Physics-based learned design: Optimized coded-illumination for quantitative phase imaging,” IEEE Trans. Comput. Imag., vol. 5, no. 3, pp. 344–353, 2020.
  • (56) A. Beck, First-Order Methods in Optimization, chapter The Proximal Operator, pp. 129–177, MOS-SIAM Series on Optimization. SIAM, 2017.
  • (57) B. Wohlberg, “Efficient algorithms for convolutional sparse representations,” IEEE Trans. Image Process., vol. 25, no. 1, pp. 301–315, January 2016.
  • (58) S. Ramani and J. A. Fessler, “A splitting-based iterative algorithm for accelerated statistical X-ray CT reconstruction,” IEEE Trans. Med. Imaging, vol. 31, no. 3, pp. 677–688, March 2012.
  • (59) M. S. C. Almeida and M. A. T. Figueiredo, “Deconvolving images with unknown boundaries using the alternating direction method of multipliers,” IEEE Trans. Image Process., vol. 22, no. 8, pp. 3074–3086, August 2013.
  • (60) E. K. Ryu and S. Boyd, “A primer on monotone operator methods,” Appl. Comput. Math., vol. 15, no. 1, pp. 3–43, 2016.
  • (61) S. Boyd and L. Vandenberghe, “Subgradients,” April 2008, Class notes for Convex Optimization II. http://see.stanford.edu/materials/lsocoee364b/01-subgradients_notes.pdf.
  • (62) Y.-L. Yu, “Better approximation and faster algorithm using the proximal average,” in Proc. Advances in Neural Information Processing Systems 26, 2013.
  • (63) M. Terris, A. Repetti, J.-C. Pesquet, and Y. Wiaux, “Building firmly nonexpansive convolutional neural networks,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process., Barcelona, Spain, May 2020, pp. 8658–8662.
  • (64) T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral normalization for generative adversarial networks,” in International Conference on Learning Representations (ICLR), 2018.
  • (65) M. Fazlyab, A. Robey, Hassani. H., M. Marari, and G. Pappas, “Efficient and accurate estimation of Lipschitz constants for deep neural networks,” in Proc. Advances in Neural Information Processing Systems 33, Vancouver, BC, Canada, Dec. 2019, pp. 11427–11438.
  • (66) H. H. Bauschke and P. L. Combettes, Convex Analysis and Monotone Operator Theory in Hilbert Spaces, Springer, 2 edition, 2017.
  • (67) R. Ling, W. Tahir, H. Lin, H. Lee, and L. Tian, “High-throughput intensity diffraction tomography with a computational microscope,” Biomed. Opt. Express, vol. 9, no. 5, pp. 2130–2141, May 2018.
  • (68) H. Sedghi, V. Gupta, and P. M. Long,

    “The singular values of convolutional layers,”

    in International Conference on Learning Representations (ICLR), 2019.
  • (69) D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” in Proc. IEEE Int. Conf. Comp. Vis. (ICCV), Vancouver, Canada, July 7-14, 2001, pp. 416–423.
  • (70) A. Chambolle, “An algorithm for total variation minimization and applications,” Journal of Mathematical Imaging and Vision, vol. 20, no. 1, pp. 89–97, 2004.
  • (71) A. Borji and L. Itti, “Cat2000: A large scale fixation dataset for boosting saliency research,” Comput. Vis. Patt. Recong. (CVPR) 2015 Workshop on "Future of Datasets", 2015.
  • (72) H. H. Bauschke, R. Goebel, Y. Lucet, and X. Wang, “The proximal average: Basic theory,” SIAM J. Optim., vol. 19, no. 2, pp. 766–785, 2008.
  • (73) R. T. Rockafellar and R. J-B Wets, Variational Analysis, Springer, 1998.
  • (74) S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge Univ. Press, 2004.
  • (75) Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Kluwer Academic Publishers, 2004.
  • (76) R. T. Rockafellar, Convex Analysis, chapter Conjugate Saddle-Functions and Minimax Theorems, pp. 388–398, Princeton Univ. Press, Princeton, NJ, 1970.
  • (77) D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations (ICLR), 2015.

Supplementary Material for
“Scalable Plug-and-Play ADMM with Convergence Guarantees”

We adopt the monotone operator theory Ryu.Boyd2016 ; Bauschke.Combettes2017 for a unified analysis of IPA. In Supplement A, we discuss a minibatch variant of IPA that enables parallel evaluation of several component proximal operators. In Supplement B, we present the convergence analysis of IPA. In Supplement C, we analyze the convergence of the algorithm for strongly convex data-fidelity terms and contractive denoisers. In Supplement D, we discuss interpretation of IPA’s fixed-points from the perspective of monotone operator theory. For completeness, in Supplement E, we discuss the convergence results for traditional PnP-ADMM Ryu.etal2019 . In Supplement F, we provide the background material used in our analysis. In Supplement G, we provide additional technical details, omitted from the main paper due to space, such as the details on our deep neural net architecture and results of additional simulations.

For the sake of simplicity, this supplement uses to denote the standard -norm in . We will also use instead of to denote the denoiser, thus dropping the explicit notation for .

Appendix A Minibatch Implementation of IPA

1:input: initial values , parameters , minibatch size .
2:for  do
3:     Choose indices from the set .
4:      where impose data-consistency
5:      impose prior knowledge
6:     
7:end for
Algorithm 2 Incremental Plug-and-Play ADMM (IPA) (Minibatch Version)

Algorithm 2 presents the minibatch version of IPA that averages several proximal operators evaluated over different data blocks. When the minibatch size , Algorithm 2 reverts to Algorithm 1. The main benefit of minibatch IPA is its suitability for parallel computation of , which can take advantage of multi-processor architectures. The convergence analysis in Theorem 1 can be easily extended to minibatch IPA with a straightforward extension of Lemma 1 in Supplement B.2 to several indices, and by following the steps of the main proof in Supplement B.1.

Minibatch IPA is related to the proximal average approximation of  Bauschke.etal2008 ; Yu2013

When Assumption 1 is satisfied, then the approximation error is bounded for any as Yu2013

Minibatch IPA thus simply uses a minibatch approximation of the proximal average . One implication of this is that even when the minibatch is exactly equal to the full measurement vector, minibatch IPA is not exact due to the approximation error introduced by the proximal average. However, the resulting approximation error can be made as small as desired by controlling the penalty parameter .

Appendix B Convergence Analysis of IPA

In this section, we present one of the main results in this paper, namely the convergence analysis of IPA. A fixed-point convergence of averaged operators is well-known under the name of Krasnosel’skii-Mann theorem (see Section 5.2 in Bauschke.Combettes2017 ) and was recently applied to the analysis of PnP-SGD Sun.etal2018a . Additionally, PnP-ADMM was analyzed for strongly convex data-fidelity terms and contractive residual denoisers  Ryu.etal2019 . Our analysis here extends these results to IPA by providing an explicit upper bound on the convergence of IPA. In Section B.1, we present the main steps of the proof, while in Section B.2 we prove two technical lemmas useful for our analysis.

b.1 Proof of Theorem 1

Supplement D.3 establishes that defined in (10) is firmly nonexpansive. Consider any and any , then we have

(14)

where we used the firm nonexpansiveness of and . The direct consequence of (14) is that

We now consider the following two equivalent representations of IPA for some iteration

(15)

where

is a random variable uniformly distributed over

, is the proximal operator with respect to , and is the denoiser. To see the equivalence between the left and the right sides of (15), simply introduce the variable into the right side of (15). It is straightforward to verify that the right side of (15) can also be rewritten as

(16)

Then, for any , we have that

where in the first inequality we used Cauchy-Schwarz and (14), and in the second inequality we used Lemma 2 in Supplement B.2. By taking the conditional expectation on both sides, invoking Lemma 1 in Supplement B.2, and rearranging the terms, we get

Hence, by averaging over iterations and taking the total expectation, we obtain

The final result is obtained by noting that

b.2 Lemmas Useful for the Proof of Theorem  1

This section presents two technical lemmas used in our analysis in Section B.1.

Lemma 1.

Assume that Assumptions 1-3 hold and let be a uniform random variable over . Then, we have that

Proof.

Let and for any and . From the optimality conditions for each proximal operator

and

where we used Proposition 7 in Supplement F.2. By using the bound on all the subgradients (due to Assumption 1 and Proposition 8 in Supplement F.2), we obtain

where is the Lipschitz constant of all s and . This inequality directly implies that

Since, this inequality holds for every , it also holds in expectation. ∎

Lemma 2.

Assume that Assumptions 1-3 hold and let the sequence be generated via the iteration (16). Then, for any , we have that

Proof.

The optimality of the proximal operator in (16) implies that there exists such that

By applying to the equality above, we obtain

Additionally, for any and , we have that

Thus, by using Assumption 3 and the bounds on all the subgradients (due to Assumption 1 and Proposition 8 in Supplement F.2), we obtain

Appendix C Analysis of IPA for Strongly Convex Functions

In this section, we perform analysis of IPA under a different set of assumptions, namely under the assumptions adopted in Ryu.etal2019 .

Assumption 4.

Each is proper, closed, strongly convex with constant , and Lipschitz continuous with constant . We define the smallest strong convexity constant as and the largest Lipschitz constant as .

This assumption further restricts Assumption 1 in the main paper to strongly convex functions.

Assumption 5.

The residual of the denoiser is a contraction. It thus satisfies

for all for some constant .

This assumption replaces Assumption 2 in the main paper by assuming that the residual of the denoiser is a contraction. Note that this can be practically imposed on deep neural net denoisers via spectral normalization Miyato.etal2018 . We can then state the following.

Theorem 3.

Run IPA for iterations with random i.i.d. block selection under Assumptions 3-5 using a fixed penalty parameter . Then, the iterates of IPA satisfy

Proof.

It was shown in Theorem 2 of Ryu.etal2019 that under Asumptions 4 and 5, we have that

for all , where is given in (10). Hence, when

the operator is a contraction.

Using the reasoning in Supplement B, the sequence can be written as

(17)

Then, for any , we have that