Deep Equilibrium Architectures for Inverse Problems in Imaging

02/16/2021 ∙ by Davis Gilton, et al. ∙ 0

Recent efforts on solving inverse problems in imaging via deep neural networks use architectures inspired by a fixed number of iterations of an optimization method. The number of iterations is typically quite small due to difficulties in training networks corresponding to more iterations; the resulting solvers cannot be run for more iterations at test time without incurring significant errors. This paper describes an alternative approach corresponding to an infinite number of iterations, yielding up to a 4dB PSNR improvement in reconstruction accuracy above state-of-the-art alternatives and where the computational budget can be selected at test time to optimize context-dependent trade-offs between accuracy and computation. The proposed approach leverages ideas from Deep Equilibrium Models, where the fixed-point iteration is constructed to incorporate a known forward model and insights from classical optimization-based reconstruction methods.



There are no comments yet.


page 12

page 16

page 17

page 18

page 19

page 20

page 21

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A collection of recent efforts surveyed by [24] consider the problem of using training data to solve inverse problems in imaging. Specifically, imagine we observe a corrupted set of measurements of an image according to a measurement operator with some noise according to


Our task is to compute an estimate of

given measurements and knowledge of . This task is particularly challenging when the inverse problem is ill-posed, i.e., when the system is underdetermined or ill-conditioned, in which case simple methods such as least squares estimation (i.e., ) may not exist or may produce estimates that are highly sensitive to noise.

(a) MRI reconstruction.
(b) Deep unrolling challenges
Figure 1: (a) PSNR of reconstructed images for an MRI reconstruction problem as a function of iterations used to compute the reconstruction. Unrolled methods are optimized for a fixed computational budget during training, and running additional steps at test time yields a significant drop in performance. Our deep equilibrium methods can achieve the same PSNR for the optimal computational budget of an unrolled method, but can trade slightly more computation time for a significant increase in the PSNR, allowing a user to choose the desired computational budget and reconstruction quality. (b) Standard unrolled deep optimization networks typically require choosing some fixed number of iterates during training. Deviating from this fixed number at test time incurs a significant penalty in PSNR. The forward model here is 8x accelerated single-coil MRI reconstruction, and the unrolled algorithm is unrolled proximal gradient descent with iterates, labeled PROX-K (Fig. 3). For further experimental details see Section 6.4.

Decades of research has explored geometric models of image structure that can be used to regularize solutions to this inverse problem, including [32, 29, 10] and many others. More recent efforts have focused instead on using large collections of training images, , to learn effective regularizers.

One particularly popular and effective approach involves augmenting standard iterative inverse problem solvers with learned deep networks. This approach, which we refer to as deep unrolling (DU), is reviewed in Section 2.1

. The basic idea is to build an architecture that mimics a small number of steps in of an iterative procedure. In practice, the number of steps is quite small (typically 5-10) because of issues stability, memory, and numerical issues arising in backpropagation. This paper sidesteps this key limitation of deep unrolling methods with a novel approach based on

deep equilibrium models (DEMs) [4], which are designed for training arbitrarily deep networks. The result is a novel approach to training networks to solve inverse problems in imaging that yields up to a 4dB improvement in performance above state-of-the-art alternatives and where the computational budget can be selected at test time to optimize context-dependent tradeoffs between accuracy and computation. The key empirical findings, which are detailed in Section 6.4, are illustrated in Fig. 1(a).

1.1 Contributions

This paper presents a fundamentally new approach to machine-learning based methods for solving linear inverse problems in imaging. Unlike most state-of-the-art methods, which are based on unrolling a small number of iterations of an iterative reconstruction scheme (“deep unrolling”), our method is based on deep equilibrium models that correspond to a potentially infinite number of iterations. This framework yields more accurate reconstructions that the current state-of-the-art across a range of inverse problems and gives users the ability to navigate a tradeoff between reconstruction computation time and accuracy at test time; specifically, we observe up to a 4dB improvement in PSNR. Furthermore, because our formulation is based on finding fixed points, we can use standard accelerated fixed point methods to speed test time computations – something that is not possible with deep unrolling methods. In addition, our approach inherits provable convergence guarantees depending on the “base” algorithm used to select a fixed point equation for the deep equilibrium framework. Experimental results also show that our proposed initialization based on pre-training is superior than random initialization, and the proposed approach is more robust to noise than past methods. Overall, the proposed approach is a unique bridge between conventional fixed-point methods in numerical analysis and deep learning and optimization-based solvers for inverse problems.

2 Relationship to Prior Work

2.1 Review of Deep Unrolling Methods

Deep Unrolling methods describe approaches to solving inverse problems which consist of a fixed number of architecturally identical “blocks,” which are often inspired by a particular optimization algorithm. These methods represent the current state of the art in MRI reconstruction, with most top submissions to the fastMRI challenge [23] being some sort of unrolled net. Unrolled networks have seen success in other imaging tasks, e.g. low-dose CT [36], light-field photography [9], and emission tomography [21].

We describe here a specific variant of deep unrolling methods based on gradient descent, although many variants exist based on alternative optimization or fixed point iteration schemes [24]. Suppose we have a known regularization function that could be applied to an image ; e.g. in Tikhonov regularization, for some scalar . Then we could compute an image estimate by solving the optimization problem


If is differentiable, this can be accomplished via gradient descent. That is, we start with an initial estimate such as and choose a step size , such that for iteration , we set

where is the gradient of the regularizer.

The basic idea behind deep unrolled methods is to fix some number of iterations (typically ranges from 5 to 10), declare that will be our estimate , and model with a neural network, denoted , that can be learned with training data. We assume that all have identical weights, although other works also explore non-weight-tied variants [1]. For example, we may define the unrolled gradient descent estimate to be where and for


Training attempts to minimize the cost function with respect to the network parameters . This form of training is often called “end-to-end”; that is, we do not train the network representing in isolation, but rather on the quality of the resulting estimate , which depends on the forward model .

The number of iterations is kept small for two reasons. First, at deployment, these systems are optimized to compute image estimates quickly – a desirable property we wish to retain in developing new methods. Second, it is challenging to train deep unrolled networks for many iterations due to memory limitations of GPUs because the memory required to calculate the backpropagation updates scales linearly with the number of unrolled iterations.

As a workaround, consider training such systems for a small number of iterations (e.g., ), then extracting the learned regularizer gradient , and using it within a gradient descent algorithm until convergence (i.e. for more iterations than used in training). Our numerical results highlight how poorly this method performs in practice (Section 6.4). Choosing the number of iterations (and hence the test time computational budget) at training time is essential. As we illustrate in Fig. 1(b), one cannot deviate from this choice after training and expect good performance.

2.2 Review of Deep Equilibrium Models

In [4], the authors propose a method for training arbitrarily-deep networks defined by repeated application of a single layer. Imagine an -layer network with input and weights . Letting denote the output of the hidden layer, we may write

where is the layer index and

is a nonlinear transformation such as inner products followed by the application of a nonlinear activation function. Recent prior work explored forcing this transformation at each layer to be the same (i.e.

weight tying), so that for all and showed that such networks still yield competitive performance [11, 3]. Under weight tying, we have the recursion


and the output as is a fixed point of the operator . [4] show this fixed point can be computed without explicitly building an infinitely-deep network, and that the network weights can be learned using implicit differentiation and constant memory, bypassing computation and numerical stability issues associated with related techniques on large-scale problems [8, 15]. This past work focused on sequence models and time-series tasks, assuming that each was a single layer of a neural network, and did not explore the image reconstruction task that is the focus of this paper.

2.3 Relationship to Plug-and-Play and Regularization by Denoising Methods

Initiated by [34], a collection of methods based on the plug-and-play (PnP) framework have been proposed, allowing denoising algorithms to be used as priors for model-based image reconstruction. Given an inverse problem setting, one writes the reconstructed image as the solution to an optimization problem in which the objective function is the sum of a data-fit term and a regularization term. Applying alternating directions method of multipliers (ADMM, [6, 7]) to this optimization problem, we arrive at a collection of update equations, one of which has the form

where is the regularization function; this optimization problem in this update equation can be considered as “denoising” the image . PnP methods replace this explicit optimization step with a “plugged-in” denoising method. Notably, some state-of-the-art denoisers (e.g., BM3D [10] and U-nets [28]) do not have an explicit associated with them, but nevertheless empirically work well within the PnP framework. [27] propose Regularization by Denoising (RED) based on a similar philosophy, but consider a regularizer of the form

where corresponds to an image denoising function.

Recent PnP and RED efforts focuses on using training data to learn denoisers [22, 30, 38, 33, 17]. In contrast to the unrolling methods described in Section 2.1, these methods are not trained end-to-end; rather, the denoising module is trained independent of the inverse problem (i.e. ) at hand. As described by [24], decoupling the training of the learned component from results in a reconstruction system that is flexible and does not need to be re-trained for each new , but can require substantially more training samples to achieve the reconstruction accuracy of a method trained end-to-end for a specific .

3 Proposed Approach

Our approach is to choose a function so that a fixed point of the recursion in (4) is a good estimate of for a given . To the best of our knowledge, this is the first example of the application of DEMs to image reconstruction tasks. We describe choices of (and hence of the implicit infinite-depth neural network architecture) that explicitly account for the forward model and generally for the inverse problem at hand. Specifically, we propose choosing based on the ideas of deep unrolling (Section 2.1) extended to an infinite number of iterations – a paradigm that has been beyond the reach of all previous deep unrolling methods.

Let denote the image estimate after rounds of an iterative algorithm (as in Section 2.1) and denote the observation from (1) under forward operator . We consider three specific choices of below, but note that many other options are possible.

Deep equilibrium gradient descent (DE-Grad):

Connecting the gradient descent iterations in (3) with the deep equilibrium model in (4), we let


and note that a fixed point of , denoted , corresponds to the limit of as in (3).

Figure 2: Deep Equilibrium Gradient Descent (DE-Grad)

Deep equilibrium proximal gradient (DE-Prox):

Proximal gradient methods [25] use a proximal operator associated with a function :


and use this to solve the optimization problem in (2) via the iterates

where is a step size. Inspired by this optimization framework, we choose in the DEM framework as


Following [19], we replace with a trainable network , leading to the fixed-point iterations:

Figure 3: Deep Equilibrium Proximal Gradient (DE-Prox)

Deep equilibrium alternating direction method of multipliers (De-Admm):

The Alternating Direction Method of Multipliers (ADMM, [6]) reformulates the optimization problem (2) as

The augmented Lagrangian (in its “scaled form” – see [6]) associated with this problem is given by

where is an additional auxiliary variable and is a user-defined parameter. The ADMM iterates are then


Here the - and -updates simplify as

As in the DE-Prox approach, can be replaced with a learned network, denoted . Making this replacement, and substituting directly into the expressions for and gives:


Note that the updates for and depend only on the previous iterates and . Therefore, the above updates can be interpreted as fixed-point iterations on the joint variable , where the iteration map is implicitly defined as the map that satisfies


Here we take the estimated image to be , where is the fixed-point of , i.e., the limit of as .

Figure 4: Deep Equilibrium Alternating Direction Method of Multipliers (DE-ADMM)

4 Calculating forward passes and gradient updates

Given a choice of , we confront the following obstacles. (1) Forward calculation: given an observation and weights , we need to compute the fixed point of efficiently. (2) Training: given a collection of training samples , we need to find the optimal .

4.1 Calculating Fixed-Points

Both training and inference in a DEM require calculating a fixed point of the iteration map given some initial point . The most straightforward approach is to use fixed-point iterations given in (4). Convergence of this scheme for specific designs is discussed in Section 5.

However, fixed-point iterations may not converge quickly. By viewing unrolled deep networks as fixed-point iterations, we inherit the ability to accelerate inference with standard fixed-point accelerators. To our knowledge, this work is the first time iterative inversion methods incorporating deep networks have been accelerated using fixed-point accelerators.

Anderson Acceleration:

Anderson acceleration [35]222Anderson acceleration for Deep Equilibrium models was introduced in a NeurIPS tutorial by [16].

utilizes past iterates to identify promising directions to move during the iterations. This takes the form of identifying a vector

and setting

We find by solving the optimization problem:


with a matrix with columns, where the column is the (vectorized) residual . The optimization problem in (12) admits a least-squares solution, adding negligible computational overhead when is small (e.g., ).

An important practical consideration is that accelerating fixed-point iterations arising from optimization algorithms with auxiliary variables (like ADMM) is non-trivial. In these cases, standard fixed-point iterations may be preferred for their simplicity of implementation. This is the approach we take in finding fixed-points of our proposed DE-ADMM model.

4.2 Gradient Calculation

In this section, we provide a brief overview of the training procedure used to train all networks in Section 6.4

. We use stochastic gradient descent to find network parameters

that (locally) minimize a cost function of the form where

is a given loss function,

is the th training image with paired measurements , and denotes the reconstructed image given as the fixed-point of . For our image reconstruction experiments, we use the mean-squared error (MSE) loss:


To simplify the calculations below, we consider gradients of the cost function with respect to a single training measurement/image pair, which we denote . Following [4], we leverage the fact that is a fixed-point of to find the gradient of the loss with respect to the network parameters without backpropagating through an arbitrarily-large number of fixed-point iterations. We summarize this approach below.

First, abbreviating by

, then by the chain rule


Since we assume is the MSE loss, the gradient is simply the residual between and the equilibrium point: .

In order to compute we start with the fixed point equation: . Implicitly differentiating and rearranging this equation with respect to gives


Plugging this expression into (14) gives

This converts the memory-intensive task of backpropagating through many iterations of to the problem of calculating an inverse Jacobian-vector product. To approximate the inverse Jacobian-vector product, first we define the vector by

Following [16], we note that is a fixed point of the equation


and the same machinery used to calculate the fixed point may be used to calculate . For analysis purposes, we note if , simple fixed-point iterations (16) may be represented by the Neumann series:


Convergence of the above Neumann series is discussed in Section 5.

Conventional autodifferentiation tools permit quickly computing the vector-Jacobian products in (16) and (17). Once an accurate approximation to is calculated, the gradient in (14) is given by


The gradient calculation process is summarized in the following steps, assuming a fixed point of is known:

  1. [topsep=-2ex,itemsep=-1ex,partopsep=1ex,parsep=1ex,leftmargin=4ex]

  2. Compute the residual .

  3. Compute an approximate fixed-point of the equation .

  4. Compute .

5 Convergence Theory

Here we study convergence of the proposed deep equilibrium models to a fixed-point at inference time, i.e., given the iteration map we give conditions that guarantee the convergence of to a fixed-point as .

Classical fixed-point theory ensures that the iterates converge to a unique fixed-point if the iteration map is contractive, i.e., if there exists a constant such that . Below we give conditions on the learned component (replacing the gradient or proximal mapping of a regularizer) used in the DE-Grad, DE-Prox and DE-ADMM models that that ensure the resulting iteration map is contractive and thus the fixed-point iterations for these models converge.

In particular, following [30], we assume that the learned component satisfies the following condition: there exists an such that for all we have


where . In other words, we assume the map is -Lipschitz.

If we interpret as a denoising or de-artifacting network, then is the map that outputs the noise or artifacts present in a degraded image. In practice, often is implemented with a residual “skip-connection”, such that , where is, e.g., a deep U-net. Therefore, in this case, (19) is equivalent to assuming the trained network is -Lipschitz.

We prove the following theorem in the supplement.

Theorem 1 (Convergence of DE-Grad).

Assume that is -Lipschitz (19), and let and , where and

denote the maximum and minimum eigenvalue, respectively. If the step-size parameter

is such that , then the DE-Grad iteration map defined in (5) satisfies

for all . The coefficient is less than if , in which case the the iterates of DE-Grad converge.


Let be the iteration map for DE-Grad. The Jacobian of with respect to is given by

where is the Jacobian of with respect to . To prove is contractive it suffices to show for all where denotes the spectral norm. Towards this end, we have


where denotes the th eigenvalue of , and in the final inequality (20) we used our assumption that the map is -Lipschitz, and therefore the spectral norm of its Jacobian is bounded by .

Finally, by our assumption where , we have for all , which implies for all . Therefore, the maximum in (20) is obtained at , which gives

This shows is -Lipschitz with , proving the claim.

Convergence of PnP approaches PnP-Prox and PnP-ADMM was studied in [30]. At inference time, the proposed DE-Prox and DE-ADMM methods are equivalent to the corresponding PnP method but with a retrained denoising network . Therefore, the convergence results in [30] apply directly to DE-Prox and DE-ADMM. To keep the paper self-contained, we restate these results below, specialized to the case of the quadratic data-fidelity term assumed in (2).

Theorem 2 (Convergence of DE-Prox).

Assume that is -Lipschitz (19), and let and , where and denote the maximum and minimum eigenvalue, respectively. Then the DE-Prox iteraion map defined in (8) is contractive if the step-size parameter satisfies

Such an exists if .

See Theorem 1 of [30].

Theorem 3 (Convergence of De-Admm).

Assume that is -Lipschitz (19), and let and , where and denote the maximum and minimum eigenvalue, respectively. Then the iteration map for DE-ADMM defined in (11) is contractive if the ADMM step-size parameter parameter satisfies

See Corollary 1 of [30].

Unlike the convergence result for DE-Grad in Theorem 1, the convergence results for DE-Prox and DE-ADMM in Theorem 2 and Theorem 3 make the assumption that , i.e., has a trivial nullspace. This is condition is satisfied for certain inverse problems, such as denoising or deblurring, but violated in many others, including compressed sensing and undersampled MRI. However, in practice we observe that the iterates of DE-Prox and DE-ADMM still appear to converge even in situations where has a nontrivial nullspace, indicating this assumption may be stronger than necessary.

Finally, an important practical concern when training deep equilibrium models is whether the fixed-point iterates used to compute gradients (as detailed in Section 4.2) will converge. Specifically, the gradient of the loss at the training pair involves computing the truncated Neumann series in (17). This series converges if the Jacobian has spectral norm less than when evalated at any , which is true if and only if is contractive. Therefore, the same conditions in Theorems 1-3 that ensure the iteration map is contractive also ensure that the Neumann series in (17) used to compute gradients converges.

6 Experimental Results

6.1 Comparison Methods and Inverse Problems

width=center Plug-n-Play (U-net denoiser) RED (U-net denoiser) Deep Unrolled Methods Trained End-to-End Deep Equilibrium (Ours) TV Prox ADMM ADMM Grad Prox ADMM PNeumann Grad Prox ADMM Deblur (1) PSNR 26.79 24.86 25.95 25.35 28.17 27.57 28.14 28.73 29.26 30.17 30.02 SSIM 0.93 0.90 0.91 0.89 0.93 0.93 0.94 0.94 0.95 0.96 0.96 Deblur (2) PSNR 31.31 33.14 33.43 33.25 34.17 34.22 35.38 35.47 38.74 39.60 38.35 SSIM 0.97 0.97 0.97 0.97 0.98 0.98 0.99 0.99 0.99 0.99 0.99 CS (8x) PSNR 15.30 4.61 4.04 5.32 18.45 20.38 19.81 20.25 21.89 22.31 21.66 SSIM 0.46 0.26 0.23 0.34 0.79 0.85 0.76 0.83 0.85 0.86 0.85 MRI (8x) PSNR 24.74 27.45 26.69 26.82 29.01 29.10 28.82 29.38 31.32 31.31 31.25 SSIM 0.76 0.84 0.83 0.83 0.84 0.84 0.82 0.85 0.89 0.89 0.87

Table 1: Median Test PSNR and SSIM comparison across reconstruction approaches and problems; for each setting, the best PSNR and SSIMs are in bold.

Our numerical experiments include comparisons with a variety of models and methods. Total-variation Regularized Least Squares (TV) is an important baseline that does not use any training data but rather leverages geometric models of image structure [29, 31, 5]. The PnP and RED methods are described in Section 2.3; we consider both the original ADMM variant of [34] PnP-ADMM and a proximal gradient PnP-Prox method as described in [30]. We utilize the ADMM formulation of RED. Deep Unrolled methods (DU) are described in Section 2.1; we consider DU using gradient descent, proximal gradient, and ADMM. The preconditioned Neumann network [14] represents the state of the art in unrolled approaches but does not have simple Deep Equilibrium or Plug-and-Play analogues.

We compare the above approaches across three inverse problems: Gaussian deblurring (Deblur), 8 Gaussian compressed sensing (CS), and 8 accelerated Cartesian single-coil MRI reconstruction (MRI). The compressed sensing and single-coil MRI measurements are corrupted with additive Gaussian noise with . Deblur (1) is blurred with additive Gaussian noise with , and Deblur (2) is corrupted with additive Gaussian noise with .

For deblurring and compressed sensing, we utilize a subset of the Celebrity Faces with Attributes (CelebA) dataset [18], which consists of centered human faces. We train on a subset of 10000 of the training images. All images are resized to 128128. For the single-coil MRI problem, we use a random subset of size 2000 of the fastMRI single-coil knee dataset [37] for training. We trim the fully-sampled images so they are 320320 pixels.

6.2 Architecture Specifics

For our learned network, we utilize a U-Net architecture [28]

with some modifications. First, we have removed all instance normalization layers. For both the CelebA and fastMRI datasets, we train six U-Net denoisers with noise variances

on the training split.

For the CelebA set, to ensure contractivity of the learned component, we add spectral normalization to all layers [4], ensuring that each layer has a Lipschitz constant bounded above by 1. This normalization is enforced during pretraining as well as during the Deep Equilibrium training phrase.

We found that adding spectral normalization resulted in significant PSNR drops on the fastMRI dataset, and so did not use spectral normalization for the fastMRI U-Nets. Instead, we initialized the U-Nets before pretraining by sampling kernel weights from a random Gaussian distribution with

. Empirically, this initialization provides sufficient expressive power, but without causing a lack of contractivity, which can cause significant problems during training.

Further details on settings, parameter choices, and data may be found in the appendix and in our publicly-available code.333Available at:

6.3 Parameter Tuning and Pretraining

The proposed approaches to solving inverse problems via Deep Equilibrium are all based on some iterative optimization algorithm. Each of these iterative optimization algorithms has their own set of hyperparameters to choose,

e.g., the step size in DE-Grad, plus any parameters used to calculate the initial estimate .

We choose to fix all algorithm-specific hyperparameters prior to training a Deep Equilibrium network, for clarity and ease of training. We perform a grid search over algorithm-specific hyperparameters, testing the performance of the untrained Deep Equilibrium network on a held-out test set.

Tuning hyperparameters requires choosing a particular during tuning. We use an that has been pretrained for Gaussian denoising. Pretraining can be done on the target dataset (e.g., training on MRI images directly) or using an independent dataset (e.g., the BSD500 image dataset [20]). We use the former approach in our experiments. In the supplemental materials we show that pretraining provides a small improvement in reconstruction accuracy over random initialization.

Because we initialize our learned components with denoisers, the initial setup of our method exactly corresponds to tuning a PnP approach with a deep denoiser. Training adapts the iteration map to a particular inverse problem and data distribution. We note that our approach may be used to adapt any iterative optimization framework satisfying the conditions in Section 4, e.g., solvers for RED [27].

6.4 Main Results

We present the main reconstruction accuracy comparison in Table 1. Each entry for Deep Equilibrium (DE), Regularization by Denoising (RED), and Plug-and-Play (PnP) approaches is the result of running fixed-point iterations until the relative change between iterations is less than . During training, all DE models were limited to 50 forward updates. The DU models are tested at the number of iterations for which they were trained and all parameters for TV reconstructions (including number of TV iterations) are cross-validated to maximize PSNR. Performance as a function of iteration is shown in Figs. 1(a)5(a), and 5(b), with example reconstructions in Fig. 6. Further example reconstructions are available for qualitative evaluation in Appendix 8. We observe our DE reconstructions improve reconstruction quality beyond the number of iterations they were trained for.

We observe our DE-based approaches consistently outperform DU approaches across choices. Among choices of iterative reconstruction architectures for , DE-Prox is a front-runner. However, some of the differences may be due to DE-ADMM not being accelerated, while DE-Prox and DE-Grad leverage Anderson acceleration.

(a) Deblurring
(b) CS
Figure 5: Iterations vs. reconstruction PSNR for DE-Prox and competing methods for (a) deblurring and (b) compressed sensing. MRI results are in Fig. 1(a). The deep unrolled prox grad was trained for 10 iterations. In (b), the PSNR of Plug-and-Play ProxGrad was below 5dB for all iterations. In all examples, deep unrolling is only effective at the number of iterations for which it is trained, whereas deep equilibrium achieves higher PSNR across a broad range of iterations, allowing a user to trade off computation time and accuracy.
(a) Ground truth
(b) IFFT (), PSNR = 22.7 dB
(c) DU-Prox, PSNR = 30.1 dB
(d) DE-Prox, PSNR = 32.2 dB
Figure 6: MRI reconstruction example. Best viewed digitally.

Across problems, the DE approach is generally competitive with the end-to-end trained DU solvers. For two of three problems, our approach requires no more computation to be competitive with a DU network, with increasing advantage to our approach with further computation.

6.5 Effect of Pre-Training

Here we compare the effect of initializing the learned component in our deep equilibrium models with a pretrained denoiser versus initializing with random weights. We use identical hyperparameters for both initialization methods.

We present our results on Deep Equilibrium Proximal Gradient Descent (DE-Prox) and Deep Equilibrium ADMM (DE-ADMM) in Figure 7. We observe an improvement in reconstruction quality when utilizing our pretraining method compared to a random initialization. We also note that pretraining enables a simpler choice of algorithm-specific hyperparameters. For example, with random initialization, choosing the proper internal step size for DE-Prox would require training several different DE-Prox instances, and choosing the correct step size based on validation set, in addition to any other parameters, such as the learning rate used during training.

Figure 7: (a) Comparison of learned DE-Prox reconstruction quality across three different inverse problems: Deblurring (Blur), compressed sensing (CS), and undersampled MRI reconstruction (MRI). In our experiements, initializing with a pretrained denoiser routined offered as good or better reconstruction quality (in terms of PSNR) than a random initialization. (b) Noise sensitivity comparison between DU-Prox and DE-Prox. The forward model used is MRI reconstruction, and here corresponds to the level of Gaussian noise added to observations.

6.6 Noise Sensitivity

We observe empirically that the Deep Equilibrium approach to training achieves competitive reconstruction quality and increased flexibility with respect to allocating computation budget. Recent work in deep inversion has questioned these methods’ robustness to noise and unexpected inputs [2, 26, 13].

To examine whether the Deep Equilibrium approach is brittle to simple changes in the noise distribution, we varied the level of Gaussian noise added to the observations at test time and observed the effect on reconstruction quality. Fig. 7 demonstrates that the Deep Equilibrium model DE-Prox is more robust to variation in the noise level than the analogous Deep Unrolled approach DU-Prox. The forward model used in Fig. 7 is MRI reconstruction.

7 Conclusions

This paper illustrates non-trivial quantitative benefits to using implicitly-defined infinite-depth networks for solving linear inverse problems in imaging. Other recent work has focused on such implicit networks akin to the deep equilibrium models considered here (e.g. [12]). Whether these models could lead to additional advances in image reconstruction remains an open question for future work. Furthermore, while the exposition in this work focused on linear inverse problems, nonlinear inverse problems may be solved with iterative approaches just as well. The conditions under which deep equilibrium methods proposed here may be used on such iterative approaches are an active area of investigation.


  • [1] J. Adler and O. Öktem (2018) Learned primal-dual reconstruction. IEEE transactions on medical imaging 37 (6), pp. 1322–1332. Cited by: §2.1.
  • [2] V. Antun, F. Renna, C. Poon, B. Adcock, and A. C. Hansen (2020) On instabilities of deep learning in image reconstruction and the potential costs of ai. Proceedings of the National Academy of Sciences. Cited by: §6.6.
  • [3] S. Bai, J. Z. Kolter, and V. Koltun (2018) Trellis networks for sequence modeling. arXiv preprint arXiv:1810.06682. Cited by: §2.2.
  • [4] S. Bai, J. Z. Kolter, and V. Koltun (2019) Deep equilibrium models. In Advances in Neural Information Processing Systems, pp. 690–701. Cited by: §1, §2.2, §4.2, §6.2.
  • [5] A. Beck and M. Teboulle (2009) Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems. IEEE transactions on image processing 18 (11), pp. 2419–2434. Cited by: §6.1.
  • [6] S. Boyd, N. Parikh, and E. Chu (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Now Publishers Inc. Cited by: §2.3, §3.
  • [7] S. H. Chan, X. Wang, and O. A. Elgendy (2016) Plug-and-play admm for image restoration: fixed-point convergence and applications. IEEE Transactions on Computational Imaging 3 (1), pp. 84–98. Cited by: §2.3.
  • [8] R. T. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud (2018)

    Neural ordinary differential equations

    In Advances in neural information processing systems, pp. 6571–6583. Cited by: §2.2.
  • [9] I. Y. Chun, Z. Huang, H. Lim, and J. Fessler (2020) Momentum-net: fast and convergent iterative neural network for inverse problems. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2.1.
  • [10] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian (2007) Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE Transactions on image processing 16 (8), pp. 2080–2095. Cited by: §1, §2.3.
  • [11] R. Dabre and A. Fujita (2019)

    Recurrent stacking of layers for compact neural machine translation models


    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 6292–6299. Cited by: §2.2.
  • [12] L. El Ghaoui, F. Gu, B. Travacca, and A. Askari (2019) Implicit deep learning. arXiv preprint arXiv:1908.06315. Cited by: §7.
  • [13] M. Genzel, J. Macdonald, and M. März (2020) Solving inverse problems with deep neural networks–robustness included?. arXiv preprint arXiv:2011.04268. Cited by: §6.6.
  • [14] D. Gilton, G. Ongie, and R. Willett (2019) Neumann networks for linear inverse problems in imaging. IEEE Transactions on Computational Imaging. Cited by: §6.1.
  • [15] E. Haber and L. Ruthotto (2017) Stable architectures for deep neural networks. Inverse Problems 34 (1), pp. 014004. Cited by: §2.2.
  • [16] Z. Kolter, D. Duvenaud, and M. Johnson (2020) Deep implicit layers - neural odes, deep equilibirum models, and beyond. External Links: Link Cited by: §4.2, footnote 2.
  • [17] J. Liu, Y. Sun, C. Eldeniz, W. Gan, H. An, and U. S. Kamilov (2020) Rare: image reconstruction using deep priors learned without groundtruth. IEEE Journal of Selected Topics in Signal Processing 14 (6), pp. 1088–1099. Cited by: §2.3.
  • [18] Z. Liu, P. Luo, X. Wang, and X. Tang (2015-12) Deep learning face attributes in the wild. In

    Proceedings of International Conference on Computer Vision (ICCV)

    Cited by: §6.1.
  • [19] M. Mardani, Q. Sun, D. Donoho, V. Papyan, H. Monajemi, S. Vasanawala, and J. Pauly (2018) Neural proximal gradient descent for compressive imaging. In Advances in Neural Information Processing Systems, pp. 9573–9583. Cited by: §3.
  • [20] D. Martin, C. Fowlkes, D. Tal, and J. Malik (2001) A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, Vol. 2, pp. 416–423. Cited by: §6.3.
  • [21] A. Mehranian and A. J. Reader (2020) Model-based deep learning pet image reconstruction using forward-backward splitting expectation maximisation. IEEE Transactions on Radiation and Plasma Medical Sciences. Cited by: §2.1.
  • [22] T. Meinhardt, M. Moller, C. Hazirbas, and D. Cremers (2017) Learning proximal operators: using denoising networks for regularizing inverse imaging problems. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1781–1790. Cited by: §2.3.
  • [23] M. J. Muckley, B. Riemenschneider, A. Radmanesh, S. Kim, G. Jeong, J. Ko, Y. Jun, H. Shin, D. Hwang, M. Mostapha, et al. (2020) State-of-the-art machine learning mri reconstruction in 2020: results of the second fastmri challenge. arXiv preprint arXiv:2012.06318. Cited by: §2.1.
  • [24] G. Ongie, A. Jalal, C. A. Metzler, R. G. Baraniuk, A. G. Dimakis, and R. Willett (2020) Deep learning techniques for inverse problems in imaging. arXiv preprint arXiv:2005.06001. Cited by: §1, §2.1, §2.3.
  • [25] N. Parikh and S. Boyd (2014) Proximal algorithms. Foundations and Trends in optimization 1 (3), pp. 127–239. Cited by: §3.
  • [26] A. Raj, Y. Bresler, and B. Li (2020) Improving robustness of deep-learning-based image reconstruction. arXiv preprint arXiv:2002.11821. Cited by: §6.6.
  • [27] Y. Romano, M. Elad, and P. Milanfar (2017) The little engine that could: regularization by denoising (red). SIAM Journal on Imaging Sciences 10 (4), pp. 1804–1844. Cited by: §2.3, §6.3.
  • [28] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §2.3, §6.2.
  • [29] L. I. Rudin, S. Osher, and E. Fatemi (1992) Nonlinear total variation based noise removal algorithms. Physica D: nonlinear phenomena 60 (1-4), pp. 259–268. Cited by: §1, §6.1.
  • [30] E. Ryu, J. Liu, S. Wang, X. Chen, Z. Wang, and W. Yin (2019) Plug-and-play methods provably converge with properly trained denoisers. In International Conference on Machine Learning, pp. 5546–5557. Cited by: §2.3, §5, §5, §5, §5, §6.1.
  • [31] D. Strong and T. Chan (2003) Edge-preserving and scale-dependent properties of total variation regularization. Inverse problems 19 (6), pp. S165. Cited by: §6.1.
  • [32] A. N. Tikhonov (1943) On the stability of inverse problems. In Dokl. Akad. Nauk SSSR, Vol. 39, pp. 195–198. Cited by: §1.
  • [33] T. Tirer and R. Giryes (2019) Super-resolution via image-adapted denoising cnns: incorporating external and internal learning. IEEE Signal Processing Letters 26 (7), pp. 1080–1084. Cited by: §2.3.
  • [34] S. V. Venkatakrishnan, C. A. Bouman, and B. Wohlberg (2013) Plug-and-play priors for model based reconstruction. In 2013 IEEE Global Conference on Signal and Information Processing, pp. 945–948. Cited by: §2.3, §6.1.
  • [35] H. F. Walker and P. Ni (2011) Anderson acceleration for fixed-point iterations. SIAM Journal on Numerical Analysis 49 (4), pp. 1715–1735. Cited by: §4.1.
  • [36] D. Wu, K. Kim, and Q. Li (2019) Computationally efficient deep neural network for computed tomography image reconstruction. Medical physics 46 (11), pp. 4763–4776. Cited by: §2.1.
  • [37] J. Zbontar, F. Knoll, A. Sriram, M. J. Muckley, M. Bruno, A. Defazio, M. Parente, K. J. Geras, J. Katsnelson, H. Chandarana, Z. Zhang, M. Drozdzal, A. Romero, M. Rabbat, P. Vincent, J. Pinkerton, D. Wang, N. Yakubova, E. Owens, C. L. Zitnick, M. P. Recht, D. K. Sodickson, and Y. W. Lui (2018) fastMRI: an open dataset and benchmarks for accelerated MRI. External Links: 1811.08839 Cited by: §6.1.
  • [38] K. Zhang, W. Zuo, S. Gu, and L. Zhang (2017) Learning deep cnn denoiser prior for image restoration. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 3929–3938. Cited by: §2.3.

8 Appendix

8.1 Further Qualitative Results

In this section, we provide further visualizations of the reconstructions produced by Deep Equilibrium models and the corresponding Deep Unrolled approaches, beyond those shown in the main body. Figures 10, 9, and 8 are best viewed electronically, and contain the ground-truth images, the measurements (projected back to image space in the case of MRI and compressed sensing), and reconstructions by DU-Prox and DE-Prox.

We also visualize the intermediate iterates in the fixed-point iterations, to further demonstrate the convergence properties of DEMs for image reconstruction. We find that DEMs converge quickly to reasonable reconstructions, and maintain high-quality reconstructions after more than one hundred iterations.

Ground Truth
DE-Prox (Ours)
Figure 8: Sample images and reconstructions with for accelerated MRI reconstruction with additive noise of . Best viewed electronically.
Ground Truth
DE-Prox (Ours)
Figure 9: Sample images and reconstructions for Gaussian compressed sensing with additive noise of . Best viewed electronically.
Ground Truth
Measure- ments
DE-Prox (Ours)
Figure 10: Sample images and reconstructions for Gaussian deblurring with additive noise of . Best viewed electronically.

8.2 Visualizing Iterates

In Figures 13, 12, and 11 we visualize the outputs of the ’th iteration of the mapping in DE-Prox. Recall that during training, DE-Prox was limited to 50 forward iterations. We observe that across forward problems, the reconstructions continue converging after this number of iterations.

We illustrate 90 iterations for deblurring and MRI reconstructions, and illustrate 190 iterations for compressed sensing, since that problem requires additional iterations to converge.

For further illustration we also demonstrate the qualitative effects of running DU-Prox for more iterations than it was trained for.

K=0 K=10 K=20 K=30 K=40
K=50 K=60 K=70 K=80 K=90
Figure 11: Sample images and reconstructions for MRI reconstruction with accleration and additive Gaussian noise with . Each image represents the output of iterate number . Below each image is the residual between iterate and the previously-visualized iterate, or in the case of , between the input to the network and the output of the initial iterate. Each residual is multiplied by for easier visualization. The ground truth may be viewed in the initial column of Figure 8. Best viewed electronically.
K=0 K=10 K=20 K=30 K=40 K=50 K=60 K=70 K=80 K=90
K=100 K=110 K=120 K=130 K=140 K=150 K=160 K=170 K=180 K=190
Figure 12: Sample images and reconstructions from DE-Prox reconstructions, with the forward model Gaussian compressed sensing with Gaussian noise with . Each image represents the output of iterate number . Below each image is the residual between iterate and the previously-visualized iterate, or in the case of , between the input to the network and the output of the initial iterate. The ground truth may be viewed in the final column of Figure 9. Each residual is multiplied by for easier visualization. Best viewed electronically.
K=0 K=10 K=20 K=30 K=40
K=50 K=60 K=70 K=80 K=90
Figure 13: Sample images and reconstructions from DE-Prox performing Gaussian deblurring with Gaussian noise with . Each image represents the output of iterate number . Below each image is the residual between iterate and the previously-visualized iterate, or in the case of , between the input to the network and the output of the initial iterate. Each residual is multiplied by for easier visualization. The ground truth may be viewed in the final column of Figure 10. Best viewed electronically.
K=0 K=10 K=20 K=30 K=40
K=50 K=60 K=70 K=80 K=90
Figure 14: Sample images and reconstructions from the standard DU-Prox method performing MRI reconstruction with accleration and additive Gaussian noise with . Each image represents the output of iterate number . Unlike our approach, iterates beyond the number DU-Prox was trained for decrease reconstruction accuracy, adding nonphysical artifacts. The ground truth may be viewed in the initial column of Figure 8. Best viewed electronically.
K=0 K=10 K=20 K=30 K=40
K=50 K=60 K=70 K=80 K=90
Figure 15: Sample images and reconstructions from the standard DU-Prox trained for 10 iterations performing Gaussian deblurring with Gaussian noise with . Each image represents the output of iterate number . Unlike our approach, DE-Prox, after 10 iterations additional iterations decrease reconstruction quality significantly. The ground truth may be viewed in the final column of Figure 10.

8.3 Further Experimental Details

In this section we provide further details related to the experimental setup.

The input to every learned algorithm is the preconditioned measurement , where is generally set to be equal to the noise level . For MRI reconstruction experiments, was used. The masks used in the MRI reconstruction experiments are based on a Cartesian sampling pattern, as in the standard fastMRI setting. The center 4 of frequencies are fully sampled, and further frequencies are sampled according to a Gaussian distribution centered at 0 frequency with .

The compressed sensing design matrices have entries sampled and scaled so that each entry is drawn from a Gaussian distribution with variance , where . The same design matrix is used for all learned methods.

Optimization algorithm parameters for RED, Plug-and-Play, and all Deep Equilibrium approaches are all chosen via a logarithmic grid search from to with 20 elements in each dimension of the grid. All DU methods were trained for 10 iterations. All testing was done on an NVidia Titan X. All networks were trained on a cluster with a variety of computing resources 444See: Every experiment was run utilizing a single GPU-single CPU setup with less than 12 GB of GPU memory.