# The Little Engine that Could: Regularization by Denoising (RED)

Removal of noise from an image is an extensively studied problem in image processing. Indeed, the recent advent of sophisticated and highly effective denoising algorithms lead some to believe that existing methods are touching the ceiling in terms of noise removal performance. Can we leverage this impressive achievement to treat other tasks in image processing? Recent work has answered this question positively, in the form of the Plug-and-Play Prior (P^3) method, showing that any inverse problem can be handled by sequentially applying image denoising steps. This relies heavily on the ADMM optimization technique in order to obtain this chained denoising interpretation. Is this the only way in which tasks in image processing can exploit the image denoising engine? In this paper we provide an alternative, more powerful and more flexible framework for achieving the same goal. As opposed to the P^3 method, we offer Regularization by Denoising (RED): using the denoising engine in defining the regularization of the inverse problem. We propose an explicit image-adaptive Laplacian-based regularization functional, making the overall objective functional clearer and better defined. With a complete flexibility to choose the iterative optimization procedure for minimizing the above functional, RED is capable of incorporating any image denoising algorithm, treat general inverse problems very effectively, and is guaranteed to converge to the globally optimal result. We test this approach and demonstrate state-of-the-art results in the image deblurring and super-resolution problems.

• 30 publications
• 57 publications
• 42 publications
08/01/2020

### Regularization by Denoising via Fixed-Point Projection (RED-PRO)

Inverse problems in image processing are typically cast as optimization ...
02/26/2016

### Patch-Ordering as a Regularization for Inverse Problems in Image Processing

Recent work in image processing suggests that operating on (overlapping)...
01/27/2021

### An Interpretation of Regularization by Denoising and its Application with the Back-Projected Fidelity Term

The vast majority of image recovery tasks are ill-posed problems. As suc...
02/08/2017

### Scene-adapted plug-and-play algorithm with convergence guarantees

Recent frameworks, such as the so-called plug-and-play, allow us to leve...
03/25/2019

Inverse problems in imaging are extensively studied, with a variety of s...
11/27/2010

### Edge Preserving Image Denoising in Reproducing Kernel Hilbert Spaces

The goal of this paper is the development of a novel approach for the pr...
09/13/2021

### Implicit Regularization Effects of the Sobolev Norms in Image Processing

In this paper, we propose to use the general L^2-based Sobolev norms (i....

## Code Repositories

### RED

RED - Regularization by Denoising

## 1 Introduction

We open this paper with a bold and possibly controversial statement: To a large extent, removal of zero-mean white additive Gaussian noise from an image is a solved problem in image processing.

Before justifying this statement, let us describe the basic building block that will be the star of this paper: the image denoising engine

. From the humble linear Gaussian filter to the recently developed state-of-the-art methods using convolutional neural networks, there is no shortage of denoising approaches. In fact, these algorithms are so widely varied in their definition and underlying structure that a concise description will need to be made carefully. Our story begins with an image

x corrupted by zero-mean white additive Gaussian noise,

 y=x+e. (1)

In our notation, we consider an image as a vector of length

(after lexicographic ordering). In the above description, the noise vector is normally distributed,

. In the most general terms, the image denoising engine is a function that maps an image y to another image of the same size , with the hope to get as close as possible to the original image x. Ideally, such functions operate on the input image y to remove the deleterious effect of the noise while maintaining edges and textures beneath.

The claim made above about the denoising problem being solved is based on the availability of algorithms proposed in the past decade that can treat this task extremely effectively and stably, getting very impressive results, which also tend to be quite concentrated (see for example the work reported in [3, 4, 5, 6, 7, 42, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]). Indeed, these documented results have led researchers to the educated guess that these methods are getting very close to the optimally possible denoising performance [21, 22, 23]. This aligns well with the unspoken observation in our community in recent years that investing more work to improve image denoising algorithms seems to lead to diminishing returns.

While the above may suggest that work on denoising algorithms is turning to a dead-end avenue, a new opportunity emerges from this trend: Seeking ways to leverage the vast progress made on the image denoising front in order to treat other tasks in image processing, bringing their solutions to new heights. One natural path towards addressing this goal is to take an existing and well-performing denoising algorithm, and generalize it to handle a new problem. This has been the logical path that has led to contributions such as [24, 25, 26, 27, 28, 29], and many others. These papers, and others like them, offer an exhaustive manual adaptation of existing denoising algorithms, carefully re-tailored to handle specific alternative problems. This line of work, while often successful, is quite limited, as it offers no flexibility and no general scheme for diverting image denoising engines to treat new image processing tasks.

Could one offer a more systematic way to exploit the abundance of high-performing image-denoising algorithms to treat a much broader family of problems? The recent work by Venkatakrishnan, Bouman and Wohlberg provides a positive and tantalizing answer to this question, in the form of the Plug-and-Play Prior () method [30, 33, 31, 32]. This technique builds on the use of an implicit prior for regularizing general inverse problems. When handling the obtained optimization task via the ADMM optimization scheme [58, 59, 60], the overall problem decomposes into a sequence of image denoising tasks, coupled with simpler -regularized inverse problems that are much easier to handle.

While the scheme may sound like the perfect answer to our prayers, reality is somewhat more complicated. First, this method is not always accompanied by a clear definition of the objective function, since the regularization being effectively used is only implicit, implied by the denoising algorithm. Indeed, it is not clear at all that there is an underlying objective function behind the scheme, if arbitrary denoising engines are used [31]. Second, parameter tuning of the ADMM scheme is a delicate matter, and especially so under a non-provable convergence regime, as is the case when using sophisticated denoising algorithms. Third, being intimately coupled with the ADMM, the scheme does not offer easy and flexible ways of replacing the iterative procedure. Because of these reasons, the scheme is not a turn-key tool, nor is it free from emotional-involvement. Nevertheless, the method has drawn much attention (e.g., [31, 32, 33, 34, 35, 36, 37, 38]), and rightfully so, as it offers a clear path towards harnessing a given image denoising engine for treating more general inverse problems, just as described above.

Is there a more general alternative to the method that could be simpler and more stable? This paper puts forward such a framework, offering a systematic use of such denoising engines for regularization of inverse problems. We term the proposed method “Regularization by Denoising” (RED), relying on a general structured smoothness penalty term harnessed to regularize any desired inverse problem. More specifically, the regularization term we propose in this work is of the following

 ρ(x)=12xT[x−f(x)], (2)

in which the denoising engine itself is applied on the candidate image x, and the penalty induced is proportional to the inner-product between this image and its denoising residual. This defined smoothness regularization is effectively using an image-adaptive Laplacian, which in turn draws its definition from the arbitrary image denoising engine of choice, . Surprisingly, under mild assumptions on , it is shown that the gradient of the regularization term is manageable, given as the denoising residual, . Therefore, armed with this regularization expression, we show that any inverse problem can be handled while calling the denoising engine iteratively.

RED, the newly proposed framework, is much more flexible in the choice of the optimization method to use, not being tightly coupled to one specific technique, as in the case of the scheme (relying on ADMM). Another key difference w.r.t. the method is that our adaptive Laplacian-based regularization functional is explicit, making the overall Bayesian objective function clearer and better defined. RED is capable of incorporating any image denoising algorithm, and can treat general inverse problems very effectively, while resulting in an overall algorithm with very simple structure.

An important advantage of RED over the scheme is the flexibility with which one can choose the denoising engine to plug in the regularization term. While most of the discussion in this paper keeps focusing on White Gaussian Noise (WGN) removal, RED can actually deploy almost any denoising engine. Indeed, we define a set of two mild conditions that should satisfy, and show that many known denoising methods obey these properties. As an example, in our experiments we show how the median filter can become an effective regularizer. Last but not least, we show that the defined regularization term is a convex function, implying that in most cases, in which the log-likelihood term is convex too, the proposed algorithms are guaranteed to converge to a global optimum solution. We demonstrate this scheme, showing state-of-the-art results in image deblurring and single image super-resolution.

This paper is organized as follows: In the next section we present the background material for this work, discussing the general form of inverse problems as optimization tasks, and presenting the Plug-and-Play Prior scheme. Section 3 focuses on the image denoising engine, defining it and its properties clearly, so as to enable its use in the proposed Laplacian paradigm. Section 4 serves the main part of this work – introducing RED: a new way to use an image denoising engine to handle general structured inverse problems. In Section 5 we analyze the proposed scheme, discussing convexity, an alternative formulation, and a qualitative comparison to the scheme. Results on the image deblurring and single-image super-resolution problems are brought in Section 6, demonstrating the strength of the proposed scheme. We conclude the paper in Section 7 with a summary of the open questions that we identify for future work.

## 2 Preliminaries

In this section we provide background material that serves as the foundation to this work. We start by presenting the breed of optimization tasks we will work on throughout the paper for handling the inverse problems of interest. We then introduce the method and discuss its merits and weaknesses.

### 2.1 Inverse Problems as Optimization Tasks

Bayesian estimation of an unknown image

x given its measured version y

uses the posterior conditional probability,

, in order to infer x. The most popular estimator in this regime is the Maximum aposteriori Probability (MAP), which chooses the mode (x for which the maximum probability is obtained) of the posterior. Using Bayes’ rule, this implies that the estimation task is turned into an optimization problem of the form

 ˆxMAP=\text{Arg}max% xP(x|y) = \text{Arg}maxxP(y|x)P(x)P(y)=\text{Arg}maxxP(y|x)P(x) = \text{Arg}minx−log{P(y|x)}−log{P(x)}.

In the above derivations we exploited the fact that is not a function of x and thus can be omitted. We also used the fact that the function is monotonic decreasing, turning the maximization into a minimization problem.

The term is known as the log-likelihood term, and it encapsulates the probabilistic relationship between the desired image x and the measurements y, under the assumption that the desired image is known. We shall rewrite this term as

 ℓ(y,x)=−log{P(y|x)}. (4)

As a classic example for the log-likelihood that will accompany us throughout this paper, the expression refers to the case of , where H is a linear degradation operator and e

is white Gaussian noise contamination of variance

. Naturally, if the noise distribution changes, we depart form the comfortable form.

The second term in Equation (2.1), , refers to the prior, bringing in the influence of the statistical nature of the unknown. This term is also referred to as the regularization, as it helps in better conditioning the overall optimization task in cases where the likelihood alone cannot lead to a unique or stable solution. We shall rewrite this term as

 λρ(x)=−log{P(x)}, (5)

where is a scalar that encapsulates the confidence in this term.

What is and how is it chosen? This is the holy grail of image processing, with a progressive advancement over the years of better modeling the image statistics and leveraging this for handling various tasks in image processing. Indeed, one could claim that almost everything done in our field surrounds this quest for choosing a proper prior, from the early smoothness prior using the classic Laplacian [39], through total variation [40] and wavelet sparsity [41], all the way to recent proposals based on patch-based GMM [42, 43] and sparse-representation modeling [44]. Interestingly, the work we report here builds on the surprising comeback of the Laplacian regularization in a much more sophisticated form, as reported in [48, 49, 50, 51, 52, 53, 54, 55, 56, 57].

Armed with a clear definition of the relation between the measurements and the unknown, and with a trusted prior, the MAP estimation boils down to the optimization problem of the form

 ˆxMAP=\text{Arg}min% xℓ(y,x)+λρ(x). (6)

This defines a wide family of inverse problems that we aim to address in this work, which includes tasks such as denoising, deblurring, super-resolution, demosaicing, tomographic reconstruction, optical-flow estimation, segmentation, and many other problems. The randomness in these problems is typically due to noise contamination of the measurements, and this could be Gaussian, Laplacian, Gamma-distributed, Poisson, and other noise models.

### 2.2 The Plug-and-Play Prior (P3) Approach

For completeness of this exposition, we briefly review the approach. Aiming to solve the problem posed in Equation (6), the ADMM technique [58, 59, 60] suggests to handle this by variable splitting, leading to the equivalent problem

 (7)

The constraint is turned into a penalty term, relying on the augmented Lagrangian method (in its scaled dual form [58]), leading to

 (8)

where u serves as the Lagrange multiplier vector for the set of constraints. ADMM addresses the resulting problem by updating x, v, and u sequentially in a block-coordinate-descent fashion, leading to the following series of sub-problems:

1. Update of x: When considering v (and u) as fixed, the term is omitted, and our task becomes

 ˆx=\text{Arg}minxℓ(y,x)+β2∥x−v+u∥22, (9)

which is a far simpler inverse problem, where the regularization is an proximity one, which is easy to solve in most cases.

2. Update of v: In this stage we freeze x (and u), and thus the log-likelihood term drops, leading to

 ˆv=\text{Arg}minvλρ(v)+β2∥x−v+u∥22. (10)

This stage is nothing but a denoising of the image , assumed to be contaminated by a white additive Gaussian noise of power . This is easily verified by returning to Equation (6) and plugging the log-likelihood term referring to this case. Indeed, this is the prime observation in [30], as they suggest to replace the direct solution of (10) by activating an image denoising engine of choice. This way, we do not need to define explicitly the regularization to be used, as it is implied by the engine chosen.

3. Update of u: We complete the algorithm description by considering the update of the Lagrange multiplier vector u, which is done by .

Although the above algorithm has a clear mathematical formulation and only two parameters, denoted by and , it turns out that tuning these is not a trivial task. The source of complexity emerges from the fact that the input noise-level to the denoiser is equal to . The confidence in the prior is determined by , and the penalty on the distance between x and v is affected by . Empirically, setting a fixed value of does not seize the potential of this algorithm; following previous work (e.g. [37, 32]), a common practical strategy to achieve a high-quality estimation is to increase the value of as a function of the iterations: Starting from a relatively small value, i.e allowing an aggressive regularization, then proceeding to a more conservative one that limits the smoothing effect, up-to a point where should be large enough to ensure convergence [32] and to avoid an undesired over-smoothed outcome. As one can imagine, it is cumbersome to choose the rate in which should be increased, especially because the corrupted image is a function of the Lagrange multiplier, which varies through the iterations as well.

In terms of convergence, the scheme has been shown to be well-behaved under some conditions on the denoising algorithm used. While the work reported in [31] requires the denoiser to be a symmetric smoothing and non-expansive filter, the later work in [32] relaxes this condition to much simpler boundedness of the denoising effect. However, both these prove at best a convergence to a steady-state outcome, which is very far from the desired claim of getting to the global minimizer of the overall objective function. The work reported in [33] offers clear conditions for a global convergence of , requiring the denoiser to be non-expansive, and emerging as the minimizer of a convex functional. A recently released paper extends the above by using a specific GMM-based denoiser, showing that these two conditions are met, thus guaranteeing global convergence of their ADMM scheme [38].

Indeed, in that respect, a delicate matter with the approach is the fact that given a choice of a denoising engine, it does not necessarily refer to a specific choice of a prior , as not every such engine could have a MAP-oriented interpretation. This implies a fundamental difficulty in the scheme, as in this case we will be activating a denoising algorithm while departing from the original setting we have defined, and having no underlying cost function to serve. Indeed, the work reported in [31] addresses this very matter in a narrower setting, by studying the identity of the effective prior obtained from a chosen denoising engine. The author chooses to limit the answer to symmetric smoothing filters, showing that even in this special case, the outcome is far from being trivial. As we are about to see in the next section, this shortcoming can be overcome by adopting a different regularization strategy.

## 3 The Image Denoising Engine

Image denoising is a special case of the inverse problem posed in Equation (6), referring to the case , where e is white Gaussian noise contamination of variance . In this case, the MAP problem becomes

 ˆxDenoise=\text{Arg}minx12σ2∥y−x∥22+λρ(x). (11)

The image denoising engine, which is the focal point of this work, is any candidate solver to the above problem, under a specific choice of a prior. In fact, in this work we choose to widen the definition of the image denoising engine to be any function that maps an image y to another image of the same size, and which aims to treat the denoising problem by the operation , be it MAP-based, MMSE-based, or any other approach.

Below, we accompany the definition of a denoiser with few basic conditions on the function . Just before doing so, we make the following broad observation: Among the various degradations that inverse problems come to remedy, removal of noise is fundamentally different. Consider the set of all reasonable “natural” images living on a manifold . If we blur any given image or down-scale it, it is still likely to live in . However, if the image is contaminated by an additive noise, it pops out of the manifold along the normal to with high probability. Denoising is therefore a fundamental mechanism for an orthogonal “projection” of an image back onto 111In the context of optimization, a smaller class of the general denoising algorithms we define are characterized as “proximal operators” [61]. These operators are in fact direct generalizations of orthogonal projections.. This may explain why denoising is such a central operation, which has been so heavily studied. In the context of this work, in any given step of our iterations, this projection would allow us to project the temporary result back onto , so as to increase chances of getting a good-quality restored version of our image.

### 3.1 Conditions and Properties of f(x)

We pose the following two necessary conditions on that will facilitate our later derivations. Both these conditions rely on the differentiability222A discussion on this requirement and possible ways to relax it appear in appendix D. of the denoiser ).

• Condition 1: (Local) Homogeneity. A denoiser applied to a positively scaled image should result in a scaled version of the original image. More specifically, for any scalar we must have . In this work we shall relax this condition and demand its satisfaction for for a very small .

A direct implication of the above property refers to the behavior of the directional derivative of the denoiser along the direction x. This derivative can be evaluated as

 ∇xf(x)x=f(x+ϵx)−f(x)ϵ (12)

for a very small . Invoking the homogeneity condition this leads to

 ∇xf(x)x = (1+ϵ)f(x)−f(x)ϵ (13) = f(x).

Thus, the filter can be written as333This result is sometimes known as Euler’s homogeneous function theorem [66].

 f(x)=∇xf(x)x. (14)
• Condition 2: Strong Passivity. The Jacobian of the denoising algorithm is stable, satisfying the condition

 η(∇xf(x))≤1, (15)

where is the spectral radius of the matrix A. We interpret this condition as the restriction of the denoiser not to magnify the norm of an input image since

 ∥f(x)∥=∥∇xf(x)x∥≤η(∇xf(x))⋅∥x∥≤∥x∥. (16)

Here we have relied on the relation that has been established in Equation (14). Note that we have chosen this specific condition over the natural weaker alternative, , since the strong condition implies the weak one.

A reformulation of the denoising engine that will be found useful throughout this work is the one suggested in [52], where we assume that the algorithm is built of two phases – a first in which highly non-linear decisions are made, and a second in which these decisions are used to adapt a linear filter to the raw noisy image in order to perform the actual noise removal. Algorithms such as the NLM, kernel-regression, K-SVD, and many others admit this structure, and for them we can write

 ˆxDenoise=f(y)=W(y)y. (17)

The matrix W is an matrix, representing the (pseudo-) linear filter that multiplies the noisy image vector y. This matrix is image dependent, as it draws its content from the pixels in y. Nevertheless, this pseudo-linear format provides a convenient description of the denoising engine for our later derivations. We should emphasize that while this notation is true for only some of the denoising algorithms, the proposed framework we outline in this paper is general and applies to any denoising filter that satisfies the two conditions posed above. Indeed, the careful reader will observe that this pseudo-linear form is closely related to the directional derivative relation shown above: . In this form, the right hand side is now reminding us of the pseudo-linear form where the matrix is replaced by the Jacobian matrix .

As a side note we mention that yet another natural requirement on W (or in a wider perspective) is that it is row-stochastic, implying that (i) this matrix is (entry-wise) non-negative, and that (ii) the vector

is an eigenvector of

W. This would imply a constancy behavior – a denoiser does not alter a constant image. More specifically, defining as an -dimensional column vector of all ones, for any scalar we have . Algorithms such as the NLM [1] and its many variants all lead to such a row-stochastic Jacobian. We note that this property, while nice to have, is not required for the derivations in this paper.

An interesting consequence of the homogeneity property is the following stability of the pseudo-linear operator W. Starting with a first-order Taylor expansion of , and invoking the directional derivative relation , we get

 f(y+h) ≈ f(y)+∇yf(y)⋅h ≈ ∇f(y)⋅y+∇yf(% y)⋅h ≈ ∇yf(y)⋅(y+h).

This result implies that while may indeed depend on y, its sensitivity to its perturbation is negligible, rendering it as an essentially constant linear operator on the perturbed image .

### 3.2 Denoisers Obeying the Above Conditions

We cannot conclude this section without answering the key question: Which are the denoising engines to which we are constantly referring? While these could include any of the thousands of denoising algorithms published over the years, we obviously focus on the best performing ones, such as the Non-Local Means (NLM) and its advanced variants [1, 2, 3], the K-SVD denoising method that relies on sparse representation modeling of image patches [4] and its non-local extension [7], the kernel-regression method that exploits local orientation [6], the well-known BM3D that combines sparsity and self-similarity of patches [5], the EPLL scheme that suggests patch-modeling based on the GMM model [42], CSR and NCSR, which cluster the patches and sparsifies them jointly [8, 9], the group-Wiener filtering applied on patches [10]

, the multi-layer Perceptron method trained to clean an image or the more recent CNN-based alternative called Trainable Nonlinear Reaction-Diffusion (TNRD) algorithm

[11, 20], more recent work that proposed low-rank modeling of patches and the use of the weighted nuclear-norm [16], non-local sparsity with GSM model [17], and the list goes on and on. Each and every one of these options (and many others) is a candidate engine that could be fit into our scheme.

A fair and necessary question is whether the denoisers we work with obey the two conditions we have posed above (homogeneity and passivity), and whether the preliminary requirement of differentiability is met. We choose to defer the discussion on the differentiability to Appendix D, due to its relevance to several spread parts of this paper and focus here on the homogeneity and passivity.

Starting with the homogeneity property, we give an experimental evidence, accompanied by a theoretical analysis, to substantiate the fulfillment of this property by a series of well-known denoisers. Figure 1 shows versus as a scatter-plot, tested for K-SVD, BM3D, NLM, EPLL, and the TNRD. In all these experiments, the image x is set to be Peppers, and (level of noise assumed within ). As can be seen, a tendency to an equality

is obtained, suggesting that all these are indeed satisfying the homogeneity property. The deviation from exact equality in each of these tests has been evaluated as the standard deviation of the difference

, leading to , respectively. A further discussion on the homogeneity property from a theoretical perspective is given in Appendix C.

Turning to the passivity condition, a conceptual difficulty is the need to explicitly obtain the Jacobian of the denoiser in question. Assuming that we overcame this problem somehow and got , its spectral radius would be evaluated using the Power-Method [69] that applies iterations of the form

 hk+1=∇xf(x)⋅hk∥∇xf(x)⋅hk∥2. (19)

The spectral radius itself is easily obtained as

 η(∇xf(x))=h% Tk+1hkhTkhk. (20)

In order to bypass the need to explicitly obtain , we rely on the first order Taylor expansion again,

 f(x+h)=f(x)+∇xf(x)⋅h, (21)

implying that , which holds true if is small enough. Thus, our alternative Power-Method activates one denoising step per iteration,

 hk+1≈f(x+hk)−f(x)∥f(x+hk)−f(x)∥2. (22)

The vector is normalized in each iteration, and thus . This vector is an image, and thus the gray values in it must be very small (), so as to lead to a sum of squares to be equal to 1. This agrees with the need for the perturbation to be small.

This algorithm has been applied to K-SVD, BM3D, NLM, EPLL, and the TNRD (x set to be the image Cameraman, , number of iterations set to give an accuracy of ), resulting all with values smaller or equal to , verifying the passivity of these filters.

## 4 Regularization by Denoising (RED)

The new and alternative framework we propose relies on a form of an image-adaptive Laplacian which builds a powerful (empirical) prior that can be used to regularize a variety of inverse problems. As a place to start and motivate this definition, let’s go back to the description of the denoiser given in Equation (17), namely444Note that we conveniently assume that the prior is applied to the clean image x, a matter that will be clarified as we dive into our explanations. . We may think of this pseudo-linear filter as one where a set of coefficients (depending on x) are first computed in the matrix W, and then applied to the image x. From this we can construct the Laplacian form,

 ρL(x)=12xTL(x)% x=12xT(I−W(x))x=12xT[x−W(x)x]. (23)

This definition by itself is not novel, as it is similar to ideas brought up in a series of recent contributions [48, 49, 50, 51, 52, 53, 54, 55, 56, 57]. This expression relies on using an image-adaptive Laplacian – one that draws its definition from the image itself.

Observing the obtained expression, we note that it can be interpreted as the unnormalized cross-correlation between the image x and its corresponding residual . As a prior expression should give low values for likely images, in our case this would be achieved in one of two ways (or their combination):

• A small value is obtained for if the residual is very small, implying that the image x serves as a near fixed-point of the denoising engine, .

• A small value is obtained for

if the cross-correlation of the residual to the image itself is small, a feature that implies that the residual behaves like white noise, or alternatively, if it does not contain elements from the image itself. Interestingly, this concept has been harnessed successfully by some denoising algorithms such as the Dantzig-Selector

[63] and by image denoising boosting techniques [52, 64, 65]

. Indeed, enforcing orthogonality between the signal and its treated residual is the underlying force behind the Normal equations in statistical estimation (e.g. Least Squares and Kalman Filtering).

Given the above prior, we return to the general inverse-problem posed in Equation (6), and define our new objective,

 ˆx=\text{Arg}minxℓ(y,x)+λ2xT[x−W(x)x]. (24)

The prior expression, while exhibiting a possibly complicated dependency on the unknown x, is well-defined and clear. Nevertheless, an attempt to apply any gradient-based algorithm for solving the above minimization task encounters an immediate problem, due to the need to differentiate with respect to x. We overcome this problem by observing that is in fact the activation of the image denoising engine on x, i.e., . This observation inspires the following more general definition of the Laplacian regularizer, which is the prime message of this paper:

 ρL(x)=12xT[x−f(x)]. (25)

This is the Regularization by Denoising (RED) paradigm that this work advocates. In this expression, the residual is defined more generally for any filter even if it can not be written in the familiar (pseudo-)linear form. Note that all the preceding intuition about the meaning of this prior remains intact; namely, the value is low if the cross-correlation between the image and its denoising residual is small, or if the residual itself is small due to x being a fixed point of .

Surprisingly, while this expression is more general, it leads to a better-managed optimization problem due to the careful properties we have outlined in Section 3 on our denoising engines . The overall energy functional to minimize is

 E(x)=ℓ(y,x)+λ2xT(x−f(x)), (26)

 ∇xE(x) = = ∇xℓ(y,x)+λ2∇xxTx−λ2∇x[xTf(x)] = ∇xℓ(y,x)+λx−λ2[f(x)+∇xf(x)x% ].

Based on our prior assumption regarding the availability of a directional derivative for the denoising engine, the term can be replaced555A better approximation can be applied in which we replace by the difference , but this calls for two activations of the denoising engine per gradient evaluation. by , based on Equation (14), implying that the gradient expression is further simplified to be

 ∇xE(x)=∇xℓ(% y,x)+λx−λf(x)=∇xℓ(y,x)+λ(x−f(x)), (28)

requiring only one activation of the denoising engine for the gradient evaluation. Interestingly, if we bring back now the pseudo-linear interpretation of the denoising engine, the gradient would be the residual, just as posed above, implying that

 ∇xE(x)=∇xℓ(% y,x)+λ(x−W(x)x). (29)

Observe that this is a non-trivial derivation of the gradient of the original penalty function posed in Equation (24).

### 4.2 Deploying the Denoising Engine for Solving Inverse Problems

In the discussion above we have seen that the gradient of the energy functional to minimize (given in Equation (26)) is easily computable, given in Equation (28). We now turn to show several options for using this in order to solve a general inverse problem. Common to all these methods is the fact that the eventual algorithm is iterative, in which each step is composed of applying the denoising engine (once or more), accompanied by other simpler calculations. In Figures 2, 3, and 4 we present pseudo-code for several such algorithms, all in the context of handling the case in which the likelihood function is given by .

• Gradient Descent Methods: Given the gradient of the energy function , the Steepest-Descent (SD) is the simplest option that can be considered, and it amounts to the update formula

 ˆxk+1 = ˆxk−μ∇xE(% x)∣∣ˆxk = ˆxk−μ[∇xℓ(y,x)∣∣ˆxk+λ(ˆxk−f(ˆxk))].

Figure 2 describes this algorithm in more details.

A line-search can be proposed in order to set dynamically per iteration, but this is necessarily more involved. For example, in the case of the Armijo rule, it requires a computation of the above gradient and then assessing the energy for different values of in a retracting fashion, each of which calling for a computation of the denoising engine once.

One could envision using the Conjugate-Gradient (CG) to speed this method, or better yet, applying the Sequential Subspace Optimization (SESOP) algorithm [70]. SESOP holds the current gradient and the last several update directions as the columns of a matrix (referring to the iteration), and seeks the best linear combination of these columns as an update direction to the current solution, namely . When restricted to have only one column, this reduces to a simple SD with line-search. When using two columns, it has the flavor (and strength) of CG, and when using more columns, this method can lead to much faster convergence in non-quadratic problems. The key points of SESOP are (i) The matrix V is updated easily from one iteration to another by discarding the last direction, bringing in the last one, and adding the new gradient; and (ii) The unknown weights vector is low-dimensional, and thus updating it can be done using a Newton method. Naturally, one should evaluate the first and second derivatives of the penalty function w.r.t. , and these will leverage the relations established above. We shall not dive deeper into this option because it will not be included in our experiments.

One possible shortcoming of the gradient approach (in all its manifestations) is the fact that per activation of the denoising engine, the likelihood is updated rather mildly as a simple step toward the current log-likelihood gradient. This may imply that the overall algorithm will require many iterations to converge. The next two methods propose a way to overcome this limitation, by treating the log-likelihood term more “aggressively”.

• ADMM: Addressing the optimization task given in Equation (26), we can imitate the path taken by the scheme, and apply variable splitting and ADMM. The steps would follow the description given in Section 2 almost exactly, with one major difference – the prior in our case is explicit, and therefore, the stage termed “update of v” would become

 ˆv=\text{Arg}minvλ2vT(v−f(v))+β2∥v−x−u∥22. (31)

Rather than applying an arbitrary denoising engine to compute as a replacement to the actual minimization, we should target this minimization directly by some iterative scheme. For example, setting the gradient of the above expression to zero leads to the equation

 λ(v−f(v))+β(v−x−u)=0, (32)

which can be solved iteratively using the fixed-point strategy, by

 λ(vj−f(vj−1))+β(vj−x−u)=0 (33) →  vj=1β+λ(λf(vj−1+β(x+u)).

This means that our approach in this case is computationally more expensive, as it will require several activations of the denoising engine. However, a common approach to speed up the convergence (in terms of runtime) of the ADMM is called “early termination” [58], suggesting to approximate the solution of the v-update stage. We found this approach useful for our setting, especially because the application of a denoiser is computationally expensive. To this end, we may choose to apply only one iteration of the iterative process described in Equation (33), which amounts to one operation of a denoising algorithm. Figure 3 describes this specific algorithm in more details. If one changes all Part 2 (in Figure 3) with the computation , we obtain the scheme for the same choice of the denoising engine. While this difference is quite delicate, we should remind the reader that (i) this bridge between the two approaches is valid only when we deploy ADMM on our scheme, and (ii) as opposed to the method, our method is guaranteed to converge to the global optimum of the overall penalty function, as will be described hereafter.

We should point out that when using the ADMM, the update of x applies an aggressive inversion of the log-likelihood term, which is followed by the above optimization task. Thus, the shortcoming mentioned above regarding the lack of balance between the treatments given to the likelihood and the prior is mitigated.

• Fixed-Point Strategy: An appealing alternative to the above exists, obtained via the fixed-point strategy. As our aim is to find x that nulls the gradient, this could be posed as an implicit equation to be solved directly,

 ∇xℓ(y,x)+λ(x−f(x))=0. (34)

Using the fixed-point strategy, this could be handled by the iterative formula

 ∇xℓ(y,xk+1)+λ(xk+1−f(xk))=0. (35)

As an example, in the case of linear degradation model and Gaussian white additive noise, this equation would be

 1σ2HT(y−Hx% k+1)+λ(xk+1−f(xk))=0, (36)

leading to the recursive update relation

 xk+1=[1σ2HT% H+λI]−1[1σ2HTy+λf(xk)]. (37)

This formula suggests one activation of the denoising per iteration, followed by what seems to be a plain Wiener filtering computation666Note that Equation (33) refers to the same problem posed here under the choice and .. The matrix inversion itself could be done in the Fourier domain for block-circulant H, or iteratively using for example, the Richardson algorithm: Defining

 A=1σ2HTH+λI   and   b=1σ2H% Ty+λf(xk), (38)

our goal is to solve the linear system . This is achieved by a variant of the SD method777All this refer to a specific iteration within which we apply inner iterations to solve the linear system, and thus the use of the different index ., , where we have defined . By setting the step size to be , we greedily optimize the potential of each iteration.

Convergence of the above algorithm is guaranteed since

 ∥∥∥[1σ2HTH+λI]−1λ∇xf(xk)∥∥∥<1. (39)

This approach, similarly to the ADMM, has the desired balance mentioned above between the likelihood and the regularization terms, matching the efforts dedicated to both. A pseudo-code describing this algorithm appears in Figure 4.

A basic question that has not been discussed so far is how to set the parameters of in defining the regularization term. More specifically, assuming that the denoising engine depends on one parameter – the noise standard-deviation – the question is which value to use. While one could envision using varying values as the iterations progress and the outcome improves, the approach we take in this work is to set this parameter to be a small and fixed value. Our intuition for this choice is the desire to have a clear and fixed regularization term, which in turn implies a clear cost function to work with. Furthermore, the prior we propose should encapsulate in it our desire to get to a final image that is a stable point of such a weak denoising engine, . Clearly, more work is required to better understand the influence of this parameter and its automatic setting.

## 5 Analysis

### 5.1 Convexity

Is our proposed regularization function convex? At first glance, this may seem like too much to expect. Nevertheless, it appears that for reasonably performing denoising engines obeying the conditions posed in Section 3, this is exactly the case. For the function to be convex, we should demand that the second derivative is a positive semi-definite matrix [67]. We have already seen that the first derivative is simply , which leads to the conclusion that the second derivative is given by .

As already mentioned earlier, in the context of some algorithms such as the NLM and the K-SVD, this is associated with the Laplacian , and it is positive semi-definite if W

has all its eigenvalues in the range

888We could in fact allow negative eigenvalues for W, but this is unnatural in the context of denoising. . This is indeed the case for the NLM filter [1], the K-SVD-denoising algorithm [57], and many other denoising engines.

In the wider context of general image denoising engines, convexity is assured if the Jacobian of the denoising algorithm is stable, as indeed required in Condition 2 in Section 3, . In this case we have that is convex, and this implies that if the log-likelihood expression is convex as well, the proposed scheme is guaranteed to converge to the global optimum of our cost function in Equation (6). In this respect the proposed algorithm is superior to the scheme in its most general form, which at best is known to get to a stable-point [30, 32]. Furthermore, this result may seem similar to the one posed in [33, 38], as our two conditions for global convergence are homogeneity and passivity of , while in these papers the requirements are passivity of , along with a convex energy functional whom minimizes. The second condition – having a convex origin to derive – is in fact more restrictive than demanding homogeneity, as it is unclear which of the known denoisers meet this requirement.

### 5.2 An Alternative Prior

In Section 4 we motivated the choice of the proposed prior by the desire to characterize the unknown image x as one that is not affected by the denoising algorithm, namely, . Rather than taking the route proposed, we could have suggested a prior of the form

 ρQ(x)=∥x−f(x)∥22. (40)

This prior term makes sense intuitively, being based on the same desire to see the denoising residual being small. Indeed, this choice is somewhat related to the option we chose since

 ρQ(x)=∥x−f(x)∥22=ρL(x)+f(x)T(f(x)−x), (41)

suggesting a symmetrization of our own expression.

In order to understand the deeper meaning of this alternative, we resort again to the pseudo-linear denoisers, for which this prior is nothing but

 ρQ(x)=∥x−W(x)x% ∥22=xT(I−W(x))T(I−W(x))x. (42)

This means that rather than regularizing with the Laplacian, we do so with its square. While this is a worthy possibility which has been considered in the literature under the term “fourth order regularization” [71], it is known to be more delicate. We leave this and other possibilities of formulating the regularization with the use of for future work.

### 5.3 When is Plug-and-Play-Prior = RED ?

In Section 4 we described the use of ADMM as one of the possible avenues for handling our proposed regularization. When handling the inverse problem posed in Equation (26) with ADMM, we have shown that the only difference between this and the scheme resides in the update stage for v. Here we aim to answer the following question: Assuming that the numerical algorithm used is indeed the ADMM, under what conditions would the two methods ( and ours) become equivalent? The answer to this question resides in the optimization task for updating v, which is a denoising task. Thus, purifying this question, our goal is to find conditions on and such that the two treatments of this update stage coincide. Starting from our approach, we would seek the solution of

 ^x=\text{Arg}% minxβ2∥x−y∥22+λ2xT(x−f(x)), (43)

or, putting it in terms of nulling the gradient of this energy, require

 β(x−y)+λ(x−f(x))=0. (44)

The that is the solution of this equation is our updated image. On the other hand, the scheme would propose to simply compute999A delicate matter not considered here is that may apply in order to tune to a specific noise level. We assume for simplicity. as a replacement to this minimization task. Therefore, for the two methods to coincide, we should demand that the gradient expression posed above is solved for the choice of the scheme, namely,

 β(f(y)−y)+λ(f(y)−f(f(%y)))=0, (45)

or posed slightly different,

 f(y)−f(f(y))=βλ(y−f(y)). (46)

This means that the denoising residual should remain the same (up to a constant) for the first activation of the denoising engine , and the second one applied on the filtered image .

In order to get a better intuition towards this result, let’s return to the pseudo-linear case, with the assumption that W

is a fixed and diagonalizable matrix. Plugged into the above condition, this gives

 Wy−W2y=βλ(y−Wy), (47)

or posed differently,

 (βλI−W)(I−W)y=0. (48)

As the above equation should hold true for any image y, we require

 (βλI−W)(I−W)=0. (49)

Without loss of generality, we can assume that W is diagonal, after multiplying the above equation from the left and right by the diagonalizing matrix. With this simplification in mind, we now consider the eigenvalues of W, and the above equation implies that exact equivalence between our scheme and the one is obtained only if our denoising engine has eigenvalues that are purely ’s, or . Clearly, this is a very limiting case, which suggests that for all other cases, the two methods are likely to differ.

Interestingly, the above analysis is somewhat related to the one given in [31]. Both [31] and our treatment assume that the actual applied denoising engine is within the ADMM scheme. While we ask for the conditions on W to fit our regularization term , the author of [31] seeks the actual form of the prior to match this step, reaching the conclusion that the prior should be . Bearing in mind that the conditions we get for the equivalence between the two methods are too restricting and rarely met, the result in [31] shows the actual gap between the two methods: While we regularize with the expression , an equivalence takes place only if the modifies this to involve , getting a far more complicated and less natural term.

Just before we conclude this section, we turn briefly to discuss the computational complexity of the proposed algorithm and its relation to the complexity of the scheme. Put very simply, RED and are roughly of the same computational cost. This is the case when RED is deployed via ADMM and assuming only one iteration in the update of v, as shown above. Similarly, when using the fixed-point option, RED has the same cost as per iteration.

To conclude, we must state that this paper is about a more general framework rather than a comparison to the . Indeed, one could consider this work as an attempt to provide more solid mathematical foundations for methods like the . In addition, when comparing and RED, one can identify several major differences that are far more central than the complexity issue, such as (1) a lack of a clear objective function that serves, while our scheme has a very well-defined penalty; (2) the inability to claim much in terms of convergence of the , while our penalty is shown to be convex; (3) the complications of tuning the algorithm, which is very different from the experience we show with RED.

## 6 Results

In this section we compare the performance of the proposed framework to the approach, along with various other leading algorithms that are designed to tackle the image deblurring and super-resolution problems. To this end, we plug two substantially different denoising algorithms into the proposed scheme. The first is the (simple) median filter, which surprisingly turns out to act as a reasonable regularizer to our ill-posed inverse problems. This option is brought as a core demonstration of the idea that an arbitrary denoiser can be deployed in RED without difficulties. The second denoising engine we use is the state-of-the-art Trainable Nonlinear Reaction Diffusion (TNRD) [20] method. This algorithm trains a nonlinear reaction-diffusion model in a supervised manner. As such, in order to treat different restoration problems, one should re-train the underlying model for every specific task – something we aim to avoid. In the experiments below we build upon the published pre-trained model by the authors of TNRD, tailored to denoise images that are contaminated by white Gaussian noise with a fixed101010In order to handle an arbitrary noise-level, , we rely on the following relation , where . noise-level, which is equal to . Leveraging this, we show how state-of-the-art deblurring and super-resolution results can be achieved simply by integrating the TNRD denoiser in RED. In all the experiments that follow, the parameters were manually set in order to enable each method to get its best possible results over the subset of images tested.

### 6.1 Image Deblurring

In order to have a fair comparison to previous work, we follow the synthetic non-blind deblurring experiments conducted in the state-of-the-art work that introduced the Non-locally Centralized Sparse Representation (NCSR) algorithm [9], which combines the self-similarity assumption [1] with the sparsity-inspired model [68]. More specifically, we degrade the test images, supplied by the authors of NCSR, by convolving them with two commonly used point spread functions (PSF); the first is a uniform blur, and the second is a 2D Gaussian function with a standard deviation of . In both cases, an additive Gaussian noise with is then added to the blurred images. Similarly to NCSR, restoring an RGB image is done by converting it to the YCbCr color-space, applying the deblurring algorithm on the luminance channel only, and then converting the result back to the RGB domain.

Table 1 provides the restoration performance of the three RED schemes – the steepest-descent (SD), the ADMM, and the fixed-point (FP) methods – along with the results of the111111We note that using TNRD has never appeared in an earlier publication, and it is brought here in order to let perform as best as it possibly can. , the state-of-the-art NCSR and IDD-BM3D [27], and two additional baseline deblurring methods [72, 73]. For brevity, only the steepest-descent scheme is presented when considering the basic median filter as a denoiser. The performance is evaluated using the Peak Signal to Noise Ratio (PSNR) measure, higher is better, computed on the luminance channel of the ground-truth and the estimated image. The parameters of the proposed approach, as well as the ones of the , are tuned to achieve the best performance on this dataset; in the case of the TNRD denoiser, these are depicted in Table 2 and 3, respectively. In the setting of the median filter, which extracts the median value of a window, we choose to run the suggested steepest-descent scheme for iterations with for the uniform PSF, and with for the Gaussian PSF.