I Introduction
The PlugandPlay (PnP) ADMM is a variant of the alternating direction method of multiplier (ADMM) algorithm. Since its introduction in 2013 [1], the algorithm has demonstrated extremely promising results in image restoration and signal recovery problems [2, 3, 4, 5, 6, 7, 8]. However, despite the enormous applications of PnP ADMM and several studies on its convergence [8, 9, 10]
, it is generally unclear why the algorithm is performing so well. On the one hand, one can argue that the superior performance is attributed to the underlying image denoisers, which is evident from applications using convolutional neural networks
[11, 12, 13, 14]. Yet on the other hand, unless we can explicitly write down the optimization which describes the interaction of the denoisers, e.g., in the form of maximumaposteriori (MAP), it will be extremely difficult to quantify the solution even if we know the solution exists. The contribution of this paper is to present a set of formal performance analysis results for a family of denoisers known as the graph filters, so that we can understand the PnP ADMM behavior.In order to state the problem more concretely, it would be useful to first review the PnP ADMM. The starting point of the PnP ADMM is the ADMM algorithm, which has become a standard tool for minimizing a sum of two separable functions of the form
(1) 
Here, represents the objective function
arising from the physical measurement models, e.g., blur, sampling, projection, or other linear transformations. The function
is the regularization function, representing the prior model of the underlying signals, e.g., norm, sparsity, or total variation. The parameter is the regularization constant, and is assumed known apriori and is fixed in this paper. Problems taking the form of (1) is broad, encompassing various linear inverse problems in imaging, deblurring, superresolution, CT, MRI, and compressed sensing, to name a few.
ADMM solves (1) by converting the unconstrained optimization into a constrained problem
(2) 
and considers the augmented Lagrangian function:
(3) 
Then, the algorithm finds the solution by seeking a saddle point of , which involves solving a sequence of subproblems in the form
(4)  
(5)  
(6) 
where is the scaled Lagrange multiplier, and are the intermediate variables. Under mild conditions, e.g., when both and are closed, proper and convex, and if a saddle point of exists, one can show that the iterates returned by (4)(6) converges to the solution of (1). Readers interested in knowing more about the theoretical properties of ADMM can consult tutorials such as [15].
The idea of PnP ADMM is to modify (5) by observing that it is essentially a denoising step, if we treat as a “noisy” version of and as a regularization for . Based on this observation, we can replace (5) by a denoiser such that
(7) 
where is the denoising strength or a hypothesized “noise level”. The choice of is broad. can be a proximal map such as total variation denoising, or an offtheshelf image denoisers such as BM3D, nonlocal means, and more recently neural networks.
Replacing the explicit optimization in (5) by an offtheshelf denoiser in (7) leads to many interesting questions:

Existence of . Given an arbitrary denoiser , is it possible to have an equivalent such that (5) and (7) coincides? If not, under what conditions of would this happen? Some recent studies provide an answer to this question [8, 16], which will be elaborated in later sections of this paper. An alternative approach is the Regularization by Denoising (RED) [17], which has advantages and limitations.

Performance. If does exist, what is it and why is it an effective regularization function? If does not exists, then what does PnP ADMM solve, and why is it good? The focus of this paper is to address this performance issue. We will take two paths, one through the MAP formulation, and the other one through a concept called the consensus equilibrium.

Generalization. How to generalize the algorithm to multiple denoisers? The generalization can be done using the consensus equilibrium technique. However, can we explain the performance, and what would be the optimal combination weights?
The main assumption we make in this paper is that the denoisers are in the family of symmetric smoothing filters [18, 19, 20] or graph filters [21, 22, 23, 24, 25, 26, 27]. These filters are simple in the sense that they are (pseudo) linear and hence they can be expressed using matrices. However, they are also representative enough to cover a sufficiently large group of image denoisers that we encounter in the literature.
The remaining of the paper is arranged as follows. We begin with a brief review of the construction of graph filters and their properties in Section 2. The main results are presented in Section 3 and 4. Section 3 analyzes the problem from a MAP perspective, whereas Section 4 approaches the problem from the consensus equilibrium perspective. Further discussions of open problems are presented in Section 5.
Ii Graph Filters
Iia Constructing a Graph Filter
Graph filters is a class of image denoisers mostly studied in the field of graph signal processing. The application of graph filters is very broad, e.g., deblurring using graph Laplacian [24], boosting image denoisers [22], image denoising using global similarity [25], JPEG compression using random walk [28], and blind deconvolution using reweighted graph total variation [27]. In terms of theory, there are studies of the graphs in continuous domain [26], sampling of graphs [29], and filter banks of graphs [30], to name a few.
Represented as matrices, graph filters take the form:
(8) 
where is the noisy image, and is the denoised image. The matrix is the filter. There are multiple ways of constructing . One of the most commonly used approaches is to define a nonlocal weight kernel :
(9) 
where is a dimensional patch centered at pixel , and is a parameter characterizing the denoising strength of the filter. Modifications to (9) are common, e.g., by incorporating spatial distance (which gives the nonlocal means filter), restricting to pixels instead of patches (which gives the bilateral filter), or extending the norm to a weighted norm (which gives the kernel regression filter). For more examples of the kernel matrices, we refer the readers to [18].
The matrix defined in (9) is symmetric but the row sum is not 1. Thus, cannot be used directly as a denoising filter because it amplifies or attenuates signals. To normalize the matrix while preserving the symmetry, the most popular method is by means of the SinkhornKnopp balancing algorithm [31, 19, 20], which iteratively normalizes the rows and columns of until convergence. When is symmetrized, the resulting matrix is called a symmetric smoothing filter, given by
(10) 
where the diagonal matrix is defined such that and . Since the columns and rows of sum to 1,
is a doubly stochastic matrix.
IiB Properties of Graph Filters
There are a few interesting properties of .

can be considered as an undirected graph. is the weight on the edge linking node and node .

. This follows from the fact that
is doubly stochastic, and so the eigenvalues are all nonnegative.

The regularization defined through the graph Laplacian can be interpreted as
(11) which is measuring the smoothness of the signal with the weights defined by the graph .

The symmetric matrix has a better denoising performance than its nonsymmetric counter part where . This is attributed to the implicit clustering of the pixels during the SinkhornKnopp balancing algorithm [20]
, among other reasons such as reduced degree of freedom so that the drop in variance overrides the gain in bias
[19].
In order to present the theoretical results of this paper, we make an assumption about :
Assumption 1.
We assume that the matrix is invertible, i.e., there exists a such that
(12) 
Assumption 1 is very strong and could be unrealistic for practical ’s. However, for the derivation of the main results, at least in Section 3, we need to simplify some of the technical details. In Section IV when we discuss the performance from the consensus equilibrium perspective, we will lift the assumption.
Remark: PseudoLinearity. An important remark we should pay attention to is the pseudolinearity of in the actual PnP ADMM implementation. That is, the matrix is a function of the input noisy image , or more precisely instead of (8). However, as mentioned in [32], the dependency of on is typically small if
is estimated from a
prefiltered version of , for example applying a baseline denoising algorithm to and constructing a new weight matrix based on the baseline estimate. Practically this is also justified in applications such as [8], where is updated for the first tens of iterations and is kept fixed for the remaining iterations. Following the same line of arguments, we assume that is predefined and is independent of . Of particular interest is the oracle case where is estimated from the ground truth.Iii Performance Analysis via MAP
In this section we present the first set of analytic results. We call it the maximumaposteriori (MAP) analysis because we need to explicitly derive the MAP optimization.
Iiia Existence of
Given a graph filter with , the first question we should ask is whether it is possible to obtain a regularization function defined in (5). The rationale is that if we are able to do so, then we can characterize the MAP optimization in (1), and hence understand the performance.
Consider the minimization
(13) 
the solution of which is characterized by
(14) 
where is the subdifferential of , and the inverse operator is the resolvent of [15]. We assume that
is an invertible matrix
. Therefore, must be continuously differentiable and so the subgradient at every point has a unique gradient. This replaces by , and substituting into (14) yields(15) 
Define as the matrix representation of such that . Since (15) holds for all , this further implies that the mappings are identical:
(16) 
Applying to both sides yields , and hence
(17) 
The identity in (17) offers a useful way of characterizing the existence of .
Theorem 1 (Sreehari et al. [8] Theorem A.I, and ReehorstSchniter [16] Theorem 1.).
The mapping exists if and only if is symmetric.
Proof.
The idea is to treat as a vector field which must be conservative. Therefore, the gradient must be symmetric using a classic result in vector calculus [33] (in particular Theorem 4.3.8 and 4.3.10). ∎
For graph filters, the symmetry of is guaranteed because is symmetric. In addition, we also have that is nonexpansive, as the eigenvalues of are bounded above by 1. Therefore, if we use a graph filter and if exists, then according to [8] the corresponding minimization is a proximal map, and is convex, closed and proper. However, an arbitrary denoiser does not necessarily have a symmetric Jacobian, and hence is not guaranteed to exist.
IiiB Deriving
We can now state the first key result of this paper.
Theorem 2.
Let be a graph filter with full rank so that the inverse of exists. If is defined by
(18) 
then the solution of the minimization of (5) is
(19) 
Proof.
Can we replace by its pseudoinverse if is not invertible? The answer is no. If we replace by in (18), the optimization (13) becomes
(21) 
The first order optimality condition of (21) is given by , which does not imply because and unless has full rank. More specifically, requires to live in the range space of . Since contains zero eigenvalues, there are multiple which can be mapped to the same through . This is problematic, because the optimizer of (21) is no longer , and hence it violates our assumption that the denoiser is a graph filter.
IiiC The Role of
Knowing the explicit expression of , we can now analyze its performance. But before we proceed to the discussion, we should clarify the role of the parameter . By substituting (18) into (2), and recalling , we can show that the original problem which PnP ADMM solves is
(22) 
There is an alternative optimization in the graph signal processing literature known as the graph Laplacian, which we will study extensively in this paper:
(23) 
In (22), the original regularization parameter is eliminated because contains its reciprocal. Thus, is replaced by the ADMM internal parameter . This internal parameter has no influence to the final solution in the classical ADMM because ADMM converges for all when and are closed, proper and convex. However, in PnP ADMM, the optimization cost is characterized by . Thus, the solution would change as changes, a phenomenon reported in earlier papers, e.g., [9].
We can also ask: For the best and , how would the solutions of (22) and (23) behave? Figure 1 shows an experimental result where . We vary (and equivalently ) to evaluate the mean squared error (MSE) for a fixed oracle . As shown in Figure 1, PnP ADMM and graph Laplacian have different operating regimes for and . PnP ADMM prefers smaller , whereas graph Laplacian prefers larger . We will explain this observation in Section 4.
IiiD Comparison with Graph Laplacian
We can now compare the performance of PnP ADMM and the conventional graph Laplacian regularization. In order to do so, we consider a denoising problem such that
where , where . That is, is a noisy version of . Such choice of is also justified by the subproblem (5), which is also a denoising problem.
We consider the following two optimizations:
(24)  
(25) 
Here, the first optimization (24) uses graph Laplacian as the regularization, and the second optimization (25) uses the PnP regularization. We use a parameter instead of or to keep the notation simple. The solutions of the two optimizations are
(26)  
(27) 
where we defined matrices and to simplify notations.
Our goal is to analyze the mean squared error (MSE) of these two estimates. To this end, we first eigendecompose as , where is the eigenvalue matrix. Define as the projection of by columns of . Then, the MSE of and can be found as follows.
Proposition 1.
Define and via (26) and (27). Let , where
is the eigenvector matrix, and
is the eigenvalue matrix. Define . Then, the MSE of and are respectively(28)  
(29) 
Proof.
The MSE of is
(30)  
(31) 
Similarly, the MSE of is
(32) 
∎
Now, we can consider the th term of the sums:
(33)  
(34) 
It holds that for ,
(35)  
(36) 
The limits in (35) and (36) provide important information about the performance. We make the following observations.
1. Energy Concentration by . The graph filter , when applied to a signal , projects the signal onto the space spanned by its columns. The effectiveness of this projection is determined by the quality of . In the oracle setting, ensures that most of the energy is captured by the first few eigenvectors. This is illustrated in the Figure 2, where we consider a 1D signal and its noisy version . The weight matrix is constructed from using the nonlocal means kernel defined in (9). By projecting and using the eigenvector matrix , we see that has significantly fewer nonzeros than for large eigenindices. Since is the vector , we have the observation:
Observation 1: For large , is typically very small.
2. Eigenvalues at High Frequency. Based on Observation 1, an immediate conclusion from (35) and (36) is that when , the two limits are respectively
In other words, the PnP regularization ensures that the MSE goes to zero for high frequency (i.e., large eigenindex ) whereas the graph Laplacian has residue MSE for high frequency. These can also be seen from the eigenvalues of and :
where denotes the th eigenvalue of the matrix. If we plot and as functions of the eigenindex , we will obtain a result shown in Figure 3. In this experiment we set and . As , we have
which explain why reaches a steady state whereas reaches zero as .
Observation 2: As increases so that and , the eigenvalue approaches a nonzero constant, whereas approaches 0. The same observation applies to MSEs.
3. Trade offs in Bias and Variance. MSE is the sum of bias and variance. So we can decompose the MSE and inspect
Since by construction is decreasing as grows, we can show that .
Proposition 2.
For any , and , it holds that for all .
Proof.
To prove this result we just need to show the function
is nonpositive. In this expression, the first two terms are nonnegative. The last term is nonpositive because , where the first inequality holds since . ∎
The conclusion of this proposition is that the variance of PnP is always lower than the variance of graph Laplacian. However, the bias of PnP is larger than the bias of graph Laplacian, as we show in Figure 4. More specifically, we can prove the following result:
Proposition 3.
For any , and , it holds that for all .
Proof.
Since the numerators of and are identical, we only need to compare the denominator. Thus, we have
if , which is true when . ∎
The reason why the overall MSE is lower for PnP is attributed to the observation below.
Observation 3: The drop in variance of PnP is more significant than the gain in bias, and hence PnP’s overall MSE is lower than that of graph Laplacian.
4. When is Graph Laplacian Better? The analysis above suggests conditions under which PnP will perform worse than graph Laplacian. This happens when the vector is large, because then the denominator of will amplify more than that of . This also suggests that PnP is more sensitive to the correctness of than graph Laplacian. That is, when constructing from a prefiltered signal, PnP demands higher quality of the prefiltered signal.
To confirm this claim, we conduct an experiment by constructing a prefiltered signal via , where is the true signal, and . That is, the prefiltered signal is a noisy version of the clean signal, with perturbation specified by the noise variance . We then construct a graph filter using . Technically speaking, is now . Our goal is to see how and would change when increases. ^{1}^{1}1An alternative experiment would be to pick a specific baseline denoising algorithm and adjust its internal parameters. However, different baseline denoising algorithms have different characteristics and this could cause bias in the analysis. Thus, we choose to design a prefiltered signal by adding iid noise.
The result of this experiment is shown in Figure 5, where we observe that PnP has a considerably lower MSE than graph Laplacian when is small. But as increases, quickly increases and eventually goes above . This explains the relatively weak robustness of PnP compared to graph Laplacian.
However, we should emphasize that such conclusion is only valid when is fixed. In practical PnP ADMM, the weight matrix is updated in every ADMM iteration. If decreases as PnP iterates, then could become a good filter.
Observation 4: The performance of PnP could be worse than graph Laplacian when is estimated wrongly.
5. Geometry: Truncating Eigenvectors of . If is oracle, is there any simple geometric interpretation, beside the biasvariance analysis, that can explain the performance?
By expressing the eigendecomposition of the as
(37) 
the PnP ADMM regularization in (25) can be written as
(38)  
If the eigenvalues , then the second term in (38) will dominate as is infinity. In this case, we must have in order for the regularization to stay finite. Therefore, the overall optimization is now constrained:
(39) 
The constraint provides a simple geometric interpretation. While the graph Laplacian regularization searches in the entire column space, PnP ADMM only searches for a subspace spanned by eigenvectors of nonzero eigenvalues. As we see in Observation 1, this provides a very strong denoising power when is the oracle. Referring to the decomposition in (37), the signal is spanned only by whereas the noise is spanned by both and . By restricting , we ensure that any noise living in will be eliminated. See Figure 6 for illustration. As a result, the denoising is significantly more effective than graph Laplacian.
Observation 5: Geometrically, PnP ADMM restricts the search space so that the pure noise components are forced to zero.
IiiE Questions Not Answered
To summarize, the fundamental reason why PnP works better is that the graph filter has effectively removed the unwanted noise from before proceeding to additional processing. Thus, a stronger denoiser leads to better PnP ADMM algorithm.
One major limitation of the above MAPbased analysis is that it tries to match with a proximal map associated with a regularization function . This is a particularly strong requirement, because many denoisers are not symmetric and hence there is no corresponding proximal map. It also requires to exist, which is difficult to have in practice, e.g., a uniform average operator does not have an inverse.
The question we ask is that if we are willing to give up the proximal map requirement, would it be possible to obtain another interpretation that can explain the performance for denoisers that do not have ? This leads to the next section of equilibriumbased analysis.
Iv Performance Analysis via Equilibrium
The MAP optimization in (1) is not the only way to obtain . In [34], Buzzard et al. generalize PnP ADMM by a concept called consensus equilibrium, which allows us to characterize the solution of PnP ADMM by analyzing the equilibrium state of a dynamical system. In this section we discuss the PnP ADMM performance from such equilibrium perspective.
Iva Equilibrium
The starting point of the equilibrium analysis is the PnP ADMM algorithm listed in (4)(6). By defining a pair of functions and using (4) and (5) respectively, we can show that the PnP ADMM solution satisfies
(40)  
(41) 
A solution satisfying (40) and (41) is called a consensus equilibrium solution. To make this more specific to the readers, consider the following denoising example.
Proposition 4.
Consider two operators and :
The consensus equilibrium solution satisfies
(42) 
and .
Proof.
Since , it holds that , which implies that
The optimality of gives
Substituting into gives
Rearranging the terms yields the desired result. ∎
The interesting fact about Proposition 4 is that, given we can easily check whether it is a consensus equilibrium point without going through the PnP ADMM algorithm. In other words, we can “create” an alternative optimization such that its solution is an equilibrium point, and we can interpret the meaning of the optimization to gain insights. To illustrate this point, let us continue with the example in Proposition 4.
Proposition 5.
Define the functions , , and :
(43)  
(44)  
(45)  
(46) 
The minimizers of these four functions all satisfy the consensus equilibrium condition (42).
Proof.
Proposition 5 shows four different optimization formulations that give the same consensus equilibrium solution. The first optimization in is the MAP formulation we saw in Section 3. The other optimizations of , and have some meaningful interpretations.
Interpretation of . The optimization of is informative in the sense that it completely eliminates the inverse matrix , as opposed to the MAP in (25). The objective function says that instead of trying to minimize the residue between the optimization variable and the noisy observation , we minimize with the filtered version . For oracle ’s where pure noise components are isolated at high frequencies (See Observation 5), ensures that the pure noise components are eliminated. Therefore, when minimizing , we are guaranteed to only search for the signal component.
The regularization term is a concave function for any , because is positive semidefinite. The concave function injects high frequency components that are lost by . In the extreme cases when , we have
and so the minimizer is just . When , we have
and so the minimizer is . In order to control the amount of high frequency injection, a small is more preferred than a large . This explains why in Figure 1 a smaller works better for PnP ADMM.
Interpretation of . The function offers an interpretation similar to , e.g., the objective has no penalty over the subspace spanned by the eigenvectors of zeroeigenvalues, and so the regularization will force those dimensions to zero.
is also a special case of Kheradmand and Milanfar [24], who suggested
for any . If we let , then we obtain .
Interpretation of . The last example , while still giving the consensus equilibrium solution, is more or less useless (in terms of interpretation) because it is just a restatement of the solution.
In summary, Proposition 5 shows that the solution of the PnP ADMM is completely characterized by the consensus equilibrium. Solving the MAP optimization is just one of the many possible ways of achieving the equilibrium. If we depart from the MAP formulation, i.e., if we are willing to change the forward modeling by integrating the denoiser into the forward model, then there is a wide range of optimization formulations that can offer meaningful interpretations while satisfying the equilibrium condition.
IvB General Linear Inverse Problems
The equilibrium based approach also allows us to extend the analysis to general linear inverse problems which could be difficult to study by writing out the MSE in Section 3. The general linear inverse problem using the PnP ADMM with a graph filter requires two operators:
of which the equilibrium solution satisfies
(48) 
The condition in (48) can be written as
which is also the equivalent to
(49) 
While (49) may look unnatural at a first glance, in the sparse signal processing literature this is reminiscent to the MAPsynthesis minimization discussed by Elad [35].
How do we interpret (49)? The inner minimization is the standard graph Laplacian minimization with changed to in the objective. If has zero eigenvalues, then ensures that the columns along those directions are untouched. Then by the virtue of , the minimizer must be zero along those directions because is positive semidefinite. The outer step by multiplying the solution with ensures that the noise from the inner minimization is eliminated before creating the final solution.
IvC Performance of Multiple Priors
Beyond linear inverse problem we can extend the analysis to PnP ADMM involving multiple priors. In this case, the MAPbased optimization is
Comments
There are no comments yet.