Performance Analysis of Plug-and-Play ADMM: A Graph Signal Processing Perspective

08/31/2018 ∙ by Stanley H. Chan, et al. ∙ Purdue University 0

The Plug-and-Play (PnP) ADMM algorithm is a powerful image restoration framework that allows advanced image denoising priors to be integrated into physical forward models to yield a provably convergent algorithm. However, despite the enormous applications and promising results, very little is known about why the PnP ADMM performs so well. This paper presents a formal analysis of the performance of PnP ADMM. By restricting the denoisers to the class of graph filters, or more specifically the symmetric smoothing filters, we offer three contributions: (1) We rigorously show conditions under which an equivalent maximum-a-posteriori (MAP) optimization exists, (2) we derive the mean squared error of the PnP solution, and provide a simple geometric interpretation which can explain the performance, (3) we introduce a new analysis technique via the concept of consensus equilibrium, and provide interpretations to general linear inverse problems and problems with multiple priors.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The Plug-and-Play (PnP) ADMM is a variant of the alternating direction method of multiplier (ADMM) algorithm. Since its introduction in 2013 [1], the algorithm has demonstrated extremely promising results in image restoration and signal recovery problems [2, 3, 4, 5, 6, 7, 8]. However, despite the enormous applications of PnP ADMM and several studies on its convergence [8, 9, 10]

, it is generally unclear why the algorithm is performing so well. On the one hand, one can argue that the superior performance is attributed to the underlying image denoisers, which is evident from applications using convolutional neural networks

[11, 12, 13, 14]. Yet on the other hand, unless we can explicitly write down the optimization which describes the interaction of the denoisers, e.g., in the form of maximum-a-posteriori (MAP), it will be extremely difficult to quantify the solution even if we know the solution exists. The contribution of this paper is to present a set of formal performance analysis results for a family of denoisers known as the graph filters, so that we can understand the PnP ADMM behavior.

In order to state the problem more concretely, it would be useful to first review the PnP ADMM. The starting point of the PnP ADMM is the ADMM algorithm, which has become a standard tool for minimizing a sum of two separable functions of the form

(1)

Here, represents the objective function

arising from the physical measurement models, e.g., blur, sampling, projection, or other linear transformations. The function

is the regularization function, representing the prior model of the underlying signals, e.g., -norm, sparsity, or total variation. The parameter is the regularization constant, and is assumed known a-priori and is fixed in this paper. Problems taking the form of (1

) is broad, encompassing various linear inverse problems in imaging, deblurring, super-resolution, CT, MRI, and compressed sensing, to name a few.

ADMM solves (1) by converting the unconstrained optimization into a constrained problem

(2)

and considers the augmented Lagrangian function:

(3)

Then, the algorithm finds the solution by seeking a saddle point of , which involves solving a sequence of subproblems in the form

(4)
(5)
(6)

where is the scaled Lagrange multiplier, and are the intermediate variables. Under mild conditions, e.g., when both and are closed, proper and convex, and if a saddle point of exists, one can show that the iterates returned by (4)-(6) converges to the solution of (1). Readers interested in knowing more about the theoretical properties of ADMM can consult tutorials such as [15].

The idea of PnP ADMM is to modify (5) by observing that it is essentially a denoising step, if we treat as a “noisy” version of and as a regularization for . Based on this observation, we can replace (5) by a denoiser such that

(7)

where is the denoising strength or a hypothesized “noise level”. The choice of is broad. can be a proximal map such as total variation denoising, or an off-the-shelf image denoisers such as BM3D, non-local means, and more recently neural networks.

Replacing the explicit optimization in (5) by an off-the-shelf denoiser in (7) leads to many interesting questions:

  • Existence of . Given an arbitrary denoiser , is it possible to have an equivalent such that (5) and (7) coincides? If not, under what conditions of would this happen? Some recent studies provide an answer to this question [8, 16], which will be elaborated in later sections of this paper. An alternative approach is the Regularization by Denoising (RED) [17], which has advantages and limitations.

  • Convergence. Does PnP ADMM converge? Clearly the convergence does not hold for arbitrary denoisers. For global convergence, the most basic requirement is that has symmetric gradient and is non-expansive [8]. For fixed point convergence, should be asymptotically invariant as [9].

  • Performance. If does exist, what is it and why is it an effective regularization function? If does not exists, then what does PnP ADMM solve, and why is it good? The focus of this paper is to address this performance issue. We will take two paths, one through the MAP formulation, and the other one through a concept called the consensus equilibrium.

  • Generalization. How to generalize the algorithm to multiple denoisers? The generalization can be done using the consensus equilibrium technique. However, can we explain the performance, and what would be the optimal combination weights?

The main assumption we make in this paper is that the denoisers are in the family of symmetric smoothing filters [18, 19, 20] or graph filters [21, 22, 23, 24, 25, 26, 27]. These filters are simple in the sense that they are (pseudo) linear and hence they can be expressed using matrices. However, they are also representative enough to cover a sufficiently large group of image denoisers that we encounter in the literature.

The remaining of the paper is arranged as follows. We begin with a brief review of the construction of graph filters and their properties in Section 2. The main results are presented in Section 3 and 4. Section 3 analyzes the problem from a MAP perspective, whereas Section 4 approaches the problem from the consensus equilibrium perspective. Further discussions of open problems are presented in Section 5.

Ii Graph Filters

Ii-a Constructing a Graph Filter

Graph filters is a class of image denoisers mostly studied in the field of graph signal processing. The application of graph filters is very broad, e.g., deblurring using graph Laplacian [24], boosting image denoisers [22], image denoising using global similarity [25], JPEG compression using random walk [28], and blind deconvolution using reweighted graph total variation [27]. In terms of theory, there are studies of the graphs in continuous domain [26], sampling of graphs [29], and filter banks of graphs [30], to name a few.

Represented as matrices, graph filters take the form:

(8)

where is the noisy image, and is the denoised image. The matrix is the filter. There are multiple ways of constructing . One of the most commonly used approaches is to define a non-local weight kernel :

(9)

where is a -dimensional patch centered at pixel , and is a parameter characterizing the denoising strength of the filter. Modifications to (9) are common, e.g., by incorporating spatial distance (which gives the non-local means filter), restricting to pixels instead of patches (which gives the bilateral filter), or extending the norm to a weighted norm (which gives the kernel regression filter). For more examples of the kernel matrices, we refer the readers to [18].

The matrix defined in (9) is symmetric but the row sum is not 1. Thus, cannot be used directly as a denoising filter because it amplifies or attenuates signals. To normalize the matrix while preserving the symmetry, the most popular method is by means of the Sinkhorn-Knopp balancing algorithm [31, 19, 20], which iteratively normalizes the rows and columns of until convergence. When is symmetrized, the resulting matrix is called a symmetric smoothing filter, given by

(10)

where the diagonal matrix is defined such that and . Since the columns and rows of sum to 1,

is a doubly stochastic matrix.

Ii-B Properties of Graph Filters

There are a few interesting properties of .

  • can be considered as an undirected graph. is the weight on the edge linking node and node .

  • . This follows from the fact that

    is doubly stochastic, and so the eigenvalues are all non-negative.

  • is the graph Laplacian, with the zero eigenvalue associated to the vector

    .

  • The regularization defined through the graph Laplacian can be interpreted as

    (11)

    which is measuring the smoothness of the signal with the weights defined by the graph .

  • The symmetric matrix has a better denoising performance than its non-symmetric counter part where . This is attributed to the implicit clustering of the pixels during the Sinkhorn-Knopp balancing algorithm [20]

    , among other reasons such as reduced degree of freedom so that the drop in variance overrides the gain in bias

    [19].

In order to present the theoretical results of this paper, we make an assumption about :

Assumption 1.

We assume that the matrix is invertible, i.e., there exists a such that

(12)

Assumption 1 is very strong and could be unrealistic for practical ’s. However, for the derivation of the main results, at least in Section 3, we need to simplify some of the technical details. In Section IV when we discuss the performance from the consensus equilibrium perspective, we will lift the assumption.

Remark: Pseudo-Linearity. An important remark we should pay attention to is the pseudo-linearity of in the actual PnP ADMM implementation. That is, the matrix is a function of the input noisy image , or more precisely instead of (8). However, as mentioned in [32], the dependency of on is typically small if

is estimated from a

pre-filtered version of , for example applying a baseline denoising algorithm to and constructing a new weight matrix based on the baseline estimate. Practically this is also justified in applications such as [8], where is updated for the first tens of iterations and is kept fixed for the remaining iterations. Following the same line of arguments, we assume that is pre-defined and is independent of . Of particular interest is the oracle case where is estimated from the ground truth.

Iii Performance Analysis via MAP

In this section we present the first set of analytic results. We call it the maximum-a-posteriori (MAP) analysis because we need to explicitly derive the MAP optimization.

Iii-a Existence of

Given a graph filter with , the first question we should ask is whether it is possible to obtain a regularization function defined in (5). The rationale is that if we are able to do so, then we can characterize the MAP optimization in (1), and hence understand the performance.

Consider the minimization

(13)

the solution of which is characterized by

(14)

where is the sub-differential of , and the inverse operator is the resolvent of [15]. We assume that

is an invertible matrix

. Therefore, must be continuously differentiable and so the subgradient at every point has a unique gradient. This replaces by , and substituting into (14) yields

(15)

Define as the matrix representation of such that . Since (15) holds for all , this further implies that the mappings are identical:

(16)

Applying to both sides yields , and hence

(17)

The identity in (17) offers a useful way of characterizing the existence of .

Theorem 1 (Sreehari et al. [8] Theorem A.I, and Reehorst-Schniter [16] Theorem 1.).

The mapping exists if and only if is symmetric.

Proof.

The idea is to treat as a vector field which must be conservative. Therefore, the gradient must be symmetric using a classic result in vector calculus [33] (in particular Theorem 4.3.8 and 4.3.10). ∎

For graph filters, the symmetry of is guaranteed because is symmetric. In addition, we also have that is non-expansive, as the eigenvalues of are bounded above by 1. Therefore, if we use a graph filter and if exists, then according to [8] the corresponding minimization is a proximal map, and is convex, closed and proper. However, an arbitrary denoiser does not necessarily have a symmetric Jacobian, and hence is not guaranteed to exist.

Iii-B Deriving

We can now state the first key result of this paper.

Theorem 2.

Let be a graph filter with full rank so that the inverse of exists. If is defined by

(18)

then the solution of the minimization of (5) is

(19)
Proof.

The simplest proof is to substitute (18) into the optimization in (19). An alternative proof is to multiply both sides of (17) by to obtain

(20)

Then integrating both sides with respect to yields , which completes the proof. ∎

Can we replace by its pseudo-inverse if is not invertible? The answer is no. If we replace by in (18), the optimization (13) becomes

(21)

The first order optimality condition of (21) is given by , which does not imply because and unless has full rank. More specifically, requires to live in the range space of . Since contains zero eigenvalues, there are multiple which can be mapped to the same through . This is problematic, because the optimizer of (21) is no longer , and hence it violates our assumption that the denoiser is a graph filter.

Iii-C The Role of

Knowing the explicit expression of , we can now analyze its performance. But before we proceed to the discussion, we should clarify the role of the parameter . By substituting (18) into (2), and recalling , we can show that the original problem which PnP ADMM solves is

(22)

There is an alternative optimization in the graph signal processing literature known as the graph Laplacian, which we will study extensively in this paper:

(23)

In (22), the original regularization parameter is eliminated because contains its reciprocal. Thus, is replaced by the ADMM internal parameter . This internal parameter has no influence to the final solution in the classical ADMM because ADMM converges for all when and are closed, proper and convex. However, in PnP ADMM, the optimization cost is characterized by . Thus, the solution would change as changes, a phenomenon reported in earlier papers, e.g., [9].

We can also ask: For the best and , how would the solutions of (22) and (23) behave? Figure 1 shows an experimental result where . We vary (and equivalently ) to evaluate the mean squared error (MSE) for a fixed oracle . As shown in Figure 1, PnP ADMM and graph Laplacian have different operating regimes for and . PnP ADMM prefers smaller , whereas graph Laplacian prefers larger . We will explain this observation in Section 4.

Fig. 1: Mean squared error of the solutions of (22) and (23) as (or ) increases. PnP has a lower MSE than graph Laplacian for the best and .

Iii-D Comparison with Graph Laplacian

We can now compare the performance of PnP ADMM and the conventional graph Laplacian regularization. In order to do so, we consider a denoising problem such that

where , where . That is, is a noisy version of . Such choice of is also justified by the -subproblem (5), which is also a denoising problem.

We consider the following two optimizations:

(24)
(25)

Here, the first optimization (24) uses graph Laplacian as the regularization, and the second optimization (25) uses the PnP regularization. We use a parameter instead of or to keep the notation simple. The solutions of the two optimizations are

(26)
(27)

where we defined matrices and to simplify notations.

Our goal is to analyze the mean squared error (MSE) of these two estimates. To this end, we first eigen-decompose as , where is the eigenvalue matrix. Define as the projection of by columns of . Then, the MSE of and can be found as follows.

Proposition 1.

Define and via (26) and (27). Let , where

is the eigenvector matrix, and

is the eigenvalue matrix. Define . Then, the MSE of and are respectively

(28)
(29)
Proof.

The MSE of is

(30)
(31)

Similarly, the MSE of is

(32)

Now, we can consider the -th term of the sums:

(33)
(34)

It holds that for ,

(35)
(36)

The limits in (35) and (36) provide important information about the performance. We make the following observations.

1. Energy Concentration by . The graph filter , when applied to a signal , projects the signal onto the space spanned by its columns. The effectiveness of this projection is determined by the quality of . In the oracle setting, ensures that most of the energy is captured by the first few eigenvectors. This is illustrated in the Figure 2, where we consider a 1D signal and its noisy version . The weight matrix is constructed from using the non-local means kernel defined in (9). By projecting and using the eigenvector matrix , we see that has significantly fewer non-zeros than for large eigen-indices. Since is the vector , we have the observation:

Observation 1: For large , is typically very small.

Fig. 2: Projecting a noisy signal and the ground truth signal by the principle eigenvectors. The two curves show and , where is the eigenvector matrix of .

2. Eigenvalues at High Frequency. Based on Observation 1, an immediate conclusion from (35) and (36) is that when , the two limits are respectively

In other words, the PnP regularization ensures that the MSE goes to zero for high frequency (i.e., large eigen-index ) whereas the graph Laplacian has residue MSE for high frequency. These can also be seen from the eigenvalues of and :

where denotes the -th eigenvalue of the matrix. If we plot and as functions of the eigen-index , we will obtain a result shown in Figure 3. In this experiment we set and . As , we have

which explain why reaches a steady state whereas reaches zero as .

Observation 2: As increases so that and , the eigenvalue approaches a non-zero constant, whereas approaches 0. The same observation applies to MSEs.

Fig. 3: Comparison of and .

3. Trade offs in Bias and Variance. MSE is the sum of bias and variance. So we can decompose the MSE and inspect

Since by construction is decreasing as grows, we can show that .

Proposition 2.

For any , and , it holds that for all .

Proof.

To prove this result we just need to show the function

is non-positive. In this expression, the first two terms are non-negative. The last term is non-positive because , where the first inequality holds since . ∎

The conclusion of this proposition is that the variance of PnP is always lower than the variance of graph Laplacian. However, the bias of PnP is larger than the bias of graph Laplacian, as we show in Figure 4. More specifically, we can prove the following result:

Proposition 3.

For any , and , it holds that for all .

Proof.

Since the numerators of and are identical, we only need to compare the denominator. Thus, we have

if , which is true when . ∎

The reason why the overall MSE is lower for PnP is attributed to the observation below.

Observation 3: The drop in variance of PnP is more significant than the gain in bias, and hence PnP’s overall MSE is lower than that of graph Laplacian.

Fig. 4: Bias and variance of PnP and graph Laplacian.

4. When is Graph Laplacian Better? The analysis above suggests conditions under which PnP will perform worse than graph Laplacian. This happens when the vector is large, because then the denominator of will amplify more than that of . This also suggests that PnP is more sensitive to the correctness of than graph Laplacian. That is, when constructing from a pre-filtered signal, PnP demands higher quality of the pre-filtered signal.

To confirm this claim, we conduct an experiment by constructing a pre-filtered signal via , where is the true signal, and . That is, the pre-filtered signal is a noisy version of the clean signal, with perturbation specified by the noise variance . We then construct a graph filter using . Technically speaking, is now . Our goal is to see how and would change when increases. 111An alternative experiment would be to pick a specific baseline denoising algorithm and adjust its internal parameters. However, different baseline denoising algorithms have different characteristics and this could cause bias in the analysis. Thus, we choose to design a pre-filtered signal by adding iid noise.

The result of this experiment is shown in Figure 5, where we observe that PnP has a considerably lower MSE than graph Laplacian when is small. But as increases, quickly increases and eventually goes above . This explains the relatively weak robustness of PnP compared to graph Laplacian.

Fig. 5: The influence of pre-filtered signal quality to the final MSE. In this plot, determines the amount of perturbation are there in the pre-filtered signal. As increases, PnP shows a more sensitive behavior than graph Laplacian.

However, we should emphasize that such conclusion is only valid when is fixed. In practical PnP ADMM, the weight matrix is updated in every ADMM iteration. If decreases as PnP iterates, then could become a good filter.

Observation 4: The performance of PnP could be worse than graph Laplacian when is estimated wrongly.

5. Geometry: Truncating Eigenvectors of . If is oracle, is there any simple geometric interpretation, beside the bias-variance analysis, that can explain the performance?

By expressing the eigen-decomposition of the as

(37)

the PnP ADMM regularization in (25) can be written as

(38)

If the eigenvalues , then the second term in (38) will dominate as is infinity. In this case, we must have in order for the regularization to stay finite. Therefore, the overall optimization is now constrained:

(39)

The constraint provides a simple geometric interpretation. While the graph Laplacian regularization searches in the entire column space, PnP ADMM only searches for a subspace spanned by eigenvectors of non-zero eigenvalues. As we see in Observation 1, this provides a very strong denoising power when is the oracle. Referring to the decomposition in (37), the signal is spanned only by whereas the noise is spanned by both and . By restricting , we ensure that any noise living in will be eliminated. See Figure 6 for illustration. As a result, the denoising is significantly more effective than graph Laplacian.

Fig. 6: PnP ADMM decomposes the optimization variable into two subspaces: , the subspace encoding both signal and noise, and , the subspace encoding only the noise. By restricting such that , we force PnP ADMM to eliminate any noise living in .

Observation 5: Geometrically, PnP ADMM restricts the search space so that the pure noise components are forced to zero.

Iii-E Questions Not Answered

To summarize, the fundamental reason why PnP works better is that the graph filter has effectively removed the unwanted noise from before proceeding to additional processing. Thus, a stronger denoiser leads to better PnP ADMM algorithm.

One major limitation of the above MAP-based analysis is that it tries to match with a proximal map associated with a regularization function . This is a particularly strong requirement, because many denoisers are not symmetric and hence there is no corresponding proximal map. It also requires to exist, which is difficult to have in practice, e.g., a uniform average operator does not have an inverse.

The question we ask is that if we are willing to give up the proximal map requirement, would it be possible to obtain another interpretation that can explain the performance for denoisers that do not have ? This leads to the next section of equilibrium-based analysis.

Iv Performance Analysis via Equilibrium

The MAP optimization in (1) is not the only way to obtain . In [34], Buzzard et al. generalize PnP ADMM by a concept called consensus equilibrium, which allows us to characterize the solution of PnP ADMM by analyzing the equilibrium state of a dynamical system. In this section we discuss the PnP ADMM performance from such equilibrium perspective.

Iv-a Equilibrium

The starting point of the equilibrium analysis is the PnP ADMM algorithm listed in (4)-(6). By defining a pair of functions and using (4) and (5) respectively, we can show that the PnP ADMM solution satisfies

(40)
(41)

A solution satisfying (40) and (41) is called a consensus equilibrium solution. To make this more specific to the readers, consider the following denoising example.

Proposition 4.

Consider two operators and :

The consensus equilibrium solution satisfies

(42)

and .

Proof.

Since , it holds that , which implies that

The optimality of gives

Substituting into gives

Rearranging the terms yields the desired result. ∎

The interesting fact about Proposition 4 is that, given we can easily check whether it is a consensus equilibrium point without going through the PnP ADMM algorithm. In other words, we can “create” an alternative optimization such that its solution is an equilibrium point, and we can interpret the meaning of the optimization to gain insights. To illustrate this point, let us continue with the example in Proposition 4.

Proposition 5.

Define the functions , , and :

(43)
(44)
(45)
(46)

The minimizers of these four functions all satisfy the consensus equilibrium condition (42).

Proof.

Recalling (27), the minimizer of is

(47)

which satisfies (42). Rearranging the terms we obtain

which is the minimizer of . The minimizers of and can be derived in a similar manner. ∎

Proposition 5 shows four different optimization formulations that give the same consensus equilibrium solution. The first optimization in is the MAP formulation we saw in Section 3. The other optimizations of , and have some meaningful interpretations.

Interpretation of . The optimization of is informative in the sense that it completely eliminates the inverse matrix , as opposed to the MAP in (25). The objective function says that instead of trying to minimize the residue between the optimization variable and the noisy observation , we minimize with the filtered version . For oracle ’s where pure noise components are isolated at high frequencies (See Observation 5), ensures that the pure noise components are eliminated. Therefore, when minimizing , we are guaranteed to only search for the signal component.

The regularization term is a concave function for any , because is positive semi-definite. The concave function injects high frequency components that are lost by . In the extreme cases when , we have

and so the minimizer is just . When , we have

and so the minimizer is . In order to control the amount of high frequency injection, a small is more preferred than a large . This explains why in Figure 1 a smaller works better for PnP ADMM.

Interpretation of . The function offers an interpretation similar to , e.g., the objective has no penalty over the subspace spanned by the eigenvectors of zero-eigenvalues, and so the regularization will force those dimensions to zero.

is also a special case of Kheradmand and Milanfar [24], who suggested

for any . If we let , then we obtain .

Interpretation of . The last example , while still giving the consensus equilibrium solution, is more or less useless (in terms of interpretation) because it is just a re-statement of the solution.

In summary, Proposition 5 shows that the solution of the PnP ADMM is completely characterized by the consensus equilibrium. Solving the MAP optimization is just one of the many possible ways of achieving the equilibrium. If we depart from the MAP formulation, i.e., if we are willing to change the forward modeling by integrating the denoiser into the forward model, then there is a wide range of optimization formulations that can offer meaningful interpretations while satisfying the equilibrium condition.

Iv-B General Linear Inverse Problems

The equilibrium based approach also allows us to extend the analysis to general linear inverse problems which could be difficult to study by writing out the MSE in Section 3. The general linear inverse problem using the PnP ADMM with a graph filter requires two operators:

of which the equilibrium solution satisfies

(48)

The condition in (48) can be written as

which is also the equivalent to

(49)

While (49) may look unnatural at a first glance, in the sparse signal processing literature this is reminiscent to the MAP-synthesis minimization discussed by Elad [35].

How do we interpret (49)? The inner minimization is the standard graph Laplacian minimization with changed to in the objective. If has zero eigenvalues, then ensures that the columns along those directions are untouched. Then by the virtue of , the minimizer must be zero along those directions because is positive semi-definite. The outer step by multiplying the solution with ensures that the noise from the inner minimization is eliminated before creating the final solution.

Iv-C Performance of Multiple Priors

Beyond linear inverse problem we can extend the analysis to PnP ADMM involving multiple priors. In this case, the MAP-based optimization is