# First order algorithms in variational image processing

Variational methods in imaging are nowadays developing towards a quite universal and flexible tool, allowing for highly successful approaches on tasks like denoising, deblurring, inpainting, segmentation, super-resolution, disparity, and optical flow estimation. The overall structure of such approaches is of the form D(Ku) + α R (u) →_u ; where the functional D is a data fidelity term also depending on some input data f and measuring the deviation of Ku from such and R is a regularization functional. Moreover K is a (often linear) forward operator modeling the dependence of data on an underlying image, and α is a positive regularization parameter. While D is often smooth and (strictly) convex, the current practice almost exclusively uses nonsmooth regularization functionals. The majority of successful techniques is using nonsmooth and convex functionals like the total variation and generalizations thereof or ℓ_1-norms of coefficients arising from scalar products with some frame system. The efficient solution of such variational problems in imaging demands for appropriate algorithms. Taking into account the specific structure as a sum of two very different terms to be minimized, splitting algorithms are a quite canonical choice. Consequently this field has revived the interest in techniques like operator splittings or augmented Lagrangians. Here we shall provide an overview of methods currently developed and recent results as well as some computational studies providing a comparison of different methods and also illustrating their success in applications.

## Authors

• 20 publications
• 1 publication
• 12 publications
• ### On Optical Flow Models for Variational Motion Estimation

The aim of this paper is to discuss and evaluate total variation based r...
12/01/2015 ∙ by Martin Burger, et al. ∙ 0

• ### Joint Large-Scale Motion Estimation and Image Reconstruction

10/31/2016 ∙ by Hendrik Dirks, et al. ∙ 0

• ### Back-Projection based Fidelity Term for Ill-Posed Linear Inverse Problems

Ill-posed linear inverse problems appear in many image processing applic...
06/16/2019 ∙ by Tom Tirer, et al. ∙ 1

• ### Lifting methods for manifold-valued variational problems

Lifting methods allow to transform hard variational problems such as seg...
08/10/2019 ∙ by Thomas Vogt, et al. ∙ 0

• ### Mumford-Shah and Potts Regularization for Manifold-Valued Data with Applications to DTI and Q-Ball Imaging

Mumford-Shah and Potts functionals are powerful variational models for r...
10/07/2014 ∙ by Andreas Weinmann, et al. ∙ 0

• ### Convex Variational Image Restoration with Histogram Priors

We present a novel variational approach to image restoration (e.g., deno...
01/16/2013 ∙ by Paul Swoboda, et al. ∙ 0

• ### Flows Generating Nonlinear Eigenfunctions

Nonlinear variational methods have become very powerful tools for many i...
09/27/2016 ∙ by Raz Z. Nossek, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Variational methods in imaging are nowadays developing towards a quite universal and flexible tool, allowing for highly successful approaches on tasks like denoising, deblurring, inpainting, segmentation, super-resolution, disparity, and optical flow estimation. The overall structure of such approaches is of the form

 D(Ku)+αR(u)→minu,

where the functional is a data fidelity term also depending on some input data and measuring the deviation of from such and is a regularization functional. Moreover is a (often linear) forward operator modeling the dependence of data on an underlying image, and is a positive regularization parameter. While is often smooth and (strictly) convex, the current practice almost exclusively uses nonsmooth regularization functionals. The majority of successful techniques is using nonsmooth and convex functionals like the total variation and generalizations thereof, cf. [28, 31, 40], or -norms of coefficients arising from scalar products with some frame system, cf. [73] and references therein.

The efficient solution of such variational problems in imaging demands for appropriate algorithms. Taking into account the specific structure as a sum of two very different terms to be minimized, splitting algorithms are a quite canonical choice. Consequently this field has revived the interest in techniques like operator splittings or augmented Lagrangians. In this chapter we shall provide an overview of methods currently developed and recent results as well as some computational studies providing a comparison of different methods and also illustrating their success in applications.

We start with a very general viewpoint in the first sections, discussing basic notations, properties of proximal maps, firmly non-expansive and averaging operators, which form the basis of further convergence arguments. Then we proceed to a discussion of several state-of-the art algorithms and their (theoretical) convergence properties. After a section discussing issues related to the use of analogous iterative schemes for ill-posed problems, we present some practical convergence studies in numerical examples related to PET and spectral CT reconstruction.

## 2 Notation

In the following we summarize the notations and definitions that will be used throughout the presented chapter:

• , , whereby the maximum operation has to be interpreted componentwise.

• is the indicator function of a set given by

 ιC(x):={0ifx∈C,+∞otherwise.
• is a set of proper, convex, and lower semi-continuous functions mapping from into the extended real numbers .

• denotes the effective domain of .

• denotes the subdifferential of at and is the set consisting of the subgradients of at . If is differentiable at , then Conversely, if contains only one element then is differentiable at and this element is just the gradient of at . By Fermat’s rule, is a global minimizer of if and only if

 0∈∂f(^x).
• is the (Fenchel) conjugate of . For proper , we have if and only if . If is positively homogeneous, i.e., for all , then

 f∗(x∗)=ιCf(x∗),Cf:={x∗∈Rd:⟨x∗,x⟩≤f(x)∀x∈Rd}.

In particular, the conjugate functions of -norms, , are given by

 ∥⋅∥∗p(x∗)=ιBq(1)(x∗) (1)

where and as usual corresponds to and conversely, and denotes the ball of radius with respect to the -norm.

## 3 Proximal Operator

The algorithms proposed in this chapter to solve various problems in digital image analysis and restoration have in common that they basically reduce to the evaluation of a series of proximal problems. Therefore we start with these kind of problems. For a comprehensive overview on proximal algorithms we refer to [132].

### 3.1 Definition and Basic Properties

For and , the proximal operator of is defined by

 proxλf(x):=argminy∈Rd{12λ∥x−y∥22+f(y)}. (2)

It compromises between minimizing and being near to , where is the trade-off parameter between these terms. The Moreau envelope or Moreau-Yoshida regularization is given by

 λf(x):=miny∈Rd{12λ∥x−y∥22+f(y)}.

A straightforward calculation shows that . The following theorem ensures that the minimizer in (2) exists, is unique and can be characterized by a variational inequality. The Moreau envelope can be considered as a smooth approximation of . For the proof we refer to [8].

###### Theorem 3.1.

Let . Then,

• For any , there exists a unique minimizer of (2).

• The variational inequality

 1λ⟨x−^x,y−^x⟩+f(^x)−f(y)≤0∀y∈Rd. (3)

is necessary and sufficient for to be the minimizer of (2).

• is a minimizer of if and only if it is a fixed point of , i.e.,

 ^x=proxλf(^x).
• The Moreau envelope is continuously differentiable with gradient

 ∇(λf)(x)=1λ(x−proxλf(x)). (4)
• The set of minimizers of and are the same.

Rewriting iv) as we can interpret as a gradient descent step with step size for minimizing .

###### Example 3.2.

Consider the univariate function and

 proxλf(x)=argminy∈R{12λ(x−y)2+|y|}.

Then, a straightforward computation yields that is the soft-shrinkage function with threshold (see Fig. 1) defined by

 Sλ(x):=(x−λ)+−(−x−λ)+=⎧⎪⎨⎪⎩x−λforx>λ,0forx∈[−λ,λ],x+λforx<−λ.

Setting , we get

This function is known as Huber function (see Fig. 1).

###### Theorem 3.3 (Moreau decomposition).

For the following decomposition holds:

 proxf(x)+proxf∗(x)=x, 1f(x)+1f∗(x)=12∥x∥22.

For a proof we refer to [141, Theorem 31.5].

###### Remark 3.4 (Proximal operator and resolvent).

The subdifferential operator is a set-valued function . For , we have by Fermat’s rule and subdifferential calculus that if and only if

 0 ∈^x−x+λ∂f(^x), x ∈(I+λ∂f)(^x),

which implies by the uniqueness of the proximum that . In particular, is a single-valued operator which is called the resolvent of the set-valued operator . In summary, the proximal operator of coincides with the resolvent of , i.e.,

 proxλf=Jλ∂f.

The proximal operator (2) and the proximal algorithms described in Section 5 can be generalized by introducing a symmetric, positive definite matrix as follows:

 (5)

where , see, e.g., [49, 54, 181].

### 3.2 Special Proximal Operators

Algorithms involving the solution of proximal problems are only efficient if the corresponding proximal operators can be evaluated in an efficient way. In the following we collect frequently appearing proximal mappings in image processing. For epigraphical projections see [12, 47, 88].

#### 3.2.1 Orthogonal Projections

The proximal operator generalizes the orthogonal projection operator. The orthogonal projection of onto a non-empty, closed, convex set C is given by

 ΠC(x):=argminy∈C∥x−y∥2

and can be rewritten for any as

 ΠC(x)=argminy∈Rd{12λ∥x−y∥22+ιC(y)}=proxλιC(x).

Some special sets are considered next.

##### Affine set

with , .
In case of subject to we substitute which leads to

 ∥z∥2→minzsubject toAz=r:=Ax−b.

This can be directly solved, see [20], and leads after back-substitution to

 ΠC(x)=x−A†(Ax−b),

where denotes the Moore-Penrose inverse of .

##### Halfspace

with , .
A straightforward computation gives

 ΠC(x)=x−(a\tiny{T}x−b)+∥a∥22a.
##### Box and Nonnegative Orthant

with .
The proximal operator can be applied componentwise and gives

 (ΠC(x))k=⎧⎪⎨⎪⎩lkifxkuk.

For and we get the orthogonal projection onto the non-negative orthant

 ΠC(x)=x+.
##### Probability Simplex

.
Here we have

 ΠC(x)=(x−μ1)+,

where has to be determined such that . Now can be found, e.g., by bisection with starting interval or by a method similar to those described in Subsection 3.2.2 for projections onto the -ball. Note that is a linear spline function with knots so that is completely determined if we know the neighbor values of .

#### 3.2.2 Vector Norms

We consider the proximal operator of , . By the Moreau decomposition in Theorem 3.3, regarding and by (1) we obtain

 proxλf(x) =x−proxλf∗(⋅λ) =x−ΠBq(λ)(x),

where . Thus the proximal operator can be simply computed by the projections onto the -ball. In particular, it follows for :

##### p=1,q=∞:

For ,

where , , denotes the componentwise soft-shrinkage with threshold .

##### p=q=2:
 ΠB2,λ(x)={xif∥x∥2≤λ,λx∥x∥2if∥x∥2>λ,andproxλ∥⋅∥2(x)={0if∥x∥2≤λ,x(1−λ∥x∥2)if∥x∥2>λ.
##### p=∞,q=1:
 ΠB1,λ(x) ={xif∥x∥1≤λ,Sμ(x)if∥x∥1>λ, and proxλ∥⋅∥∞(x) ={0if∥x∥1≤λ,x−Sμ(x)if∥x∥1>λ,

with , where are the sorted absolute values of the components of and is the largest index such that is positive and see also [59, 62]

. Another method follows similar lines as the projection onto the probability simplex in the previous subsection.

Further, grouped/mixed --norms are defined for with , by

 ∥x∥2,p:=∥(∥xj∥2)nj=1∥p.

For the --norm we see that

 proxλ∥⋅∥2,1(x)=argminy∈Rdn{12λ∥x−y∥22+∥y∥2,1}

can be computed separately for each which results by the above considerations for the -norm for each in

The procedure for evaluating is sometimes called coupled or grouped shrinkage.

Finally, we provide the following rule from [53, Prop. 3.6].

###### Lemma 3.5.

Let , where is differentiable at with . Then .

###### Example 3.6.

Consider the elastic net regularizer , see [183]. Setting the gradient in the proximal operator of to zero we obtain

 proxλg(x)=11+λx.

The whole proximal operator of can be then evaluated componentwise and we see by Lemma 3.5 that

 proxλf(x)=proxλg(Sλμ(x))=11+λSμλ(x).

#### 3.2.3 Matrix Norms

Next we deal with proximation problems involving matrix norms. For , we are looking for

 proxλ∥⋅∥(X)=argminY∈Rm,n{12λ∥X−Y∥2F+∥Y∥}, (6)

where is the Frobenius norm and is any unitarily invariant matrix norm, i.e., for all unitary matrices . Von Neumann (1937) [169] has characterized the unitarily invariant matrix norms as those matrix norms which can be written in the form

 ∥X∥=g(σ(X)),

where

is the vector of singular values of

and is a symmetric gauge function, see [175]. Recall that is a symmetric gauge function if it is a positively homogeneous convex function which vanishes at the origin and fulfills

 g(x)=g(ϵ1xk1,…,ϵkxkd)

for all and all permutations of indices. An analogous result was given by Davis [60] for symmetric matrices, where is replaced by

and the singular values by the eigenvalues.

We are interested in the Schatten- norms for which are defined for and by

 ∥X∥∗ := t∑i=1σi(X)=g∗(σ(X))=∥σ(X)∥1,(Nuclear norm) ∥X∥F := (m∑i=1n∑j=1x2ij)12=(t∑i=1σi(X)2)12=gF(σ(X))=∥σ(X)∥2,(Frobenius norm) ∥X∥2 := maxi=1,...,tσi(X)=g2(σ(X))=∥σ(X)∥∞,(Spectral norm).

The following theorem shows that the solution of (6) reduces to a proximal problem for the vector norm of the singular values of . Another proof for the special case of the nuclear norm can be found in [37].

###### Theorem 3.7.

Let

be the singular value decomposition of

and a unitarily invariant matrix norm. Then in (6) is given by , where the singular values in are determined by

 σ(^X):=proxλg(σ(X))=argminσ∈Rt{12∥σ(X)−σ∥22+λg(σ)} (7)

with the symmetric gauge function corresponding to .

###### Proof.

By Fermat’s rule we know that the solution of (6) is determined by

 0∈^X−X+λ∂∥^X∥ (8)

and from [175] that

 ∂∥X∥=conv{UDV\tiny{T}:X=UΣXV%T,D=diag(d),d∈∂g(σ(X))}. (9)

We now construct the unique solution of (8). Let be the unique solution of (7). By Fermat’s rule satisfies and consequently there exists such that

 0=U(diag(^σ)−ΣX+λdiag(d))V\tiny{T}F ⇔ 0=Udiag(^σ)V\tiny{T}−X+λUdiag(d)V\tiny{T}.

By (9) we see that is a solution of (8). This completes the proof. ∎

For the special matrix norms considered above, we obtain by the previous subsection

 ∥⋅∥∗: σ(^X):=σ(X)−ΠB∞,λ(σ(X)), ∥⋅∥F: σ(^X):=σ(X)−ΠB2,λ(σ(X)), ∥⋅∥2: σ(^X):=σ(X)−ΠB1,λ(σ(X)).

## 4 Fixed Point Algorithms and Averaged Operators

An operator is contractive if it is Lipschitz continuous with Lipschitz constant , i.e., there exists a norm on such that

 ∥Tx−Ty∥≤L∥x−y∥∀x,y∈Rd.

In case , the operator is called nonexpansive. A function is firmly nonexpansive if it fulfills for all one of the following equivalent conditions [12]:

 ∥Tx−Ty∥22 ≤⟨x−y,Tx−Ty⟩, ∥Tx−Ty∥22 ≤∥x−y∥22−∥(I−T)x−(I−T)y∥22. (10)

In particular we see that a firmly nonexpansive function is nonexpansive.

###### Lemma 4.1.

For , the proximal operator is firmly nonexpansive. In particular the orthogonal projection onto convex sets is firmly nonexpansive.

###### Proof.

By Theorem 3.1ii) we have that

 ⟨x−proxλf(x),z−proxλf(x)⟩≤0∀z∈Rd.

With this gives

 ⟨x−proxλf(x),proxλf(y)−proxλf(x)⟩≤0

and similarly

 ⟨y−proxλf(y),proxλf(x)−proxλf(y)⟩≤0.

 ⟨x−proxλf(x)+proxλf(y)−y,proxλf(y)−proxλf(x)⟩ ≤ 0, ∥proxλf(y)−proxλf(x)∥22 ≤ ⟨y−x,proxλf(y)−proxλf(x)⟩.

The Banach fixed point theorem guarantees that a contraction has a unique fixed point and that the Picard sequence

 x(r+1)=Tx(r) (11)

converges to this fixed point for every initial element . However, in many applications the contraction property is too restrictive in the sense that we often do not have a unique fixed point. Indeed, it is quite natural in many cases that the reached fixed point depends on the starting value . Note that if is continuous and is convergent, then it converges to a fixed point of . In the following, we denote by the set of fixed points of . Unfortunately, we do not have convergence of just for nonexpansive operators as the following example shows.

###### Example 4.2.

In we consider the reflection operator

 R:=(100−1).

Obviously, is nonexpansive and we only have convergence of if . A possibility to obtain a ’better’ operator is to average , i.e., to build

 T:=αI+(1−α)R=(1002α−1),α∈(0,1).

By

 Tx=x⇔αx+(1−α)R(x)=x⇔(1−α)R(x)=(1−α)x, (12)

we see that and have the same fixed points. Moreover, since , the sequence converges to for every .

An operator is called averaged if there exists a nonexpansive mapping and a constant such that

 T=αI+(1−α)R.

Following (12) we see that

 Fix(R)=Fix(T).

Historically, the concept of averaged mappings can be traced back to [106, 113, 149], where the name ’averaged’ was not used yet. Results on averaged operators can also be found, e.g., in [12, 36, 52].

###### Lemma 4.3 (Averaged, (Firmly) Nonexpansive and Contractive Operators).

space

• Every averaged operator is nonexpansive.

• A contractive operator with Lipschitz constant is averaged with respect to all parameters .

• An operator is firmly nonexpansive if and only if it is averaged with .

###### Proof.

i) Let be averaged. Then the first assertion follows by

 ∥T(x)−T(y)∥2≤α∥x−y∥2+(1−α)∥R(x)−R(y)∥2≤∥x−y∥2.

ii) We define the operator . It holds for all that

 ∥Rx−Ry∥2 = 11−α∥(T−αI)x−(T−αI)y∥2, ≤ 11−α∥Tx−Ty∥2+α1−α∥x−y∥2, ≤ L+α1−α∥x−y∥2,

so is nonexpansive if .
iii) With we obtain the following equalities

 ∥Rx−Ry∥22 = ∥Tx−Ty−((I−T)x−(I−T)y)∥22 = −∥x−y∥22+2∥Tx−Ty∥22+2∥(I−T)x−(I−T)y∥22

and therefore after reordering

 ∥x−y∥22−∥Tx−Ty∥22−∥(I−T)x−(I−T)y∥22 = ∥Tx−Ty∥22+∥(I−T)x−(I−T)y∥22−∥Rx−Ry∥22 = 12(∥x−y∥22+∥Rx−Ry∥22)−∥Rx−Ry∥22 = 12(∥x−y∥22−∥Rx−Ry∥22).

If is nonexpansive, then the last expression is and consequently (10) holds true so that is firmly nonexpansive. Conversely, if fulfills (10), then

 12(∥x−y∥22−∥Rx−Ry∥22)≥0

so that is nonexpansive. This completes the proof. ∎

By the following lemma averaged operators are closed under composition.

###### Lemma 4.4 (Composition of Averaged Operators).

space

• Suppose that is averaged with respect to . Then, it is also averaged with respect to any other parameter .

• Let be averaged operators. Then, is also averaged.

###### Proof.

i) By assumption, with nonexpansive. We have

 T=~αI+((α−~α)I+(1−α)R)=~αI+(1−~α)(α−~α1−~αI+1−α1−~αR)~R

and for all it holds that

 ∥~R(x)−~R(y)∥2≤α−~α1−~α∥x−y∥2+1−α1−~α∥R(x)−R(y)∥2≤∥x−y∥2.

So, is nonexpansive.
ii) By assumption there exist nonexpansive operators and such that

 T2(T1(x)) = α2T1(x)+(1−α2)R2(T1(x)) = α2(α1x+(1−α1)R1(x))+(1−α2)R2(T1(x)) = α2α1:=αx+(α2−α2α1=α)R1(x)+(1−α2)R2(T1(x)) = αx+(1−α)(α2−α1−αR1(x)+1−α21−αR2(T1(x)))=:R

The concatenation of two nonexpansive operators is nonexpansive. Finally, the convex combination of two nonexpansive operators is nonexpansive so that is indeed nonexpansive. ∎

An operator is called asymptotically regular if it holds for all that

Note that this property does not imply convergence, even boundedness cannot be guaranteed. As an example consider the partial sums of a harmonic sequence.

###### Theorem 4.5 (Asymptotic Regularity of Averaged Operators).

Let be an averaged operator with respect to the nonexpansive mapping and the parameter . Assume that . Then, is asymptotically regular.

###### Proof.

Let and for some starting element . Since is nonexpansive, i.e., we obtain

 limr→∞∥x(r)−^x∥2=d≥0. (13)

Using it follows

 limr→∞sup∥R(x(r))−^x∥2=limr→∞sup∥R(x(r))−R(^x)∥2≤limr→∞∥x(r)−^