# The Primal-Dual Hybrid Gradient Method for Semiconvex Splittings

This paper deals with the analysis of a recent reformulation of the primal-dual hybrid gradient method [Zhu and Chan 2008, Pock, Cremers, Bischof and Chambolle 2009, Esser, Zhang and Chan 2010, Chambolle and Pock 2011], which allows to apply it to nonconvex regularizers as first proposed for truncated quadratic penalization in [Strekalovskiy and Cremers 2014]. Particularly, it investigates variational problems for which the energy to be minimized can be written as G(u) + F(Ku), where G is convex, F semiconvex, and K is a linear operator. We study the method and prove convergence in the case where the nonconvexity of F is compensated by the strong convexity of the G. The convergence proof yields an interesting requirement for the choice of algorithm parameters, which we show to not only be sufficient, but necessary. Additionally, we show boundedness of the iterates under much weaker conditions. Finally, we demonstrate effectiveness and convergence of the algorithm beyond the theoretical guarantees in several numerical experiments.

## Authors

• 10 publications
• 2 publications
• 21 publications
• 97 publications
• ### Linear Convergence of the Primal-Dual Gradient Method for Convex-Concave Saddle Point Problems without Strong Convexity

We consider the convex-concave saddle point problem _x_y f(x)+y^ A x-g(y...
02/05/2018 ∙ by Simon S. Du, et al. ∙ 0

• ### NESTT: A Nonconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization

We study a stochastic and distributed algorithm for nonconvex problems w...
05/25/2016 ∙ by Davood Hajinezhad, et al. ∙ 0

• ### Stochastic Variance Reduction Methods for Policy Evaluation

Policy evaluation is a crucial step in many reinforcement-learning proce...
02/25/2017 ∙ by Simon S. Du, et al. ∙ 0

• ### Randomized Primal-Dual Algorithms for Composite Convex Minimization with Faster Convergence Rates

We develop two novel randomized primal-dual algorithms to solve nonsmoot...
03/03/2020 ∙ by Quoc Tran-Dinh, et al. ∙ 0

• ### Stochastic Primal-Dual Hybrid Gradient Algorithm with Arbitrary Sampling and Imaging Applications

We propose a stochastic extension of the primal-dual hybrid gradient alg...
06/15/2017 ∙ by Antonin Chambolle, et al. ∙ 0

• ### Dualize, Split, Randomize: Fast Nonsmooth Optimization Algorithms

We introduce a new primal-dual algorithm for minimizing the sum of three...
04/03/2020 ∙ by Adil Salim, et al. ∙ 8

• ### Robust supervised classification and feature selection using a primal-dual method

This paper deals with supervised classification and feature selection in...
02/05/2019 ∙ by Michel Barlaud, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

One of the most successful and popular approaches to ill-posed inverse problems particularly those arising in imaging and computer vision are variational methods. Typically, one designs an energy functional

, which maps an image from a Banach space to a number on the extended real value line, such that a low energy corresponds to an image with desired properties. The solution is then determined by finding the argument that minimizes the corresponding energy functional. In this paper we are concerned with energies that can be written as

 E(u)=G(u)+F(Ku).

In many cases can be interpreted as a data fidelity term enforcing some kind of closeness to the measurements and serves as a regularization term which incorporates prior information about , where denotes a linear operator.

One property that many functionals appearing in practice have in common is that both and often have easy-to-evaluate proximity operators, i.e. the minimization problem

 (I+τ∂G)−1(v)=argminuG(u)+12τ∥u−v∥22

can be solved efficiently to a high precision or even has a closed form solution. This observation has lead to many very efficient optimization methods, such as the primal-dual hybrid gradient method (PDHG), the alternating minimization algorithm (AMA) or equivalently proximal forward backward splitting (PFBS) or the alternating directions method of multipliers (ADMM). We refer to [3] for a detailed overview over these different splitting methods and their connections.

Although the majority of the proposed variational approaches in image processing and computer vision have focused on the use of convex functionals, there are many interesting nonconvex functionals such as the Mumford-Shah functional [6] or related piecewise smooth approximations [7, 8], nonconvex regularizers such as TV approaches motivated by natural image statistics (cf. [9, 10]), nonconvex constraints such as orthogonality constraints or sperical constraints (cf. [11] and the references therein), or application/model dependent nonconvex functionals for which the prior knowledge cannot be appropriately represented in a convex framework (e.g. [12]). Interestingly, the proximity operators to the nonconvex functions can often still be evaluated efficiently, such that splitting approaches seem to be a very effective choice even in the nonconvex case.

This paper investigates the use of a particular splitting based optimization technique for certain nonconvex with easy-to-evaluate proximity operators. As we will see, the proposed algorithm coincides with the PDHG method in the convex case, but is reformulated such that it can be applied in a straightforward fashion even to energies with nonconvex regularization functionals. This reformulation was initially proposed in [5].

The rest of this paper is organized as follows. First, we will summarize and discuss the literature on nonconvex optimization and point out some difficulties of existing methods. Particularly, we discuss why many methods are rather costly in comparison to direct splitting approaches, such as our proposed iterative PDHG scheme. In Section 2 we will analyze the behavior of the algorithm and prove that it converges to the minimum in case where the nonconvexity of is compensated by the strong convexity of . Additionally, we provide an example showing that the parameter choices which are sufficient to guarantee the convergence are also necessary. Furthermore, we will follow the analysis done for the same algorithm in the case of a nonconvex but convex in [12], and show that the iterates remain bounded under rather mild conditions. We can conclude the convergence to a critical point if additionally the difference between two sucessive iterates tends to zero. In our numerical experiments in Section 3 we will first discuss some cases in which our full convergence proof holds and illustrate why this class of problems can be efficiently tackled with splitting approaches. Next, we provide numerical examples showing that convergence along subsequences to critical points is achieved for a much wider class of problems and often goes beyond the theoretical guarantees. Finally, we draw conclusions and point out future directions of research in Section 4.

### 1.1 Related Work

For the rest of this paper, let and be finite dimensional Hilbert spaces endowed with an inner product and induced norm . In this paper we are concerned with the following class of problems.Let be a linear operator and consider the optimization problem

 minuG(u)+F(Ku), (P)

where is proper, convex and lower semicontinuous, proper, lower semicontinuous and possibly nonconvex. For now we will not specify further properties of , but in the absence of convexity one typically needs some kind of smoothness assumption or, as in our case, -semiconvexity as defined in Section 2. We will also work with the equivalent problem

 minu,gG(u)+F(g)subject tog=Ku. (PC)

We will also assume that the above minimization problems are well defined in the sense that minimizers exist.

Generally, problems of the form (P) or even more general nonconvex minimization problems have been studied in several cases and there exist a number of optimization techniques, which provably converge to critical points. In the following we will briefly summarize these approaches:

• The simplest and probably oldest approach is to use gradient descent in case where the entire energy is smooth. If

is coercive, and one uses gradient descent with appropriate step sizes, the sequence is bounded and every accumulation point is critical (cf. [13]). Gradient descent methods are, however, often inefficient. Additionally, they require the whole energy to be smooth and thus cannot deal with hard constraints.

• In the case where the nonconvex part is smooth and the convex part is nondifferentiable, one can apply forward-backward splitting techniques. We refer to [14] and the references therein for convergence results for descent methods. To give an example, recently, Ochs et al. [15] proposed a forward-backward splitting approach called iPiano which additionally uses a heavy-ball type of approach and provably converges to critical points under mild conditions. The basic idea of forward-backward splitting approaches is to construct a sequence of minimizers

 uk+1=(I+τk∂G)−1(uk−τkKT∇F(Kuk)).

While this approach works well and is applicable to many situations it has the drawback that has to be differentiable. Thus any type of hard constraint has to be incorporated into , which can make the above minimization problem hard to evaluate if no efficient closed form projection exists.

• For nonconvex functions which just need an additional squared norm to become convex (also called -semiconvex functions as defined in Section 2), Artina, Fornasier and Solombrino proposed an Augmented Lagrangian based scheme for constrained problems and proved convergence to critical points [16].

The drawback of that approach, however, would be that the resulting scheme requires the solution of an inner convex problem at each iteration which might become costly.

• Along the same lines, there exist approaches in case the functional can be written as the difference of two convex functions. For instance, if is convex, then the difference of convex functions algorithm (DCA, see e.g. [17, Algorithm 1]) would be

 un+1=argminuG(u)+F(Ku)+ω2∥Ku∥22−ω⟨KTKun,u⟩.

In this case the convergence to a critical point can be proven under mild conditions (cf. [18, 17]). Once more, the scheme requires the solution of a convex optimization problem at each iteration and will therefore be rather costly. Often times, the problem with these approaches is that it is not clear how many iterations are required to solve the inner convex problem.

• If is convex and is well approximated from above by a quadratic or a norm, there exist the iteratively reweighted least squares and iteratively reweighted algorithms. These methods solve a sequence of convex problems which majorize the original nonconvex cost function. The latter method has been recently generalized by Ochs et. al. [19] to handle linearly constrained nonconvex problems which can be written as the sum of a convex and a concave function. Furthermore they prove convergence to a stationary point if the concave term is smooth. While in principle it is still required to solve a sequence of convex subproblems, they precisely specify to which accuracy each subproblem has to be solved which makes their algorithm very efficient in practice.

• Another track of algorithms are alternating minimization methods which are based on the well-known quadratic penalty method. In the classic quadratic penalty method the constrained cost function is augmented by a quadratic term that penalizes constraint violation in order to arrive at an unconstrained optimization problem:

 minu,gG(u)+F(g)+τ2∥g−Ku∥22. (1)

This augmented cost function is then solved for increasing values of , see e.g. [20]. Since joint minimization over and is difficult, minimization is usually carried out in an alternating fashion. Variations exist where the alternating minimizations are repeated several times before increasing the penalty parameter , see e.g. [10] for details. We refer the reader to [21] and the references therein for a discussion on the convergence analysis for this class of methods.

• Finally, there exist some theory for the convergence of the PDHG method in the cases that the nonconvexity can be isolated in an operator with certain properties [22], which is different from the problem we are investigating in this paper.

The above approaches that can deal with constraints or other nondifferentiabilities all have the disadvantage, that each step of the minimization algorithm involves the solution of a convex problem, and hence each step is as expensive as a full typical minimization algorithm such as the PDHG method. This is the reason why some authors have started applying popular convex optimization methods mentioned in the previous section to nonconvex problems as if they were convex (e.g. [23, 24, 12, 25]). Although the results obtained in practise are convincing, very little is known about the convergence of these methods. One theoretical result we would like to point out is the one by Zhang and Esser in [12], where boundedness results and convergence under an additional (iteration dependend) assumption is shown. While [12] investigated the use of the PDHG method for a particular type of nonconvex and convex , we will reproduce and extend their results for nonconvex and convex .

The algorithm we are investigating has been initially proposed by Strekalovskiy and Cremers [5] and is itself is a reformulation of [1, 2, 4] to make it applicable to nonconvex . A detailed description as well as a derivation and its relation to the PDHG method is given in the next section.

### 1.2 The Algorithm

We study the following algorithm that aims at minimizing (P) for -semiconvex with as defined in Section 2. Given an and for , , , , , , iterate for all :

 gn+1 =argmingσ2∥g−K¯un∥22−⟨g,qn⟩+F(g), (2) qn+1 =qn+σ(K¯un−gn+1), un+1 =argminu12τ∥u−un∥22+⟨Ku,qn+1⟩+G(u), ¯un+1 =un+1+θ(un+1−un).

For the above update scheme to be well defined in the absence of convexity, we need existence and uniqueness of a minimizer for the minimization problem in . We will see that this is fulfilled for -semiconvex and for all . Note that the unusual step size criterion on , i.e. , will follow from our main theoretical convergence result in Section 2.

For convex , the above algorithm is exactly the primal-dual hybrid gradient method as proposed in [2, 4]. If we stack the primal and dual variable and start the iteration scheme with the primal variable update (which is equivalent to (2) for a different initial value ) our reformulation takes the following simple form:

 0∈T(zn+1)+M(zn+1−zn) (3)

with

 T=(∂GKT−K(∂F)−1),M=(τ−1I−KT−θKσ−1I), (4)

where denotes the inverse of the set-valued operator . To see this, we use the identity (see [26, Prop. 23.6 (ii)]), and writing out (3) with that, we again obtain the algorithm

 un+1 =(I+τ∂G)−1(un−τKTqn), (5) qn+1 =(I+σ(∂F)−1)−1(qn+σK(2un+1−un)) =qn+σ(K(2un+1−un)−(I+σ−1∂F)−1(K(2un+1−un)+σ−1qn)=:gn+1),

which is exactly (2) for and with the order of primal and dual update reversed.

If is proper, convex and lower semicontinuous it readily follows that and we arrive at the iteration scheme described in [27]. In that case it is easy to check that is a maximally monotone operator, and since for the matrix is symmetric and positive definite [27, Remark 1] it is well known that the algorithm is just a special case of the classical proximal point algorithm in a different norm. Thus it converges to a saddle point with .

In that sense we can interpret as a natural generalization of the subdifferential of the convex conjugate to nonconvex functions. However, for general nonconvex the operator fails to be monotone, and thus the convergence analysis becomes more complicated. There exist convergence results for the proximal point algorithm for nonmonotone operators, in particular for operators whose inverse is -hypomonotone [28, 29]. While for certain choices of and we can show that is -hypomonotone, the results obtained by this analysis are much weaker than the ones provided in the rest of this paper. Analyizing the algorithm from this set-valued operator point of view is left for further future work.

Additionally, notice that for , the primal-dual algorithm (2) is remarkably similar to the alternating direction method of multipliers (ADMM) which we obtain by replacing by in the update for . The updates for and are exactly the same. Connections of this type have been pointed out in [3] where the method we are considering here was interpreted in the sense of a preconditioned ADMM algorithm. Further generalizations for simplifying subproblems were investigated in [30].

### 1.3 Contribution

Our contribution in this paper is to present a generalized usage of the algorithm proposed in [5] and study its convergence behavior. We can prove the convergence for a certain choice of parameters in case the strong convexity of dominates the semiconvexity of the other term . We find an simple sufficient requirement for the step size parameter, namely that the step size for the minimization in has to be twice as large as necessary to make the subproblem convex and show that this criterion is also necessary by constructing an example for a diverging algorithm (for an overall convex energy) in case the step size requirement is violated. In a more general setting we can state the boundedness of the iterates under mild conditions, and can show a result similar to the one obtained in [12] for nonconvex functions  . Finally, we demonstrate the efficiency of the proposed method in several numerical examples which show that the convergence of the algorithm often goes beyond the theoretical guarantees.

## 2 Theory

### 2.1 Mathematical preliminaries

Before going into the analysis of the algorithm, let us recall some definitions that give sense to the proposed algorithm and the following analysis.

###### Definition 2.1 (Effective domain).

Let . We call the set

 dom(E):={u∈X | E(u)<∞}

the effective domain of . If is a nonempty set, then we call proper.

###### Definition 2.2 (Subdifferential, cf. [31, Definition 8.3]).

Let be a finite dimensional Hilbert space and a lower semicontinuous functional. We say that belongs to the (regular) subdifferential of at a point if and only if

 liminfρ→gρ≠gE(ρ)−E(g)−⟨ξ,ρ−g⟩∥ρ−g∥≥0.

For we define .

It is well known that this definition extends (and is consistent with) the usual definition of the subdifferential for convex functions:

 ∂E(g)={ξ∈X | E(ρ)−E(g)−⟨ξ,ρ−g⟩≥0},   g∈dom(E).

For illustrative purposes Fig. 1 shows some examples of subgradients for one dimensional functions.

###### Remark 2.3 (Sum of subdifferentials, from [16], [31, Exercise 8.8(c)]).

We say that is a perturbation of a lower semicontinuous function, if for being lower semicontinuous and being of class . For this type of functions the decomposition

 ∂E(u)=∂ˆF(u)+∂Q(u)

holds true with the subdifferential from Definition 2.2.

###### Definition 2.4 (Critical point).

Let be a lower semicontinuous function. We say that is a critical point if .

###### Definition 2.5 (Semiconvexity and strong convexity, from [16]).

• We call a lower semicontinuous function -semiconvex if is convex.

• We call a lower semicontinuous function -strongly convex, if for all , , , it holds that

 ⟨u−v,q−ξ⟩≥c∥u−v∥2.

One typically uses the term -strongly convex for and the term -semiconvex for , but if one allows negative values for the constants, it is well-known that both definitions can be used to express the same thing.

###### Remark 2.6.

An -semiconvex function is -strongly convex with . A -strongly convex function is -semiconvex with . For example, is c-strongly convex and is -semiconvex.

###### Lemma 2.7.

Let be the sum of a proper, lower semicontinuous -semiconvex function (with being a linear operator) and a proper, lower semicontinuous convex function . Additionally, let . Then the following holds true

 ∂E(u)=KT(∂F)(Ku)+∂G(u).
###### Proof.

Due to Remark 2.3, as well as the result that the subdifferential of the sum of two convex functions is the sum of the subdifferentials if their domains have a nonempty intersection (cf. [32]), we find that

 ∂(G(u)+F(Ku)) =∂(G(u)+F(Ku)+ω2∥u∥2)−ωu =∂G(u)+∂(F(Ku)+ω2∥u∥2)−ωu =∂G(u)+KT(∂F)(Ku).

### 2.2 Convergence Analysis

Throughout the rest of the paper, we will assume that the assumptions of Lemma 2.7 hold. Particularly, we make the stronger assumption that itself is -semiconvex, such that the subproblem in arising in (2) is strongly convex for and the iterates are well defined. Thus, let hold for the rest of this work. With the tools defined in the previous section, we can now state the optimality conditions for Algorithm (2) to be

 0 ∈σ(gn+1−K¯un)+∂F(gn+1)−qn, (6) 0 ∈1τ(un+1−un)+KTqn+1+∂G(un+1). (7)

The optimality conditions yield an interesting property of . Using the definition of in the optimality condition for leads to

 qn∈∂F(gn) ∀n≥1, (8)

such that the variable has an immediate interpretation. Notice that is one of the two main variables in the PDHG method in the convex case, where it is typically updated with a proximity operator involving the convex conjugate . Formulating the algorithm in terms of an update of the primal variable allows us to apply the same iterative scheme in the nonconvex case. Notice that is still defined in the case where is nonconvex, however, will be convex independent of the convexity of . Using a PDHG formulation involving would therefore implicitly use convex relaxation which is not always desirable.

#### 2.2.1 Convergence Analysis: Semiconvex + strongly convex

In this subsection we will investigate the convergence in the special case, where is semi-convex, is strongly convex, and the strong convexity dominates the semiconvexity. Our main result is stated in the following theorem. For the sake of clarity and readability, we moved the corresponding proof to Appendix A.

###### Theorem 2.8.

If is -semiconvex and -strongly convex with a constant , then Algorithm (2) converges to the unique solution of (P) for , , and any , with rate .

Although the requirements of the theorem lead to the overall energy being strongly convex, this is to the best knowledge of the authors the first convergence result for an algorithm where a nonconvex part of the energy was separated from a convex one by variable splitting. We will show some examples for energies our above convergence result can be applied to in the numerical results Section 3. Regarding the Theorem 2.8, we’d like to remark the following:

###### Remark 2.9.

Theorem 2.8 states the convergence of the variable only, which is due to the choice which does not necessarily lead to a convergence for as we will see in Example 1 below. Let us note that, however, we can modify Theorem 2.8 to also yield convergence of by choosing , but small enough such that

. In the very first estimate of the proof, this leads to having terms of both forms,

as well as , with strictly positive factors. The remaining proof continues as before and one can conclude the convergence of at least as fast as as well.

Notice that the assumptions in Theorem 2.8 require , i.e. needs to be chosen twice as large as necessary to make the subproblem for the minimization in convex. It is interesting to see that this requirement is not based on a (possibly crude) estimate for the convergence. In fact, it is not only sufficient but also necessary, as the next proposition shows.

###### Proposition 2.10.

Let all assumptions for Theorem 2.8 be met, except that we choose . Then there exists a problem for which the sequences of and of the proposed algorithm with diverge for any fixed .

###### Proof.

Let us give a very simple example problem for which the algorithm diverges for . Consider

 ^u=argminu c2∥u∥2−12∥u∥2,

Clearly, for the problem is strongly convex and the solution is given by . We apply the proposed algorithm with , being the identity, , and being strongly convex. We assume , and can immediately see that has to hold for the subproblem in to even have a minimizer, i.e. for the algorithm to be well defined. Note that by (8) such that we can eliminate from the algorithm; a short computation shows that

 gn+1 =−1σ−1gn+σσ−1un, un+1 =1τ−1+cgn+1+τ−1τ−1+cun.

Using the first formula to replace in the second equation admits the fixed point form

 (gn+1un+1)=(a1,1a1,2a2,1a2,2)(gnun).

with

 a1,1= −1σ−1  a1,2= σσ−1 a2,1= −1(τ−1+c)(σ−1)   a2,2= τ−1τ−1+c+σ(τ−1+c)(σ−1)

Obviously, this iteration can diverge as soon as one eigenvalue of the matrix

is larger than one – simply choose the corresponding eigenvector as a starting point of the iteration. The eigenvalues of a

matrix are given by

 d1= a1,1+a2,22+√(a1,1+a2,22)2−(a1,1a2,2−a1,2a2,1) d2= a1,1+a2,22−√(a1,1+a2,22)2−(a1,1a2,2−a1,2a2,1)

Due to , we know that and are real. Note that and , such that we could choose large enough such that

 d2=a1,1+ε(c),

with arbitrarily small. Assume we have picked some . Then and we can pick a large enough such that and the algorithm diverges. Notice that for any finite , we have and such that neither nor are eigenvectors which means that both variables, and , diverge.

Note that in the above example, the algorithm can diverge although the total energy was strongly convex, which shows that the convergence result of Theorem 2.8 is not trivial. Additionally, we could see that choosing a large , i.e. making the total energy even more strongly convex, led to an algorithm that provably diverges. As a second example, let us also state that the case of converging without converging can also happen in practice and is not an artifact of the convergence estimates:

###### Example 1.

Consider

 ^u=argminu c2∥u∥2−12∥Ku∥2,

with . Clearly, for the problem is strongly convex and the solution is given by . If we apply the proposed algorithm with , , and start with , , we obtain

 gn+1 =(σ−1)−1qn, qn+1 =−gn+1, un+1 =0.

Notice that we only need for the minimization problem in to be convex and for the problem to be strictly convex, coercive and the iterates to be well defined. However, is -semi convex with , and Theorem 2.8 tells us that we need for the convergence. Looking at the above iteration we see that indeed such that diverges for . It is interesting to see that Theorem 2.8 states the convergence of the variable only, which is due to the choice which - as we can see here - does not need to lead to a convergence for . As pointed out in remark 2.9 we need to choose a slightly larger to also obtain the convergence in , which is consistent with the behavior of the algorithm in this toy example.

In any case, this example as well as the previous proposition show, that the estimates used to prove the convergence of the algorithm seem to be sharp - at least in terms of the algorithm parameters.

Although the assumptions in Theorem 2.8 seem to be rather restrictive it is interesting to see that there are some nontrivial optimization problems which meet the necessary requirements as we will see in the numerical experiments section.

#### 2.2.2 Convergence Analysis: Discussion in the nonconvex case

In the case where the energy is truly nonconvex, the situation is much more difficult and we do not have a full convergence result yet. Based on the example of the diverging algorithm in Proposition 2.10 we can already see that we will need some additional assumptions. Particularly, it seems that we need the update of to be somehow contractive - despite the nonconvexity. Still, there are a few things one can state for fairly general nonconvex functions. The following analysis is closely related to the analysis in [12], where the primal-dual algorithm for nonconvex data fidelity term ( in our notation) was analyzed.

###### Proposition 2.11.

Let be a differentiable function whose derivative is uniformly bounded by some constant for all . Additionally, let there exist constants and with , such that for all with , we have

 G((1+ε)u)≥G(u)+tε∥u∥,∀ε≥0.

Then the sequence is bounded and therefore converges along subsequences.

###### Proof.

Since (see (8)), the boundedness of is trivial. We prove the boundedness of by contradiction. Assume there exists a with . Let us pick the first iterate with this property (assuming ), which ensures that . We know that minimizes

 E(u)=12τ∥u−un∥2+G(u)+⟨u,KT∇F(gn+1)⟩.

Consider . We will show that there exists a which has a lower energy than , thus leading to a contradiction. One finds that

 G(un+1)=G((1+ε)~u(ε))≥G(~u(ε))+εt∥~u(ε)∥

for all small enough such that . For these we have

 G(un+1)+⟨un+1,KT∇F(gn+1)⟩ ≥G(~u(ε))+⟨~u(ε),KT∇F(gn+1)⟩+ε(t−c∥K∥)∥~u(ε)∥ >G(~u(ε))+⟨~u(ε),KT∇F(gn+1)⟩.

 ∥un+1−un∥2 =∥(1+ε)~u(ε)−un∥2 =∥~u(ε)−un∥2+ε2∥~u(ε)∥2+2ε⟨~u(ε),~u(ε)−un⟩,

and observers that at least as long as , such that for small enough we arrive at

 E(un+1)>E(~u(ε)),

which is a contradiction to being a minimizer of . Therefore, remains bounded. Finally, notice that the boundedness of and together with the update immediately implies the boundedness of . ∎

The assumptions of Proposition 2.11 are met for instance for image denoising with an fidelity term and smooth approximations to popular regularizations like truncated quadratic or TV. We will show some of these cases in our numerical results Section 3. Basically the condition ensures the coercivity of the problem which of course seems to be a reasonable criterion to obtain boundedness of the iterates. To get a better understanding of what the assumptions of Proposition 2.11 mean we illustrate the idea in a 1D case in Fig. 2.

Unfortunately, it is not clear that the accumulation points of the bounded sequence generated by Algorithm (2) are critical. For the convergence to a critical point we need the distance between successive iterates to vanish in the limit. The following result is similar to the observations by Zhang and Esser in [12].

###### Proposition 2.12.

If the sequences , and remain bounded (and thus are convergent along subsequences), and we additionally have that and , then the iteration converges to critical points along subsequences.

###### Proof.

If we pick a convergent subsequence (denoted with indices ) that converges to a point , we have

 ∥qmn−1−qmn∥=σ∥gmn−K¯umn−1∥ ≥σ(∥gmn−Kumn∥−∥Kumn−Kumn−1∥−θ∥Kumn−1−Kumn−2∥)

which ensures that . Additionally, taking the limit of the equation

 −1τ(umn−umn−1)∈KTqmn+∂G(umn),

shows that the accumulation point satisfies

 −KT^q∈∂G(^u).

Using that (equation (8) along with the lower semicontinuity of ), as well as yields

 KT∂F(K^u)+∂G(^u)∋0,

and coincides with the definition of being a critical point. ∎

Note that if the assumptions of Proposition 2.11 are satisfied, the assumptions of Proposition 2.12 are also fulfilled.

## 3 Numerical Experiments

In this section we will provide different types of numerical examples. The first two subsections contain examples for which the overall function is strictly convex, but a splitting approach into a strongly convex and a nonconvex part clearly simplifies the numerical method, such that the primal-dual approach (2) can be applied with a convergence guarantee based on Theorem 2.8. Secondly, we consider the case of Mumford-Shah regularized denoising in the case where the overall energy is truly nonconvex. In this case we cannot guarantee the convergence a-priori, but can guarantee the boundedness of the iterates (Proposition 2.11) and have a way to check a-posteriori if a critical point was found (based on Proposition 2.12). Finally, we present numerical results on Mumford-Shah based inpainting as well as on image dithering to further illustrate the behavior of the algorithm in the fully nonconvex setting and to show that the numerical convergence often goes beyond the current theoretical guarantees.

### 3.1 Notation and Discretization

Throughout the rest of this section will denote a -dimensional discretized rectangular domain. For images with channels we denote the gradient discretization as . We will make frequent use of the following norms for :

 ∥g∥2,1:=∑x∈Ω ∥g(x)∥2,  ∥g∥2,2:=√∑x∈Ω ∥g(x)∥22,

where denotes the Frobenius norm. In the following is discretized by standard forward differences and we have the following bound on its norm (see [4]).

### 3.2 Joint Denoising and Sharpening via Backwards Diffusion

One of the most popular variational denoising methods is the Rudin-Osher-Fatemi (ROF) model [33] with the total variation (TV) as a regularizer. A possible interpretation in the vectorial case is given as

 minu:Ω→Rkc2∥u−f∥2+∥∇u∥2,1.

Based on Lemma 2.6 we can see that the energy to be minimized is -strongly convex, which gives us the freedom to introduce further -semiconvex terms and, as long as , still be able to use the primal-dual splitting approach as an efficient minimization method. One possible example would be to incorporate a sharpening/edge enhancement term of the form with . The latter constraint is needed for the assumptions of Theorem 2.8 to hold and leads to (9) being a convex problem. Note that incorporating a term like into the energy can be interpreted as an implicit step for the backward heat equation. Thus, if a blur is assumed to follow a diffusion process, the term would aim at removing the blur. Similar approaches to image sharpening have been investigated in the literature, mostly in the context of inverse diffusion equations (cf. [34, 35]).

The full energy minimization approach for joint denoising and sharpening could be

 minu:Ω→Rk c2∥u−f∥2+∥∇u∥2,1−ω2∥∇u∥22,2+ι[0,1](u) (9)

where

 ι[0,1](u)={0,0≤u(x)≤1,∀x∈Ω∞,else, (10)

restricts the range of to be between zero and one. Generally, we expect the total variation term to be dominant for small oscillations in the data, thus removing some of the noise, while the sharpening term will be dominant on large edges leading to an increased contrast. Notice that one can also iterate the above model by replacing with the previous iterate starting with , which results in a combination of the TV flow [36] and inverse diffusion.

Notice that (9) is a good example for a functional which is convex but difficult to minimize without splittings that divide the energy into a strongly convex and a nonconvex part. A convex splitting would have to compute a proximity operator with respect to , which is difficult to solve without an additional splitting. Forward-backward splittings have to use in the backward step, since it is nondifferentiable, such that one has to solve a full TV-denoising problem at each iteration. Similar problems occur in differences of convex approaches.

To the best knowledge of the authors this manuscript is the first work to theoretically justify the splitting into a convex and a nonconvex part by considering the problem

 minu,g c2∥u−f∥2+∥g∥2,1−ω2∥g∥22,2+ι[0,1](u)subject tog=∇u. (11)

Fig. 3 shows the result of the flow arising from (9) (called enhanced TV flow in Fig. 3) in comparison to a plain TV flow for and . While both, the TV and the inverse diffusion TV flow, are able to remove the noise from the image, we can see that in the course of the iteration, the inverse diffusion TV flow results remain significantly sharper. Strong image edges even get enhanced during the iterations, such that due to the nature of backward diffusion, the iterations have to be stopped at some point to avoid significant overshooting.

We do not claim the above energy to be a state of the art image enhancement technique. However, the above experiment shows that we can add a certain amount of a semiconvexity to any strongly convex energy without limiting our abilities to tackle the resulting problem with an efficient splitting approach. We believe that the inclusion of such nonconvexities in the regularization might aid in the design of energy functionals that describe the desired effects even more accurately than purely convex ones.

### 3.3 Illumination Correction

In the previous subsection, the linear operator in the nonconvex term was exactly the same as the one we would typically use for the splitting in solving a plain TV problem. Additionally, the proximity operator for the TV plus the nonconvex could be solved in closed form. In this subsection, we would like to use a toy example to show that this does not have to be the case and one could imagine adding nonconvex terms which are treated completely independent of the convex ones. Consider for instance terms of the form

 N(Au)=∥(Au−r)2−e∥1, (12)

where is a linear operator, and are given values, and the square is a componentwise operation. Due to the norm, this term has the interpretation that is sparse, i.e. that many components meet . Looking at the function in 1D in Fig. 4 it is obvious that such an is nonconvex and nondifferentiable for positive . It is, however, -semiconvex, such that adding with a sufficiently small weight to any strongly convex function leads to an overall convex energy and our convergence theory regarding a splitting into a convex and nonconvex part applies.

As a toy problem, consider again the ROF denoising model for three color channels (), with the additional idea to perform a particular illumination correction, by requiring many gray values defined as red channel plus green channel plus blue channel over three, to be close either a number or a number . In other words, we introduce the term , where and are just numbers in . For instance, for and the term would prefer many intensity values to be either or . With this term we can modify the tonemapping; notice that intensity contrast enhancement is a special case obtained for and , where the relevant part of is mere the negative quadratic of the intensity.

The energy of our variational model reads as follows:

 minu:Ω→R3 c2∥u−f∥2+∥∇u∥2,1+ω2N(Au)+ι[0,1](u). (13)

Writing the color image

in vector form, we can introduce a new variable

along with the constraint , where

 K=⎛⎜ ⎜ ⎜⎝∇000∇000∇A⎞⎟ ⎟ ⎟⎠, (14)

and denotes the discrete gradient operator of a single channel image in matrix form. We can apply the proposed algorithm which then decouples the minimization of the nonconvex term involving from the rest of the energy and all proximity that need to be evaluated in the algorithm have a closed from solution.

Fig. 5 shows some results of the method obtained by minimizing (13). We compare the original noisy image with the one obtained by (13) without the nonconvex term (i.e. ), and with the nonconvex term (i.e. for a small ) for , . It is interesting to see how strong the effect of the nonconvex term is: We see big intensity changes and particularly a stronger denoising effect in the background. To illustrate that our idea of using (12) to obtain many entries that meet , we plotted the sorted pixel intensities of the plain TV denoising solution as well as the sorted pixel intensities of our model with the additional nonconvex term. As we can see there is a drastic increase in the number of pixel for which the intensity is exactly or .

Again, the exaggerated effect of Fig. 5 was for illustration purposes. One could imagine cases where the original image has a low dynamic range and does not fully use the possible range of values from zero to one. In this case the approach seems more reasonable and we obtain the results shown in Fig. 6.

While we are not certain if the idea of incorporating the nonconvex term for illumination correction has wide applicability, our experiments do show that whenever one has a strongly convex energy, one has the freedom to model any desirable image property with a semiconvex term without complicating the minimization algorithm. As we have seen, the condition of being semiconvex is rather weak (particularly when working in a bounded domain) such that one has a great freedom in the design of such terms. Furthermore, we have seen that the effect of these additional nonconvex terms is by no means negligible. Additionally we would like to mention, that since the effect of the nonconvex terms is influenced by their weight and therefore by the strong convexity constant, the effect of any nonconvex term can be enhanced in an iterative fashion by computing the corresponding flow (which for a simple data term amounts to replacing the data by the previous iterate ).

### 3.4 Mumford-Shah Regularization

In the remaining subsections we will give examples where the overall energy is nonconvex. Our main convergence Theorem does not hold, but we observe convergence experimentally nonetheless.

The Mumford-Shah functional provides means to compute a discontinuity preserving piecewise smooth approximation of the input image. Strekalovskiy and Cremers [5] initially proposed the nonconvex primal-dual algorithm (2) to efficiently find a minimizer of the Mumford-Shah problem. The following discretization was considered:

 minu:Ω→Rkx∈Ω∑(u(x)−f(x))2+(ˆRMS∘∥⋅∥2)(∇u(x)) (15)

where is the truncated quadratic regularizer

 ˆRMS(t)=min{λ,αt2}. (16)

Since (16) is not -semiconvex due to the truncation, it seems to be necessary for the step size to approach infinity in order for the algorithm to converge. The authors of [5] employed the variable step size scheme from [4, Algorithm 2] which along with has the nice property that it provably converges in the fully convex setting with rate if the dataterm is strongly convex.

We propose to use a slight modification of (16) in order to obtain -semiconvexity of the regularizer. We consider the smoothed version of (16) as described in [16]. It is shown in Fig. 7 and given as

 ˆRεMS(t)=⎧⎨⎩αt2,ts2, (17)

where , . is a cubic spline from [16] with constants

 A=−α4ε,B=−α(2√λ/α+ε)4ε,C=λ. (18)
###### Lemma 3.1.

The smoothed truncated quadratic regularizer (17) is -semiconvex with constant

 ω=α(2+ε0)2ε0 (19)

where . Furthermore, the composition given by

 RεMS(g)=(ˆRεMS∘∥⋅∥2)(g) (20)

is also -semiconvex with the same constant.

###### Proof.

A sufficient condition for being convex is that the second derivative is nonnegative:

 (ˆRεMS)′′(t)+ω≥0,∀t∈R. (21)

We only have to consider the region , since is nonconvex only in that part:

 mins1≤t≤s2{6A(t−s2)+2B}+ω≥0 (22) ⇒ω≥−2B=α(2+ε0)2ε0.

Since is increasing and convex, the composition is convex as well. ∎

##### Evaluation of the Proximal Mapping.

Next, we describe how to evaluate the proximal mapping of the truncated quadratic regularizer . For that we note that a slight modification of [37, Theorem 4] to our setting of -semiconvex functions also holds. This allows us to reduce the evaluation of the proximal mapping of -semiconvex regularizers to single variable optimization problems.

###### Remark 3.2.

Let be increasing and -semiconvex. Furthermore, let and . Then the proximal mapping for is given as

 (I+τ∂f)−1(~u)=(I+τ∂h)−1(∥~u∥)~u∥~u∥ (23)

with the convention .

The proximal mapping