 # Adaptive Smoothing Algorithms for Nonsmooth Composite Convex Minimization

We propose an adaptive smoothing algorithm based on Nesterov's smoothing technique in Nesterov2005c for solving "fully" nonsmooth composite convex optimization problems. Our method combines both Nesterov's accelerated proximal gradient scheme and a new homotopy strategy for smoothness parameter. By an appropriate choice of smoothing functions, we develop a new algorithm that has the O(1/ε)-worst-case iteration-complexity while preserves the same complexity-per-iteration as in Nesterov's method and allows one to automatically update the smoothness parameter at each iteration. Then, we customize our algorithm to solve four special cases that cover various applications. We also specify our algorithm to solve constrained convex optimization problems and show its convergence guarantee on a primal sequence of iterates. We demonstrate our algorithm through three numerical examples and compare it with other related algorithms.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

This paper develops new smoothing optimization methods for solving the following “fully” nonsmooth composite convex minimization problem:

 F⋆:=minx∈Rp{F(x):=f(x)+g(x)}, (1)

where is a proper, closed and convex function, and is a convex function defined by the following max-structure:

 f(x):=maxu∈Rn{⟨x,Au⟩−φ(u):u∈U}. (2)

Here, is a proper, closed and convex function, and is a nonempty, closed, and convex set in , and is given.

Clearly, any proper, closed and convex function can be written as (2) using its Fenchel conjugate , i.e., . Hence, the max-structure (2) does not restrict the applicability of the template (1). Moreover, (1

) also directly models many practical applications in signal and image processing, machine learning, statistics and data sciences, see, e.g.,

Beck2009 ; BenTal2001 ; Boyd2011 ; Combettes2011a ; Nesterov2007 ; Parikh2013 ; Tran-Dinh2013a and the references quoted therein.

While the first term is nonsmooth, the second term remains unspecified. On the one hand, we can assume that is smooth and its gradient is Lipschitz continuous. On the other hand, can be nonsmooth, but it is equipped with a “tractable” proximity operator defined as follows: is said to be tractably proximal if its proximal operator

 proxg(x):=argminy{g(y)+(1/2)∥y−x∥2:y∈dom(g)}, (3)

can be computed “efficiently” (e.g., by a closed form or by polynomial time algorithms). In general, computing requires to solve the strongly convex problem (3), but in many cases, this operator can be obtained in a closed form or by a low-cost polynomial algorithm. Examples of such convex functions can be found in the literature including Bauschke2011 ; Combettes2011a ; Parikh2013 .

Solving nonsmooth convex optimization problems remains challenging, especially when none of the two nonsmooth terms and is equipped with a tractable proximity operator. Existing nonsmooth convex optimization approaches such as subgradient-type descent algorithms, dual averaging strategies, bundle-level techniques or derivative-free methods are often used to solve general nonsmooth convex problems. However, these methods suffer a slow convergence rate (resp., - worst-case iteration-complexity). In addition, they are sensitive to the algorithmic parameters such as stepsizes Nesterov2004 .

In his pioneering work Nesterov2005c , Nesterov shown that one can solve the nonsmooth structured convex minimization problem (1) within iterations. This method combines a proximity smoothing technique and Nesterov’s accelerated gradient scheme Nesterov1983 to achieve the optimal worst-case iteration-complexity, which is much better than the -worst-case iteration complexity in nonsmooth optimization methods.

Motivated by Nesterov2005c , Nesterov and many other researchers have proposed different algorithms using such a proximity smoothing method to solver other problems, to improve Nesterov’s original algorithm or customize his algorithm to specific applications, see, e.g., baes2009smoothing ; Becker2011b ; Becker2011a ; chen2014first ; Goldfarb2012 ; Necoara2008 ; Nedelcu2014 ; Nesterov2005d ; Nesterov2007d ; TranDinh2012a . In Beck2012a , Beck and Teboulle generalized Nesterov’s smoothing technique to a generic framework, where they discussed the advantages and disadvantages of smoothing techniques. In addition, they also illustrated the numerical efficiency between smoothing techniques and proximal-type methods. In argyriou2014hybrid ; orabona2012prisma , the authors studied smoothing techniques for the sum of three convex functions, where one term is Lipschitz gradient, while the others are nonsmooth. In boct2012variable , a variable smoothing method was proposed, which possesses the -convergence rate. This convergence rate is worse than the one in Nesterov2005c . However, as a compensation, the smoothness parameter is updated at each iteration. In addition, their method uses special quadratic proximity functions, while smooths both and under their Lipschitz continuity assumption.

In Nesterov2005d , Nesterov introduced an excessive gap technique, which requires both primal and dual schemes using two smoothness parameters. It symmetrically updates one parameter at each iteration. Nevertheless, this method uses different assumptions than our method. Other primal-dual methods studied in, e.g., Bot2013 ; Devolder2012 use double smoothing techniques to solve (1), but only achieve -worst-case iteration-complexity.

Our approach in this paper is also based on Nesterov’s smoothing technique in Nesterov2005c . To clarify the differences between our method and Nesterov2005d ; Nesterov2005c , let us first briefly present Nesterov’s smoothing technique in Nesterov2005c applying to (1).

Recall that a convex function is a proximity function of if it is continuous, and strongly convex with the convexity parameter and . We define

 ¯uc:=argminu{bU(u):u∈U}   and  DU:=supu{bU(u):u∈U}∈[0,+∞).

Here, and are called the prox-center and prox-diameter of w.r.t. , respectively. Without loss of generality, we can assume that and . Otherwise, we just rescale and shift it.

As shown in Nesterov2005c , given and , we can approximate by as

 fγ(x):=maxu{⟨x,Au⟩−φ(u)−γbU(u):u∈U}, (4)

where is called a smoothness parameter. Since is smooth and has Lipschitz gradient, one can apply accelerated proximal gradient methods Beck2009 ; Nesterov2007 to minimize the sum . Using such methods, we can eventually guarantee

 F(xk)−F⋆≤minγ>0{2∥A∥2R20γ(k+1)2+γDU}=2√2∥A∥R0√DU(k+1), (5)

where is the underlying sequence generated by the accelerated proximal-gradient method, see Nesterov2005c , and . To achieve an -solution such that , we set at its optimal value. Hence, the algorithm requires at most iterations.

#### Our approach:

The original smoothing algorithm in Nesterov2005c has three computational disadvantages even with the optimal choice of .

• It requires the prox-diameter of to determine

, which may be expensive to estimate when

is complicated.

• If is small and is large, then is small, and hence, the strong convexity parameter of (4) is small. Algorithms for solving (4) have slow convergence speed.

• The Lipschitz constant of is , which is large. This leads to a small step-size of in the accelerated proximal-gradient algorithm and hence, can have a slow convergence.

Our approach is briefly presented as follows. We first choose a smooth proximity function instead of a general one. We assume that is -Lipschitz continuous with the Lipschitz constant . Then, we define as in (4), which is a smoothed approximation to as above.

We design a smoothing accelerated proximal-gradient algorithm that can updates from to at each iteration so that by performing only one accelerated proximal-gradient step Beck2009 ; Nesterov2007 to minimize the sum for each value of . We prove that the sequence of the objective residuals, , converges to zero up to the -rate.

#### Our contributions:

Our main contributions can be summarized as follows:

• We propose using a smooth proximity function to smooth the max-structure objective function in (2), and develop a new smoothing algorithm, Algorithm 1

, based on the accelerated proximal-gradient method to adaptively update the smoothness parameter in a heuristic-free fashion.

• We prove up to the -worst-case iteration-complexity for our algorithm as in Nesterov2005c to achieve an -solution, i.e., . Especially, with the quadratic proximity function , our algorithm achieve exactly the -worst-case iteration-complexity as in Nesterov2005c .

• We customize our algorithm to handle four important special cases that have a great practical impact in many applications.

• We specify our algorithm to solve constrained convex minimization problems, and propose an averaging scheme to recover an approximate primal solution with a rigorous convergence guarantee.

From a practical point of view, we believe that the proposed algorithm can overcome three disadvantages mentioned previously in the original smoothing algorithm in Nesterov2005c . However, our condition on the choice of proximity functions may lead to some limitation of the proposed algorithm for exploiting further the structures of the constrained set . Fortunately, we can identify several important settings in Section 4, where we can eliminate this disadvantage. Such classes of problems cover several applications in image processing, compressive sensing, and monotropic programming Bauschke2011 ; Combettes2011a ; Parikh2013 ; Yang2011 .

#### Paper organization:

The rest of this paper is organized as follows. Section 2 briefly discusses our smoothing technique. Section 3 presents our main algorithm, Algorithm 1, and proves its convergence guarantee. Section 4 handles four special but important cases of (1). Section 5 specializes our algorithm to solve constrained convex minimization problems. Preliminarily numerical examples are given in Section 6. For clarity of presentation, we move the long and technical proofs to the appendix.

#### Notation and terminology:

We work on the real spaces and , equipped with the standard inner product and the Euclidean -norm . Given a proper, closed, and convex function , we use and to denote its domain and its subdifferential at , respectively. If is differentiable, then stands for its gradient at .

We denote , the Fenchel conjugate of . For a given set , if and , otherwise, defines the indicator function of . For a smooth function , we say that is -smooth if for any , we have , where . We denote by the class of all -smooth and convex functions . We also use for the strong convexity parameter of a convex function . For a given symmetric matrix , and

denote its smallest and largest eigenvalues of

, respectively; and is the condition number of . Given a nonempty, closed and convex set , denotes the distance from to .

## 2 Smoothing techniques via smooth proximity functions

Let be a prox-function of the nonempty, closed and convex set with the strong convexity parameter . In addition, is smooth on , and its gradient is Lipschitz continuous with the Lipschitz constant . In this case, is said to be -smooth. As a default example, for fixed satisfies our assumptions with . Let be the -prox-center point of , i.e., . Without loss of generality, we can assume that . Otherwise, we consider .

Given a convex function , we define a smoothed approximation of as

 φ∗γ(z):=maxu∈U{⟨z,u⟩−φ(u)−γbU(u)}, (6)

where is a smoothness parameter. We note that is not a Fenchel conjugate of unless . We denote by the unique optimal solution of the strongly concave maximization problem (6), i.e.:

 u∗γ(z)∈argmaxu{⟨z,u⟩−φ(u)−γbU(u):u∈U}. (7)

We also define the -prox diameter of . If or is bounded, then .

Associated with , we consider a smoothed function for in (2) as

 fγ(x):=φ∗γ(A⊤x)=maxu{⟨A⊤x,u⟩−φ(u)−γbU(u):u∈U}. (8)

Then, the following lemma summaries the properties of the smoothed function defined by (6) and defined by (8), whose proof can be found in Tran-Dinh2014a .

###### Lemma 1

The function defined by (6) is convex and smooth. Its gradient is given by which is Lipschitz continuous with the Lipschitz constant . Consequently, for any , we have

 γ2∥u∗γ(z)−u∗γ(¯z)∥2≤φ∗γ(z)−φ∗γ(¯z)−⟨∇φ∗γ(z),z−¯z⟩≤12γ∥z−¯z∥2. (9)

For fixed , is convex w.r.t. , and

 φ∗γ(z)−(^γ−γ)bU(u∗γ(z))≤φ∗^γ(z),  ∀γ,^γ∈R++. (10)

As a consequence, defined by (8) is convex and smooth. Its gradient is given by , which is Lipschitz continuous with the Lipschitz constant . In addition, we also have

 fγ(x)≤f(x)≤fγ(x)+γDU,  ∀x∈Rp. (11)

We emphasize that Lemma 1 provides key properties to analyze the complexity of our algorithm in the next setions.

## 3 The adaptive smoothing algorithm and its convergence

Associated with (1), we consider its smoothed composite convex problem as

 F⋆γ:=minx∈Rp{Fγ(x):=fγ(x)+g(x)}. (12)

Similar to Nesterov2005c , the main step of Nesterov’s accelerated proximal-gradient scheme Beck2009 ; Nesterov2007 applied to the smoothed problem (12) is expressed as follows:

 xk+1:=proxβg(^xk−β∇fγ(^xk))≡argminx∈Rp{g(x)+12β∥∥x−(^xk−βAu∗γ(A⊤^xk))∥∥2}, (13)

where is given, and is a given step size, which will be chosen later.

The following lemma provides a descent property of the proximal-gradient step (13), whose proof can be found in Appendix A.1.

###### Lemma 2

Let be generated by (13). Then, for any , we have

 Fγ(xk+1)≤^ℓkγ(x)+1β⟨xk+1−^xk,x−^xk⟩−12(2β−∥A∥2γ)∥^xk−xk+1∥2, (14)

where

 ^ℓkγ(x):=fγ(^xk)+⟨∇fγ(^xk),x−^xk⟩+g(x)≤Fγ(x)−γ2∥u∗γ(A⊤x)−u∗γ(A⊤^xk)∥2. (15)

We now adopt the accelerated proximal-gradient scheme (FISTA) in Beck2009 to solve (12) using an adaptive step-size , which becomes

 ⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩^xk:=(1−τk)xk+τk~xkxk+1:=proxβk+1g(^xk−βk+1∇fγk+1(^xk))~xk+1:=~xk−1τk(^xk−xk+1), (16)

where is the smoothness parameter, and .

By letting , we can eliminate in (16) to obtain a compact version

 ⎧⎪⎨⎪⎩xk+1:=proxβk+1g(^xk−βk+1∇fγk+1(^xk))^xk+1:=xk+1+tk−1tk+1(^xk−xk). (17)

The following lemma provides a key estimate to prove the convergence of the scheme (16) (or (17)), whose proof can be found in Appendix A.2.

###### Lemma 3

Let be the sequence generated by (16). Then

 (18)

for any and is defined by

 Rk:=τkγk+1bU(u∗γk+1(A⊤^xk))−(1−τk)(γk−γk+1)bU(u∗γk+1(A⊤xk))+(1−τk)γk+12∥u∗γk+1(A⊤xk)−u∗γk+1(A⊤^xk)∥2. (19)

Moreover, the quantity is bounded from below by

 Rk≥12(1−τk)[γk+1τk−Lb(γk−γk+1)]bU(u∗γk+1(A⊤xk)). (20)

Next, we show one possibility for updating and , and provide an upper bound for . The proof of this lemma is moved to Appendix A.3.

###### Lemma 4

Let us choose , , and an arbitrary constant . If the parameters and are updated by

 τk:=1k+¯c   and   γk+1:=γ1¯ck+¯c, (21)

then the quantity defined by (19) and satisfy

 γk+1τ2kRk≥−γ21¯c2[(Lb−1)(k+¯c)+1](k+¯c)2DU    and    (1−τk)γk+1τ2k=γkτ2k−1. (22)

Moreover, the following estimate holds

 Fγk+1(xk+1)−F⋆≤τ2kγk+1[(1−τ0)γ1τ20(Fγ0(x0)−F⋆)+∥A∥22∥x0−x⋆∥2+SkDU], (23)

where

 (24)

In particular, if we choose such that , then .

By (21), the second line of (17) reduces to . Using this step into (17) and combining the result with the update rule (21), we can present our algorithm for solving (1) as in Algorithm 1.

The following theorem proves the convergence of Algorithm 1 and estimates its worst-case iteration-complexity.

###### Theorem 3.1

Let be the sequence generated by Algorithm 1 using . Then, for , we have

 F(xk)−F⋆≤∥A∥2∥x0−x⋆∥22γ1k+3γ1DUk+γ1(Lb−1)(ln(k)+1)DUk. (25)

If is chosen so that e.g., , then (25) reduces to

 F(xk)−F⋆ ≤∥A∥2∥x0−x⋆∥22γ1k+3γ1DUk,  (∀k≥1). (26)

Consequently, if we set , which is independent of , then

 F(xk)−F⋆≤R0∥A∥√6DUk   (∀k≥1), (27)

where .

In this case, the worst-case iteration-complexity of Algorithm 1 to achieve an -solution to (1) such that is .

###### Proof

From (21), we have . Using this bound and into (23) we get

 Fγk(xk)−F⋆ ≤1γ1k[∥A∥22∥x0−x⋆∥2+γ1(1−τ0)τ20[Fγ0(x0)−F⋆]] +(γ1(Lb−1)[ln(k)+1]+2γ1)DUk.

Since due to (11), and . Substituting this inequality into the last estimate, and using , we obtain (25).

If we choose such that , e.g., , then as shown in (24). Using this, it follows from (25) that . By minimizing the right hand side of this estimate w.r.t , we have and hence, , which is exactly (27). The last statement is a direct consequence of (27).

For general prox-function with , Theorem 3.1 shows that the convergence rate of Algorithm 1 is , which is similar to boct2012variable . However, when is close to , the last term in (25) is better than (boct2012variable, , Theorem 1).

###### Remark 1

Let . Then, (27) shows that the number of maximum iterations in Algorithm 1 is , which is the same, , as in (5) (with different factors, and ).

## 4 Exploiting structures for special cases

For general smooth proximity function with , we can achieve the convergence rate. When , we obtain exactly the rate as in Nesterov2005c . In this section, we consider three special cases of (1) where we use the quadratic proximity function . Then, we specify Algorithm 1 for the -smooth objective function in (1).

### 4.1 Fenchel conjugate

Let be the Fenchel conjugate of . We can write in the form of (2) as

 f(x)=maxu{⟨x,u⟩−f∗(u):u∈dom(f∗)}.

We can smooth by using as

 fγ(x):=maxu∈dom(f∗){⟨x,u⟩−f∗(u)−(γ/2)∥u∥22}=∥x∥22γ−γ−1f∗(γ−1x),

where is the Moreau envelope of a convex function with a parameter Bauschke2011 . In this case, . Hence, . The main step, Step 5, of Algorithm 1 becomes

 xk+1=proxγk+1g(proxγk+1f(^xk)).

Hence, Algorithm 1 can be applied to solve (1) using the proximal operator of and . The worst-case complexity bound in Theorem 3.1 becomes , where is the diameter of .

### 4.2 Composite convex minimization with linear operator

We consider the following composite convex problem with a linear operator that covers many important applications in practice, see, e.g., argyriou2014hybrid ; Bauschke2011 ; Combettes2011a :

 F⋆:=minx∈Rp{F(x):=f(Ax)+g(x)}, (28)

where and are two proper, closed and convex functions, and is a linear operator from to .

We first write . Next, we choose a quadratic smoothing proximity function for fixed , and define . Using this smoothing prox-function, we obtain a smoothed approximation of as follows:

 fγ(Ax):=maxu{⟨Ax,u⟩−f∗(u)−(γ/2)∥u−¯uc∥2:u∈dom(f∗)}.

In this case, we can compute by using the proximal operator of . By Fenchel-Moreau’s decomposition as above, we can compute using the proximal operator of . In this case, we can specify the proximal-gradient step (13) as

 ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩^u∗k:=proxγ−1k+1f∗(¯uc+γ−1k+1A^xk)=¯uc+γ−1k+1(A^xk−proxγk+1f(γk+1¯uc+A^xk))xk+1:=proxβk+1g(^xk−βk+1A⊤^u∗k),

where . Using this proximal gradient step in Algorithm 1, we still obtain the complexity as in Theorem 3.1, which is , where the domain of is assumed to be bounded.

### 4.3 The decomposable structure

The function and the set in (2) are said to be decomposable if they can be represented as follows:

 φ(u):=m∑i=1φi(ui),   and   U:=U1×⋯×Um