# Convergence guarantees for a class of non-convex and non-smooth optimization problems

We consider the problem of finding critical points of functions that are non-convex and non-smooth. Studying a fairly broad class of such problems, we analyze the behavior of three gradient-based methods (gradient descent, proximal update, and Frank-Wolfe update). For each of these methods, we establish rates of convergence for general problems, and also prove faster rates for continuous sub-analytic functions. We also show that our algorithms can escape strict saddle points for a class of non-smooth functions, thereby generalizing known results for smooth functions. Our analysis leads to a simplification of the popular CCCP algorithm, used for optimizing functions that can be written as a difference of two convex functions. Our simplified algorithm retains all the convergence properties of CCCP, along with a significantly lower cost per iteration. We illustrate our methods and theory via applications to the problems of best subset selection, robust estimation, mixture density estimation, and shape-from-shading reconstruction.

## Authors

• 9 publications
• 88 publications
01/24/2019

### Perturbed Proximal Descent to Escape Saddle Points for Non-convex and Non-smooth Objective Functions

We consider the problem of finding local minimizers in non-convex and no...
11/05/2018

### Task Embedded Coordinate Update: A Realizable Framework for Multivariate Non-convex Optimization

We in this paper propose a realizable framework TECU, which embeds task-...
06/26/2020

### Understanding Notions of Stationarity in Non-Smooth Optimization

Many contemporary applications in signal processing and machine learning...
03/31/2021

### CDiNN -Convex Difference Neural Networks

Neural networks with ReLU activation function have been shown to be univ...
03/05/2019

### Inertial Block Mirror Descent Method for Non-Convex Non-Smooth Optimization

In this paper, we propose inertial versions of block coordinate descent ...
09/13/2018

### Hamiltonian Descent Methods

We propose a family of optimization methods that achieve linear converge...
06/05/2019

### Data Sketching for Faster Training of Machine Learning Models

Many machine learning problems reduce to the problem of minimizing an ex...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Non-convex optimization problems arise frequently in statistical machine learning; examples include the use of non-convex penalties for enforcing sparsity

[14, 28], non-convexity in likelihoods in mixture modeling [40]

, and non-convexity in neural network training

[26]. Of course, minimizing a non-convex problem is NP-hard in general, but problems that arise in machine learning applications are not constructed in an adversarial manner. Moreover, there have been a number of recent papers demonstrating that all first (and/or second) order critical points have desirable properties for certain statistical problems (e.g. see the papers [28, 15]). Given results of this type, it is often sufficient to find critical points that are first-order (and possibly second-order) stationary. Accordingly, recent years have witnessed an explosion of research on different algorithms for non-convex problems, with the goal of trying to characterize the nature of their fixed points, and their convergence properties.

There is a lengthy literature on non-convex optimization, dating back more than six decades, and rapidly evolving in the present (e.g., see the books and papers [36, 17, 19, 23, 41, 24, 6, 32, 27, 10, 4, 16]). Perhaps the most straightforward approach to obtaining a first-order critical point is via gradient descent. Under suitable regularity conditions and step size choices, it can be shown that gradient descent can be used to compute first-order critical points. Moreover, with a random initialization and additional regularity conditions, gradient descent converges almost surely to a second-order stationary point (e.g., [24, 32]

). These results, like much of the currently available theory for (sub)-gradient methods for non-convex problems, involve smoothness conditions on the underlying objectives. In practice, many machine learning problems have non-smooth components; examples include the hinge loss in support vector machines, the rectified linear unit in neural networks, and various types of matrix regularizers in collaborative filtering and recommender systems. Accordingly, a natural goal is to develop subgradient-based techniques that apply to a broader class of non-convex functions, allowing for non-smoothness.

The main contribution of this paper is to provide precisely such a set of techniques, along with non-asymptotic guarantees on their convergence rates. In particular, we study algorithms that can be used to obtain first-order (and in some cases, also second-order) optimal solutions to a relatively broad class of non-convex functions, allowing for non-smoothness in certain portions of the problem. For each sequence generated by one of our algorithms, we provide non-asymptotic bounds on the convergence rate of the gradient sequence . Moreover, for functions that satisfy a form of the Kurdaya-Łojasiewicz inequality, we show that our methods achieve faster rates.

Our work has important points of contact with a recent line of papers on algorithms for non-convex and non-smooth problems, and we discuss a few of them here. Bolte et al. [6] developed a proximal-type algorithm applicable to objective functions formed as a sum of smooth (possibly non-convex) and a convex (possibly non-differentiable) function. Some recent work [39] extended these ideas and provided analysis for block co-ordinate descent methods for non-convex functions. In other recent work, Hong et al. [18] provided analysis of ADMM method for non-convex problems. In few recent papers [3, 38] the authors proposed a proximal-type method for non-convex functions which can be written as a sum of a smooth function, a concave continuous function and a convex lower semi-continuous function; we also analyze this class in one of our results (Theorem 2).

Our results also relate to another interesting sub-area of non-convex optimization, namely functions that can be represented as a difference of two convex functions, popularly known as DC functions. We refer the reader to the papers [36, 17, 23, 41] for more details on DC functions and their properties. One of the most popular DC optimization algorithms is the Convex Concave Procedure, or CCCP for short; see the papers [41, 27] for further details. This is a double loop algorithm that minimizes a convex relaxation of the non-convex objective function at each iteration. While the CCCP algorithm has some attractive convergence properties [23], it can be slow in many situations due to its double loop structure. One outcome of the analysis in this paper is a single-loop proximal-method that retains all the convergence guarantees of CCCP while—as shown in our experimental results—being much faster to run.

### Overview of our results

• Our first main result (Theorem 1) provides guarantees for a subgradient algorithm as applied to the minimization problem (2) defined over a closed convex set . We provide convergence bounds in terms of the Euclidean norm of the subgradient and show that our rates are unimprovable in general. We also illustrate some consequences of Theorem 1 by deriving a convergence rate for our algorithm when applied to non-smooth coercive functions; this result has interesting implications for polynomial programming. We also provide a simplification of the CCCP algorithm, along with convergence guarantees. In Corollary 3, we argue that our algorithm can escape strict saddle points for a large class of non-smooth functions, thereby generalizing known results for smooth functions.

• Our second main result (Theorem 2) provides convergence rates for a proximal-type algorithm for problem (1) (see below). In Section 5.3, we demonstrate how this proximal-type algorithm can be used to minimize a smooth convex function subject to a sparsity constraint. We demonstrate the performance of this algorithm through the example of best subset selection.

• In Theorem 3, we provide a Frank-Wolfe type algorithm for solving optimization problem (17), and we provide a rate of convergence in terms of the associated Frank-Wolfe gap.

• Finally, in Theorems 4 and 5, we prove that Algorithms 1 and 2, when applied to functions that satisfy a variant of the Kurdaya-Łojasiewicz inequality, have faster convergence rates. In particular, the convergence rate in terms of gradient norm is at least – whereas the worst case rate for general non-convex functions is . We also provide examples of functions for which the convergence rate is with . In Theorem 6, we characterize the class of functions that can be written as a difference of a smooth function and a differentiable convex function.

Section 5 is devoted to an illustration of our methods and theory via applications to the problems of best subset selection, robust estimation, mixture density estimation and shape-from-shading reconstruction.

##### Notation:

Given a set , we use to denote its interior. We use , and to denote the Euclidean norm, -norm and norm respectively, of a vector . We say that a continuously differentiable function is -smooth if the gradient is -Lipschitz continuous. In many examples considered in this paper, the objective function is a linear combination of a differentiable function and one or more convex functions and . With a slight abuse of notation, for a function , we refer to a vector of the form , where and , as a gradient of the function at point — and we denote it by ; here, and denote the subgradient sets of the convex functions and respectively. We say a point is a critical point of the function if . For a sequence , we define the running arithmetic mean as . Similarly, for a non-negative sequence , we use

to denote the running geometric mean. Finally, for real-valued sequences

and , we say , if there exists a positive constant , which is independent of , such that for all . We say if and .

## 2 Problem setup

In this paper, we study the problem of minimizing a non-convex and possibly non-smooth function over a closed convex set. More precisely, we consider optimization problems of the form

 minx∈C{g(x)−h(x)+φ(x)f(x)}, (1)

where the domain is a closed convex set. In all cases, we assume the function is bounded below over domain , and that the function is continuous and convex. Our aim is to derive algorithms for problem (1) for various types of functions and .

##### Structural assumption on functions g and h
• Theorems 1 and 4 are based on the assumption that the function is continuously differentiable and smooth, and that the function .

• In Theorems 2 and 5, we assume that the function is continuously differentiable and smooth, and that the function is convex, proper and lower semi-continuous.111Taking the function yields part (a) as a special case, but it is worthwhile to point out that the assumptions in Theorem 1 are weaker than the assumptions of Theorem 2. Furthermore, we can prove some interesting results about saddle points when the function ; see Corollary 3.

• Theorem 3 focuses on the case in which the function is continuously differentiable, and the function .

The class of non-convex functions covered in part (a) includes, as a special case, the class of differences of convex (DC) functions, for which the first convex function is smooth and the second convex function is continuous. Note that we only put a mild assumption of continuity on the convex function , meaning that the difference function can be non-smooth and non-differentiable in general. In particular, for any continuously differentiable function and any smooth function , the difference function is non-smooth. Furthermore, if we take the function , then we recover the class of smooth functions as a special case.

## 3 Main results

Our main results are analyses of three algorithms for this class of non-convex non-smooth problems; in particular, we derive non-asymptotic bounds on their rates of convergence. The first algorithm is a (sub)-gradient-type method, and it is mainly suited for unconstrained optimization; the second algorithm is based on a proximal operator and can be applied to constrained optimization problems. The third algorithm is a Frank-Wolfe-type algorithm, which is also suitable for constrained optimization problems, but it applies to a more general class of non-convex optimization problems.

In this section, we analyze a (sub)-gradient-based method for solving a certain class of non-convex optimization problems. In particular, consider a pair of functions such that:

Assumption GR:

• The function is continuously differentiable and -smooth.

• The function is continuous and convex.

• There is a closed convex set such that the difference function is bounded below on the set .

Under these conditions, we then analyze the behavior of a (sub)-gradient method in application to the following problem

 (2)

With a slight abuse of notation, we refer to a vector of the form with — where denote the subgradient set of the convex function at the point — as a gradient of the function at the point .

In our analysis, we assume that the initial vector is chosen such that the associated level set

 L(f(x0)):={x∈Rd∣f(x)≤f(x0)}

is contained within . This condition is standard in the analysis of non-convex optimization methods (e.g., see Nesterov and Polyak [30]). When , it holds trivially. With this set-up, we have the following guarantees on the convergence rate of Algorithm 1.

###### Theorem 1.

Under Assumption GR, any sequence produced by Algorithm 1 has the following properties:

1. Any limit point is a critical point of the function , and the sequence of function values is strictly decreasing and convergent.

2. For all , we have

 Avg(∥∇f(xk)∥22) ≤2(f(x0)−f∗)α(k+1). (3)

See Appendix B.1 for a proof of this theorem.

### 3.2 Consequences for differentiable functions

In the special case when the function is convex and differentiable, Algorithm 1 reduces to an ordinary gradient descent on the difference function . However, note that the step size choice required in Algorithm 1 does not depend on the smoothness of the function ; consequently, the algorithm can be applied to objective functions that are not smooth. As a simple but concrete example, suppose that we wish to apply gradient descent to minimize the function , where is any -strongly convex and -smooth function, and is a given parameter. Classical guarantees on gradient descent, which require the smoothness of the function , would not apply here since the function itself is not smooth. However, Theorem 1 guarantees that standard gradient descent would converge for any step size .

More generally, given an arbitrary continuously differentiable function , we can define its effective smoothness constant as

 M∗f:=infh{L∣(f+h) is L-% smooth}, (4)

where the infimum ranges over all convex and continuously differentiable functions . Suppose that this infimum is achieved by some function , then gradient descent on the function can be viewed as applying Algorithm 1 to the decomposition , where the function is guaranteed to be -smooth. To be clear, the algorithm itself does not need to know the decomposition , but the existence of the decomposition ensures the success of a backtracking procedure. Putting together the pieces, we arrive at the following consequence of Theorem 1:

###### Corollary 1.

Given a closed convex set , consider a continuously differentiable function with effective smoothness that is bounded below on . Then for any sequence obtained by applying the gradient update with step size , we have:

 Avg(∥∇f(xk)∥22) ≤2(f(x0)−f∗)α(k+1). (5a) Moreover, if we choose step size by backtracking222A detailed description of gradient descent with backtracking is provided in Algorithm 4. with parameter β∈(0,1), then for all k=0,1,2,…, we have Avg(∥∇f(xk)∥22) ≤2max{1,M∗f}(f(x0)−f∗)β2(k+1). (5b)

See Appendix B.2 for proof of the above corollary.

Let us reiterate that the advantage of backtracking gradient descent is that it works without knowledge of the scalar . The parameter mentioned in equation (5b) is the backtracking parameter and is a user defined fraction in the backtracking method (see Algorithm 4 for details). In particular, substituting in equation (5b) yields

 Avg(∥∇f(xk)∥22)≤4max{1,M∗f}(f(x0)−f∗)(k+1),

which differs from the rate obtained in equation (5a) only by a factor of two, and a possible multiple of .

#### 3.2.1 Consequences for coercive functions

As a consequence of Corollary 1, we can obtain a rate of convergence of the backtracking gradient descent algorithm (Algorithm 4) for a class of non-smooth coercive functions. Consider any twice continuously differentiable coercive function , which is bounded below. Recall that a function is coercive if

 f(xℓ)→∞for any sequence {xℓ}ℓ≥0 such that∥xℓ∥2→∞. (6)

Let denote the level set of the function at point . It can be verified that for any coercive function , the set is bounded above for all . This property ensures that for any descent algorithm and any starting point , the set of iterates obtained from the algorithm remains within a bounded set—viz. the level set in this case. Since the function is twice continuously differentiable, we have that is smooth over bounded set ; this fact ensures that has a finite effective smoothness constant in the set , which we denote by . Finally, note that Algorithm 4 is a descent algorithm; as a result, a simple application of Corollary 1 yields the following rate of convergence for the backtracking gradient descent algorithm (Algorithm 4):

###### Corollary 2.

Consider the unconstrained minimization problem of a twice continuously differentiable coercive function that is bounded below on . Then for any initial point , the sequence obtained by applying Algorithm 4 satisfies the following property:

 Avg(∥∇f(xk)∥22) ≤2max{1,M∗f,x0}(f(x0)−f∗)β2(k+1)for all k=0,1,2,…, (7)

where is the backtracking parameter.

##### Implications for polynomial programming:

Corollary 2 has useful implications for problems that involve minimizing polynomials. Such problems of polynomial programming arise in various applications, including phase retrieval and shape-from-shading [37], and we illustrate our algorithms for the latter application in Section 5.1. For minimization of a coercive polynomial, Corollary 2 shows that Algorithm 4 achieves a near-optimal rate.

It is worth noting that any even degree polynomial can be represented as a difference of convex (DC) function; hence, such problems are amenable to DC optimization techniques like CCCP, which we discuss at more length in Section 3.3. However, obtaining a good DC decomposition, which is crucial to the success of CCCP, is often a formidable task. In particular, obtaining an optimal decomposition for a polynomial with degree greater than four is NP-hard —the main reason for this phenomenon is that deciding the convexity of an even degree polynomial with degree greater than four is NP-hard [1, 37]. Even for a fourth degree polynomial with dimension larger than three, there is no known algorithm for finding an optimal DC decomposition [2]. An advantage of Algorithm 4 is that it obviates the need to find a DC decomposition.

#### 3.2.2 Escaping strict saddle points

One of the obstacles with gradient-based continuous optimization method is possible convergence to saddle points. Here we show that with a random initialization this undesirable outcome does not occur for the class of strict saddle points. Recall that for a twice differentiable function , a point is called a strict saddle point of the function if , where

denotes the minimum eigenvalue of the Hessian matrix

. The following corollary shows that such saddle points are not troublesome:

###### Corollary 3.

Suppose that, in addition to the conditions on from Theorem 1, the functions are twice continuously differentiable. If Algorithm 1 is applied with step size , then the set of initial points for which it converges to a strict saddle point has measure zero.

See Appendix B.3 for the proof of this corollary.

We note that similar guarantees of avoidance of strict saddlepoints are known when the function is twice continuously differentiable and -smooth (e.g., [24, 32]). The novelty of Corollary 3 is that the same guarantee holds without imposing a smoothness condition on the entire function .

### 3.3 Connections to the convex-concave procedure

As a consequence of Algorithm 1, we show that one can obtain a convergence rate of the Euclidean norm of the gradient for CCCP (convex-concave procedure), which is a heavily used algorithm in Difference of Convex (DC) optimization problems. Before doing so, let us provide a brief description of DC functions and the CCCP algorithm.

##### DC functions:

Given a convex set , we say that a function is DC if there exist convex functions and with domain such that . Note that the DC representation mentioned in the definition is not unique. In particular, for any convex function , we can write . The class of DC functions includes a large number of non-convex problems encountered in practice. Both convex and concave functions are DC in a trivial sense, and the class of DC functions remains closed under addition and subtraction. More interestingly, under mild restrictions on the domain, the class of non-zero DC functions is also closed under multiplication, division, and composition (see the papers [36, 17]). The maximum and minimum of a finite collection of DC functions are also DC functions.

##### Convex-concave procedure:

An interesting class of problems are those that involve minimizing a DC function over a closed convex set , i.e.

 f∗:=minx∈Cf(x)=minx∈C{g(x)−h(x)}, (8)

where and are proper convex functions. The above problem has been studied intensively, and there are various methods for solving it; for instance, see the papers [36, 27, 34] and references therein for details. One of the most popular algorithms to solve problem (8) is the Convex-concave Procedure (CCCP), which was introduced by Yuille and Rangarajan [41]. The CCCP algorithm is a special case of a Majorization-Minimization algorithm, which uses the DC structure of the objective function in problem (8) to construct a convex majorant of the objective function at each step. We start with a feasible point . Let denote the iterate at iteration; at the iteration we construct a convex majorant of the function via

 f(x) ≤g(x)−h(xk)−⟨uk,x−xk⟩=:q(x,xk), (9)

where , the subgradient set of the convex function at point . The next iterate is obtained by solving the convex program

 xk+1∈argminx∈Cq(x,xk). (10)

The CCCP algorithm has some attractive convergence properties. For instance, it is a descent algorithm; when the function is strongly convex differentiable and the function is continuously differentiable, it can be shown [23] that any limit point of the sequence obtained from CCCP is stationary. Under the same assumptions, one can also verify that .

We now turn to an analysis of CCCP using the techniques that underlie Theorem 1. In the next proposition, we derive a rate of convergence of the gradient sequence and show that all limit points of the sequence are stationary. Earlier analyses of CCCP, including the papers [23, 41], are mainly based on the assumption of strong convexity of the function , whereas in the next proposition, we only assume that the function is -smooth. When the function is strongly convex, our analysis recovers the well-known convergence result in past work [23]. In particular, we show that CCCP enjoys the same rate of convergence as that of Algorithm 1.

###### Proposition 1.

Under Assumption GR and with the function being convex, the CCCP sequence (10) has the following properties:

1. Any limit point of the sequence is a critical point, and the sequence of function values is strictly decreasing and convergent.

2. Furthermore, for all , we have

 Avg(∥∇f(xk)∥22) ≤2Mg(f(x0)−f∗)(k+1), (11a) and assuming moreover that g is μ-strongly convex, Avg(∥xk−xk+1∥22) ≤2(f(x0)−f∗)μ(k+1). (11b)

The proof of this proposition builds on the argument used for Theorem 1; see Appendix B.4 for details.

#### 3.3.1 Simplifying CCCP

Algorithm 1 provides us an alternative procedure for minimizing a difference of convex functions when the first convex function is smooth. The benefit of Algorithm 1 over standard CCCP is that Algorithm 1 is a single loop algorithm and is expected to be faster than standard double loop CCCP algorithm in many situations. Furthermore, Algorithm 1 shares convergence guarantees similar to a standard CCCP algorithm.

### 3.4 Proximal-type method

We now turn to a more general class of optimization problems of the form

 f∗:=minx∈Rdf(x)=minx∈Rd{(g(x)−h(x))+φ(x)}. (12)

We assume that the functions and satisfy the following conditions:

##### Assumption PR
• The function is bounded below on .

• The function is continuously differentiable and -smooth; the function is continuous and convex; and the function is proper, convex and lower semi-continuous.

Typical examples of the function include , or the indicator of a closed convex convex set . Since for a general lower semi-continuous function , the sum-function is neither differentiable nor smooth, a gradient-based method cannot be applied. One way to minimize such functions is via a proximal-type algorithm, of which the following is an instance.

The proximal update in line 3 of Algorithm 2  is very easy to compute and often has a closed form solution (see Parikh and Boyd [33]). Let us now derive the rate of convergence result of Algorithm 2.

###### Theorem 2.

Under Assumption PR, any sequence obtained from Algorithm 2 has the following properties:

1. Any limit point of the sequence is a critical point, and the sequence of function values is strictly decreasing and convergent.

2. For all , we have

 Avg(∥xk−xk−1∥22) ≤2α(f(x0)−f∗)(k+1). (13a) If moreover the function h is Mh-smooth, then Avg(∥∇f(xk)∥22)≤2αCM,α(f(x0)−f∗)(k+1), (13b) where CM,α=(Mg+Mh+1α)2.

See Appendix C for the proof of the theorem.

The proof of Theorem 2 reveals that the smoothness condition on the function in Theorem 2 can be replaced by the local smoothness of , when the sequence is bounded. Note that the local smoothness condition is weaker than the global smoothness condition. For instance, any twice continuously differentiable function is locally smooth. The boundedness assumption on the iterates holds in many situations. For instance, if the function is coercive (6), then it follows that the iterates remain bounded. Another instance is when the function is the indicator function of a compact convex set. Finally, we point out that when the function is non-smooth but the proximal-function is smooth, the existing proof can be easily modified to obtain a rate of convergence of the gradient-norm .

A special case of the Algorithm 2 is when is equal to the indicator function of a closed convex set . Consider the following constrained optimization problem

 f∗:=minx∈X{g(x)−h(x)f(x)}, (14)

where is a closed convex set, the function is -smooth, and the function is convex continuous. Using Algorithm 2, the update equation in this case is given by

 xk+1=ΠX(xk−α(∇g(xk)−uk)). (15)

In projected-gradient-type methods, we should not expect a rate in terms of the gradient. In such cases, the projected gradient step may not be aligned with the gradient direction, or the step size may be arbitrarily small due to projection. Rather, an appropriate analogue of the gradient in this case is as follows:

 ∇fX(xk)=1α(xk−ΠX(xk−α(∇g(xk)−uk))). (16)

The analysis of the projected gradient method using is standard in the optimization literature [8]. It is worth pointing out that the quantity is the analogue of the gradient in the constrained optimization setup, and coincides with the gradient in the unconstrained setup. Concretely, we have where , and . Combining equations (15) and (16) and applying the bound (13b) from Theorem 2, we find that

 Avg(∥∇fX(xk)∥22)≤2(f(x0)−f∗)α(k+1).

### 3.5 Frank-Wolfe type method

In our analysis of the previous two algorithms, we assumed that the objective function has a smooth component , and we leveraged the smoothness property of to establish convergence rates. In many situations, the objective function may not have a smooth component; consequently, neither the gradient-type algorithm nor the prox-type algorithm provides any theoretical guarantee. In this section, we analyze a Frank-Wolfe-type algorithm for solving such optimization problems. In particular, consider an optimization problem of the form

 f∗:=minx∈Cf(x)=minx∈C{g(x)−h(x)}, (17)

where is a closed convex set, and the functions satisfy the following conditions:

##### Assumption FW:
• The difference function is bounded below over range .

• The function is continuously differentiable, whereas the function is convex and continuous.

The analysis of the Frank-Wolfe algorithm for a convex problem is based on the curvature constant of the convex objective function with respect to the closed convex set . This curvature constant can be defined for any differentiable function, which need not be convex [22].

Here we define a slight generalization of this notion, applicable to a non-differentiable function that can be written as a difference of a differentiable function and a continuous convex function (which may be non-differentiable). Define the set

 Sγ:={x,y∈C∣there exist γ∈(0,1] and u∈C with y=x+γ(u−x)},

and the curvature constant

 Cf=supx,y∈Sγu∈∂h(x)2γ2[f(y)−f(x)−⟨y−x,∇g(x)−u⟩]. (18)

Note that in the special case , we recover the curvature constant of the differentiable function used by Lacoste-Julien [22]. We refer to the scalar as the generalized curvature constant of the function with respect to the closed convex set .

Next, we provide an analysis of Algorithm 3 in terms of the Frank-Wolfe (FW) gap  defined Step 5. We show that the minimum FW gap  defined in Algorithm 3 converges to zero at the rate .

###### Theorem 3.

Under Assumption FW, the Frank-Wolfe gap sequence from Algorithm 3 satisfies the following property:

 min0≤j≤kgj ≤max{2(f(x0)−f∗),C0}√k+1for all k=0,1,2,….

See Appendix D.1 for the proof of this theorem.

The FW gap appearing in Theorem 3 is standard in the analysis of Frank-Wolfe algorithm; note that it is invariant to an affine transformation of the set . Similar convergence guarantees for the minimum FW-gap are available for differentiable functions; for instance, see the paper [22]. The novelty of the above theorem is that it provides convergence guarantees of minimum FW-gap for a class of non-differentiable functions.

##### Upper bound on generalized curvature constant:

It is worth mentioning that Algorithm 3 only requires an upper bound of the generalized curvature constant . Consequently, it is interesting to obtain an upper bound for the scalar . For a -smooth function , one well-known upper bound of the curvature constant is ; see the works by Jaggi [20]. A similar upper bound also holds for the generalized curvature constant defined in equation (59). In particular, we prove that for a difference function , with the function being convex continuous, the scalar is always upper bounded by , the curvature constant of the function (see Lemma 6).

## 4 Faster rate under KL-inequality

In the preceding sections, we have derived rates of convergence for the gradient norms for various classes of problems. It is natural to wonder if faster convergence rates are possible when the objective function is equipped with some additional structure. Based on Theorems 1 and 2, we see that both Algorithms 1 and 2 ensure that , meaning that the successive differences between the iterates converge to zero. Although we proved that any limit point of the sequence has desirable properties, the condition is not sufficient—at least in general—to prove convergence333The convergence of the sequence for Algorithm 2 was studied in the papers [3, 38]. We provide the proof under a weaker set of assumptions. of the sequence . In this section, we provide a sufficient condition under which Algorithm 1 and Algorithm 2 yield convergent sequences of iterates , and we establish that the gradient sequences converge at faster rates.

### 4.1 Kurdaya-Łojasiewicz inequality

Let us now establish a faster local rate of convergence of Algorithms 1 and 2 for functions that satisfy a form of the Kurdaya-Łojasiewicz (KL) inequality. More precisely, suppose that there exists a constant such that the ratio is bounded above in a neighborhood of every point . This type of inequality is known as a Kurdaya-Łojasiewicz inequality, and the exponent is known as the Kurdaya-Łojasiewicz exponent (KL-exponent) of the function at the point . These type of inequalities were first proved by Łojasiewicz  [29] for real analytic functions; Kurdaya [21] and Bolte et al. [5] proved similar inequalities for non-smooth functions, and the authors also provided examples of many functions that satisfy a form of the KL inequality. See Appendix A.2 for further details on functions of the KL type.

##### Assumption KL:

For any point 444It can be shown that such an inequality would hold at non-critical point of a continuous function ; see Remark 3.2 of Bolte et al. [5]. Note that the parameter and the neighborhood mentioned in Assumption KL above may depend on the point . , there exists a scalar such that the ratio is bounded above in a neighborhood of .

### 4.2 Convergence guarantees

###### Theorem 4.

Under Assumptions GR & KL, any bounded sequence obtained from Algorithm 1 satisfy the following properties:

• The sequence converges to a critical point , and for all

 Avg(∥∇f(xk)∥2)≤c1k,
• Suppose that at the point , the function has a KL exponent for some . Then we have

 GAvg(∥∇f(xk)∥2)≤c2krfor allk=1,2,…,

where the constants are independent of , but they may depend on the KL parameters at the point .

See Appendix E.1 for proof of this theorem.

It is worth pointing out that Theorem 4 does not require the function to satisfy any smoothness assumption. Such conditions are needed for applying Algorithm 2, so that Theorem 4 is based on milder conditions than Theorem 5.

Our next result is to exhibit a faster convergence rate for Algorithm 2 under the KL assumption:

###### Theorem 5.

Suppose that, in addition to Assumptions PR & KL, the function in Algorithm 2 is locally smooth. Then any bounded sequence obtained from Algorithm 2 satisfy the following properties:

• The sequence converges to a critical point , and for all

 Avg(∥∇f(xk)∥2)≤c1k.
• Given some , suppose that at the point the function has a KL exponent . Then

 GAvg(∥∇f(xk)∥2)≤c2krfor allk=1,2,…,

where the constants are independent of , but they may depend on the KL parameters at the point .

See Appendix E.2 for the proof of this theorem.

Comments: Note that is upper bounded by the quantities and . It thus follows that the sequence converges to zero at a rate of at least , thereby improving the rate of convergence of obtained in Theorems 1 and 2. When , a simple modification of the proof (using ) shows that, Algorithms 1 and 2 converge in a finite number of steps. Finally, we point out that when the function is non-smooth but the proximal-function is smooth, the existing proof can be easily modified to obtain a rate of convergence of the gradient-norm .

## 5 Some illustrative applications

In this section, we study four interesting classes of non-convex problems that fall within the framework of this paper. We also discuss various consequences of Theorems 1 – 5 as well as Corollaries 1 – 3 when applied to these problems.

The problem of shape from shading is to reconstruct the three-dimensional (3D) shape of an object based on observing a two-dimensional (2D) image of intensities, along with some information about the light source direction. It is assumed that the observed 2D image intensity is determined by the angle between the light source direction and the surface normals of the object [12].

In more detail, suppose that both the object and its 2D image are supported on a rectangular grid of size . We introduce the shorthand notation and for the rows and columns of this grid. For each pair , we let denote the observed intensity at location in the image, and we let denote the surface normal at the vertex of the object. Based on observing the 2-dimensional image, both the intensity and co-ordinate pair are known for each pair . The goal of shape from shading is to estimate the unknown coordinate , which corresponds to the height of the object at location . Knowledge of these -coordinates allows us to generate a 3D representation of the object, as illustrated in Figure 1.

##### Lambertian lighting model:

In order to reconstruct the -coordinates, we require a model that relates the observed intensity to the surface normal. In a Lambertian model, for a given light source direction , it is assumed that the surface normal and intensity are related via the relation

 Iij=⟨L,Nij⟩∥Nij∥2. (19)

In one standard model [37], the surface normal is assumed to be determiend by the triplet of vertices via the equations

 pij =(yi,j+1−yi,j)(zi+1,j−zij)−(yi+1,j−yi,j)(zi,j+1−zij)(xi,j+1−xij)(yi+1,j−yij)−(xi+1,j−xij)(yi,j+1−yij), qij =(xi,j+1−xi,j)(zi+1,j−zij)−(xi+1,j−xi,j)(zi,j+1−zij)(xi,j+1−xij)(yi+1,j−yij)−(xi+1,j−xij)(yi,j+1−yij).

Squaring both sides of equation (19) and substituting the expression for surface normal yields the polynomial equation

 (p2ij+q2ij+1)Iij−(ℓ1pij+ℓ2qij+ℓ3)2=0, (20)

which should be satisfied under the assumed model.

In practice, this equality will not be exactly satisfied, but we can estimate the -coordinates by solving the following non-convex optimization problem in the matrix with entries :

 minz∈Rr×c {r∑i=1c∑j=1((1+p2ij+q2ij)I2ij−(ℓ1pij+ℓ2qij+ℓ3)2)2P(z)}. (21)
##### Some reconstruction experiments:

In order to illustrate the behavior of our method for this problem, we considered two synthetic images for simulated experiments. The first one is a image of Mozart [42], and the second one is a image of Vase. The 3D shapes were constructed from the 2D images by solving optimization problem (21) using the backtracking gradient descent algorithm 4. The reconstructed surfaces for Vase and Mozart are provided in figure 1. We ran iterations of Algorithm 4 for both the images. The runtime for Mozart-example was 87 seconds, whereas the runtime for Vase-example was 39 seconds. The implementation of Algorithm 4 for Problem (21) is parallelizable; hence, the runtime can be much lower than our runtime with a parallel implementation. It is worth mentioning that the polynomial is a fourth-degree polynomial with dimension ; polynomial is coercive and bounded below by zero. Consequently, we can apply Corollary 2 to the problem (21) which guarantees that average of the squared gradient norm converges to zero at a rate .

One might also consider applying the CCCP method to this problem. In a recent paper, Wang et al. [37] provided a DC decomposition of the polynomial using a sum of square (SOS) optimization technique. However, it is crucial to note that the DC decomposition of polynomial obtained from the SOS-optimization method need not be optimal. In order to see this, note that the dimension of the polynomial is much larger than three. In particular, the variable is used in the computation of surface normals and , hence is related to variables —which are again related to the other variables. It was shown in the paper [2] that SOS techniques for deriving a DC decomposition are sub-optimal for a fourth-degree polynomial when the dimension of the polynomial is greater than 3. Consequently, deriving an optimal DC decomposition for the polynomial will be computationally intensive.

### 5.2 Robust regression using Tukey’s bi-weight

Next, we turn to the problem of robust regression with Tukey’s bi-weight penalty function. Suppose that we observe pairs linked via the noisy linear model

 yi=⟨zi,μ∗⟩+εi% for i=1,…,n.

Here the vector is the unknown parameter of interest, whereas the variables correspond to additive noise. In robust regression, we obtain an estimate of the parameter vector by computing

 minμ∈Rd{1nn∑i=1Ψ(yi−⟨zi,μ⟩)}=:f(μ) (22)

where

is a known loss function with some robustness properties. One popular example of the loss function

is Tukey’s bi-weight function, which is given by

 Ψ(t)={1−(1−(t/λ)2)3 if |t|≤λ1otherwise, (23)

where is a tuning parameter. Note that is a smooth function, whence the function in the objective (22) is also smooth, implying that Algorithm 1 is suitable for the problem.

With this set-up, applying Theorem 1, Theorem 4 and Corollary 3, we obtain the following guarantee:

###### Corollary 4.

Given a random initialization, any bounded sequence obtained by applying Algorithm 1 to the objective (22) has the following properties:

• Almost surely with respect to the random initialization, the sequence converges to a point such that and .

• There is a universal constant such that

 Avg(∥∇f(μk)∥2)≤c1kfor all k=1,2,….

We provide the proof in Appendix F.1.

### 5.3 Smooth function minimization with sparsity constraints

Moving beyond the robust regression problem, we now discuss another interesting problem of minimizing a smooth function subject to sparsity penalty. Consider the following optimization problem

 minx∈Rd∥x∥0≤sg(x), (24)

where is a smooth function, the -“norm” counts the number of non-zero entries in the vector , and is a sparsity parameter. The constraint set is non-convex, and consequently, the optimization problem (24) is non-convex. However, the constraint set can be expressed as the level set of a certain DC function; see Gotoh et al. [16]. In particular, let denote the values of re-ordered in terms of their absolute magnitudes. In terms of this notation, we have for all , with equality holding if and only if is –sparse. This fact ensures that

 {x∈Rd:∥x∥0≤s}={x∈Rd:∥x∥1−d∑i=d−s+1|x|(i)≤0}. (25)

Since both of the functions and are convex [7], this level set formulation is a DC constraint. Now using the representation (25), we can rewrite problem (24) as such that . For our experiments, it is more convenient to solve the penalized analogue of the last problem, given by

 minx∈Rd{g(x)+λ(∥x∥1−d∑i=d−s+1|x|(i))}, (26)

where is a tuning parameter. The optimization problem (26) can be solved using Algorithm 2 with , and