Calculus of the exponent of Kurdyka-Łojasiewicz inequality and its applications to linear convergence of first-order methods

02/09/2016
by   Guoyin Li, et al.
UNSW
0

In this paper, we study the Kurdyka-Łojasiewicz (KL) exponent, an important quantity for analyzing the convergence rate of first-order methods. Specifically, we develop various calculus rules to deduce the KL exponent of new (possibly nonconvex and nonsmooth) functions formed from functions with known KL exponents. In addition, we show that the well-studied Luo-Tseng error bound together with a mild assumption on the separation of stationary values implies that the KL exponent is 1/2. The Luo-Tseng error bound is known to hold for a large class of concrete structured optimization problems, and thus we deduce the KL exponent of a large class of functions whose exponents were previously unknown. Building upon this and the calculus rules, we are then able to show that for many convex or nonconvex optimization models for applications such as sparse recovery, their objective function's KL exponent is 1/2. This includes the least squares problem with smoothly clipped absolute deviation (SCAD) regularization or minimax concave penalty (MCP) regularization and the logistic regression problem with ℓ_1 regularization. Since many existing local convergence rate analysis for first-order methods in the nonconvex scenario relies on the KL exponent, our results enable us to obtain explicit convergence rate for various first-order methods when they are applied to a large variety of practical optimization models. Finally, we further illustrate how our results can be applied to establishing local linear convergence of the proximal gradient algorithm and the inertial proximal algorithm with constant step-sizes for some specific models that arise in sparse recovery.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

02/10/2019

Deducing Kurdyka-Łojasiewicz exponent via inf-projection

Kurdyka-Łojasiewicz (KL) exponent plays an important role in estimating ...
10/10/2021

Convergence of Random Reshuffling Under The Kurdyka-Łojasiewicz Inequality

We study the random reshuffling (RR) method for smooth nonconvex optimiz...
04/19/2018

A refined convergence analysis of pDCA_e with applications to simultaneous sparse recovery and outlier detection

We consider the problem of minimizing a difference-of-convex (DC) functi...
08/22/2018

Convergence of Cubic Regularization for Nonconvex Optimization under KL Property

Cubic-regularized Newton's method (CR) is a popular algorithm that guara...
02/12/2018

Convergence Analysis of Alternating Nonconvex Projections

We consider the convergence properties for alternating projection algori...
11/18/2017

Proximal Gradient Method with Extrapolation and Line Search for a Class of Nonconvex and Nonsmooth Problems

In this paper, we consider a class of possibly nonconvex, nonsmooth and ...
06/25/2021

A proximal-proximal majorization-minimization algorithm for nonconvex tuning-free robust regression problems

In this paper, we introduce a proximal-proximal majorization-minimizatio...

1 Introduction

Large-scale nonsmooth and nonconvex optimization problems are ubiquitous in machine learning and data analysis. Tremendous efforts have thus been directed at designing efficient algorithms for solving these problems. One popular class of algorithms is the class of first-order methods. These methods are noted for their simplicity, ease-of-implementation and relatively (often surprisingly) good performance; some notable examples include the proximal gradient algorithm, the inertial proximal algorithms and the alternating direction method of multipliers, etc. Due to the excellent performance and wide applicability of first-order methods, their convergence behaviors have been extensively studied in recent years; see, for example,

[1, 2, 3, 10, 19, 20, 24, 26, 35] and references therein. Analyzing the convergence rate of first-order methods is an important step towards a better understanding of existing algorithms, and is also crucial for developing new optimization models and numerical schemes.

As demonstrated in [2, Theorem 3.4], the convergence behavior of many first-order methods can be understood using the celebrated Kurdyka-Łojasiewicz (KL) property and its associated KL exponent; see Definitions 2.2 and 2.3. The KL property and its associated KL exponent have their roots in algebraic geometry, and they describe a qualitative relationship between the value of a suitable potential function (depending on the optimization model and the algorithm being considered) and some first-order information (gradient or subgradient) of the potential function. The KL property has been applied to analyzing local convergence rate of various first-order methods for a wide variety of problems by many researchers; see, for example, [2, 16, 24, 43]. In these studies, a proto-typical theorem on convergence rate takes the following form:

Prototypical result on convergence rate. For a certain algorithm of interest, consider a suitable potential function. Suppose that the potential function satisfies the KL property with an exponent of , and that is a bounded sequence generated by the algorithm. Then the following results hold.

  1. If , then converges finitely.

  2. If , then converges locally linearly.

  3. If , then converges locally sublinearly.

While this kind of convergence results is prominent and theoretically powerful, for the results to be fully informative, one has to be able to estimate the KL exponent. Moreover, in order to guarantee a local linear convergence rate, it is desirable to be able to determine whether a given model has a KL exponent of at most

, or be able to construct a new model whose KL exponent is at most if the old one does not have the desired KL exponent.

However, as noted in [31, Page 63, Section 2.1], the KL exponent of a given function is often extremely hard to determine or estimate. There are only few results available in the literature concerning explicit KL exponent of a function. One scenario where an explicit estimate of the KL exponent is known is when the function can be expressed as the maximum of finitely many polynomials. In this case, it has been shown in [23, Theorem 3.3] that the KL exponent can be estimated explicitly in terms of the dimension of the underlying space and the maximum degree of the involved polynomials. However, the derived estimate grows rapidly with the dimension of the problem, and so, leads to rather weak sublinear convergence rate. It is only until recently that a dimension-independent KL exponent of convex piecewise linear-quadratic functions is known, thanks to [7, Theorem 5] that connects the KL property and the concept of error bound111This notion is different from the Luo-Tseng error bound to be discussed in Definition 2.1. for convex functions. In addition, a KL exponent of is only established in [27] very recently for a class of quadratic optimization problems with matrix variables satisfying orthogonality constraints. Nevertheless, the KL exponent of many common optimization models, such as the least squares problem with smoothly clipped absolute deviation (SCAD) regularization [14] or minimax concave penalty (MCP) regularization [45] and the logistic regression problem with regularization [39], are still unknown to the best of our knowledge. In this paper, we attempt to further address the problem of determining the explicit KL exponents of optimization models, especially for those that arise in practical applications.

The main contributions of this paper are the rules for computing explicitly the KL exponent of many (convex or nonconvex) optimization models that arise in applications such as statistical machine learning. We accomplish this via two different means: studying calculus rules and building connections with the concept of Luo-Tseng error bound; see Definition 2.1. The Luo-Tseng error bound was used for establishing local linear convergence for various first-order methods, and was shown to hold for a wide range of problems; see, for example, [28, 29, 30, 40, 41, 46] for details. This concept is different from the error bound studied in [7] because the Luo-Tseng error bound is defined for specially structured optimization problems and involves first-order information, while the error bound studied in [7] does not explicitly involve any first-order information. The different nature of these two concepts was also noted in [7, Section 1], in which the Luo-Tseng error bound was referred as “first-order error bound”.

In this paper, we first study various calculus rules for the KL exponent. For example, we deduce the KL exponent of the minimum of finitely many KL functions, the KL exponent of the Moreau envelope of a convex KL function, and the KL exponent of a convex objective from its Lagrangian relaxation, etc., under suitable assumptions. This is the context of Section 3. These rules are useful in our subsequent analysis of the KL exponent of concrete optimization models that arise in applications. Next, we show that if the Luo-Tseng error bound holds and a mild assumption on the separation of stationary values is satisfied, then the function is a KL function with an exponent of . This is done in Section 4. Upon making this connection, we can now take advantage of the relatively better studied concept of Luo-Tseng error bound, which is known to hold for a wide range of concrete optimization problems; see, for example, [28, 29, 30, 40, 41, 46]. Hence, in Section 5, building upon the calculus rules and the connection with Luo-Tseng error bound, we show that many optimization models that arise in applications such as sparse recovery have objectives whose KL exponent is ; this covers the least squares problem with smoothly clipped absolute deviation (SCAD) [14] or minimax concave penalty (MCP) [45] regularization, and the logistic regression problem with regularization [39]. In addition, we also illustrate how our result can be used for establishing linear convergence of some first-order methods, such as the proximal gradient algorithm and the inertial proximal algorithm [35] with constant step-sizes. Finally, we present some concluding remarks in Section 6.

2 Notation and preliminaries

In this paper, we use to denote the -dimensional Euclidean space, equipped with the standard inner product and the induced norm . The closed ball centered at with radius is denoted by . We denote the nonnegative orthant by , and the set of symmetric matrices by

. For a vector

, we use to denote the norm and to denote the number of entries in that are nonzero (“ norm”). For a (nonempty) closed set , the indicator function is defined as

In addition, we denote the distance from an to by , and the set of points in that achieve this infimum (the projection of onto ) is denoted by . The set becomes a singleton if is a closed convex set. Finally, we write to represent the relative interior of a closed convex set .

For an extended-real-valued function , the domain is defined as . Such a function is called proper if it is never and its domain is nonempty, and is called closed if it is lower semicontinuous. For a proper function , we let denote and . The regular subdifferential of a proper function [38, Page 301, Definition 8.3(a)] at is given by

The (limiting) subdifferential of a proper function [38, Page 301, Definition 8.3(b)] at is then defined by

(1)

By convention, if , then . We also write . It is well known that when is continuously differentiable, the subdifferential (1) reduces to the gradient of denoted by ; see, for example, [38, Exercise 8.8(b)]. Moreover, when is convex, the subdifferential (1) reduces to the classical subdifferential in convex analysis; see, for example, [38, Proposition 8.12]. The limiting subdifferential enjoys rich and comprehensive calculus rules and has been widely used in nonsmooth and nonconvex optimization [33, 38]. We also define the limiting (resp. regular) normal cone of a closed set at as (resp. ) where is the indicator function of . A closed set is called regular at if (see [38, Definition 6.4]), and a proper closed function is called regular at if its epigraph is regular at the point (see [38, Definition 7.25]). Finally, we say that is a stationary point of proper closed function if . It is known that any local minimizer of is a stationary point; see, for example, [38, Theorem 10.1].

For a proper closed convex function , the proximal mapping at any is defined as

where denotes the unique minimizer of the optimization problem .222This problem has a unique minimizer because the objective is proper closed and strongly convex. For a general optimization problem , we use to denote the set of minimizers, which may be empty, a singleton or may contain more than one point. This mapping is nonexpansive, i.e., for any and , we have

(2)

see, for example, [36, Page 340]. Moreover, it is routine to show that if and only if .

The following property is defined for proper closed functions of the form , where is a proper closed function with an open domain, and is continuously differentiable with a locally Lipschitz continuous gradient on , and is proper closed convex. Recall that for this class of functions, we have if and only if , where denotes the set of stationary points of . Indeed, we have

where (i) follows from [38, Exercise 8.8(c)].

Definition 2.1.

(Luo-Tseng error bound)333We adapt the definition from [41, Assumption 2a]. Let be the set of stationary points of . Suppose that . We say that the Luo-Tseng error bound 444This is referred as first-order error bound in [7, Section 1]. holds if for any , there exist , so that

(3)

whenever and .

It is known that this property is satisfied for many choices of and , and we refer to [28, 29, 30, 40, 41, 46] and references therein for more detailed discussions. This property was used for establishing local linear convergence of various first-order methods applied to minimizing .

Recently, the following property was also used extensively for analyzing convergence rate of first-order methods, mainly for possibly nonconvex objective functions; see, for example, [2, 3].

Definition 2.2.

(KL property & KL function) We say that a proper closed function has the Kurdyka-Łojasiewicz (KL) property at if there exist a neighborhood of , and a continuous concave function with such that:

  1. is continuously differentiable on with ;

  2. for all with , one has

A proper closed function satisfying the KL property at all points in is called a KL function.

In this paper, we are interested in the KL exponent, which is defined [2, 3] as follows.

Definition 2.3.

(KL exponent) For a proper closed function satisfying the KL property at , if the corresponding function can be chosen as for some and , i.e., there exist , and so that

(4)

whenever and , then we say that has the KL property at with an exponent of . If is a KL function and has the same exponent at any , then we say that is a KL function with an exponent of . 555In classical algebraic geometry, the exponent is also referred as the Łojasiewicz exponent.

This definition encompasses broad classes of function that arise in practical optimization problems. For example, it is known that if is a proper closed semi-algebraic function [3], then is a KL function with a suitable exponent . As established in [2, Theorem 3.4] and many subsequent work, KL exponent has a close relationship with the rate of convergence of many commonly used optimization methods.

Before ending this section, we state two auxiliary lemmas. The first result is an immediate consequence of the fact that the set-valued mapping is outer semicontinuous (with respect to the -attentive convergence, i.e., ; see [38, Proposition 8.7]), and can be found in [2, Remark 4 (b)]. This result will be used repeatedly at various places in our discussion below. We include a proof for self-containedness.

Lemma 2.1.

Suppose that is a proper closed function, and . Then, for any , satisfies the KL property at with an exponent of .

Proof.

Fix any . Since and is nonempty and closed, it follows that is positive and finite. Define . We claim that there exists so that whenever and .

Suppose for the sake of contradiction that this is not true. Then there exists a sequence with and so that

In particular, there exists a sequence satisfying and . By passing to a subsequence if necessary, we may assume without loss of generality that for some , and we have , thanks to [38, Proposition 8.7]. But then we have , a contradiction. Thus, there exists so that whenever and .

Using this, we see immediately that

whenever and , showing that satisfies the KL property at with an exponent of . This completes the proof. ∎

The second result concerns the equivalence of “norms”, whose proof is simple and is omitted.

Lemma 2.2.

Let . Then there exist so that

for any .

3 Calculus of the KL exponent

In this section, we discuss how the KL exponent behaves under various operations on KL functions. We briefly summarize our results below. The required assumptions will be made explicit in the respective theorems.

  1. Exponent for given the exponents of for each ; see Theorem 3.1 and Corollary 3.1.

  2. Exponent for when the Jacobian of is surjective, given the exponent of ; see Theorem 3.2.

  3. Exponent for given the exponents of for each ; see Theorem 3.3.

  4. Exponent for the Moreau envelope of a convex KL function; see Theorem 3.4.

  5. Deducing the exponent from the Lagrangian relaxation for convex problems; see Theorem 3.5.

  6. Exponent for a potential function used in the convergence analysis of the inertial proximal algorithm in [35]; see Theorem 3.6.

  7. Deducing the exponent of a partly smooth KL function by looking at its restriction on its active manifold; see Theorem 3.7.

We shall make use of some of these calculus rules in Section 5 to deduce the KL exponent of some concrete optimization models.

We start with our first result, which concerns the minimum of finitely many KL functions. This rule will prove to be useful in Section 5. Indeed, as we shall see there, many nonconvex optimization problems that arise in applications have objectives that can be written as the minimum of finitely many KL functions whose exponents can be deduced from our results in Section 4; this includes some prominent and widely used NP-hard optimization model problems, for example, the least squares problem with cardinality constraint [6].

Theorem 3.1.

(Exponent for minimum of finitely many KL functions) Let , , be proper closed functions, be continuous on and , where . Suppose further that each , , satisfies the KL property at with an exponent of . Then satisfies the KL property at with an exponent of .

Proof.

From the definition of , we see that . Since is lower semicontinuous and the function is continuous on , there exists such that for all with , we have

Thus, whenever and , we have .

Next, using and the subdifferential rule of the minimum of finitely many functions [32, Theorem 5.5], we obtain for all that

(5)

On the other hand, by assumption, for each , there exist , , such that for all with and , one has

(6)

Let , and . Take any with and . Then and we have

where the first inequality follows from (5), the second inequality follows from (6), the construction of , , and , as well as the facts that and for ; these facts also give the last equality. This completes the proof. ∎

We have the following immediate corollary.

Corollary 3.1.

Let , , be proper closed functions with for all , and be continuous on . Suppose further that each is a KL function with an exponent of for . Then is a KL function with an exponent of .

Proof.

In view of Theorem 3.1, it suffices to show that for any , we have . To this end, take any . Note that we have by the definition, and hence . In addition, from the definition of , we have for all . Hence, is finite for all . Thus, we conclude that for all , which implies that because for all by assumption. ∎

The next theorem concerns the composition of a KL function with a smooth function that has a surjective Jacobian mapping.

Theorem 3.2.

(Exponent for composition of KL functions) Let , where is a proper closed function on and is a continuously differentiable mapping. Suppose in addition that is a KL function with an exponent of and the Jacobian is a surjective mapping at some . Then has the KL property at with an exponent of .

Proof.

Note from [38, Exercise 10.7] and that . As is a KL function, there exist , , such that for all with and , one has

(7)

On the other hand, since the linear map is surjective and is continuously differentiable, it follows from the classical Lyusternik-Graves theorem (see, for example, [33, Theorem 1.57]) that there are numbers and such that for all with

where and are the closed unit balls in and , respectively. This implies that for all we have the following estimate:

(8)

whenever

. Moreover, from the chain rule of the limiting subdifferential for composite functions (see, for example,

[38, Exercise 10.7]), we have for all with that

because is a surjective mapping for all such .

Now, let be such that for all , and . Fix with and . Let be such that . Then, we have for some . Hence, it follows from (8) that

In addition, since , applying (7) with gives us that

Therefore,

Our next theorem concerns separable sums.

Theorem 3.3.

(Exponent for block separable sums of KL functions) Let , be such that . Let , where , , are proper closed functions on with . Suppose further that each is a KL function with an exponent of and that each is continuous on , . Then is a KL function with an exponent of .

Proof.

Denote with , . Then [38, Proposition 10.5] shows that for each . As each , , is a KL function with exponent , there exist , , such that for all with and , one has

(9)

Since the left hand side of (9) is always nonnegative, the above relation holds trivially whenever . In addition, since is continuous on by assumption for each , by shrinking if necessary, we conclude that whenever and . Thus, we have from these two observations that for all with ,

(10)

Let . Take any with and . We will now verify (4). To this end, let be such that

If , then clearly

since . Thus, we consider the case where . In this case, recall from [38, Proposition 10.5] that

Hence, there exist with such that

This together with (10) implies that

(11)

for . Define . Since and , it then follows from (11) that

where the second inequality follows from Lemma 2.2 with . This completes the proof. ∎

We now discuss the operation of taking Moreau envelope. This operation is a common operation for smoothing the objective function of convex optimization problems.

Theorem 3.4.

(Exponent for Moreau envelope of convex KL functions) Let be a proper closed convex function that is a KL function with an exponent of . Suppose further that is continuous on . Fix and consider

Then is a KL function with an exponent of .

Proof.

It suffices to consider the case and show that has the KL property with an exponent of at any fixed , in view of Lemma 2.1 and the convexity of .

To this end, recall from [4, Proposition 12.29] that, for all ,

(12)

and that is Lipschitz continuous with a Lipschitz constant of . Consequently, we have for any that

(13)

where is the projection of onto , and the last equality holds because .

Next, note that we have according to [4, Proposition 12.28], which implies as . This together with (12) gives . Hence, . Since is a KL function with an exponent of and , using the fact that is continuous on , we obtain that there exist and so that

(14)

whenever and ; here, the condition on the bound on function values is waived by using the continuity of on and choosing a smaller if necessary. Moreover, in view of [7, Theorem 5(i)], by shrinking if necessary, we conclude that there exists so that

(15)

whenever and ; here, the condition on the bound on function values is waived similarly as before. Finally, since , we have . Combining this with (14) and (15) implies that for some ,

(16)

whenever and .

Now, using the definition of the proximal mapping as minimizer, we have by using the first-order optimality condition that for any ,

In particular, . In addition, using the above relation and (12), we deduce that

(17)

Fix an arbitrary with . Then , where the inequality is due to (2). Let . Then and . Hence, the relations (16) and (17) imply that

Applying (13) with and combining this with the preceding relation, we obtain further that

(18)

whenever . Finally, from the convexity of , we have

(19)

where the equality follows from (12).

Shrink