A nonsmooth nonconvex descent algorithm

10/24/2019 ∙ by Jan Mankau, et al. ∙ 0

The paper presents a new descent algorithm for locally Lipschitz continuous functions f:X→R. The selection of a descent direction at some iteration point x combines an approximation of the set-valued gradient of f on a suitable neighborhood of x (recently introduced by Mankau Schuricht) with an Armijo type step control. The algorithm is analytically justified and it is shown that accumulation points of iteration points are critical points of f. Finally the algorithm is tested for numerous benchmark problems and the results are compared with simulations found in the literature.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper we present a new descent algorithm to find local minima or critical points of a locally Lipschitz continuous function on a Hilbert space . For the minimization of a nonsmooth function

(1.1)

numerous algorithms based on quite different methods have been proposed in the literature. Let us mention, without being complete, bundle-type methods (cf. Alt [3], Frangioni [11], Gaudioso & Monaco [12], Hiriat-Urruty [17], Kiwiel [18], Makela & Neittaanmaki [22], Mifflin [24], Schramm [28], Wolfe [32], Zowe [34]), proximal point and splitting methods as e.g. the Fista or the primal dual method (cf. Beck [5], Eckstein & Bertsekas [10], Chambolle & Pock [7]), gradient sampling algorithms (cf. Burke, Lewis & Overton [6], Kiwiel [19]), algorithms based on smoothing techniques (cf. Polak & Royset [26]) and the discrete gradient method (cf. Bagirov & Karasozen [4]).

Bundle-type methods, proximal point methods, and splitting methods require to be convex or to have some other special structure. Many algorithms for locally Lipschitz continuous functions as the discrete gradient method need to know the entire generalized gradient of at given points. Stochastic methods like the gradient sampling algorithm are robust without the knowledge of the entire generalized gradient, but at the cost of high computational effort. Therefore they are limited to minimization problems on low dimensional spaces.

Recall that the derivative indicates a direction of descent for near . However if the direction of descent changes rapidly in a small neighborhood of , which is typical for functions having large second derivatives or that are even nonsmooth, then some knowledge of on a whole neighborhood of is necessary for the determination of a suitable direction of descent near .

For a new robust and fast algorithm we combine ideas of bundle-methods and gradient sampling methods. We use the concept of gradients of on sets as introduced in Mankau & Schuricht [23] (which extends ideas from Goldstein [15]). Here, similar to gradient sampling methods, generalized gradients of on a whole neighborhood of are considered for the determination of a suitable descent direction near point . But, in contrast to e.g. Burke, Lewis & Overton [6] and Kiwiel [19], the set-valued gradient on a neighborhood of point is not approximated stochastically. We rather use an elaborate recursive inner approximation coupled with the computation of related descent directions until a generalized Armijo condition is satisfied (a condition similar to that used in Alt [3] and Schramm [28] in connection with the -subdifferential). Finally a line search along a direction of sufficient descent gives the next iteration point (cf. Pytlak [27]). For better performance we may also adapt the norm of in each step. It turns out that our algorithm requires substantially less gradient computations than in [6] and [19]. Therefore it is also applicable on high dimensional spaces as needed for variational problems.

For a locally Lipschitz continuous function our methods merely demand that, at any point , both the value and at least one element of the generalized gradient (in the sense of Clarke [9]) can be computed. Notice that this mild requirement is assumed in any of the above mentioned gradient based algorithms and that it is typically met in applications. In an upcoming paper an extended algorithm is presented where quasi-Newton methods and preconditioning methods are included by a suitable change of norm in each iteration step.

Section 2 gives a brief overview about gradients on sets as needed for our treatment. The algorithm and some convergence results are given in Section 3. After the formulation of the condition of sufficient descent and of several general assumptions, Section 3.1 provides the main Algorithm 3.8 and its properties. Algorithm 3.8 calls Algorithm 3.14 for the computation of a suitable inner approximation of the set-valued gradient on a neighborhood of the current iteration point and the computation of a related descent direction while Step 3 of Algorithm 3.14 calls Algorithm 3.17 for some subiteration. Figure 1 gives an overview of the whole algorithm and several statements justify essential steps of it. Theorem 3.24 shows that every accumulation point of iteration points produced by Algorithm 3.8 is a critical point in the sense of Clarke. The proofs are collected in Section 3.2. Comprehensive numerical tests of our algorithm for classical benchmark problems can be found in Section 4. Here the simulations are also compared with results from Burke, Lewis & Overton [6], Kiwiel [19], Alt [3], Schramm [28] and the BFGS algorithm.

Notation: is a Hilbert space111Notice that any Hilbert space is uniformly convex and reflexive. with scalar product where the dual is always identified with . For a set we write for its closure, for its convex hull and for its closed convex hull. and are the open -neighborhood of point and set , respectively. stands for the open segment between and, in particular, for the open interval. denotes the positive real numbers. For a locally Lipschitz continuous function we write for Clarke’s generalized directional derivative of at in direction and for Clarke’s generalized gradient of at (cf. Clarke [9]).

2 Gradient on sets

Let be a locally Lipschitz continuous function on a Hilbert space . Clarke’s generalized gradient of at and the corresponding generalized directional derivative of at in direction somehow express the behavior of at point (cf. Clarke [9]). However, for the construction of a descent step in a numerical scheme, some information about the behavior of on a whole neighborhood of is useful in general. In particular, for describing the behavior of on the whole -ball , we use some set-valued gradient of and some corresponding generalized directional derivative as introduced in Mankau & Schuricht [23] by using Clarke’s pointwise quantities. For the convenience of the reader we present some brief specialized summary of that material as needed for our treatment.

For we define the gradient of on by

(2.1)

(notice that the closed convex hull agrees with the weakly closed convex hull) and the directional derivative of on in direction by

(2.2)

We have the following basic properties (cf. Propositon 2.3 and Corollary 2.10 in [23]).

Proposition 2.3.

Let be Lipschitz continuous of rank  on a neighborhood of with and . Then

  • is nonempty, convex, weak-compact and bounded by .

  • is finite, positively homogeneous, subadditive, and Lipschitz continuous of rank . Moreover it is the support function of , i.e.

    (2.4)
  • We have

    (2.5)
  • Let with and let with . Then

  • Let with and let . Then

Regularity of at , i.e. , implies regularity of on some by Proposition 2.16 in [23].

Lemma 2.6.

Let be locally Lipschitz continuous and let for some . Then there exist and with such that

Moreover, by (2.5).

Motivated by Proposition 2.3 (4) we say that is a descent direction of on if . We call steepest or optimal descent direction of on (with respect to ) if

(2.7)

Theorem 3.10 of [23] ensures the existence of optimal descent directions and of norm-minimal elements in .

Proposition 2.8.

Let be Lipschitz continuous on a neighborhood of for some , . Then there is a unique with

(2.9)

If or, equivalently, for some (cf. (2.5)), then there is a unique optimal descent direction on . In particular

(2.10)

Corollary 3.15 and Corollary 3.16 in [23] state some stability of descent directions.

Corollary 2.11.

Let be Lipschitz continuous of rank on a neighborhood of for some , , let , and let , be as in Proposition 2.8. Then every with is a descent direction on .

This allows to get descent directions by suitable approximations of , which is important for our numerical algorithms.

Corollary 2.12.

Let be Lipschitz continuous on a neighborhood of for some , , let , and let , be as in Proposition 2.8. Then for any there is some such that for every with

we have that is a descent direction on and satisfies

3 Descent algorithm

We now introduce some descent algorithm for locally Lipschitz continuous functions on a Hilbert space . At each iteration point we determine an approximation of the norm-minimal element (cf. (2.9)) with respect to some suitable radius . We are interested in pairs satisfying a condition of sufficient descent in the sense of a generalized Armijo step of the form

(3.1)

where is fixed for the whole scheme. As new iteration point we then select for some such that (3.1) still holds with instead of . If , the norm will be very small and the null step condition

(3.2)

(with a suitable control function that is fixed for the whole scheme) indicates that situation. Here we cannot expect (3.1) in general and we have two possibilities. If is on the desired level of accuracy for the minimizer (or critical point), we can stop the algorithm. Otherwise the used ball is too large for an iteration step with sufficient descent. Therefore we decrease and look for sufficient descent with a new pair . Our approximation of combined with the analytically justified step size control ensures that we always get sufficient descent for some small enough (cf. Lemma 2.6 and also the proof of Theorem 3.24). That we finally end up with the null step condition on the desired scale, has to become sufficiently small during the algorithm, which is ensured by control functions and . But, that the algorithm doesn’t get stuck in a small ball without critical point, shouldn’t approach zero too fast, which is ensured by control functions and . Thus a careful selection of the step size, that is related to , plays a very important role. The algorithm can be improved by choosing suitable equivalent norms at each iteration step.

Let us start with general requirements for the control functions , , and .

Assumption 3.3.

Suppose that:

  • are non-decreasing functions such that

    where is given inductively by and for all . Notice that this implies

    (otherwise for some and, since is non-decreasing, induction would give ).

  • is a function having at least one of the following properties:

    • implies for any sequences and .

    • For any there is some such that for all and .

Since the conditions for and are quite technical, we provide some typical examples.

Example 3.4 (examples for and ).

  • for  .

  • where, in particular,  .

  • with a constant satisfies (a).

  • with satisfies (b).

  • with non-decreasing and satisfies (a).

  • and satisfy (a) or (b) if , satisfy both (a) or (b), respectively.

  • satisfies (a) and satisfies (b) for any .

As already mentioned, it might be useful to adapt the norm in every iteration (recall that the Newton method can be considered as descent algorithm with changing norm at each step). In our algorithm we allow a change of norm in every step as long as we have some uniform equivalence.

Assumption 3.5.

The norms and on are uniformly equivalent, i.e. there is some such that

(3.6)

In practice is related to the Hessian of some smooth function at iteration point and is some (usually not explicitly known) bound of that Hessian.

Notice that the definition of and of as subset of merely uses convergence in and, thus, does not depend on equivalent norms on . However the Riesz mapping identifying with the Hilbert space depends on the norm. Therefore depends on the norm if it is considered as subset of , which we usually do for simplification of notation.

Remark 3.7.

The gradient based on the norm is understood as subset of equipped with where, in particular, is taken with respect to .

3.1 Algorithm

Now we introduce the main algorithm based on two subalgorithms presented afterwards. We formulate several results that justify the single steps and that finally show convergence of the algorithm (cf. Figure 1 below for a rough overview). The proofs are collected in Section 3.2.

Algorithm 3.8 (Main Algorithm).

  • Initialization: Choose and satisfying Assumption 3.3,

    and set .

  • Choose some norm subject to Assumption 3.5, some (w.r.t. ), and some .

  • Determine (w.r.t. ) by Algorithm 3.14 such that the null step condition

    (3.9)

    or the condition of sufficient descent

    (3.10)

    is satisfied (recall Remark 3.7 for the meaning of related to and notice that (3.9) and (3.10) can be satisfied simultaneously).

  • In case (3.9) set , increment by one, and go to Step 3.

  • If (3.9) is not true, choose such that the condition of sufficient descent

    (3.11)

    is satisfied (notice that is always possible, since (3.10) is satisfied in this case). Then fix the new iteration point

    (3.12)

    set , increment by one, set , and go to Step 2.

Remark 3.13.

  • Instead of in Step 2 one could also choose

  • The selection of in Step 5 can be done by some line search in direction (cf. Pytlak [27]).

  • One can easily ensure that by requiring that e.g.

    for some with and since the proof of Theorem 3.24 shows that . But in practice this is usually not necessary.

The essential point in Algorithm 3.8 is Step 3 with the computation of a suitable approximation of the norm-smallest element (cf. (2.9)) such that the null step condition or the condition of sufficient descent is satisfied for given . Let us briefly discuss the main idea before we formulate the corresponding subalgorithm. Usually the sets defining are not known explicitely. For the algorithm we merely suppose that always at least one element can be determined numerically (cf. Remark 3.19 below for a brief discussion of that point). On this basis we select step by step elements for suitable and determine certain such that, roughly speaking, the convex hull of all , with is an approximating subset of and is a norm-minimal element in . In doing so we still manage that is decreasing sufficiently. Therefore we reach for and large that the null step condition (3.9) is satisfied if or, otherwise, that approximates sufficiently well in the sense of Corollary 2.12. In the last case is a descent direction on and, by Proposition 2.3 (4),

with from Algorithm 3.8, i.e. condition (3.10) of sufficient descent is satisfied with the standard norm. Clearly the quality of the algorithm is closely related to the quality of the approximating set and, in some applications, we can improve the quality substantially by choosing a suitable equivalent norm in each step.

Let us now provide the precise algorithm where quantities determined here are marked by .

Algorithm 3.14.

Let , , , , and be as in Step 3 of Algorithm 3.8 for some .

  • Choose some (w.r.t. ) and some and set . (Typically, but not necessarily, agrees with from Algorithm 3.8.)

  • Set and . If satisfies the null step condition (3.9) or condition (3.10) of sufficient descent, stop and return .

  • Otherwise compute some (w.r.t. ) for some by Algorithm 3.17 such that

    (3.15)
  • Choose some subset such that and set

  • Compute

    increment by one, and go to Step 2.

Notice that for all by induction and that

Hence can be considered as some inner approximation of and the norm-smallest element is an approximation of the norm-smallest element . Algorithm 3.14 ensures with (3.15) that decreases sufficiently, i.e. we have for some as long as the null step condition (3.9) is not fulfilled (cf. the proof of Lemma 3.30). Hence, the null step condition (3.9) has to be satisfied for some after finitely many steps if we do not meet condition (3.10) of sufficient descent before. In practice we usually take

with .

Remark 3.16.

Note that the computation of is equivalent to the minimization of a quadratic function defined on some -simplex. This can be easily done with SQP or semi smooth Newton methods (cf. [2, 25, 30]). Since for typical applications, we can neglect the computational time for compared to that needed for the computation of a gradient of .

We complete our algorithm with the precise scheme about the selection of in Step 3 of Algorithm 3.14 by some nesting procedure for the segment . New quantities determined in the subalgorithm are marked by .

Algorithm 3.17.

Let , , , , , and be as in Step 3 of Algorithm 3.14. (Notice that both the null step condition (3.9) and condition (3.10) of sufficient descent are violated for .)

  • Set , , and (notice that ).

  • Choose some (w.r.t. ).

  • If satisfies (3.15) stop and return .

  • Otherwise choose and such that

    (3.18)

    where we take if (this way the condition of sufficient descent is violated on segment with ).

  • Increment by and go to Step 2.

A slightly simplified survey about the complete algorithm is given in Figure 1.

mygray1gray0.9 mygray2gray0.8 Initialization: fix parameters , , , functions , , , initial point , and

choose a norm , some (w.r.t. ), some , and

blackmygray2 Algorithm 3.14 (preconditions step size)

blackwhite,  choose (w.r.t. , e.g. )

blackwhite null stepi.e.

NO

blackwhite sufficient descent

YESYESNO

blackwhite (3.9) holds with

blackwhite (3.10) holds with

blackmygray1 Algorithm 3.17 compute with        

blackwhite with inner approximation of    improve norm-minimal element by     

blackwhite null step:

blackwhite descent step:  line search in direction for step size and

Figure 1: Flow diagram of Algorithm 3.8.
Remark 3.19.

While the implementation of the most steps in Algorithm 3.8 and its subalgorithms should be quite clear, let us briefly discuss how to choose some element . In our applications we usually have a representation of that allows the numerical computation of some element . More precisely, in many cases is continuously differentiable on an open set such that has zero Lebesgue measure. Here we can use Proposition 2.1.5 or Theorem 2.5.1 from Clarke [9] to get single elements . If is defined to be the pointwise maximum or minimum of smooth functions, Proposition 2.3.12 or Proposition 2.7.3 in [9] can be used to determine some

. Moreover we can combine this with other calculus rules as e.g. the chain rule

[9, Theorem 2.3.9]. Beyond these methods, that are sufficient for the benchmark problems considered in Section 4, also discrete approximations of elements of like e.g. in [4] can be used. Let us finally state that the presented algorithm assumes the possibility to compute at least one element of .

Let us now justify the essential steps of the algorithm, i.e. that the required conditions can be reached and that the iterations typically terminate after finitely many steps. We start with Algorithm 3.17 and consider in particular Step 4.

Proposition 3.20 (properties of Algorithm 3.17).

Let the assumptions of Algorithm 3.17 be satisfied. Then:

  • The choice in Step 4 of Algorithm 3.17 is possible for every .

  • The set

    has positive Lebesgue measure for every .

  • If Algorithm 3.17 does not terminate and, therefore, produces sequences and converging to some , then there is some satisfying (3.15) and is not strictly differentiable at .

  • If is convex on a neighborhood of , then Algorithm 3.17 terminates in Step 3 already for .

Though it is not trusted that we find some satisfying (3.15) after finitely many steps, there is an extremely good chance according to Proposition 3.20 (2). In practice Algorithm 3.17 always terminated, also for rather complex simulations presented here. Nevertheless there are examples where the algorithm does not terminate (at least theoretically) as a simple induction argument shows, e.g., for with and .

Remark 3.21.

Typically it is much cheaper for time consumption to compute merely the scalar in (3.15

) instead of the complete vector

. Therefore we compute only if (3.15) is satisfied.

Proposition 3.22 (properties of Algorithm 3.14).

Let the assumptions of Algorithm 3.14 be satisfied, let be Lipschitz continuous on some neighborhood of , and suppose that Algorithm 3.17 always terminates. Then Algorithm 3.14 stops after finitely many steps and returns some satisfying (3.9) or (3.10).

Proposition 3.23 (properties of Algorithm 3.8).

Let the assumptions of Algorithm 3.8 be satisfied and let be an iteration point from that algorithm. Then:

  • If is related to for some , then there exists satisfying the null step condition (3.9) or condition (3.10) of sufficient descent.

  • If , then there are only finitely many such that (3.9) is satisfied.

Though Proposition 3.23 (1) already follows from Propositions 3.22 we will still give a brief independent proof of it in the next section.

Summarizing we can say that, in principle, the presented algorithm always works and cannot get stuck, i.e. at most finitely many subiterations are necessary to find a new iteration point . The only point is that Algorithm 3.17 might not terminate which, however, is quite unlikely according to Proposition 3.20 (2) and which never happened in our simulations.

Let us finally confirm that the presented descent algorithm can reach both minimizers and critical points of .

Theorem 3.24 (accumulation points are critical points).

Let the assumptions of Algorithm 3.8 be satisfied and let be a corresponding sequence of iteration points. Then is strictly decreasing. Moreover, if is an accumulation point of , then and .

As consequence we can formulate some more precise statement.

Proposition 3.25.

Let the assumptions of Algorithm 3.8 be satisfied, let be a corresponding sequence of iteration points with step sizes , and suppose that is relatively compact. Then each accumulation point of is a critical point of and, if contains only finitely many critical points, converges to a critical point of . Moreover, if is not convergent, then has no isolated accumulation point.

Remark 3.26.

If and is bounded, then is relatively compact.

3.2 Proofs

Proof of Proposition 3.20.  Since does not satisfy (3.9), we have . By construction .

(1) Since (3.10) is not fulfilled, we have (3.18) for with . Assume that (3.18) holds for . Then

(3.27)

If neither and