In this paper we present a new descent algorithm to find local minima or critical points of a locally Lipschitz continuous function on a Hilbert space . For the minimization of a nonsmooth function
numerous algorithms based on quite different methods have been proposed in the literature. Let us mention, without being complete, bundle-type methods (cf. Alt , Frangioni , Gaudioso & Monaco , Hiriat-Urruty , Kiwiel , Makela & Neittaanmaki , Mifflin , Schramm , Wolfe , Zowe ), proximal point and splitting methods as e.g. the Fista or the primal dual method (cf. Beck , Eckstein & Bertsekas , Chambolle & Pock ), gradient sampling algorithms (cf. Burke, Lewis & Overton , Kiwiel ), algorithms based on smoothing techniques (cf. Polak & Royset ) and the discrete gradient method (cf. Bagirov & Karasozen ).
Bundle-type methods, proximal point methods, and splitting methods require to be convex or to have some other special structure. Many algorithms for locally Lipschitz continuous functions as the discrete gradient method need to know the entire generalized gradient of at given points. Stochastic methods like the gradient sampling algorithm are robust without the knowledge of the entire generalized gradient, but at the cost of high computational effort. Therefore they are limited to minimization problems on low dimensional spaces.
Recall that the derivative indicates a direction of descent for near . However if the direction of descent changes rapidly in a small neighborhood of , which is typical for functions having large second derivatives or that are even nonsmooth, then some knowledge of on a whole neighborhood of is necessary for the determination of a suitable direction of descent near .
For a new robust and fast algorithm we combine ideas of bundle-methods and gradient sampling methods. We use the concept of gradients of on sets as introduced in Mankau & Schuricht  (which extends ideas from Goldstein ). Here, similar to gradient sampling methods, generalized gradients of on a whole neighborhood of are considered for the determination of a suitable descent direction near point . But, in contrast to e.g. Burke, Lewis & Overton  and Kiwiel , the set-valued gradient on a neighborhood of point is not approximated stochastically. We rather use an elaborate recursive inner approximation coupled with the computation of related descent directions until a generalized Armijo condition is satisfied (a condition similar to that used in Alt  and Schramm  in connection with the -subdifferential). Finally a line search along a direction of sufficient descent gives the next iteration point (cf. Pytlak ). For better performance we may also adapt the norm of in each step. It turns out that our algorithm requires substantially less gradient computations than in  and . Therefore it is also applicable on high dimensional spaces as needed for variational problems.
For a locally Lipschitz continuous function our methods merely demand that, at any point , both the value and at least one element of the generalized gradient (in the sense of Clarke ) can be computed. Notice that this mild requirement is assumed in any of the above mentioned gradient based algorithms and that it is typically met in applications. In an upcoming paper an extended algorithm is presented where quasi-Newton methods and preconditioning methods are included by a suitable change of norm in each iteration step.
Section 2 gives a brief overview about gradients on sets as needed for our treatment. The algorithm and some convergence results are given in Section 3. After the formulation of the condition of sufficient descent and of several general assumptions, Section 3.1 provides the main Algorithm 3.8 and its properties. Algorithm 3.8 calls Algorithm 3.14 for the computation of a suitable inner approximation of the set-valued gradient on a neighborhood of the current iteration point and the computation of a related descent direction while Step 3 of Algorithm 3.14 calls Algorithm 3.17 for some subiteration. Figure 1 gives an overview of the whole algorithm and several statements justify essential steps of it. Theorem 3.24 shows that every accumulation point of iteration points produced by Algorithm 3.8 is a critical point in the sense of Clarke. The proofs are collected in Section 3.2. Comprehensive numerical tests of our algorithm for classical benchmark problems can be found in Section 4. Here the simulations are also compared with results from Burke, Lewis & Overton , Kiwiel , Alt , Schramm  and the BFGS algorithm.
Notation: is a Hilbert space111Notice that any Hilbert space is uniformly convex and reflexive. with scalar product where the dual is always identified with . For a set we write for its closure, for its convex hull and for its closed convex hull. and are the open -neighborhood of point and set , respectively. stands for the open segment between and, in particular, for the open interval. denotes the positive real numbers. For a locally Lipschitz continuous function we write for Clarke’s generalized directional derivative of at in direction and for Clarke’s generalized gradient of at (cf. Clarke ).
2 Gradient on sets
Let be a locally Lipschitz continuous function on a Hilbert space . Clarke’s generalized gradient of at and the corresponding generalized directional derivative of at in direction somehow express the behavior of at point (cf. Clarke ). However, for the construction of a descent step in a numerical scheme, some information about the behavior of on a whole neighborhood of is useful in general. In particular, for describing the behavior of on the whole -ball , we use some set-valued gradient of and some corresponding generalized directional derivative as introduced in Mankau & Schuricht  by using Clarke’s pointwise quantities. For the convenience of the reader we present some brief specialized summary of that material as needed for our treatment.
For we define the gradient of on by
(notice that the closed convex hull agrees with the weakly closed convex hull) and the directional derivative of on in direction by
We have the following basic properties (cf. Propositon 2.3 and Corollary 2.10 in ).
Let be Lipschitz continuous of rank on a neighborhood of with and . Then
is nonempty, convex, weak-compact and bounded by .
is finite, positively homogeneous, subadditive, and Lipschitz continuous of rank . Moreover it is the support function of , i.e.
Let with and let with . Then
Let with and let . Then
Regularity of at , i.e. , implies regularity of on some by Proposition 2.16 in .
Let be locally Lipschitz continuous and let for some . Then there exist and with such that
Moreover, by (2.5).
Theorem 3.10 of  ensures the existence of optimal descent directions and of norm-minimal elements in .
Let be Lipschitz continuous on a neighborhood of for some , . Then there is a unique with
If or, equivalently, for some (cf. (2.5)), then there is a unique optimal descent direction on . In particular
Corollary 3.15 and Corollary 3.16 in  state some stability of descent directions.
Let be Lipschitz continuous of rank on a neighborhood of for some , , let , and let , be as in Proposition 2.8. Then every with is a descent direction on .
This allows to get descent directions by suitable approximations of , which is important for our numerical algorithms.
Let be Lipschitz continuous on a neighborhood of for some , , let , and let , be as in Proposition 2.8. Then for any there is some such that for every with
we have that is a descent direction on and satisfies
3 Descent algorithm
We now introduce some descent algorithm for locally Lipschitz continuous functions on a Hilbert space . At each iteration point we determine an approximation of the norm-minimal element (cf. (2.9)) with respect to some suitable radius . We are interested in pairs satisfying a condition of sufficient descent in the sense of a generalized Armijo step of the form
where is fixed for the whole scheme. As new iteration point we then select for some such that (3.1) still holds with instead of . If , the norm will be very small and the null step condition
(with a suitable control function that is fixed for the whole scheme) indicates that situation. Here we cannot expect (3.1) in general and we have two possibilities. If is on the desired level of accuracy for the minimizer (or critical point), we can stop the algorithm. Otherwise the used ball is too large for an iteration step with sufficient descent. Therefore we decrease and look for sufficient descent with a new pair . Our approximation of combined with the analytically justified step size control ensures that we always get sufficient descent for some small enough (cf. Lemma 2.6 and also the proof of Theorem 3.24). That we finally end up with the null step condition on the desired scale, has to become sufficiently small during the algorithm, which is ensured by control functions and . But, that the algorithm doesn’t get stuck in a small ball without critical point, shouldn’t approach zero too fast, which is ensured by control functions and . Thus a careful selection of the step size, that is related to , plays a very important role. The algorithm can be improved by choosing suitable equivalent norms at each iteration step.
Let us start with general requirements for the control functions , , and .
are non-decreasing functions such that
where is given inductively by and for all . Notice that this implies
(otherwise for some and, since is non-decreasing, induction would give ).
is a function having at least one of the following properties:
implies for any sequences and .
For any there is some such that for all and .
Since the conditions for and are quite technical, we provide some typical examples.
Example 3.4 (examples for and ).
where, in particular, .
with a constant satisfies (a).
with satisfies (b).
with non-decreasing and satisfies (a).
and satisfy (a) or (b) if , satisfy both (a) or (b), respectively.
satisfies (a) and satisfies (b) for any .
As already mentioned, it might be useful to adapt the norm in every iteration (recall that the Newton method can be considered as descent algorithm with changing norm at each step). In our algorithm we allow a change of norm in every step as long as we have some uniform equivalence.
The norms and on are uniformly equivalent, i.e. there is some such that
In practice is related to the Hessian of some smooth function at iteration point and is some (usually not explicitly known) bound of that Hessian.
Notice that the definition of and of as subset of merely uses convergence in and, thus, does not depend on equivalent norms on . However the Riesz mapping identifying with the Hilbert space depends on the norm. Therefore depends on the norm if it is considered as subset of , which we usually do for simplification of notation.
The gradient based on the norm is understood as subset of equipped with where, in particular, is taken with respect to .
Now we introduce the main algorithm based on two subalgorithms presented afterwards. We formulate several results that justify the single steps and that finally show convergence of the algorithm (cf. Figure 1 below for a rough overview). The proofs are collected in Section 3.2.
Algorithm 3.8 (Main Algorithm).
The essential point in Algorithm 3.8 is Step 3 with the computation of a suitable approximation of the norm-smallest element (cf. (2.9)) such that the null step condition or the condition of sufficient descent is satisfied for given . Let us briefly discuss the main idea before we formulate the corresponding subalgorithm. Usually the sets defining are not known explicitely. For the algorithm we merely suppose that always at least one element can be determined numerically (cf. Remark 3.19 below for a brief discussion of that point). On this basis we select step by step elements for suitable and determine certain such that, roughly speaking, the convex hull of all , with is an approximating subset of and is a norm-minimal element in . In doing so we still manage that is decreasing sufficiently. Therefore we reach for and large that the null step condition (3.9) is satisfied if or, otherwise, that approximates sufficiently well in the sense of Corollary 2.12. In the last case is a descent direction on and, by Proposition 2.3 (4),
with from Algorithm 3.8, i.e. condition (3.10) of sufficient descent is satisfied with the standard norm. Clearly the quality of the algorithm is closely related to the quality of the approximating set and, in some applications, we can improve the quality substantially by choosing a suitable equivalent norm in each step.
Let us now provide the precise algorithm where quantities determined here are marked by .
Notice that for all by induction and that
Hence can be considered as some inner approximation of and the norm-smallest element is an approximation of the norm-smallest element . Algorithm 3.14 ensures with (3.15) that decreases sufficiently, i.e. we have for some as long as the null step condition (3.9) is not fulfilled (cf. the proof of Lemma 3.30). Hence, the null step condition (3.9) has to be satisfied for some after finitely many steps if we do not meet condition (3.10) of sufficient descent before. In practice we usually take
Note that the computation of is equivalent to the minimization of a quadratic function defined on some -simplex. This can be easily done with SQP or semi smooth Newton methods (cf. [2, 25, 30]). Since for typical applications, we can neglect the computational time for compared to that needed for the computation of a gradient of .
We complete our algorithm with the precise scheme about the selection of in Step 3 of Algorithm 3.14 by some nesting procedure for the segment . New quantities determined in the subalgorithm are marked by .
A slightly simplified survey about the complete algorithm is given in Figure 1.
While the implementation of the most steps in Algorithm 3.8 and its subalgorithms should be quite clear, let us briefly discuss how to choose some element . In our applications we usually have a representation of that allows the numerical computation of some element . More precisely, in many cases is continuously differentiable on an open set such that has zero Lebesgue measure. Here we can use Proposition 2.1.5 or Theorem 2.5.1 from Clarke  to get single elements . If is defined to be the pointwise maximum or minimum of smooth functions, Proposition 2.3.12 or Proposition 2.7.3 in  can be used to determine some
. Moreover we can combine this with other calculus rules as e.g. the chain rule[9, Theorem 2.3.9]. Beyond these methods, that are sufficient for the benchmark problems considered in Section 4, also discrete approximations of elements of like e.g. in  can be used. Let us finally state that the presented algorithm assumes the possibility to compute at least one element of .
Let us now justify the essential steps of the algorithm, i.e. that the required conditions can be reached and that the iterations typically terminate after finitely many steps. We start with Algorithm 3.17 and consider in particular Step 4.
Proposition 3.20 (properties of Algorithm 3.17).
Let the assumptions of Algorithm 3.17 be satisfied. Then:
has positive Lebesgue measure for every .
Though it is not trusted that we find some satisfying (3.15) after finitely many steps, there is an extremely good chance according to Proposition 3.20 (2). In practice Algorithm 3.17 always terminated, also for rather complex simulations presented here. Nevertheless there are examples where the algorithm does not terminate (at least theoretically) as a simple induction argument shows, e.g., for with and .
Proposition 3.22 (properties of Algorithm 3.14).
Proposition 3.23 (properties of Algorithm 3.8).
Let the assumptions of Algorithm 3.8 be satisfied and let be an iteration point from that algorithm. Then:
If , then there are only finitely many such that (3.9) is satisfied.
Summarizing we can say that, in principle, the presented algorithm always works and cannot get stuck, i.e. at most finitely many subiterations are necessary to find a new iteration point . The only point is that Algorithm 3.17 might not terminate which, however, is quite unlikely according to Proposition 3.20 (2) and which never happened in our simulations.
Let us finally confirm that the presented descent algorithm can reach both minimizers and critical points of .
Theorem 3.24 (accumulation points are critical points).
Let the assumptions of Algorithm 3.8 be satisfied and let be a corresponding sequence of iteration points. Then is strictly decreasing. Moreover, if is an accumulation point of , then and .
As consequence we can formulate some more precise statement.
Let the assumptions of Algorithm 3.8 be satisfied, let be a corresponding sequence of iteration points with step sizes , and suppose that is relatively compact. Then each accumulation point of is a critical point of and, if contains only finitely many critical points, converges to a critical point of . Moreover, if is not convergent, then has no isolated accumulation point.
If and is bounded, then is relatively compact.