A note on solving nonlinear optimization problems in variable precision

12/09/2018
by   S. Gratton, et al.
ENSEEIHT
UNamur
0

This short note considers an efficient variant of the trust-region algorithm with dynamic accuracy proposed Carter (1993) and Conn, Gould and Toint (2000) as a tool for very high-performance computing, an area where it is critical to allow multi-precision computations for keeping the energy dissipation under control. Numerical experiments are presented indicating that the use of the considered method can bring substantial savings in objective function's and gradient's evaluation "energy costs" by efficiently exploiting multi-precision computations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

07/07/2021

Performance Evaluation of Mixed-Precision Runge-Kutta Methods

Additive Runge-Kutta methods designed for preserving highly accurate sol...
07/17/2018

Minimizing convex quadratic with variable precision Krylov methods

Iterative algorithms for the solution of convex quadratic optimization p...
07/07/2021

A stochastic first-order trust-region method with inexact restoration for finite-sum minimization

We propose a stochastic first-order trust-region method with inexact fun...
09/24/2021

Mixed-precision explicit stabilized Runge-Kutta methods for single- and multi-scale differential equations

Mixed-precision algorithms combine low- and high-precision computations ...
09/09/2020

Efficient Parameter Selection for Scaled Trust-Region Newton Algorithm in Solving Bound-constrained Nonlinear Systems

We investigate the problem of parameter selection for the scaled trust-r...
06/08/2021

Using a New Nonlinear Gradient Method for Solving Large Scale Convex Optimization Problems with an Application on Arabic Medical Text

Gradient methods have applications in multiple fields, including signal ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Motivation and objectives

Two recent evolutions in the field of scientific computing motivate the present note. The first is the growing importance of deep-learning methods for artificial intelligence, and the second is the acknowledgement by computer architects that new high-performance machines must be able to run the basic tools of deep learning very efficiently. Because the ubiquitous mathematical problem in deep learning is nonlinear nonconvex optimization, it is therefore of interest to consider how to solve this problem in ways that are as efficient as possible on new very powerful computers. As it turns out, one of the crucial aspects in designing such machines and the algorithms that they use is mastering energy dissipation. Given that this dissipation is approximately proportional to chip surface and that chip surface itself approximately proportional to the square of the number of binary digits involved in the calculation

[19, 32, 23, 27], being able to solve nonlinear optimization problems with as few digits as possible (while not loosing on final accuracy) is clearly of interest.

This short note’s sole purpose is to show that this is possible and that algorithms exist which achieve this goal and whose robustness significantly exceed simple minded approaches. The focus is on unconstrained nonconvex optimization, the most frequent case in deep learning applications. Since the cost of solving such problems is typically dominated by that of evaluating the objective function (and derivatives if possible), our aim is therefore to propose optimization methods which are able to exploit/specify varying levels of preexisting arithmetic precision for these evaluations. Because of this feature, optimization in this context differs from other better-studied frameworks where the degree of accuracy may be chosen in a more continuous way, such as adaptive techniques for optimization with noisy functions (see [18, 9, 10]

) or with functions whose values and that of its derivatives are estimated by some (possibly dynamic) sampling process (see

[35, 12, 3, 13, 6, 5], for instance). We propose here a suitable adaptation of the dynamic-accuracy trust-region framework proposed by Carter in [10] and by Conn, Gould and Toint in Section 10.6 of [16] to the context of multi-precision computations. Our proposal complements that of [25], where inexactness is also used for energy saving purposes, and where its exploitation is restricted to the inner linear algebra work of the solution algorithm, while still assuming exact values of the nonlinear function involved(1)(1)(1)The solution of nonlinear systems of equations is considered rather than unconstrained optimization.. Note that the framework of inexact computation has already been discussed in othe contexts [1, 30, 24, 25].

The paper is organized as follows. Section 2 presents the algorithmic framework using variable accuracy. Section 3 reports encouraging numerical results suggesting the potential of the approach, while conclusions and perspectives for further research are discussed in Section 4.

2 Nonconvex optimization with dynamic accuracy

We start by briefly recalling the context of the dynamic-accuracy trust-region technique of [16]. Consider the unconstrained minimization problem

(2.1)

where is a sufficiently smooth function from into IR, and where the value of the objective can be approximated with a prespecified level of accuracy. This is to say, given and a an accuracy levels , the function such that

(2.2)

The crucial difference with a more standard approach for optimization with noisy function is that the required accuracy level may be specified by the minimization algorithm itself within a prespecified set, with the understanding that the more accurate the requirement specified by , the higher the “cost” of evaluating .

We propose to use a trust-region method, that is an iterative algorithm where, at iteration , a first-order model is approximately minimized on in a ball (the trust region) centered at the current iterate and of radius (the trust-region radius), yielding a trial point. The value of the reduction in the objective function achieved at the trial point is then compared to that predicted by the model. If the agreement is deemed sufficient, the trial point is accepted as the next iterate and the radius kept the same or increased. Otherwise the trial point is rejected and the radius reduced. This basic framework, whose excellent convergence properties and outstanding numerical performance are well-known, was modified in Section 10.6 of [16] to handle the situation when only is known, rather than . It has already been adapted to other contexts (see [2] for instance).

However, the method has the serious drawback of requiring exact gradient values, even if function value may be computed inexactly. We may then call on Section 8.4 of [16] which indicates that convergence of trust-region methods to first-order critical points may be ensured with inexact gradients. If we now define the approximate gradient at as the function such that

(2.3)

this convergence is ensured provided, througout the iterations,

(2.4)

that is if the relative error on the gradient remains suitably small. Note that this relative error need not tend to zero, but that the absolute error will when convergence occurs (see [16, Theorem 8.4.1, p. 281]). Also note that the concept of a relative gradient error is quite natural if one assumes that is computed using an arithmetic with a limited number of significant digits.

We now propose a variant of this scheme, which we state as Algorithm 2 2.

Algorithm 2.1: Trust region with dynamic accuracy on and (TR1DA) Step0: Initialization: An initial point , an initial trust-region radius , an initial accuracy levels and a desired final gradient accuracy are given. The constants , , , and are also given and satisfy

Compute and set . Step 1: Check for termination: If or , choose and compute such that
(2.5)
In all cases, terminate if
(2.6)
Step 2: Step calculation: Select a symmetric Hessian approximation and compute a step such that which sufficiently reduces the model
(2.7)
in the sense that
(2.8)
Step 3: Evaluate the objective function: Select
(2.9)
and compute . If , recompute .
Step 4: Acceptance of the trial point: Define the ratio
(2.10)
If , then define and set . Otherwise set , and .
Step 5: Radius update: Set
(2.11)
Increment by 1 and go to Step 2.

This algorithm differs from that presented in [16, p. 402] on two accounts. First, it incorporates inexact gradients, as we discussed above. Second, it does not require that the step is recomputed whenever a more accurate objective function’s value is required in Step 2. This last feature makes the algorithm more efficient. Moreover, it does not affect the sequence of iterates since the value of the model decrease predicted by the step is independent of the objective function value. As a consequence, the convergence to first-order points studied in Section 10.6.1 of [16] (under the assumption that the approximate Hessians remain bounded) still applies. In what follows, we choose to contruct this approximation using a limited-memory symmetric rank-one (SR1) quasi-Newton update(2)(2)(2)Numerical experiments not reported here suggest that our default choice of remembering 15 secant pairs gives good performance, although keeping a smaller number of pairs is still acceptable. based on gradient differences [29, Section 8.2].

As it turns out, this variant of [16] is quite close to the method proposed by Carter in [10], the main difference being that the latter uses fixed tolerances in slighly different ranges(3)(3)(3) [10] requires while we require . A fixed value is also used for , whose upper bound depends on . The Hessian approximation is computed using an unsafeguarded standard BFGS update..

We immediately note that, at termination,

(2.12)

where we have used the triangle inequality, (2.3) and (2.4). As a consequence, the TRDA algorithm terminates at a true -approximate first-order-necessary minimizer. Moreover, [16, Theorem 8.4.5] and the development of p. 404-405 in the same reference indicate that the decrease in objective function value at successful iterations is bounded below by a multiple of . As a consequence, we may immediately deduce from now standard arguments (see [21] for the first derivation), that the maximum number of iterations needed by Algorithm TRDA to find such an -approximate first-order-necessary minimizer is . Moreover, this bound was proved sharp in most cases in [11], even when exact and bounded second derivatives are used.

3 Numerical experience

We now present some numerical evidence that the TRDA algorithm can perform well and provide significant savings in terms of energy costs, when these are dominated by the function and gradient evaluations. Our experiments are run in Matlab (64bits) and use a collection of 86 small unconstrained test problems(4)(4)(4)The collection of [8] and a few other problems, all available in Matlab. detailed in Table 3.1. In what follows, we declare a run successful when an iterate is found such that (2.6), and hence (2.12), hold in at most 1000 iterations.

Name dim. source Name dim. source Name dim. source
argauss 3 [28, 8] arglina 10 [28, 8] arglinb 10 [28, 8]
arglinc 10 [28, 8] argtrig 10 [28, 8] arwhead 10 [14, 20]
bard 3 [8] bdarwhd 10 [20] beale 2 [8]
biggs6 6 [28, 8] box 3 [28, 8] booth 2 [8]
brkmcc 2 [8] brazthing 2 - brownal 10 [28, 8]
brownbs 2 [28, 8] brownden 4 [28, 8] broyden3d 10 [28, 8]
broydenbd 10 [28, 8] chebyqad 10 [28, 8] cliff 2 [8]
clustr 2 [8] cosine 2 [20] crglvy 10 [28, 8]
cube 2 [8] dixmaana 12 [17, 8] dixmaanj 12 [17, 8]
dixon 10 [8] dqrtic 10 [8] edensch 5 [26]
eg2 10 [15, 20] eg2s 10 [15, 20] engval1 10 [28, 8]
engval2 10 [28, 8] freuroth 4 [28, 8] genhumps 2 [20]
gottfr 2 [8] gulf 4 [28, 8] hairy 2 [20]
helix 3 [28, 8] hilbert 10 [8] himln3 10 [8]
himm25 10 [8] himm27 10 [8] himm28 10 [8]
himm29 10 [8] himm30 10 [8] himm33 10 [8]
hypcir 2 [8] indef 5 [20] integreq 2 [28, 8]
jensmp 2 [28, 8] kowosb 4 [28, 8] lminsurf 25 [22, 8]
mancino 10 [34, 8] mexhat 2 [7] meyer3 3 [28, 8]
morebv 12 [28, 8] msqrtals 16 [8] msqrtbls 16 [8]
nlminsurf 25 [22, 8] osbornea 5 [28, 8] osborneb 11 [28, 8]
penalty1 10 [28, 8] penalty2 10 [28, 8] powellbs 2 [28, 8]
powellsg 4 [28, 8] powellsq 2 [8] powr 10 [8]
recipe 2 [8] rosenbr 2 [28, 8] schmvett 3 [33, 8]
scosine 2 [20] sisser 2 [8] spmsqrt 10 [8]
tquartic 10 [8] tridia 10 [8] trigger 7 [31, 8]
vardim 10 [28, 8] watson 12 [28, 8] wmsqrtals 16 -
wmsqrtbls 16 - woods 12 [28, 8] zangwil2 2 [8]
zangwil3 3 [8]
Table 3.1: The test problems

In the following set of experiments with the TRDA variants, we assume that the objective function’s value and the gradient can be computed in double, single or half precision (with corresponding accuracy level equal to machine precision, half machine precision or quarter machine precision). Thus, when the TRDA algorithm specifies an accuracy level or , this may not be attainable as such, but the lower of the three available levels of accuracy is then chosen to perform the computation in (possibly moderately) higher accuracy than requested. The equivalent double-precision costs of the evaluations of and in single precision are then computed by dividing the cost of evaluation in double precision by four(5)(5)(5)Remember it is proportional to the square of the number of significant digits.. Those for half precision are correspond to double-precision costs divided by sixteen.

To set the stage, our first experiment starts by comparing three variants of the TRDA algorithm:

LMQN:

a version using for all (i.e. using the full double precision arithmetic throughout),

LMQN-s:

a version using single precision evaluation of the objective function and gradient for all ,

LMQN-h:

a version using half precision evaluation of the objective function and gradient for all .

These variants correspond to a simple minded approach where the expensive parts of the optimization calculation are conducted in reduced precision without any further adaptive accuracy management. For each variant, we report, for three different values of the final gradient accuracy ,

  1. the robustness as given by the number of successful solves for the relevant (nsucc),

  2. the average number of iterations (its.),

  3. the average equivalent double-precision costs of objective function’s evaluations (costf),

  4. the average equivalent double-precision costs of gradient’s evaluations (costg),

  5. the ratio of the average number of iterations used by the variant compared to that used by LMQN, computed on problems solved by both LMQN and the variant (rel. its.),

  6. the ratio of the average equivalent double-precision evaluation costs for used by the variant compared to that used by LMQN, computed on problems solved by both LMQN and the variant (rel. costf),

  7. the ratio of the average equivalent double-precision evaluation costs for used by the variant compared to that used by LMQN, computed on problems solved by both LMQN and the variant (rel. costg),

where all averages are computed on a sample of 20 independent runs. We are interested in making the values in the last two indicators as small as possible while maintaining a reasonable robustness (reported by nsucc).

Variant nsucc its. costf costg rel. its. rel. costf rel. costg
1e-03 LMQN 82 41.05 42.04 42.04
LMQN-s 78 41.40 42.60 42.60 1.03 1.04 1.04
LMQN-h 22 16.95 1.12 1.12 0.97 0.06 0.06
1e-05 LMQN 80 46.34 47.38 47.38
LMQN-s 48 47.79 48.96 48.96 1.08 1.08 1.08
LMQN-h 10 17.80 1.18 1.18 1.38 0.08 0.08
1e-07 LMQN 67 62.76 63.85 63.85
LMQN-s 25 28.28 28.96 28.96 0.82 0.81 0.81
LMQN-h 6 15.83 1.05 1.05 0.97 0.06 0.06
Table 3.2: Results for LMQN-s and LMQN-h compared to LMQN

Table 3.2 shows that the variants LMQN-s and LMQN-h compare very poorly to LMQN for two reasons. The first and most important is the quickly decreasing robustness when the final gradient accuracy gets tighter. The second is that, even for the cases where the robustness is not too bad (LMQN-s for and maybe ), we observe no improvement in costf and costg (as reported in the two last columns of the table). However, and as expected, when LMQN-h happens to succeed, it does so at a much lower cost, both for and .

Our second experiment again compares 3 variants:

LMQN:

as above,

iLMQN-a:

a variant of the TRDA algorithm where, for each

(3.1)
iLMQN-b:

a variant of the TRDA algorithm where, for each , is chosen as in (3.1) and

(3.2)

The updating formulae for iLMQNa are directly inspired by (2.9) and (2.4) above. The difference between the two updates of appear to give contrasted but interesting outcomes, as we discuss below. The results obtained for these variants are presented in Table 3.3 in the same format as that used for Table 3.2, the comparison in the last three columns being again computed with respect to LMQN.

Variant nsucc its. costf costg rel. its. rel. costf rel. costg
1e-03 LMQN 82 41.05 42.04 42.04
iLMQN-a 80 50.05 9.88 6.11 1.23 0.24 0.15
iLMQN-b 76 52.67 13.85 3.34 1.36 0.35 0.08
1e-05 LMQN 80 46.34 47.38 47.38
iLMQN-a 75 75.92 36.21 24.77 1.40 0.63 0.42
iLMQN-b 63 72.57 39.85 4.60 1.78 0.95 0.11
1e-07 LMQN 67 62.76 63.85 63.85
iLMQN-a 47 65.83 58.97 37.50 1.18 1.03 0.65
iLMQN-b 40 87.35 95.09 5.52 1.39 1.45 0.09
Table 3.3: Results for the variable-precision variants

These results are graphically summarized in Figures 3.1 and 3.2. In both figures, each group of vars represents the performance of the five methods discussed above: LMQN (dark blue), LMQN-s (light blue), LMQN-h (green), iLMQN-a (brown) and iLMQN-b (yellow). The left part of Figure 3.1 gives the ratio of successful solves to the number of problems, while the right part shows the relative number of iterations compared to that used by LMQN (on problems solved by both algorithms). Figure 3.2 gives the relative energy costs compared to LMQN, the right part presenting the costs of evaluating the objective function and the ledt part the costs of evaluating the gradients.

         

Figure 3.1: Relative reliabilities and iteration numbers

         

Figure 3.2: Relative energy savings for the evaluations of and

The following conclusions follow from Table 3.3 and Figures 3.1-3.2.

  1. For moderate final accuracy requirements ( or ), the inexact variants iLMQN-a and iLMQN-b perform well: they provide very acceptable robustness compared to the exact method and, most importantly here, yield very significant savings in costs, both for the gradient and the objective function, at the price of a reasonable increase in the number of iterations.

  2. The iLMQN-a variant appears to dominate the iLMQN-b in robustness and savings in the evaluation of the objective function. iLMQN-b nevertheless shows significantly larger savings in the gradient’s evaluation costs, but worse performance for the evaluation of the objective function.

  3. When the final accuracy is thigher (), the inexact methods seem to loose their edge. Not only they become less robust (especially iLMQN-b), but the gains in function evaluation costs disappear (while those in gradient evaluation costs remain significant for the problems the methods are able to solve). A closer examination of the detailed numerical results indicates that, unsurprisingly, inexact methods mostly fail on ill-conditioned problems (e.g. brownbs, powellbs, meyer3, osborneb).

  4. The comparison of iLMQN-a and even iLMQN-b with LMQN-s and LMQN-h clearly favours the new methods both in robsutness and gains obtained, showing that purpose-designed algorithms outperform simple-minded approaches in this context.

Summarizing, the iLMQN-a multi-precision algorithm appears, in our experiments, to produce significant savings in function’s and gradient’s evaluation costs when the final accuracy requirement and/or the problem conditioning is moderate. Using the iLMQN-b variant may produce, on the problems where it succeeds, larger gains in gradient’s evaluation cost at the price of more costly function evaluations.

4 Conclusions and perspectives

We have provided an improved provably convergent(6)(6)(6)With optimal wirst-case complexity. variant of the trust-region method using dynamic accuracy and have shown that, when considered in the context high performance computing and multiprecision arithmetic, this variant has the potential to bring significant savings in objective function’s and gradient’s evaluation cost. Despite the encouraging results reported in this note, the authors are of course aware that the numerical experiments discussed here are limited in size and scope and that the suggested conclusions need further assessment. In particular, it may be of interest to compare inexact trust-region algorithms with inexact regularization methods [4], especially if not only first-order but also second-order critical points are sought.

References