1 Motivation and objectives
Two recent evolutions in the field of scientific computing motivate the present note. The first is the growing importance of deeplearning methods for artificial intelligence, and the second is the acknowledgement by computer architects that new highperformance machines must be able to run the basic tools of deep learning very efficiently. Because the ubiquitous mathematical problem in deep learning is nonlinear nonconvex optimization, it is therefore of interest to consider how to solve this problem in ways that are as efficient as possible on new very powerful computers. As it turns out, one of the crucial aspects in designing such machines and the algorithms that they use is mastering energy dissipation. Given that this dissipation is approximately proportional to chip surface and that chip surface itself approximately proportional to the square of the number of binary digits involved in the calculation
[19, 32, 23, 27], being able to solve nonlinear optimization problems with as few digits as possible (while not loosing on final accuracy) is clearly of interest.This short note’s sole purpose is to show that this is possible and that algorithms exist which achieve this goal and whose robustness significantly exceed simple minded approaches. The focus is on unconstrained nonconvex optimization, the most frequent case in deep learning applications. Since the cost of solving such problems is typically dominated by that of evaluating the objective function (and derivatives if possible), our aim is therefore to propose optimization methods which are able to exploit/specify varying levels of preexisting arithmetic precision for these evaluations. Because of this feature, optimization in this context differs from other betterstudied frameworks where the degree of accuracy may be chosen in a more continuous way, such as adaptive techniques for optimization with noisy functions (see [18, 9, 10]
) or with functions whose values and that of its derivatives are estimated by some (possibly dynamic) sampling process (see
[35, 12, 3, 13, 6, 5], for instance). We propose here a suitable adaptation of the dynamicaccuracy trustregion framework proposed by Carter in [10] and by Conn, Gould and Toint in Section 10.6 of [16] to the context of multiprecision computations. Our proposal complements that of [25], where inexactness is also used for energy saving purposes, and where its exploitation is restricted to the inner linear algebra work of the solution algorithm, while still assuming exact values of the nonlinear function involved^{(1)}^{(1)}(1)The solution of nonlinear systems of equations is considered rather than unconstrained optimization.. Note that the framework of inexact computation has already been discussed in othe contexts [1, 30, 24, 25].2 Nonconvex optimization with dynamic accuracy
We start by briefly recalling the context of the dynamicaccuracy trustregion technique of [16]. Consider the unconstrained minimization problem
(2.1) 
where is a sufficiently smooth function from into IR, and where the value of the objective can be approximated with a prespecified level of accuracy. This is to say, given and a an accuracy levels , the function such that
(2.2) 
The crucial difference with a more standard approach for optimization with noisy function is that the required accuracy level may be specified by the minimization algorithm itself within a prespecified set, with the understanding that the more accurate the requirement specified by , the higher the “cost” of evaluating .
We propose to use a trustregion method, that is an iterative algorithm where, at iteration , a firstorder model is approximately minimized on in a ball (the trust region) centered at the current iterate and of radius (the trustregion radius), yielding a trial point. The value of the reduction in the objective function achieved at the trial point is then compared to that predicted by the model. If the agreement is deemed sufficient, the trial point is accepted as the next iterate and the radius kept the same or increased. Otherwise the trial point is rejected and the radius reduced. This basic framework, whose excellent convergence properties and outstanding numerical performance are wellknown, was modified in Section 10.6 of [16] to handle the situation when only is known, rather than . It has already been adapted to other contexts (see [2] for instance).
However, the method has the serious drawback of requiring exact gradient values, even if function value may be computed inexactly. We may then call on Section 8.4 of [16] which indicates that convergence of trustregion methods to firstorder critical points may be ensured with inexact gradients. If we now define the approximate gradient at as the function such that
(2.3) 
this convergence is ensured provided, througout the iterations,
(2.4) 
that is if the relative error on the gradient remains suitably small. Note that this relative error need not tend to zero, but that the absolute error will when convergence occurs (see [16, Theorem 8.4.1, p. 281]). Also note that the concept of a relative gradient error is quite natural if one assumes that is computed using an arithmetic with a limited number of significant digits.
This algorithm differs from that presented in [16, p. 402] on two accounts. First, it incorporates inexact gradients, as we discussed above. Second, it does not require that the step is recomputed whenever a more accurate objective function’s value is required in Step 2. This last feature makes the algorithm more efficient. Moreover, it does not affect the sequence of iterates since the value of the model decrease predicted by the step is independent of the objective function value. As a consequence, the convergence to firstorder points studied in Section 10.6.1 of [16] (under the assumption that the approximate Hessians remain bounded) still applies. In what follows, we choose to contruct this approximation using a limitedmemory symmetric rankone (SR1) quasiNewton update^{(2)}^{(2)}(2)Numerical experiments not reported here suggest that our default choice of remembering 15 secant pairs gives good performance, although keeping a smaller number of pairs is still acceptable. based on gradient differences [29, Section 8.2].
As it turns out, this variant of [16] is quite close to the method proposed by Carter in [10], the main difference being that the latter uses fixed tolerances in slighly different ranges^{(3)}^{(3)}(3) [10] requires while we require . A fixed value is also used for , whose upper bound depends on . The Hessian approximation is computed using an unsafeguarded standard BFGS update..
We immediately note that, at termination,
(2.12) 
where we have used the triangle inequality, (2.3) and (2.4). As a consequence, the TRDA algorithm terminates at a true approximate firstordernecessary minimizer. Moreover, [16, Theorem 8.4.5] and the development of p. 404405 in the same reference indicate that the decrease in objective function value at successful iterations is bounded below by a multiple of . As a consequence, we may immediately deduce from now standard arguments (see [21] for the first derivation), that the maximum number of iterations needed by Algorithm TRDA to find such an approximate firstordernecessary minimizer is . Moreover, this bound was proved sharp in most cases in [11], even when exact and bounded second derivatives are used.
3 Numerical experience
We now present some numerical evidence that the TRDA algorithm can perform well and provide significant savings in terms of energy costs, when these are dominated by the function and gradient evaluations. Our experiments are run in Matlab (64bits) and use a collection of 86 small unconstrained test problems^{(4)}^{(4)}(4)The collection of [8] and a few other problems, all available in Matlab. detailed in Table 3.1. In what follows, we declare a run successful when an iterate is found such that (2.6), and hence (2.12), hold in at most 1000 iterations.
Name  dim.  source  Name  dim.  source  Name  dim.  source 

argauss  3  [28, 8]  arglina  10  [28, 8]  arglinb  10  [28, 8] 
arglinc  10  [28, 8]  argtrig  10  [28, 8]  arwhead  10  [14, 20] 
bard  3  [8]  bdarwhd  10  [20]  beale  2  [8] 
biggs6  6  [28, 8]  box  3  [28, 8]  booth  2  [8] 
brkmcc  2  [8]  brazthing  2    brownal  10  [28, 8] 
brownbs  2  [28, 8]  brownden  4  [28, 8]  broyden3d  10  [28, 8] 
broydenbd  10  [28, 8]  chebyqad  10  [28, 8]  cliff  2  [8] 
clustr  2  [8]  cosine  2  [20]  crglvy  10  [28, 8] 
cube  2  [8]  dixmaana  12  [17, 8]  dixmaanj  12  [17, 8] 
dixon  10  [8]  dqrtic  10  [8]  edensch  5  [26] 
eg2  10  [15, 20]  eg2s  10  [15, 20]  engval1  10  [28, 8] 
engval2  10  [28, 8]  freuroth  4  [28, 8]  genhumps  2  [20] 
gottfr  2  [8]  gulf  4  [28, 8]  hairy  2  [20] 
helix  3  [28, 8]  hilbert  10  [8]  himln3  10  [8] 
himm25  10  [8]  himm27  10  [8]  himm28  10  [8] 
himm29  10  [8]  himm30  10  [8]  himm33  10  [8] 
hypcir  2  [8]  indef  5  [20]  integreq  2  [28, 8] 
jensmp  2  [28, 8]  kowosb  4  [28, 8]  lminsurf  25  [22, 8] 
mancino  10  [34, 8]  mexhat  2  [7]  meyer3  3  [28, 8] 
morebv  12  [28, 8]  msqrtals  16  [8]  msqrtbls  16  [8] 
nlminsurf  25  [22, 8]  osbornea  5  [28, 8]  osborneb  11  [28, 8] 
penalty1  10  [28, 8]  penalty2  10  [28, 8]  powellbs  2  [28, 8] 
powellsg  4  [28, 8]  powellsq  2  [8]  powr  10  [8] 
recipe  2  [8]  rosenbr  2  [28, 8]  schmvett  3  [33, 8] 
scosine  2  [20]  sisser  2  [8]  spmsqrt  10  [8] 
tquartic  10  [8]  tridia  10  [8]  trigger  7  [31, 8] 
vardim  10  [28, 8]  watson  12  [28, 8]  wmsqrtals  16   
wmsqrtbls  16    woods  12  [28, 8]  zangwil2  2  [8] 
zangwil3  3  [8] 
In the following set of experiments with the TRDA variants, we assume that the objective function’s value and the gradient can be computed in double, single or half precision (with corresponding accuracy level equal to machine precision, half machine precision or quarter machine precision). Thus, when the TRDA algorithm specifies an accuracy level or , this may not be attainable as such, but the lower of the three available levels of accuracy is then chosen to perform the computation in (possibly moderately) higher accuracy than requested. The equivalent doubleprecision costs of the evaluations of and in single precision are then computed by dividing the cost of evaluation in double precision by four^{(5)}^{(5)}(5)Remember it is proportional to the square of the number of significant digits.. Those for half precision are correspond to doubleprecision costs divided by sixteen.
To set the stage, our first experiment starts by comparing three variants of the TRDA algorithm:
 LMQN:

a version using for all (i.e. using the full double precision arithmetic throughout),
 LMQNs:

a version using single precision evaluation of the objective function and gradient for all ,
 LMQNh:

a version using half precision evaluation of the objective function and gradient for all .
These variants correspond to a simple minded approach where the expensive parts of the optimization calculation are conducted in reduced precision without any further adaptive accuracy management. For each variant, we report, for three different values of the final gradient accuracy ,

the robustness as given by the number of successful solves for the relevant (nsucc),

the average number of iterations (its.),

the average equivalent doubleprecision costs of objective function’s evaluations (costf),

the average equivalent doubleprecision costs of gradient’s evaluations (costg),

the ratio of the average number of iterations used by the variant compared to that used by LMQN, computed on problems solved by both LMQN and the variant (rel. its.),

the ratio of the average equivalent doubleprecision evaluation costs for used by the variant compared to that used by LMQN, computed on problems solved by both LMQN and the variant (rel. costf),

the ratio of the average equivalent doubleprecision evaluation costs for used by the variant compared to that used by LMQN, computed on problems solved by both LMQN and the variant (rel. costg),
where all averages are computed on a sample of 20 independent runs. We are interested in making the values in the last two indicators as small as possible while maintaining a reasonable robustness (reported by nsucc).
Variant  nsucc  its.  costf  costg  rel. its.  rel. costf  rel. costg  

1e03  LMQN  82  41.05  42.04  42.04  
LMQNs  78  41.40  42.60  42.60  1.03  1.04  1.04  
LMQNh  22  16.95  1.12  1.12  0.97  0.06  0.06  
1e05  LMQN  80  46.34  47.38  47.38  
LMQNs  48  47.79  48.96  48.96  1.08  1.08  1.08  
LMQNh  10  17.80  1.18  1.18  1.38  0.08  0.08  
1e07  LMQN  67  62.76  63.85  63.85  
LMQNs  25  28.28  28.96  28.96  0.82  0.81  0.81  
LMQNh  6  15.83  1.05  1.05  0.97  0.06  0.06 
Table 3.2 shows that the variants LMQNs and LMQNh compare very poorly to LMQN for two reasons. The first and most important is the quickly decreasing robustness when the final gradient accuracy gets tighter. The second is that, even for the cases where the robustness is not too bad (LMQNs for and maybe ), we observe no improvement in costf and costg (as reported in the two last columns of the table). However, and as expected, when LMQNh happens to succeed, it does so at a much lower cost, both for and .
Our second experiment again compares 3 variants:
 LMQN:

as above,
 iLMQNa:

a variant of the TRDA algorithm where, for each
(3.1)  iLMQNb:

a variant of the TRDA algorithm where, for each , is chosen as in (3.1) and
(3.2)
The updating formulae for iLMQNa are directly inspired by (2.9) and (2.4) above. The difference between the two updates of appear to give contrasted but interesting outcomes, as we discuss below. The results obtained for these variants are presented in Table 3.3 in the same format as that used for Table 3.2, the comparison in the last three columns being again computed with respect to LMQN.
Variant  nsucc  its.  costf  costg  rel. its.  rel. costf  rel. costg  

1e03  LMQN  82  41.05  42.04  42.04  
iLMQNa  80  50.05  9.88  6.11  1.23  0.24  0.15  
iLMQNb  76  52.67  13.85  3.34  1.36  0.35  0.08  
1e05  LMQN  80  46.34  47.38  47.38  
iLMQNa  75  75.92  36.21  24.77  1.40  0.63  0.42  
iLMQNb  63  72.57  39.85  4.60  1.78  0.95  0.11  
1e07  LMQN  67  62.76  63.85  63.85  
iLMQNa  47  65.83  58.97  37.50  1.18  1.03  0.65  
iLMQNb  40  87.35  95.09  5.52  1.39  1.45  0.09 
These results are graphically summarized in Figures 3.1 and 3.2. In both figures, each group of vars represents the performance of the five methods discussed above: LMQN (dark blue), LMQNs (light blue), LMQNh (green), iLMQNa (brown) and iLMQNb (yellow). The left part of Figure 3.1 gives the ratio of successful solves to the number of problems, while the right part shows the relative number of iterations compared to that used by LMQN (on problems solved by both algorithms). Figure 3.2 gives the relative energy costs compared to LMQN, the right part presenting the costs of evaluating the objective function and the ledt part the costs of evaluating the gradients.
The following conclusions follow from Table 3.3 and Figures 3.13.2.

For moderate final accuracy requirements ( or ), the inexact variants iLMQNa and iLMQNb perform well: they provide very acceptable robustness compared to the exact method and, most importantly here, yield very significant savings in costs, both for the gradient and the objective function, at the price of a reasonable increase in the number of iterations.

The iLMQNa variant appears to dominate the iLMQNb in robustness and savings in the evaluation of the objective function. iLMQNb nevertheless shows significantly larger savings in the gradient’s evaluation costs, but worse performance for the evaluation of the objective function.

When the final accuracy is thigher (), the inexact methods seem to loose their edge. Not only they become less robust (especially iLMQNb), but the gains in function evaluation costs disappear (while those in gradient evaluation costs remain significant for the problems the methods are able to solve). A closer examination of the detailed numerical results indicates that, unsurprisingly, inexact methods mostly fail on illconditioned problems (e.g. brownbs, powellbs, meyer3, osborneb).

The comparison of iLMQNa and even iLMQNb with LMQNs and LMQNh clearly favours the new methods both in robsutness and gains obtained, showing that purposedesigned algorithms outperform simpleminded approaches in this context.
Summarizing, the iLMQNa multiprecision algorithm appears, in our experiments, to produce significant savings in function’s and gradient’s evaluation costs when the final accuracy requirement and/or the problem conditioning is moderate. Using the iLMQNb variant may produce, on the problems where it succeeds, larger gains in gradient’s evaluation cost at the price of more costly function evaluations.
4 Conclusions and perspectives
We have provided an improved provably convergent^{(6)}^{(6)}(6)With optimal wirstcase complexity. variant of the trustregion method using dynamic accuracy and have shown that, when considered in the context high performance computing and multiprecision arithmetic, this variant has the potential to bring significant savings in objective function’s and gradient’s evaluation cost. Despite the encouraging results reported in this note, the authors are of course aware that the numerical experiments discussed here are limited in size and scope and that the suggested conclusions need further assessment. In particular, it may be of interest to compare inexact trustregion algorithms with inexact regularization methods [4], especially if not only firstorder but also secondorder critical points are sought.
References
 [1] M. Baboulin, A. Buttari, J. Dongarra, J. Kurzak, J. Langou, P. Luszczek, and S. Tomov. Accelerating scientific computations with mixed precision algorithms. Comput. Phys. Commun., 180:2526–2533, 2009.
 [2] S. Bellavia, S. Gratton, and E. Riccietti. A LevenbergMarquardt method for large nonlinear leastsquares problems with dynamic accuracy in functions and gradients. Numerische Mathematik, 140:791–825, 2018.
 [3] S. Bellavia, G. Gurioli, and B. Morini. Theoretical study of an adaptive cubic regularization method with dynamic inexact Hessian information. arXiv:1808.06239, 2018.
 [4] S. Bellavia, G. Gurioli, B. Morini, and Ph. L. Toint. Deterministic and stochastic inexact regularization algorithms for nonconvex optimization with optimal complexity. arXiv:1811.03831, 2018.
 [5] E. Bergou, Y. Diouane, V. Kungurtsev, and C. W. Royer. A subsampling linesearch method with secondorder results. arXiv:1810.07211, 2018.
 [6] J. Blanchet, C. Cartis, M. Menickelly, and K. Scheinberg. Convergence rate analysis of a stochastic trust region method via supermartingales. arXiv:1609.07428v3, 2018.

[7]
A.A. Brown and M. BartholomewBiggs.
Some effective methods for unconstrained optimization based on the solution of ordinary differential equations.
Technical Report Technical Report 178, Hatfield Polytechnic, Hatfield, UK, 1987.  [8] A. G. Buckley. Test functions for unconstrained minimization. Technical Report CS3, Computing Science Division, Dalhousie University, Dalhousie, Canada, 1989.
 [9] R. G. Carter. A worstcase example using linesearch methods for numerical optimization with inexact gradient evaluations. Technical Report MCSP2831291, Argonne National Laboratory, Argonne, USA, 1991.
 [10] R. G. Carter. Numerical experience with a class of algorithms for nonlinear optimization using inexact function and gradient information. SIAM Journal on Scientific and Statistical Computing, 14(2):368–388, 1993.
 [11] C. Cartis, N. I. M. Gould, and Ph. L. Toint. Worstcase evaluation complexity and optimality of secondorder methods for nonconvex smooth optimization. To appear in the Proceedings of the 2018 International Conference of Mathematicians (ICM 2018), Rio de Janeiro, 2018.
 [12] C. Cartis and K. Scheinberg. Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Mathematical Programming, Series A, 159(2):337–375, 2018.
 [13] X. Chen, B. Jiang, T. Lin, and S. Zhang. On adaptive cubic regularization Newton’s methods for convex optimization via random sampling. arXiv:1802.05426, 2018.
 [14] A. R. Conn, N. I. M. Gould, M. Lescrenier, and Ph. L. Toint. Performance of a multifrontal scheme for partially separable optimization. In S. Gomez and J. P. Hennart, editors, Advances in Optimization and Numerical Analysis, Proceedings of the Sixth Workshop on Optimization and Numerical Analysis, Oaxaca, Mexico, volume 275, pages 79–96, Dordrecht, The Netherlands, 1994. Kluwer Academic Publishers.
 [15] A. R. Conn, N. I. M. Gould, and Ph. L. Toint. LANCELOT: a Fortran package for largescale nonlinear optimization (Release A). Number 17 in Springer Series in Computational Mathematics. Springer Verlag, Heidelberg, Berlin, New York, 1992.
 [16] A. R. Conn, N. I. M. Gould, and Ph. L. Toint. TrustRegion Methods. MPSSIAM Series on Optimization. SIAM, Philadelphia, USA, 2000.
 [17] L. C. W. Dixon and Z. Maany. A family of test problems with sparse Hessian for unconstrained optimization. Technical Report 206, Numerical Optimization Center, Hatfield Polytechnic, Hatfield, UK, 1988.
 [18] C. Elster and A. Neumaier. A method of trust region type for minimizing noisy functions. Computing, 58(1):31–46, 1997.
 [19] S. Galal and M. Horowitz. Energyefficient floatingpoint unit design. IEEE Transactions on Computers, 60(7), 2011.

[20]
N. I. M. Gould, D. Orban, and Ph. L. Toint.
CUTEst
: a constrained and unconstrained testing environment with safe threads for mathematical optimization.
Computational Optimization and Applications, 60(3):545–557, 2015.  [21] S. Gratton, A. Sartenaer, and Ph. L. Toint. Recursive trustregion methods for multiscale nonlinear optimization. SIAM Journal on Optimization, 19(1):414–444, 2008.
 [22] A. Griewank and Ph. L. Toint. Partitioned variable metric updates for large structured optimization problems. Numerische Mathematik, 39:119–137, 1982.
 [23] N. J. Higham. The rise of multiprecision computations. Talk at SAMSI 2017, April 2017. https://bit.ly/highamsamsi17.
 [24] L. Kugler. Is “good enough” computing good enough? Commun. ACM, 58:12–14, 2015.
 [25] S. Leyffer, S. Wild, M. Fagan, M. Snir, K. Palem, K. Yoshii, and H. Finkel. Moore with less – leapgrogging Moore’s law with inexactness for supercomputing. arXiv:1610.02606v2, 2016. (to appear in Proceedings of PMES 2018: 3rd International Workshop on Post Moore’s Era Supercomputing).
 [26] G. Li. The secant/finite difference algorithm for solving sparse nonlinear systems of equations. SIAM Journal on Numerical Analysis, 25(5):1181–1196, 1988.
 [27] S. Matsuoka. private communication, March 2018.
 [28] J. J. Moré, B. S. Garbow, and K. E. Hillstrom. Testing unconstrained optimization software. ACM Transactions on Mathematical Software, 7(1):17–41, 1981.
 [29] J. Nocedal and S. J. Wright. Numerical Optimization. Series in Operations Research. Springer Verlag, Heidelberg, Berlin, New York, 1999.
 [30] K. V. Palem. Inexactness and a future of computing. Phil. Trans. R. Soc. A, 372(20130281), 2014.
 [31] G. Poenisch and H. Schwetlick. Computing turning points of curves implicitly defined by nonlinear equations depending on a parameter. Computing, 20:101–121, 1981.
 [32] J. Pu, S. Galal, X. Yang, O. Shacham, and M. Horowitz. FPMax: a 106GFLOPS/W at 217GFLOPS/mm2 singleprecision FPU, and a 43.7 GFLOPS/W at 74.6 GFLOPS/mm2 doubleprecision FPU, in 28nm UTBB FDSOI. Hardware Architecture, 2016.
 [33] J.W. Schmidt and K. Vetters. Albeitungsfreie verfahren fur nichtlineare optimierungsproblem. Numerische Mathematik, 15:263–282, 1970.
 [34] E. Spedicato. Computational experience with quasiNewton algorithms for minimization problems of moderately large size. Technical Report CISEN175, CISE Documentation Service, Segrate, Milano, 1975.
 [35] P. Xu, F. RoostaKhorasani, and M. W. Mahoney. Newtontype methods for nonconvex optimization under inexact Hessian information. arXiv:1708.07164v3, 2017.
Comments
There are no comments yet.