1 Motivation and objectives
Two recent evolutions in the field of scientific computing motivate the present note. The first is the growing importance of deep-learning methods for artificial intelligence, and the second is the acknowledgement by computer architects that new high-performance machines must be able to run the basic tools of deep learning very efficiently. Because the ubiquitous mathematical problem in deep learning is nonlinear nonconvex optimization, it is therefore of interest to consider how to solve this problem in ways that are as efficient as possible on new very powerful computers. As it turns out, one of the crucial aspects in designing such machines and the algorithms that they use is mastering energy dissipation. Given that this dissipation is approximately proportional to chip surface and that chip surface itself approximately proportional to the square of the number of binary digits involved in the calculation[19, 32, 23, 27], being able to solve nonlinear optimization problems with as few digits as possible (while not loosing on final accuracy) is clearly of interest.
This short note’s sole purpose is to show that this is possible and that algorithms exist which achieve this goal and whose robustness significantly exceed simple minded approaches. The focus is on unconstrained nonconvex optimization, the most frequent case in deep learning applications. Since the cost of solving such problems is typically dominated by that of evaluating the objective function (and derivatives if possible), our aim is therefore to propose optimization methods which are able to exploit/specify varying levels of preexisting arithmetic precision for these evaluations. Because of this feature, optimization in this context differs from other better-studied frameworks where the degree of accuracy may be chosen in a more continuous way, such as adaptive techniques for optimization with noisy functions (see [18, 9, 10]
) or with functions whose values and that of its derivatives are estimated by some (possibly dynamic) sampling process (see[35, 12, 3, 13, 6, 5], for instance). We propose here a suitable adaptation of the dynamic-accuracy trust-region framework proposed by Carter in  and by Conn, Gould and Toint in Section 10.6 of  to the context of multi-precision computations. Our proposal complements that of , where inexactness is also used for energy saving purposes, and where its exploitation is restricted to the inner linear algebra work of the solution algorithm, while still assuming exact values of the nonlinear function involved(1)(1)(1)The solution of nonlinear systems of equations is considered rather than unconstrained optimization.. Note that the framework of inexact computation has already been discussed in othe contexts [1, 30, 24, 25].
2 Nonconvex optimization with dynamic accuracy
We start by briefly recalling the context of the dynamic-accuracy trust-region technique of . Consider the unconstrained minimization problem
where is a sufficiently smooth function from into IR, and where the value of the objective can be approximated with a prespecified level of accuracy. This is to say, given and a an accuracy levels , the function such that
The crucial difference with a more standard approach for optimization with noisy function is that the required accuracy level may be specified by the minimization algorithm itself within a prespecified set, with the understanding that the more accurate the requirement specified by , the higher the “cost” of evaluating .
We propose to use a trust-region method, that is an iterative algorithm where, at iteration , a first-order model is approximately minimized on in a ball (the trust region) centered at the current iterate and of radius (the trust-region radius), yielding a trial point. The value of the reduction in the objective function achieved at the trial point is then compared to that predicted by the model. If the agreement is deemed sufficient, the trial point is accepted as the next iterate and the radius kept the same or increased. Otherwise the trial point is rejected and the radius reduced. This basic framework, whose excellent convergence properties and outstanding numerical performance are well-known, was modified in Section 10.6 of  to handle the situation when only is known, rather than . It has already been adapted to other contexts (see  for instance).
However, the method has the serious drawback of requiring exact gradient values, even if function value may be computed inexactly. We may then call on Section 8.4 of  which indicates that convergence of trust-region methods to first-order critical points may be ensured with inexact gradients. If we now define the approximate gradient at as the function such that
this convergence is ensured provided, througout the iterations,
that is if the relative error on the gradient remains suitably small. Note that this relative error need not tend to zero, but that the absolute error will when convergence occurs (see [16, Theorem 8.4.1, p. 281]). Also note that the concept of a relative gradient error is quite natural if one assumes that is computed using an arithmetic with a limited number of significant digits.
This algorithm differs from that presented in [16, p. 402] on two accounts. First, it incorporates inexact gradients, as we discussed above. Second, it does not require that the step is recomputed whenever a more accurate objective function’s value is required in Step 2. This last feature makes the algorithm more efficient. Moreover, it does not affect the sequence of iterates since the value of the model decrease predicted by the step is independent of the objective function value. As a consequence, the convergence to first-order points studied in Section 10.6.1 of  (under the assumption that the approximate Hessians remain bounded) still applies. In what follows, we choose to contruct this approximation using a limited-memory symmetric rank-one (SR1) quasi-Newton update(2)(2)(2)Numerical experiments not reported here suggest that our default choice of remembering 15 secant pairs gives good performance, although keeping a smaller number of pairs is still acceptable. based on gradient differences [29, Section 8.2].
As it turns out, this variant of  is quite close to the method proposed by Carter in , the main difference being that the latter uses fixed tolerances in slighly different ranges(3)(3)(3)  requires while we require . A fixed value is also used for , whose upper bound depends on . The Hessian approximation is computed using an unsafeguarded standard BFGS update..
We immediately note that, at termination,
where we have used the triangle inequality, (2.3) and (2.4). As a consequence, the TRDA algorithm terminates at a true -approximate first-order-necessary minimizer. Moreover, [16, Theorem 8.4.5] and the development of p. 404-405 in the same reference indicate that the decrease in objective function value at successful iterations is bounded below by a multiple of . As a consequence, we may immediately deduce from now standard arguments (see  for the first derivation), that the maximum number of iterations needed by Algorithm TRDA to find such an -approximate first-order-necessary minimizer is . Moreover, this bound was proved sharp in most cases in , even when exact and bounded second derivatives are used.
3 Numerical experience
We now present some numerical evidence that the TRDA algorithm can perform well and provide significant savings in terms of energy costs, when these are dominated by the function and gradient evaluations. Our experiments are run in Matlab (64bits) and use a collection of 86 small unconstrained test problems(4)(4)(4)The collection of  and a few other problems, all available in Matlab. detailed in Table 3.1. In what follows, we declare a run successful when an iterate is found such that (2.6), and hence (2.12), hold in at most 1000 iterations.
|argauss||3||[28, 8]||arglina||10||[28, 8]||arglinb||10||[28, 8]|
|arglinc||10||[28, 8]||argtrig||10||[28, 8]||arwhead||10||[14, 20]|
|biggs6||6||[28, 8]||box||3||[28, 8]||booth||2|||
|brownbs||2||[28, 8]||brownden||4||[28, 8]||broyden3d||10||[28, 8]|
|broydenbd||10||[28, 8]||chebyqad||10||[28, 8]||cliff||2|||
|cube||2||||dixmaana||12||[17, 8]||dixmaanj||12||[17, 8]|
|eg2||10||[15, 20]||eg2s||10||[15, 20]||engval1||10||[28, 8]|
|engval2||10||[28, 8]||freuroth||4||[28, 8]||genhumps||2|||
|jensmp||2||[28, 8]||kowosb||4||[28, 8]||lminsurf||25||[22, 8]|
|mancino||10||[34, 8]||mexhat||2||||meyer3||3||[28, 8]|
|nlminsurf||25||[22, 8]||osbornea||5||[28, 8]||osborneb||11||[28, 8]|
|penalty1||10||[28, 8]||penalty2||10||[28, 8]||powellbs||2||[28, 8]|
|recipe||2||||rosenbr||2||[28, 8]||schmvett||3||[33, 8]|
|vardim||10||[28, 8]||watson||12||[28, 8]||wmsqrtals||16||-|
In the following set of experiments with the TRDA variants, we assume that the objective function’s value and the gradient can be computed in double, single or half precision (with corresponding accuracy level equal to machine precision, half machine precision or quarter machine precision). Thus, when the TRDA algorithm specifies an accuracy level or , this may not be attainable as such, but the lower of the three available levels of accuracy is then chosen to perform the computation in (possibly moderately) higher accuracy than requested. The equivalent double-precision costs of the evaluations of and in single precision are then computed by dividing the cost of evaluation in double precision by four(5)(5)(5)Remember it is proportional to the square of the number of significant digits.. Those for half precision are correspond to double-precision costs divided by sixteen.
To set the stage, our first experiment starts by comparing three variants of the TRDA algorithm:
a version using for all (i.e. using the full double precision arithmetic throughout),
a version using single precision evaluation of the objective function and gradient for all ,
a version using half precision evaluation of the objective function and gradient for all .
These variants correspond to a simple minded approach where the expensive parts of the optimization calculation are conducted in reduced precision without any further adaptive accuracy management. For each variant, we report, for three different values of the final gradient accuracy ,
the robustness as given by the number of successful solves for the relevant (nsucc),
the average number of iterations (its.),
the average equivalent double-precision costs of objective function’s evaluations (costf),
the average equivalent double-precision costs of gradient’s evaluations (costg),
the ratio of the average number of iterations used by the variant compared to that used by LMQN, computed on problems solved by both LMQN and the variant (rel. its.),
the ratio of the average equivalent double-precision evaluation costs for used by the variant compared to that used by LMQN, computed on problems solved by both LMQN and the variant (rel. costf),
the ratio of the average equivalent double-precision evaluation costs for used by the variant compared to that used by LMQN, computed on problems solved by both LMQN and the variant (rel. costg),
where all averages are computed on a sample of 20 independent runs. We are interested in making the values in the last two indicators as small as possible while maintaining a reasonable robustness (reported by nsucc).
|Variant||nsucc||its.||costf||costg||rel. its.||rel. costf||rel. costg|
Table 3.2 shows that the variants LMQN-s and LMQN-h compare very poorly to LMQN for two reasons. The first and most important is the quickly decreasing robustness when the final gradient accuracy gets tighter. The second is that, even for the cases where the robustness is not too bad (LMQN-s for and maybe ), we observe no improvement in costf and costg (as reported in the two last columns of the table). However, and as expected, when LMQN-h happens to succeed, it does so at a much lower cost, both for and .
Our second experiment again compares 3 variants:
a variant of the TRDA algorithm where, for each
a variant of the TRDA algorithm where, for each , is chosen as in (3.1) and
The updating formulae for iLMQNa are directly inspired by (2.9) and (2.4) above. The difference between the two updates of appear to give contrasted but interesting outcomes, as we discuss below. The results obtained for these variants are presented in Table 3.3 in the same format as that used for Table 3.2, the comparison in the last three columns being again computed with respect to LMQN.
|Variant||nsucc||its.||costf||costg||rel. its.||rel. costf||rel. costg|
These results are graphically summarized in Figures 3.1 and 3.2. In both figures, each group of vars represents the performance of the five methods discussed above: LMQN (dark blue), LMQN-s (light blue), LMQN-h (green), iLMQN-a (brown) and iLMQN-b (yellow). The left part of Figure 3.1 gives the ratio of successful solves to the number of problems, while the right part shows the relative number of iterations compared to that used by LMQN (on problems solved by both algorithms). Figure 3.2 gives the relative energy costs compared to LMQN, the right part presenting the costs of evaluating the objective function and the ledt part the costs of evaluating the gradients.
For moderate final accuracy requirements ( or ), the inexact variants iLMQN-a and iLMQN-b perform well: they provide very acceptable robustness compared to the exact method and, most importantly here, yield very significant savings in costs, both for the gradient and the objective function, at the price of a reasonable increase in the number of iterations.
The iLMQN-a variant appears to dominate the iLMQN-b in robustness and savings in the evaluation of the objective function. iLMQN-b nevertheless shows significantly larger savings in the gradient’s evaluation costs, but worse performance for the evaluation of the objective function.
When the final accuracy is thigher (), the inexact methods seem to loose their edge. Not only they become less robust (especially iLMQN-b), but the gains in function evaluation costs disappear (while those in gradient evaluation costs remain significant for the problems the methods are able to solve). A closer examination of the detailed numerical results indicates that, unsurprisingly, inexact methods mostly fail on ill-conditioned problems (e.g. brownbs, powellbs, meyer3, osborneb).
The comparison of iLMQN-a and even iLMQN-b with LMQN-s and LMQN-h clearly favours the new methods both in robsutness and gains obtained, showing that purpose-designed algorithms outperform simple-minded approaches in this context.
Summarizing, the iLMQN-a multi-precision algorithm appears, in our experiments, to produce significant savings in function’s and gradient’s evaluation costs when the final accuracy requirement and/or the problem conditioning is moderate. Using the iLMQN-b variant may produce, on the problems where it succeeds, larger gains in gradient’s evaluation cost at the price of more costly function evaluations.
4 Conclusions and perspectives
We have provided an improved provably convergent(6)(6)(6)With optimal wirst-case complexity. variant of the trust-region method using dynamic accuracy and have shown that, when considered in the context high performance computing and multiprecision arithmetic, this variant has the potential to bring significant savings in objective function’s and gradient’s evaluation cost. Despite the encouraging results reported in this note, the authors are of course aware that the numerical experiments discussed here are limited in size and scope and that the suggested conclusions need further assessment. In particular, it may be of interest to compare inexact trust-region algorithms with inexact regularization methods , especially if not only first-order but also second-order critical points are sought.
-  M. Baboulin, A. Buttari, J. Dongarra, J. Kurzak, J. Langou, P. Luszczek, and S. Tomov. Accelerating scientific computations with mixed precision algorithms. Comput. Phys. Commun., 180:2526–2533, 2009.
-  S. Bellavia, S. Gratton, and E. Riccietti. A Levenberg-Marquardt method for large nonlinear least-squares problems with dynamic accuracy in functions and gradients. Numerische Mathematik, 140:791–825, 2018.
-  S. Bellavia, G. Gurioli, and B. Morini. Theoretical study of an adaptive cubic regularization method with dynamic inexact Hessian information. arXiv:1808.06239, 2018.
-  S. Bellavia, G. Gurioli, B. Morini, and Ph. L. Toint. Deterministic and stochastic inexact regularization algorithms for nonconvex optimization with optimal complexity. arXiv:1811.03831, 2018.
-  E. Bergou, Y. Diouane, V. Kungurtsev, and C. W. Royer. A subsampling line-search method with second-order results. arXiv:1810.07211, 2018.
-  J. Blanchet, C. Cartis, M. Menickelly, and K. Scheinberg. Convergence rate analysis of a stochastic trust region method via supermartingales. arXiv:1609.07428v3, 2018.
A.A. Brown and M. Bartholomew-Biggs.
Some effective methods for unconstrained optimization based on the solution of ordinary differential equations.Technical Report Technical Report 178, Hatfield Polytechnic, Hatfield, UK, 1987.
-  A. G. Buckley. Test functions for unconstrained minimization. Technical Report CS-3, Computing Science Division, Dalhousie University, Dalhousie, Canada, 1989.
-  R. G. Carter. A worst-case example using linesearch methods for numerical optimization with inexact gradient evaluations. Technical Report MCS-P283-1291, Argonne National Laboratory, Argonne, USA, 1991.
-  R. G. Carter. Numerical experience with a class of algorithms for nonlinear optimization using inexact function and gradient information. SIAM Journal on Scientific and Statistical Computing, 14(2):368–388, 1993.
-  C. Cartis, N. I. M. Gould, and Ph. L. Toint. Worst-case evaluation complexity and optimality of second-order methods for nonconvex smooth optimization. To appear in the Proceedings of the 2018 International Conference of Mathematicians (ICM 2018), Rio de Janeiro, 2018.
-  C. Cartis and K. Scheinberg. Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Mathematical Programming, Series A, 159(2):337–375, 2018.
-  X. Chen, B. Jiang, T. Lin, and S. Zhang. On adaptive cubic regularization Newton’s methods for convex optimization via random sampling. arXiv:1802.05426, 2018.
-  A. R. Conn, N. I. M. Gould, M. Lescrenier, and Ph. L. Toint. Performance of a multifrontal scheme for partially separable optimization. In S. Gomez and J. P. Hennart, editors, Advances in Optimization and Numerical Analysis, Proceedings of the Sixth Workshop on Optimization and Numerical Analysis, Oaxaca, Mexico, volume 275, pages 79–96, Dordrecht, The Netherlands, 1994. Kluwer Academic Publishers.
-  A. R. Conn, N. I. M. Gould, and Ph. L. Toint. LANCELOT: a Fortran package for large-scale nonlinear optimization (Release A). Number 17 in Springer Series in Computational Mathematics. Springer Verlag, Heidelberg, Berlin, New York, 1992.
-  A. R. Conn, N. I. M. Gould, and Ph. L. Toint. Trust-Region Methods. MPS-SIAM Series on Optimization. SIAM, Philadelphia, USA, 2000.
-  L. C. W. Dixon and Z. Maany. A family of test problems with sparse Hessian for unconstrained optimization. Technical Report 206, Numerical Optimization Center, Hatfield Polytechnic, Hatfield, UK, 1988.
-  C. Elster and A. Neumaier. A method of trust region type for minimizing noisy functions. Computing, 58(1):31–46, 1997.
-  S. Galal and M. Horowitz. Energy-efficient floating-point unit design. IEEE Transactions on Computers, 60(7), 2011.
N. I. M. Gould, D. Orban, and Ph. L. Toint.
: a constrained and unconstrained testing environment with safe threads for mathematical optimization.Computational Optimization and Applications, 60(3):545–557, 2015.
-  S. Gratton, A. Sartenaer, and Ph. L. Toint. Recursive trust-region methods for multiscale nonlinear optimization. SIAM Journal on Optimization, 19(1):414–444, 2008.
-  A. Griewank and Ph. L. Toint. Partitioned variable metric updates for large structured optimization problems. Numerische Mathematik, 39:119–137, 1982.
-  N. J. Higham. The rise of multiprecision computations. Talk at SAMSI 2017, April 2017. https://bit.ly/higham-samsi17.
-  L. Kugler. Is “good enough” computing good enough? Commun. ACM, 58:12–14, 2015.
-  S. Leyffer, S. Wild, M. Fagan, M. Snir, K. Palem, K. Yoshii, and H. Finkel. Moore with less – leapgrogging Moore’s law with inexactness for supercomputing. arXiv:1610.02606v2, 2016. (to appear in Proceedings of PMES 2018: 3rd International Workshop on Post Moore’s Era Supercomputing).
-  G. Li. The secant/finite difference algorithm for solving sparse nonlinear systems of equations. SIAM Journal on Numerical Analysis, 25(5):1181–1196, 1988.
-  S. Matsuoka. private communication, March 2018.
-  J. J. Moré, B. S. Garbow, and K. E. Hillstrom. Testing unconstrained optimization software. ACM Transactions on Mathematical Software, 7(1):17–41, 1981.
-  J. Nocedal and S. J. Wright. Numerical Optimization. Series in Operations Research. Springer Verlag, Heidelberg, Berlin, New York, 1999.
-  K. V. Palem. Inexactness and a future of computing. Phil. Trans. R. Soc. A, 372(20130281), 2014.
-  G. Poenisch and H. Schwetlick. Computing turning points of curves implicitly defined by nonlinear equations depending on a parameter. Computing, 20:101–121, 1981.
-  J. Pu, S. Galal, X. Yang, O. Shacham, and M. Horowitz. FPMax: a 106GFLOPS/W at 217GFLOPS/mm2 single-precision FPU, and a 43.7 GFLOPS/W at 74.6 GFLOPS/mm2 double-precision FPU, in 28nm UTBB FDSOI. Hardware Architecture, 2016.
-  J.W. Schmidt and K. Vetters. Albeitungsfreie verfahren fur nichtlineare optimierungsproblem. Numerische Mathematik, 15:263–282, 1970.
-  E. Spedicato. Computational experience with quasi-Newton algorithms for minimization problems of moderately large size. Technical Report CISE-N-175, CISE Documentation Service, Segrate, Milano, 1975.
-  P. Xu, F. Roosta-Khorasani, and M. W. Mahoney. Newton-type methods for non-convex optimization under inexact Hessian information. arXiv:1708.07164v3, 2017.