We investigate the role of the optimizer in determining the quality of the model fit for neural networks with a small to medium number of parameters. We study the performance of Adam, an algorithm for first-order gradient-based optimization that uses adaptive momentum, the Levenberg-Marquardt (LM) algorithm a second order method, Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS) a second order method and L-BFGS, a low memory version of BFGS. Using these optimizers we fit the function using a neural network with a few parameters. This function has a variable amplitude and a constant frequency. We observe that the higher amplitude components of the function are fitted first and the Adam, BFGS and L-BFGS struggle to fit the lower amplitude components of the function. We also solve the Burgers equation using a physics informed neural network(PINN) with the BFGS and LM optimizers. For our example problems with a small to medium number of weights, we find that the LM algorithm is able to rapidly converge to machine precision offering significant benefits over other optimizers. We further investigated the Adam optimizer with a range of models and found that Adam optimiser requires much deeper models with large numbers of hidden units containing up to 26x more parameters, in order to achieve a model fit close that achieved by the LM optimizer. The LM optimizer results illustrate that it may be possible build models with far fewer parameters. We have implemented all our methods in Keras and TensorFlow 2.
Neural networks have been successfully applied to a broad range of applications and are now increasingly employed to gain new insights into complex and demanding science problems [Lecun2015]
. Many science problems, in the context of deep learning, are considered as regression problems where obtaining the best model fit is a key measure of success. The best model fit is typically the result of applying an optimisation method that minimises a loss function. Improvements in the model fit are often a key criteria used to establish that a new model delivers improved performance over prior modelling approaches.
Training neural networks usually continues until the optimizer achieves a minimum value where the loss function no longer decreases or where a comparison of the loss achieved on the training and test data indicates that over-fitting is taking place. The optimizer may not succeed and cannot guarantee that the global minimum has been achieved. The selection and application of the optimizer therefore warrants careful consideration.
Neural networks are considered to be universal approximators [Hornik1989]. While this property gives neural networks incredible flexibility, as evidenced by their application to a very wide range of problems, is not straightforward to determine an appropriate model neural network architecture and the corresponding set of model weights required to achieve a useful approximator. In this paper we examine the challenge of finding the optimal weights for a given neural network model architecture by comparing the performance of several well known optimisation algorithms, Adam [Kingma2017], LM [Levenberg1944] [Marquardt1963], BFGS [Nocedal2006] and BFGS-L [Liu1989]. We apply these algorithms to a well known but challenging problem where the frequency is fixed and the amplitude varies. We also investigate the process of model optimisation by studying the fit achieved in incremental steps.
Finally, the results presented here have significant practical implications. This study was motivated by earlier studies where we examined applying machine learning (ML) to detecting faults of fielded machinery using unsupervised learning[Wang2021a] [Wang2021] where the function approximation and the choice of the optimizer was a key task in delivering breakthrough results.
There are many quasi-Newton based optimisation algorithms available for training neural networks. All algorithms are a compromise between accuracy, robustness, computational efficiency, memory usage, scalability and flexibility. We have selected four very well known optimisation algorithms that have seen widespread application for our study, briefly summarized as follows.
The Adam algorithm [Kingma2017] is a first-order gradient based optimization of stochastic loss functions and has enjoyed widespread application to fitting large and complex deep neural networks. The method is robust, computationally efficient and works well on deep neural networks with millions of parameters. Tuning of the learning rate, learning rate warmup, learning rate schedules, momentum options, selection of regularizers and their parameters is needed to achieve the best optimisation result. We use the Adam optimiser algorithm as implemented in TensorFlow 2.
The BFGS algorithm [Nocedal2006] is a second-order iterative gradient method that computes the Hessian matrix of the loss function using gradient evaluations, , where
is the number of parameters. Updates to the BFGS curvature matrix do not need a matrix inversion which reduces the computational cost significantly. As the Hessian matrix is the basis of the BFGS algorithm, memory usage grows as the square of the number of parameters which results in rapid growth in memory usage, making the approach unsuitable for neural networks with large numbers of parameters. The selection of BFGS algorithm parameters is straightforward. We use the BFGS optimiser as implemented in TensorFlow Probability[abadi2015] [TensorFlow2021].
addresses the problem of large memory usage by the BFGS algorithm by saving only a few vectors that represent an estimate of the full Hessian matrix. L-BFGS is computationally more efficient than BFGS, uses less memory and can be applied to problems with larger numbers of parameters than BFGS. We use the L-BFGS optimiser as implemented in TensorFlow Probability[abadi2015] [TensorFlow2021].
Both BFGS and L-BFGS can fall into local minima. By combining Adam as an initial optimisation algorithm, run over many hundreds of epochs, then continuing with BFGS and/or L-BFGS using the model weights delivered by Adam, the risk of falling into local minima can be reduced. This approach has been used in physics-Informed machine learning in order to solve systems of partial differential equations(PDEs)[Karniadakis2021], for example, and is also used in this study.
Levenberg [Levenberg1944] was the original developer of this method for non-linear least squares based on the finding that simple gradient descent and Gauss-Newton iteration are complementary methods. Marquardt [Marquardt1963] extended the original method by scaling each component of the gradient according to the curvature by including the diagonal of the Hessian. Ananth [Ranganathan2004] provides an intuitive summary of the method. The LM method does require a matrix inversion which is computationally expensive and consumes significant amounts of memory, limiting the application of LM to moderately sized models with thousands of parameters. It is worth noting that modern computing hardware, with substantial computational capability and memory, has expanded the size of models to which LM can usefully be applied. As there is no built-in implementation of LM in Tensorflow and Keras, we imported a user contributed code for Levenberg-Marquardt (LM) algorithm by Fabio Di Marco [FabioDiMarco2021].
2.5 Test model
Our test model is the function over the domain [-1.5, 1.5]. We build a neural network model consisting of an input layer, 2 Dense layers each with 20 hidden units, and an output layer with a total of 481 trainable parameters. We use the activation function in the two dense layers. In all cases we minimize the mean square error loss. We generate a training data set consisting of 20,000 evenly spaced points. We implemented the model in Python using Keras [Chollet2015] and TensorFlow 2 [abadi2015].
2.6 PINN model
Physics Informed Neural Networks (PINNs) are neural networks designed to solve a variety of computational problems while accounting for the physical equations which govern their respective natural phenomena. PINN is an emerging deep learning technique that includes linear/nonlinear Ordinary Differential Equations (ODEs), Partial Differential Equations (PDEs) into the loss function of the NNs via automatic differentiation, in addition to any collected data, essentially acting as an approximate differential equation solver which may function well even with little to no training data. The PINN approach was originally developed by[Raissi2017], with two main applications in mind: data-driven solution (solving differential equations given data/initial conditions/equation parameters) and data-driven discovery of PDEs (find the parameters of the governing PDEs given collected data), by using the fact that deep neural networks can act as universal function approximators [Hornik1989].
PINN’s initial design philosophy addressed the issue that for certain complex physical systems the collection of data may be restrictive. PINNs shine as they do not require huge data sets to produce a sufficiently good result. In most machine learning techniques, more training data means more accurate and robust results, and a sufficient amount of training data is needed for the models to produce anything of quality. The creators of PINNs took advantage of the abundance of domain specific information surrounding physical systems which may not be included in standard neural network methods of modelling. PINNs maximise the available knowledge of a problem by including already known mathematical/physical descriptions of the problems, in addition to the available training data, into the learning algorithms. Doing so may induce faster convergence, and result in a much more generalisable solution when compared to traditional machine learning methods, even with little to no training data [Raissi2019a].
This module implements a PINN model for Burgers’ equation [Raissi2019a]. Burgers’ equation is given by:
where is the kinematic viscosity, is the spatial coordinate, is the temporal coordinate, and is the velocity of the fluid at point . This is a simplification of Navier-Stokes equation where the velocity is in one spatial dimension and where external forces and pressure gradient terms are ignored. We apply the initial conditions and the boundary conditions The PINN model predicts at the point . represents a typical engineering problem where abrupt changes are expected, which can be challenging for optimizers
We use an implementation of the solution for Burgers’ equation [Raissi2019a] written in TensorFlow provided by [okada39]. We compare solutions of the PINN model obtained using the L-BFGS-B and LM optimizers.
3 Results and Discussion
We first explored the performance of the Adam optimiser by allowing the model to train for a fixed number of epochs, at which point training was paused and the solution evaluated. The model training was then restarted using the model weights obtained when training was paused. In this way we can inspect the performance of the optimiser as training proceeds.
Figure 1 shows the fit achieved to the function of the Adam optimizer after 250, 500, 750,1000, 1250 and 1500 epochs. As expected the model fit improves as we increase the number of epochs. The Adam optimizer fits the higher amplitude oscillations first, then fits each subsequent lower amplitude oscillation in the function. Note that the frequency of the oscillations is constant i.e. it is the amplitude only that varies. After 1500 epochs we found that the model fit does not improve and from inspection of the graph we can see that the model has not achieved a highly accurate fit. Simply relying on having obtained a minimum value for the MSE loss function cannot be relied on to ensure that we have obtained a satisfactory fit. In the following experiment we compare the fit obtained from multiple loss functions.
Figure 2 provides a comparison of the performance of the Adam optimizer, the LM, BFGS and BFGS-L optimizers, where we fit the function using the identical neural network, settings and input data for all training runs. BFGS and L-BFGS use the weights generated by the Adam optimizer as their starting point as suggested in previous studies [Raissi2019a]. All model runs show the results of the best fit achieved for each optimiser. The Adam, L-BFGS and BFGS optimizers have difficulty fitting the lower amplitude tails of the function with the LM model achieving the best fit. For Adam, the minimum value of MSE loss function was 0.0002018958 and occurred at optimisation step 1,464, for BFGS-L, 8.655213e-05 at step 14,558, for BFGS, 4.2915726e-06 at step 12,477 and for LM, 1.642267e-07 at step 150, the maximum number of epochs set for LM model training. The LM MSE value is 26x lower than that obtained by BFGS, 527x lower than L-BFGS and 1,230x lower than Adam. The compute time for each optimisation method, measured using the Python time module, was for Adam 76s, BFGS 58s, L-BFGS 172 s and LM 116s, noting that the total time for BFGS and L-BFGS optimisation includes Adam as the starting point, results in a total time for BFGS of 134s and for L-BFGS of 248s.
Figure 3 shows MSE loss value, plotted on a linear scale, using the Adam optimizer for each training epoch where we fit the function using a neural network. The upper panel plots the MSE loss over the entire training run with the bottom panel focussing on the end of the model run. The Adam optimiser exhibits significant variability between epochs before reaching a minimum value after 1300 epochs.
Figure 4 and Figure 5 compare the MSE loss value, plotted on a linear scale, using the L-BFGS and BFGS optimizers. Note that the L-BFGS and BFGS optimizers were initialised with the set of weights from the conclusion of the Adam optimzer run. Figure 4 shows that the L-BFGS optimiser takes many more steps before converging at a higher value of the MSE than BFGS. This would be expected given that L-BFGS does not calculate the full Hessian matrix. Where parameter numbers are in the low to medium range BFGS is likely the better choice of optimizer than L-BFGS. Only where memory usage or computational time becomes too large would L-BFGS be recommended.
Figure 6 presents the MSE loss value, plotted on a linear scale, using the LM optimiser. The LM optimizer reduces the MSE smoothly and in only a few epochs. The MSE loss value rapidly falls below the Adam, L-BFGS and BFGS optimizer MSE values. The benefits of the high quality model fit achieved by the LM optimizer are clearly demonstrated in Figure 2 where, in contrast with the other optimizers, the low amplitude (not frequency) components of the function are well fitted.
Figure 7 compares the results obtained using all optimizers plotted on a log scale. The log scale makes it possible to more easily compare optimiser performance. Figure 7 clearly demonstrates the superior performance of the LM method with a rapid minimization of the MSE to levels several orders of magnitude lower than that achieved by other methods. The LM optimizer requires no tuning to produce these results. While these results were obtained using a simple test function, earlier studies where we examined applying machine learning (ML) to detecting faults of fielded machinery using unsupervised learning [Wang2021a] [Wang2021] [Wang2021b]with noisy data collected from aircraft sensors, support the conclusion that the LM optimizer can deliver superior performance in a real-world setting.
Having determined that the LM optimizer can deliver improved performance on problems with a small to medium number number of weights we have also considered the application of the LM optimizer to PINNs, an important new class of NN applications. Many PINN models require a small to medium number of weights and would benefit from the high accuracy solutions, often required of science and engineering problems, that the LM model is able to deliver.
Figure 8 presents the results of fitting a PINN model, as described above, to Burgers’ equation [Raissi2019a] using the L-BFGS-B optimizer. This version of the BFGS model extends L-BFGS to take into account the box constraints needed to solve Burgers’ equation. The results obtained are identical to that obtained previously [Raissi2019a]. Figure 9 presents the results obtained using the LM optimizser and are identical to the results obtained using L-BFGS-B. Finally, Figure 10 shows the MSE loss curve for the LM optimiser over 50 epochs, with the MSE value approaching
, several orders of magnitude lower than previously reported MSE values, with the potential for further improvement. The application of the LM model required no hyperparameter tuning.
The previous figures all show results with the model architecture fixed for all optimizers. As the Adam optimizer is widely used when fitting deep learning models, we have also investigated how to obtain the best fit to the test function using only the Adam optimizer by modifying the model architecture. As previously shown, the Adam optimizer delivered an MSE value approximately 3 orders of magnitude higher than the LM optimizer. To achieve an improved result for the Adam optimizer we investigated a range of hidden units (HU) using 16,32,48,64 and 80 HU and varied the number of dense layers in the model from 1-4. We use the same number of hidden units for each layer. Using this basic grid search, we have a total of 20 model fit runs using a combination of 5 hidden units and 4 layer combinations. We fit all models over 5000 epochs.
Figure 11 shows the results of this grid search organised as four panels with an increasing number of model layers with each panel including the model fit for all five HU values. Using only a single layer the model fit is very similar across all HU values and does not improve on the MSE of the original model. The maximum number of parameters using a 1 layer model with 80 HUs is 241 compared with the 481 parameters in the original test function with 2 layers and 20 HUs. With the 2 layer model all models achieve improved performance compared with the single layer models. The model with 80 HUs and 2 layers (6,721 parameters) achieves the best result with the MSE value falling below , an improvement on the original model. The model with 3 layers further improves on these results, however the model with 64 HUs (8,513 parameters), rather than 80 HUs (13,201 parameters), produces the best results with the MSE value dropping below . With 4 layers the model is able to achieve good model results 48, 64, and 80 HUs with the best results at 48 HUs (7,201 parameters) and 64 HUs (12,673 parameters). The models with the larger number of parameters were approaching the total number of training samples (20,000). Figure 12 shows the corresponding model fit achieved using the models in Figure 11. The model fit is consistent with the MSE values achieved during model fitting.
Figure 11 and 12 illustrate that the Adam optimizer performs best on this problem primarily by increasing the number of layers and then by increasing the number of HUs. Note that increasing the number of HUs to 80 did not produce the best model for models with 3 and 4 layers. The best Adam optimizer results required 15x more parameters with a 3 layer model and 26x more parameters with 4 layer model as the LM optimized model. While the Adam optimiser results can be improved, even with this much greater number of model parameters, the Adam model cannot outperform the LM optimiser. This result implies that Adam optimiser may require deeper models with large numbers of hidden units in order to achieve the best model fit. The LM optimiser results illustrate that it may be possible build models with far fewer parameters.
We compared a range of well known optimizers applied to fitting neural networks with a small to medium number of weights. We studied the performance of Adam, the Levenberg-Marquardt (LM) algorithm, BFGS and L-BFGS and concluded that the LM optimiser delivered significant advantages over the other methods, including orders of magnitude improvement in the MSE values, rapid optimisation and straightforward application as no hyperparameter tuning was required.
Using these Optimizers we fit the function using a neural network with a few parameters. This function has a variable amplitude and a constant frequency. We observed that as the model fit progressed, the higher amplitude components of the function were fit first. We found that only the LM optimiser could fit this function well across the full range of amplitudes.
We have also demonstrated the usefulness of the LM optimiser to PINNs by solving the Burgers equation to a higher accuracy than has previously been achieved. Combined with the observation that the LM method can achieve an accurate fit where other optimizers cannot, the LM method is likely to be very useful for PINNs and to the broader class of NN models with a small to medium number of parameters.
It would likely benefit the broader ML community if the LM and BFGS optimizers included in could be made available without additional wrapping in TensorFlow. This is recommended as there are a large number of potential applications that would benefit significantly from the ready availability of these optimizers.