Nowadays, deep learning is successfully applied in many fields[abiodun2018state]
, especially in image recognition and natural language processing. This success is due to approximation power of the neural networks[hornik1989multilayer] and effective application of the manually designed first-order optimization methods [kingma2014adam, duchi2011adaptive, nesterov1983method]. But the design of an optimization algorithm can be considered as a learning problem that hopefully will give better convergence results due to adjustment to a particular task.
The No Free Lunch Theorem [wolpert1997no] suggests that there is no universally best learner and restricting the hypothesis class by introducing our prior knowledge about the task we are solving is the only way we can improve the state of affairs. This motivates the use of the learned optimizer for the given task and the use of different regularization methods. For instance, the Heavy Ball method [polyak1964some]
considers the gradient descent procedure as a sliding of a heavy ball on the surface of the loss function, which results in faster convergence. More generally, one can consider the gradient descent procedure as a movement of some object on the surface of the loss function under different forces: potential, dissipative (friction) and other external forces. Such a physical process can be described by port-Hamiltonian system of equations[van2006port]. Work [massaroli2019porthamiltonian] considers the optimization process as the evolution of a port-Hamiltonian system meaning that the parameters of the neural network are the solutions of the port-Hamiltonian system of equations. The results show that this framework helps to overcome the problem of getting stuck at saddle points which motivates its use for the non–convex, high–dimensional neural networks. In this work, we propose to learn the optimizer and impose the physical laws governed by the port-Hamiltonian system of equations into the optimization algorithm to provide implicit bias which acts as regularization and helps to find the better generalization optimums. We impose physical structure by learning the gradients of the parameters: gradients are the solutions of the port-Hamiltonian system, thus their dynamics is governed by the physical laws, that are going to be learned.
To summarize, we propose a new framework based on Hamiltonian Neural Networks which is used to learn and improve gradients for the gradient descent step. Our experiments on an artificial task and MNIST dataset demonstrate that our method is able to outperform many basic optimizers and achieve comparable performance to the previous LSTM-based one. Furthermore, we explore how methods can be transferred to other architectures with different hyper-parameters, e.g. activation functions. To this end, we train HNN-based optimizer for a small neural network with the sigmoid activation on MNIST dataset and then train the same network but with the ReLU activation using the already trained optimizer. The results show that our method is transferable in this case unlike the LSTM-based optimizer. The implementation is uploaded to GitHub:https://github.com/AfoninAndrei/OPT-ML.
Our work is mainly based on [andrychowicz2016learning]. Given the some learning task and an objective function defined over some domain , the goal is to find the global minimum . Usual approach in the deep learning is to use the gradient based update rule:
where is the gradient of the objective at the point and is the step size. Similar to [andrychowicz2016learning] we propose to use the following update rule instead of Equation (1):
where is the output of the optimizer neural network with the parameters , inputs and where the last is the time derivative of the gradient at the point . We propose to think of the gradient as a physical system with the continuous evolution governed by some laws. Physical structure is encoded into the architecture of the neural network which is motivated by [zhong2020symplectic, zhong2020dissipative]:
where M is an inertia matrix, q and p are the generalized coordinate and impulse of the physical system correspondingly. Their derivatives by time are and and notice that . is a Hamiltonian, V is a potential, D is a dissipative term and G are external forces which are dependant only on the coordinate q and affect only the impulse p. This form of G is applicable for many physical systems. Thus, inertia matrix M, potential term V, dissipative matrix D and external forces G are approximated by the neural networks , , and correspondingly. This form of the neural network(or more precisely, the combination of the neural networks) allows learning the dynamic governed by the Equation (2) from the data.
Let us introduce the notions and , then the update propose by our model is the following:
, the first is calculated by using usual backpropagation at each iteration, the second is calculated by takingfrom the output of Equation (2) at the previous iteration. We call our model the Hamiltonian Neural Network (HNN). In [andrychowicz2016learning] authors use LSTM model [sak2014long] for .
Let us take a closer look at the structure of our model. The scheme of our model is presented in the Fig. A1. The blue and red blocks are the model’s inputs and outputs correspondingly. The light gray blocks have the same denotation as in the Equation (2
) and are learnable. It takes as an input the vector of two stacked components at time: the gradient of the objective and its time derivative (blue blocks in the Fig. A1). As a result, physical part of the model returns the vector of stacked derivatives of the input components at time (green blocks in the Fig. A1). By taking the product of the first output component and a small constant , we can obtain a rough approximation of the corrected gradient at the next time step :
Such a scheme allows us to correct the gradient to make it obey the learned dynamic. As we do not know the optimal value of the constant , we propose to use a linear layer without bias and with matrix that allows us to learn how to combine the terms and in the optimal way (top orange block in the Fig. A1):
where approximation is true up to the multiplication by some constant (can be leveled by the smaller/bigger learning rate step). The same procedure we apply to approximate with another linear layer with the inputs and (bottom orange block in the Fig. A1). Remind the reader that we need the component as an input into our model at the next iteration. For simplicity we assume that inertia matrix M does not depend on q. This let us to do the direct approximation of the partial derivative:
Thus, for the Equation (2) there is no reason to compute Hamiltonian as
so we know all the terms dependant on the Hamiltonian. Moreover, there is no reason to compute inertia matrix itself, but only its inverse. Hence we approximate inverse of the inertia matrix directly to escape the computational issues: this form is due to the fact that M is the positive semi-definite matrix.
There were several attempts to learn the optimizer using the recurrent neural networks, specifically LSTM[ravi2016optimization, younger2001meta]. To the best of our knowledge, it is a first attempt to learn an optimizer with the hidden physical structure.
In this section, we experimentally compare the results of the proposed model against the LSTM model from prior work [andrychowicz2016learning] and standard optimization methods used in deep learning such as ADAM [kingma2014adam]
, RMSprop[hinton2012neural], SGD [ruder2016overview], and NAG [nesterov1983method]. For the standard optimizers and LSTM-based one, we repeat experiment settings reported in [andrychowicz2016learning] where each of these optimizers learning rate was tuned.
We take an update step as an output of the neural network. Similar to [andrychowicz2016learning], to train this neural network we use an objective function for the training that depends on the part of the trajectory of optimization, for some horizon :
We consider two experiment settings: minimization of the quadratic function and the optimization of the base network on the MNIST dataset. The optimization of the HNN is done using ADAM with the learning rate
for both experiment settings, no weight decay is used. We pick the best parameters for our model according to the validation loss which is calculated during the training after each epoch. Finally, in each experiment, we report the average performance on a number of freshly sampled test problems.
Iii-a Quadratic functions
In this section, for the optimization we consider 10-dimensional quadratic functions of the form:
that is drawn IID from the Gaussian distribution. Each function is optimized using objective (3) for 100 steps with the horizon parameters and for LSTM and HNN correspondingly.
The results are presented in the Tab. A1. In the Fig. 1 learning curves for different optimizers are presented. Each curve corresponds to the average performance over 100 test functions. One can see that both learned optimizers outperform the standard optimizers and LSTM slightly underperforms HNN which has significantly fewer parameters.
In this part, we train the optimizer model to optimize a base network on the MNIST dataset. The objective function is the cross-entropy for the base network that is MLP with 20 units and a sigmoid activation function. The optimization was run for 100 steps with the horizon parameter and for the LSTM and HNN correspondingly. We evaluate each optimization approach over 100 test functions on the two base networks: with the sigmoid and ReLU activation functions. Finally, we present the average results. The source of variability between different runs is the initial value of the base model parameters and the order of batches of data.
The comparison of the results for the sigmoid and ReLU are presented in the Tab. A2 and Tab. A3 correspondingly. Averaged over 100 runs learning curves for the base network using different optimizers are shown in the Fig. 2 and Fig. 3. The results for sigmoid show that HNN comparable or slightly worse than other methods which we relate to the shallowness of our model and, as a result, weak expressive power. At the same time, we see from the results for ReLU that our model has better transfer properties than LSTM and produces comparable results to other standard methods.
First, this work has started from the idea to apply Neural Ordinary Differential Equations (Neural ODE)[chen2019neural] to the given problem. That is, the Equation (2) is the base of Neural ODE, which takes as an input parameter and returns after the integration. Another idea is to do the same procedure for the gradient of the objective function at point , that is, and after integration, we obtain the ’corrected’ version of the gradient according to the learned dynamic. Experiments with the described approaches are quite time-consuming (due to the integration part in the Neural ODE) and the results are not promising. Due to simplicity and superiority in terms of performance over above discussed approaches we stick to the proposed in this paper method: without the use of ODE and with we obtain not the , but that after multiplying by some small constant can be seen as a rough approximation of the change in the gradient according to the learned dynamic.
Due to specificity, trained for the given problem optimizers produce better or comparable results with respect to standard optimizers. From our experiments, one can see that the learned neural optimizer with the hidden physical structure produces comparable performance against the proposed in the prior work LSTM optimizer and widely used gradient methods while having much fewer parameters. Moreover, it has better generalization than the LSTM model because it is not over-parametrized and thus not overfitted on MNIST with the sigmoid activation. This shows that gradients can be learned by HNN and we can benefit from the induced implicit bias at least in the simple optimization of a quadratic function.