Gradient methods, with the stochastic gradient descent at the head, play a basic and crucial role in nonlinear optimization of artificial neural networks. In recent years many efforts have been devoted to better understand and improve existing optimization methods. This led to the widespread use of the Momentum method and learning rate schedulers , as well as to creation of new algorithms, such as AdaGrad , AdaDelta 
, RMSprop, Adam .
These methods, however, are not without flaws, as they need an extensive tuning or costly grid search of hyperparameters, together with suitable learning rate scheduler – too large step-size may result in instability, while too small may be slow in convergence and may lead to stuck in the saddle points.
As the choice of the proper learning rate is crucial, in this paper we aim to find such step which is locally optimal with respect to the direction of the gradient. Our ideas were motivated by l4 algorithm 
, which however apply the idea globally, as it computes the linearization of the loss function at the given point and proceeds to the global root of this linearization. Furthermore, unlike thewngrad , our algorithm can both increase and decrease the learning rate values.
Our algorithm is similar in principle to the line search methods as its aim is to adjust the learning rate based on function evaluations in the selected direction. Line search methods however, perform well only in a deterministic setting and require a stochastic equivalent to perform well with SGD . In opposite to this, lossgrad is also intended to work well in a stochastic setting. The difference is that our algorithm changes the step-size after taking the step thus never requiring any copying of potentially large amounts of memory for network weights.
The method we propose is based on the idea of finding a step-size which is locally optimal, i.e. we follow the direction of the gradient to maximally minimize the cost function. Thus given a function (which we want to minimize), a point and the gradient111Clearly, we can use an arbitrary direction provided by some minimization method instead of the gradient in place of , we aim to find the step-size which is (locally) optimal, i.e. satisfies:
A natural problem of how to practically realize the above search emerges. This paper is devoted to the examination of one of the possible solutions.
We assume that we are given a candidate from the previous step (or some initial choice in the first step). A crucial role is played by investigation of the connection between the value of after the step size and the value given by its linearized prediction (see Figure 1):
We implicitly assume here that (if this is not the case, we replace by ).
Our idea relies on considering the loss function in a direction :
and fitting a quadratic function , which coincides with at the end points and has the same derivative at zero (see Figure 2), i.e. such that:
Then we get:
To compute we need the knowledge of the gradient and the evaluation of at (the predicted next point in which we will arrive according to the current value of the step-size). Consequently, in the case when (i.e. in the case of gradient methods), we need to additionally compute . Then the value simplifies to:
The derivative of function at point is given by:
This means that if:
the derivative of function at point is negative, and therefore we should increase the step-size to further minimize (see Figure 2). On the other hand, if :
the derivative of function at point is positive, and therefore we should decrease the step-size to further minimize .
In our method, increasing (decreasing) the step-size takes place by multiplication of the current learning rate by the learning rate adjustment factor (). One can find the lossgrad algorithm pseudocode in algorithm 1. Notice that using our method does not involve almost any additional calculations.
3 LOSSGRAD asymptotic analysis in two dimensions
In the following section we show, how this process behaves for the quadratic form , where and is a symmetric positive matrix. Observe that in this case lossgrad can be seen as the approximation of the exact solution to equation (1). Therefore in this section, we study how the minimization process given in (1) works for quadratic functions.
To obtain exact formula we now apply the orthonormal change of coordinates in which has the simple form:
are the eigenvalues of.
Starting from the random point , which gradient is equal to , we have:
After taking the derivative we have:
As our goal is to find , we equate the derivative to zero:
Since , we have:
To see how this process behaves after a greater number of steps, we assume that . We consider the function which transports the point to its next iteration:
where , and thus . By taking the worst possible case we obtain that the above minimizes for
, so we obtain an estimate for the convergence:
Notice that this is invariant to data and function scaling (i.e. ).
One can easily observe that the estimation (4) gives the upper bound for a decrease rate of the solution to any standard gradient descent method with a fixed learning rate. Whether the same holds for the method in is an open problem.
We tested lossgrad in multiple different setups. First, we explore the algorithm’s resilience on the choice of initial learning rate and its behavior for a range of different LR adjustment factor values. Then we proceed to test the algorithm’s performance on fully connected networks on MNIST classification and Fashion-MNIST autoencoder tasks. Convolutional neural networks are tested on CIFAR-10  classification task using ResNet-56 architecture . We also evaluate our optimizer for an LSTM model  trained on Large Movie Review Dataset (IMDb) . Finally, we test lossgrad on Wasserstein Auto-Encoder with Maximum Mean Discrepancy-based penalty  on CelebA dataset . Network architectures and all hyperparameters used in experiments are listed in the appendix A
In our experiments we tried to compare lossgrad with wngrad  and l4 applied to SGD . We found out that l4 based on vanilla SGD is extremely unstable both on standard and tuned parameters on almost all datasets and network architectures, so we do not present the results here. For each experiment we also tested standard SGD optimizer with a range of learning rate hyperparameters, including a highly tuned one. For comparison, we also included SGD with scheduled learning rate if that enhanced the results. Because dropout heavily affects lossgrad’s stability, we decided not to use it in our experiments.
|Test loss||Test acc.|
|Test loss||Test acc.|
We test the initial learning rate with values ranging between and for a convolutional neural network on CIFAR-10 with the rest of the settings staying the same. Resulting test loss values are presented in Fig. 3 on the right, while the first 800 batches’ step size values are presented on the left. Irrespectively of the initial learning rate chosen, the step size always converges to values around for this experiment setup. Thus, the need for tuning is practically eliminated, and this property makes the algorithm noticeably attractive.
As lossgrad requires one hyperparameter, we explore which values are appropriate. This is tested by training a convolutional neural network on CIFAR-10 dataset, using our optimizer parameterized with a different value each time, with the rest of the settings staying the same. We evaluated the following hyperparameter values: . We found that low and high values tend to cause unstable behavior. According to these results, the rest of the experiments in this paper use and initial learning rate set to .
|Train loss||Test loss|
|Test loss||Test accuracy|
Tab. 2 and Tab. 4 presents the results for fully connected network trained on MNIST and an autoencoder trained on Fashion-MNIST, respectively. We noticed the occurrence of sudden spikes in loss (classification accuracy drops and subsequent recoveries) in case of classification on MNIST, but not when training an autoencoder on Fashion-MNIST. The spikes correspond to learning rate peaks, which suggests that temporarily too high step size causes the learning process to diverge.
Fig. 4 presents test accuracy and averaged step size (learning rate) when training a ResNet-56 network on CIFAR-10, while Tab. 4 presents the results summary. Even with low initial learning rate, lossgrad
still manages to achieve results better than SGD, being beaten only by optimized scheduled SGD. Note the step size spike at the beginning of the training process. This spike consistently appears at the beginning of the training in nearly every setup tested in this paper. This result is in line with many learning rate schedulers used in the training of neural networks, that increase the step-size at the beginning of the training and then, after few epochs, they decrease the step-size value.
|Test loss||Test acc.|
|Train loss||Test loss|
Results for LSTM trained on IMDb dataset are presented in Tab. 6. Here, for the vanilla SGD, a higher learning rate is preferred. lossgrad selects a very low step-size instead (below after the initial peak) and manages only to achieve better results than untuned SGD.
Finally, the results for WAE-MMD are presented in Tab. 6. The originally used optimizer (Adam) and scheduler combination from  is marked as ,,original”. Properly tuned SGD, as well as WNGrad, yield better results than lossgrad, which chooses a very low step size for this problem.
We provide an implementation of the algorithm with basic examples of usage on a git repository: https://github.com/bartwojcik/lossgrad.
We proposed lossgrad, a simple optimization method for approximating locally optimal step-size. We analyzed the algorithm behavior in two dimensions quadratic form example and tested it on a broad range of experiments. Resilience on the choice of initial learning rate and the lack of additional hyperparameters are the most attractive properties of our algorithm.
In future work, we aim to investigate and possibly mitigate the loss spikes encountered in the experiments, as well as work on increasing the algorithm’s effectiveness. A version for momentum SGD and Adam is also an interesting topic for exploration that we intend to pursue.
John Duchi, Elad Hazan, and Yoram Singer.
Adaptive subgradient methods for online learning and stochastic
Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In , pages 770–778, 2016.
-  Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
-  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
-  Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, pages 3730–3738, 2015.
-  Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1, pages 142–150. Association for Computational Linguistics, 2011.
-  Maren Mahsereci and Philipp Hennig. Probabilistic line searches for stochastic optimization. In Advances in Neural Information Processing Systems, pages 181–189, 2015.
Jeffrey Pennington, Richard Socher, and Christopher Manning.
Glove: Global vectors for word representation.
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
-  Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145–151, 1999.
-  Michal Rolinek and Georg Martius. L4: Practical loss-based stepsize adaptation for deep learning. In Advances in Neural Information Processing Systems, pages 6434–6444, 2018.
-  Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016.
-  Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
-  Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-encoders. arXiv preprint arXiv:1711.01558, 2017.
-  Xiaoxia Wu, Rachel Ward, and Léon Bottou. Wngrad: Learn the learning rate in gradient descent. arXiv preprint arXiv:1803.02865, 2018.
-  Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
-  Chen Xing, Devansh Arpit, Christos Tsirigotis, and Yoshua Bengio. A walk with sgd. arXiv preprint arXiv:1802.08770, 2018.
-  Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
Appendix A Network architecture, hyperparameters and datasets description
Tab. 7 also contains the architectures used for image classification on MNIST dataset and for autoencoder training on Fashion-MNIST dataset. We trained both networks for 30 epochs with the rest of the hyperparameters being the same as in CIFAR-10 experiment. For MNIST, we selected the following learning rate values: for SGD and for WNGrad. For Fashion-MNIST, we tested the following learning rate values: for SGD and for WNGrad. We also evaluated two scheduled optimizers. The first one linearly increases the learning rate from to in epoch 10, stays constant until epoch 15 and decreases afterward (trapezoidal). The second one starts with and is multiplied by after 20 epochs. Mean square error was used as the loss function.
ResNet-56 architecture is described in . We trained the network for 250 epochs, used random cropping and inversion when training and applied weight decay with . The learning rate values we used were: for SGD, for WNGrad, with multiplier after epochs 100 and 150 for step scheduled SGD. We set the mini-batch size to 128 in this experiment.
For IMDb experiments, we used a pretrained embedding layer, trained with GloVe algorithm  on 6 billion tokens available from
torchtext library. Its 100 element output was fed to 1 layer of bidirectional LSTM with 256 hidden units for each direction. The final linear layer transformed that to scalar output. We set as the learning rate for wngrad , , , for SGD and also tested a scheduled SGD with learning rate initially set to and then decreasing to after 10 epochs.
The architecture for WAE-MMD is the same as in . Minibatch size was set to 64, mean square error was selected as a loss function and the network was trained for 80 epochs. We set as the initial learning rate for WNGrad and for SGD.