A novel adaptive learning rate scheduler for deep neural networks

by   Rahul Yedida, et al.

Optimizing deep neural networks is largely thought to be an empirical process, requiring manual tuning of several parameters, such as learning rate, weight decay, and dropout rate. Arguably, the learning rate is the most important of these to tune, and this has gained more attention in recent works. In this paper, we propose a novel method to compute the learning rate for training deep neural networks. We derive a theoretical framework to compute learning rates dynamically, and then show experimental results on standard datasets and architectures to demonstrate the efficacy of our approach.


page 1

page 2

page 3

page 4


Cyclical Learning Rates for Training Neural Networks

It is known that the learning rate is the most important hyper-parameter...

Adaptive Learning Rate via Covariance Matrix Based Preconditioning for Deep Neural Networks

Adaptive learning rate algorithms such as RMSProp are widely used for tr...

Demystifying Learning Rate Polices for High Accuracy Training of Deep Neural Networks

Learning Rate (LR) is an important hyper-parameter to tune for effective...

FixNorm: Dissecting Weight Decay for Training Deep Neural Networks

Weight decay is a widely used technique for training Deep Neural Network...

Applying Cyclical Learning Rate to Neural Machine Translation

In training deep learning networks, the optimizer and related learning r...

Training With Data Dependent Dynamic Learning Rates

Recently many first and second order variants of SGD have been proposed ...

Auto-Ensemble: An Adaptive Learning Rate Scheduling based Deep Learning Model Ensembling

Ensembling deep learning models is a shortcut to promote its implementat...

1 Introduction

Deep learning [8] is becoming more omnipresent for several tasks, including image recognition [23, 30]

, face recognition

[31], and object detection [6]. At the same time, the trend is towards deeper neural networks [13, 9].

Deep convolutional neural networks

[16, 17] are a variant that introduce convolutional and pooling layers, and have seen incredible success in image classification [22, 34], even surpassing human-level performance [9]. Very deep convolutional neural networks have even crossed 1000 layers [11].

Despite their popularity, training neural networks is made difficult by several problems. These include vanishing and exploding gradients [7, 3]

and overfitting. Various advances including different activation functions

[15, 18]

, batch normalization

[13], novel initialization schemes [9], and dropout [27] offer solutions to these problems.

However, a more fundamental problem is that of finding optimal values for various hyperparameters, of which the learning rate is arguably the most important. It is well-known that learning rates that are too small are slow to converge, while learning rates that are too large cause divergence

[2]. Recent works agree that rather than a fixed learning rate value, a non-monotonic learning rate scheduling system offers faster convergence [21, 24]. It has also been argued that the traditional wisdom that large learning rates should not be used may be flawed, and can lead to “super-convergence” and have regularizing effects [26]. Our experimental results agree with this statement; however, rather than use cyclical learning rates based on intuition, we propose a novel method to compute an adaptive learning rate backed by theoretical foundations.

To the best of our knowledge, this is the first work to suggest an adaptive learning rate scheduler with a theoretical background and show experimental verification of its claim on standard datasets and network architectures. Thus, our contributions are as follows. First, we propose a novel theoretical framework for computing an optimal learning rate in stochastic gradient descent in deep neural networks, based on the Lipschitz constant of the loss function. We show that for certain choices of activation functions, only the activations in the last two layers are required to compute the learning rate. Second, we compute the ideal learning rate for several commonly used loss functions, and use these formulas to experimentally demonstrate the efficacy of our approach. Finally, we extend the above theoretical framework to derive adaptive versions of other common optimization algorithms, namely, gradient descent with momentum, RMSprop, and Adam. We also show experimental results using these algorithms.

During the experiments, we explore cases where adaptive learning rates outperform fixed learning rates. Our approach exploits functional properties of the loss function, and only makes two minimal assumptions on the loss function: it must be Lipschitz continuous[20] and (at least) once differentiable. Commonly used loss functions satisfy both these properties.

The code, trained models, program outputs, and training history are available in our GitHub repository111https://github.com/yrahul3910/adaptive-lr-dnn.

The rest of the paper is organized as follows. Section 2 discusses some related work. Section 3 introduces our novel theoretical framework. Sections 4 to 6 derive the learning rates for some common loss functions. Section 7 discusses how regularization fits into our proposed approach. Section 8 extends our framework to other optimization algorithms. Section 9 then shows experimental results demonstrating the benefits of our approach. Finally, Section 10 discusses some practical considerations when using our approach, and Section 11 concludes and discusses possible future work.

2 Related Work

Several enhancements to the original gradient descent algorithm have been proposed. These include adding a “momentum” term to the update rule [29], and “adaptive gradient” methods such as RMSProp[32], and Adam[14], which combines RMSProp and AdaGrad[5]. These methods have seen widespread use in deep neural networks[19, 33, 1].

Recently, there has been a lot of work on finding novel ways to adaptively change the learning rate. These have both theoretical [21] and intuitive, empirical [26, 24] backing. These works rely on non-monotonic scheduling of the learning rate. [24] argues for cyclical learning rates. Our proposed method also yields a non-monotonic learning rate, but does not follow any predefined shape.

Our work is also motivated by recent works that theoretically show that stochastic gradient descent is sufficient to optimize over-parameterized neural networks, making minimal assumptions [35, 4]. Our aim is to mathematically identify an optimal learning rate, rejecting the notion that only small learning rates must be used, and then experimentally show the validity of our claims.

We also emphasize here that we discuss extensions to our framework, and apply it to other optimization algorithms; few papers explore these algorithms, choosing instead to only focus on SGD.

3 Theoretical framework

For a neural network that uses the sigmoid, ReLU, or softmax activations, it is easily shown that the gradients get smaller towards the earlier layers in backpropagation. Because of this, the gradients at the last layer are the maximum among all the gradients computed during backpropagation. If

is the weight from node to node at layer , and if is the number of layers, then


Essentially, (1) says that the maximum gradient of the error with respect to the weights in the last layer is greater than the gradient of the error with respect to any weight in the network222Note that we loosely use the term “weights” to refer to both the weight matrices and the biases.. In other words, finding the maximum gradient at the last layer gives us a supremum of the Lipschitz constants of the error, where the gradient is taken with respect to the weights at any layer. We call this supremum as a Lipschitz constant of the loss function for brevity.

We now analytically arrive at a theoretical Lipschitz constant for different types of problems. The inverse of these values can be used as a learning rate in gradient descent. Specifically, since the Lipschitz constant that we derive is an upper bound on the gradients, we effectively limit the size of the parameter updates, without necessitating an overly guarded learning rate. In any layer, we have the computations


Thus, the gradient with respect to any weight in the last layer is computed via the chain rule as follows.


This gives us


The third part cannot be analytically computed; we denote it as . We now look at various types of problems and compute these components.

4 Regression

For regression, we use the least-squares cost function. Further, we assume that there is only one output node. That is,


where the vectors contain the values for each training example. Then we have,

This gives us,


where is the upper bound of and . A reasonable choice of norm is the 2-norm.

Looking back at (6), the second term on the right side of the equation is the derivative of the activation with respect to its parameter. Notice that if the activation is sigmoid or softmax, then it is necessarily less than 1; if it is ReLu, it is either 0 or 1. Therefore, to find the maximum, we assume that the network is comprised solely of ReLu activations, and the maximum of this is 1.

From (6), we have


The inverse of this, therefore, can be set as the learning rate for gradient descent.

5 Binary classification

For binary classification, we use the binary cross-entropy loss function. Assuming only one output node,



is the sigmoid function. We use a slightly different version of (

6) here:


Then, we have


It is easy to show, using the second derivative, that this attains a maxima at :


Setting (13) to 0 yields , and thus . This implies . Now whether y is 0 or 1, substituting this back in (12), we get


Using (14) in (11),


6 General cross-entropy loss function

While conventionally, multi-class classification is done using one-hot encoded outputs, that is not convenient to work with mathematically. An identical form of this is to assume the output follows a Multinomial distribution, and then updating the loss function accordingly. This is because the effect of the typical loss function used is to only consider the “hot” vector; we achieve the same effect using the Iverson notation, which is equivalent to the Kronecker delta. With this framework, the loss function is


Then the first part of (6) is trivial to compute:


The second part is computed as follows.


Combining (17) and (18) in (5) gives


It is easy to show that the limiting case of this is when all softmax values are equal and each ; using this and in (19) and combining with (6) gives us our desired result:


7 A note on regularization

It should be noted that this framework is extensible to the case where the loss function includes a regularization term.

In particular, if an regularization term, is added, it is trivial to show that the Lipschitz constant increases by , where is the upper bound for . More generally, if a Tikhonov regularization term, term is added, then the increase in the Lipschitz constant can be computed as below.

If are bounded by ,

This additional term may be added to the Lipschitz constants derived above when gradient descent is performed on a loss function including a Tikhonov regularization term. Clearly, for an -regularizer, since , we have .

8 Going Beyond SGD

The framework presented so far easily extends to algorithms that extend SGD, such as RMSprop, momentum, and Adam. In this section, we show algorithms for some major optimization algorithms popularly used.

RMSprop, gradient descent with momentum, and Adam are based on exponentially weighted averages of the gradients. The trick then is to compute the Lipschitz constant as an exponentially weighted average of the norms of the gradients. This makes sense, since it provides a supremum of the “velocity” or “accumulator” terms in momentum and RMSprop respectively.

8.1 Gradient Descent with Momentum

SGD with momentum uses an exponentially weighted average of the gradient as a velocity term. The gradient is replaced by the velocity in the weight update rule.

1 ; ;
2 for each iteration do
3       Compute for all layers;
4       ;
5       // Compute the exponentially weighted average of LC
6       ;
7       // Weight update
8       ;
10 end for
Algorithm 1 AdaMo

Algorithm 1 shows the adaptive version of gradient descent with momentum. The only changes are on lines 1 and 1. The exponentially weighted average of the Lipschitz constant ensures that the learning rate for that iteration is optimal. The weight update is changed to reflect our new learning rate. We use the symbol to consistently refer to the weights as well as the biases; while “parameters” may be a more apt term, we use to stay consistent with literature.

Notice that only line 1 is our job; deep learning frameworks will typically take care of the rest; we simply need to compute and use a learning rate scheduler that uses the inverse of this value.

8.2 RMSprop

RMSprop uses an exponentially weighted average of the square of the gradients. The square is performed element-wise, and thus preserves dimensions. The update rule in RMSprop replaces the gradient with the ratio of the current gradient and the exponentially moving average. A small value is added to the denominator for numerical stability.

Algorithm 2 shows the modified version of RMSprop. We simply maintain an exponentially weighted average of the Lipschitz constant as before; the learning rate is also replaced by the inverse of the update term, with the exponentially weighted average of the square of the gradient replaced with our computed exponentially weighted average.

1 ; ;
2 for each iteration do
3       Compute on mini-batch;
4       ;
5       // Compute the exponentially weighted average of LC
6       ;
7       // Weight update
8       ;
10 end for
Algorithm 2 Adaptive RMSprop

8.3 Adam

Adam combines the above two algorithms. We thus need to maintain two exponentially weighted average terms. The algorithm, shown in Algorithm 3, is quite straightforward.

1 ; ; ; ;
2 for each iteration do
3       Compute on mini-batch;
4       ;
5       ;
6       // Compute the exponentially weighted averages of LC
7       ;
8       ;
9       // Weight update
10       ;
12 end for
Algorithm 3 Auto-Adam

In our experiments, we use the defaults of .

In practice, it is difficult to get a good estimate of

. For this reason, we tried two different estimates:

  • – This set the learning rate high (around 4 on CIFAR-10 with DenseNet), and the model quickly diverged.

  • – This turned out to be an overestimation, and while the same model above did not diverge, it oscillated around a local minimum. We fixed this by removing the middle term. This worked quite well empirically.

8.4 A note on bias correction

Some implementations of the above algorithms perform bias correction as well. This involves computing the exponentially weighted average, and then dividing by , where

is the epoch number. In this case, the above algorithms may be adjusted by also dividing the Lipschitz constants by the same constant.

9 Experiments

Dataset Architecture Algorithm LR Policy Weight Decay Valid. Acc.
MNIST Custom SGD Adaptive None 99.5%
MNIST Custom Momentum Adaptive None 99.57%
MNIST Custom Adam Adaptive None 99.43%
CIFAR-10 ResNet20 v1 SGD Baseline 60.33%
CIFAR-10 ResNet20 v1 SGD Fixed 87.02%
CIFAR-10 ResNet20 v1 SGD Adaptive 89.37%
CIFAR-10 ResNet20 v1 Momentum Baseline 58.29%
CIFAR-10 ResNet20 v1 Momentum Adaptive 84.71%
CIFAR-10 ResNet20 v1 Momentum Adaptive 89.27%
CIFAR-10 ResNet20 v1 RMSprop Baseline 84.92%
CIFAR-10 ResNet20 v1 RMSprop Adaptive 86.66%
CIFAR-10 ResNet20 v1 Adam Baseline 84.67%
CIFAR-10 ResNet20 v1 Adam Fixed 70.57%
CIFAR-10 DenseNet SGD Baseline 84.84%
CIFAR-10 DenseNet SGD Adaptive 91.34%
CIFAR-10 DenseNet Momentum Baseline 85.50%
CIFAR-10 DenseNet Momentum Adaptive 92.36%
CIFAR-10 DenseNet RMSprop Baseline 91.36%
CIFAR-10 DenseNet RMSprop Adaptive 90.14%
CIFAR-10 DenseNet Adam Baseline 91.38%
CIFAR-10 DenseNet Adam Adaptive 88.23%
CIFAR-100 ResNet56 v2 SGD Adaptive 54.29%
CIFAR-100 ResNet164 v2 SGD Baseline 26.96%
CIFAR-100 ResNet164 v2 SGD Adaptive 75.99%
CIFAR-100 ResNet164 v2 Momentum Baseline 27.51%
CIFAR-100 ResNet164 v2 Momentum Adaptive 75.39%
CIFAR-100 ResNet164 v2 RMSprop Baseline 70.68%
CIFAR-100 ResNet164 v2 RMSprop Adaptive 70.78%
CIFAR-100 ResNet164 v2 Adam Baseline 71.96%
CIFAR-100 DenseNet SGD Baseline 50.53%
CIFAR-100 DenseNet SGD Adaptive 68.18%
CIFAR-100 DenseNet Momentum Baseline 52.28%
CIFAR-100 DenseNet Momentum Adaptive 69.18%
CIFAR-100 DenseNet RMSprop Baseline 65.41%
CIFAR-100 DenseNet RMSprop Adaptive 67.30%
CIFAR-100 DenseNet Adam Baseline 66.05%
CIFAR-100 DenseNet Adam Adaptive 40.14%333This was obtained after 67 epochs. After that, the performance deteriorated, and after 170 epochs, we stopped running the model. We also ran the model on the same architecture, but restricting the number of filters to 12, which yielded 59.08% validation accuracy.
Table 1: Summary of all experiments

Below we show the results and details of our experiments on some publicly available datasets. While our results are not state of the art, our focus was to empirically show that optimization algorithms can be run with higher learning rates than typically understood. On CIFAR, we only use flipping and translation augmentation schemes as in [10]. In all experiments the raw image values were divided by 255 after removing the means across each channel. We also provide baseline experiments performed with a fixed learning rate for a fair comparison, using the same data augmentation scheme.

A summary of our experiments is given in Table 1. DenseNet refers to a DenseNet[12] architecture with and .

9.1 Mnist

Layer Filters Padding
3 x 3 Conv 32 Valid
3 x 3 Conv 32 Valid
2 x 2 MaxPool
Dropout (0.2)
3 x 3 Conv 64 Same
3 x 3 Conv 64 Same
2 x 2 MaxPool
Dropout (0.25)
3 x 3 Conv 128 Same
Dropout (0.25)
Dense (128)
Dropout (0.25)
Dense (10)
Table 2: CNN used for MNIST

On MNIST, the architecture we used is shown in Table 2. All activations except the last layer are ReLU; the last layer uses softmax activations. The model has 730K parameters.

Our preprocessing involved random shifts (up to 10%), zoom (to 10%), and rotations (to ). We used a batch size of 256, and ran the model for 20 epochs. The experiment on MNIST used only an adaptive learning rate, where the Lipschitz constant, and therefore, the learning rate was recomputed every epoch. Note that this works even though the penultimate layer is a Dropout layer. No regularization was used during training. With these settings, we achieved a training accuracy of 98.57% and validation accuracy 99.5%.

Figure 1: Adaptive learning rate over time on MNIST

Finally, Figure 1 shows the computed learning rate over epochs. Note that unlike the computed adaptive learning rates for CIFAR-10 (Figure 3) and CIFAR-100 (Figure 7), the learning rate for MNIST starts at a much higher value. While the learning rate here seems much more random, it must be noted that this was run for only 20 epochs, and hence any variation is exaggerated in comparison to the other models, run for 300 epochs.

The results of our Adam optimizer is also shown in Table 1. The optimizer achieved its peak validation accuracy after only 8 epochs.

We also used a custom implementation of SGD with momentum (see Appendix A for details), and computed an adaptive learning rate using our AdaMo algorithm. Surprisingly, this outperformed both our adaptive SGD and Auto-Adam algorithms. However, the algorithm consistently chose a large (around 32) learning rate for the first epoch before computing more reasonable learning rates–since this hindered performance, we modified our AdaMo algorithm so that on the first epoch, the algorithm sets to 0.1 and uses this value as the learning rate. We discuss this issue further in Section 9.2.

9.2 Cifar-10

For the CIFAR-10 experiments, we used a ResNet20 v1[10]. A residual network is a deep neural network that is made of “residual blocks”. A residual block is a special case of a highway networks [28] that do not contain any gates in their skip connections. ResNet v2 also uses “bottleneck” blocks, which consist of a 1x1 layer for reducing dimension, a 3x3 layer, and a 1x1 layer for restoring dimension [11]. More details can be found in the original ResNet papers [10, 11].

We ran two sets of experiments on CIFAR-10 using SGD. First, we empirically computed by running one epoch and finding the activations of the penultimate layer. We ran our model for 300 epochs using the same fixed learning rate. We used a batch size of 128, and a weight decay of . Our computed values of , , and learning rate were 206.695, 43.257, and 0.668 respectively. It should be noted that while computing the Lipschitz constant, in the denominator must be set to the batch size, not the total number of training examples. In our case, we set it to 128.

Figure 2: Plot of accuracy score and loss over epochs on CIFAR-10

Figure 2 shows the plots of accuracy score and loss over time. As noted in [25], a horizontal validation loss indicates little overfitting. We achieved a training accuracy of 97.61% and a validation accuracy of 87.02% with these settings.

Second, we used the same hyperparameters as above, but recomputed , , and the learning rate every epoch. We obtained a training accuracy of 99.47% and validation accuracy of 89.37%. Clearly, this method is superior to a fixed learning rate policy.

(a) Learning rate over epochs
(b) Learning rate from epoch 2
Figure 3: Adaptive learning rate over time on CIFAR-10

Figure 3 shows the learning rate over time. The adaptive scheme automatically chooses a decreasing learning rate, as suggested by literature on the subject. On the first epoch, however, the model chooses a very small learning rate of , owing to the random initialization.

Observe that while it does follow the conventional wisdom of choosing a higher learning rate initially to explore the weight space faster and then slowing down as it approaches the global minimum, it ends up choosing a significantly larger learning rate than traditionally used. Clearly, there is no need to decay learning rate by a multiplicative factor. Our model with adaptive learning rate outperforms our model with a fixed learning rate in only 65 epochs. Further, the generalization error is lower with the adaptive learning rate scheme using the same weight decay value. This seems to confirm the notion in [26] that large learning rates have a regularization effect.

(a) Learning rate over epochs
(b) Learning rate from epoch 2
Figure 4: Adaptive learning rate over time on CIFAR-10 using DenseNet

Figure 4 shows the learning rate over time on CIFAR-10 using a DenseNet architecture and SGD. Evidently, the algorithm automatically adjusts the learning rate as needed.

Figure 5: Learning rate over epochs on CIFAR-10 using Adam and DenseNet

Interestingly, in all our experiments, ResNets consistently performed poorly when run with our auto-Adam algorithm. Despite using fixed and adaptive learning rates, and several weight decay values, we could not optimize ResNets using auto-Adam. DenseNets and our custom architecture on MNIST, however, had no such issues. Our best results with auto-Adam on ResNet20 and CIFAR-10 were when we continued using the learning rate of the first epoch (around 0.05) for all 300 epochs.

Figure 5 shows a possible explanation. Note that over time, our auto-Adam algorithm causes the learning rate to slowly increase. We postulate that this may be the reason for ResNet’s poor performance using our auto-Adam algorithm. However, using SGD, we are able to achieve competitive results for all architectures. We discuss this issue further in Section 10.

ResNets did work well with our AdaMo algorithm, though, performing nearly as well as with SGD. As with MNIST, we had to set the initial learning rate to a fixed value with AdaMo. We find that a reasonable choice of this is between 0.1 and 1 (both inclusive). We find that for higher values of weight decay, lower values of perform better, but we do not perform a more thorough investigation in this paper. In our experiments, we choose by simply trying 0.1, 0.5, and 1.0, running the model for five epochs, and choosing the one that performs the best. In Table 1, for the first experiment using ResNet20 and momentum, we used ; for the second, we used .

Figure 6: LR over epochs with DenseNet on CIFAR-10 with momentum

AdaMo also worked well with DenseNets on CIFAR-10. We used for this model. This model crossed 90% validation accuracy before 100 epochs, maintaining a learning rate higher than 1, and was the best among all our models trained on CIFAR-10. This shows the strength of our algorithm. Figure 6 shows the learning rate over epochs for this model.

9.3 Cifar-100

For the CIFAR-100 experiments, we used a ResNet164 v2 [11]. Our experiments on CIFAR-100 only used an adaptive learning rate scheme.

We largely used the same parameters as before. Data augmentation involved only flipping and translation. We ran our model for 300 epochs, with a batch size of 128. As in [11], we used a weight decay of . We achieved a training accuracy of 99.68% and validation accuracy of 75.99% with these settings.

For the ResNet164 model trained using AdaMo, we found to be the best among the three that we tried. Note that it performs competitively compared to SGD. For DenseNet, we used .

(a) Learning rate over epochs
(b) Learning rate from epoch 3
Figure 7: Adaptive learning rate over time on CIFAR-100

Figure 7 shows the learning rate over epochs. As with CIFAR-10, the first two epochs start off with a very small () learning rate, but the model quickly adjusts to changing weights.

9.4 Baseline Experiments

For our baseline experiments, we used the same weight decay value as our other experiments; the only difference was that we simply used a fixed value of the default learning rate for that experiment. For SGD and SGD with momentum, this meant a learning rate of 0.01. For Adam and RMSprop, the learning rate was 0.001. In SGD with momentum and RMSprop, was used. For Adam, and were used.

10 Practical Considerations

Although our approach is theoretically sound, there are a few practical issues that need to be considered. In this section, we discuss these issues, and possible remedies.

The first issue is that our approach takes longer per epoch than with choosing a standard learning rate. Our code was based on the Keras deep learning library, which to the best of our knowledge, does not include a mechanism to get outputs of intermediate layers directly. Other libraries like PyTorch, however, do provide this functionality through “hooks”. This eliminates the need to perform a partial forward propagation simply to obtain the penultimate layer activations, and saves computation time. We find that computing

takes very little time, so it is not important to optimize its computation.

Another issue that causes practical issues is random initialization. Due to the random initialization of weights, it is difficult to compute the correct learning rate for the first epoch, because there is no data from a previous epoch to use. We discussed the effects of this already with respect to our AdaMo algorithm, and we believe this is the reason for the poor performance of auto-Adam in all our experiments. Fortunately, if this is the case, it can be spotted within the first two epochs–if large values of the intermediate computations: , , etc. are observed, then it may be required to set the initial LR to a suitable value. We discussed this for the AdaMo algorithm. In practice, we find that for RMSprop, this rarely occurs; but when it does, the large intermediate values are shown in the very first epoch. We find that a small value like works well as the initial LR. In our experiments, we only had to do this for ResNet on CIFAR-100.

11 Discussion and Conclusion

In this paper, we derived a theoretical framework for computing an adaptive learning rate; on deriving the formulas for various common loss functions, it was revealed that this is also “adaptive” with respect to the data. We explored the effectiveness of this approach on several public datasets, with commonly used architectures and various types of layers.

Clearly, our approach works “out of the box” with various regularization methods including , dropout, and batch normalization; thus, it does not interfere with regularization methods, and automatically chooses an optimal learning rate in stochastic gradient descent. On the contrary, we contend that our computed larger learning rates do indeed, as pointed out in [26], have a regularizing effect; for this reason, our experiments used small values of weight decay. Indeed, increasing the weight decay significantly hampered performance. This shows that “large” learning rates may not be harmful as once thought; rather, a large value may be used if carefully computed, along with a guarded value of weight decay. We also demonstrated the efficacy of our approach with other optimization algorithms, namely, SGD with momentum, RMSprop, and Adam.

Our auto-Adam algorithm performs surprisingly poorly. We postulate that like AdaMo, our auto-Adam algorithm will perform better when initialized more thoughtfully. To test this hypothesis, we re-ran the experiment with ResNet20 on CIFAR-10, using the same weight decay. We fixed the value of to 1, and found the best value of in the same manner as for AdaMo, but this time, checking , , , and . We found that the lower this value, the better our results, and we chose . While at this stage we can only conjecture that this combination of and will work in all cases, we leave a more thorough investigation as future work. Using this configuration, we achieved 83.64% validation accuracy.

A second avenue of future work involves obtaining a tighter bound on the Lipschitz constant and thus computing a more accurate learning rate. Another possible direction is to investigate possible relationships between the weight decay and the initial learning rate in the AdaMo algorithm.


The authors would like to thank the Science and Engineering Research Board (SERB)-Department of Science and Technology (DST), Government of of India for supporting this research. The project reference number is: SERB-EMR/ 2016/005687.


Appendix A Implementation Details

All our code was written using the Keras deep learning library. The architecture we used for MNIST was taken from a Kaggle Python notebook by Aditya Soni444https://www.kaggle.com/adityaecdrid/mnist-with-keras-for-beginners-99457. For ResNets, we used the code from the Examples section of the Keras documentation555https://keras.io/examples/cifar10_resnet/. The DenseNet implementation we used was from a GitHub repository by Somshubra Majumdar666https://github.com/titu1994/DenseNet. Finally, our implementation of SGD with momentum is a modified version of the Adam implementation in Keras777https://github.com/keras-team/keras/blob/master/keras/optimizers.py#L436.