1 Introduction
Adaptive gradient methods such as Adam (Kingma & Ba (2014)), Adagrad, Adadelta (Zeiler (2012)
), RMSProp (
Geoffrey Hinton & Swersky (2012)), Nadam (Dozat (2016)), AdamW (Loshchilov & Hutter (2017)), were proposed over SGD with momentum for solving optimization of stochastic objectives in highdimensions. Amsgrad was recently proposed as an improvement to Adam to fix convergence issues in the latter. These methods provide benefits such as faster convergence and insensitivity towards hyperparameter selection i.e. they are demonstrated to work with little tuning. On the downside, these adaptive methods have shown poor empirical performance and lesser generalization as compared to SGDmomentum. The authors of Padam attribute this phenomenon to the ”overadaptiveness” of the adaptive methods.The key contributions of Padam as mentioned by the authors are:

The authors put forward that Padam unifies Adam/Amsgrad and SGD with momentum by a partially adaptive parameter and that Adam/Amsgrad can be seen as a special fully adaptive instance of Padam. They further claim that Padam resolves the “small learning rate dilemma” for adaptive gradient methods and allows for faster convergence, hence closing the gap of generalization.

The authors claim that Padam generalizes equally good as SGDmomentum and achieves fastest convergence.
We address and comment on each of the above claims from an empirical point of view. We run additional experiments to study the effect of learning rate (and its schedule) on the optimal value of partially adaptive parameter p. From our analysis we propose the use of a suitable schedule to vary p as the training proceeds in order to actually enjoy the best from both the worlds.
2 Background
Padam is inspired from two recent adaptive techniques introduced in Adam and Amsgrad, we discuss them briefly here. Adam made use of biascorrected first and second order moments along with the gradients for weight update.
where (Adam)
A convergence issue in Adam was recently uncovered and addressed by Amsgrad, where they suggest tweaking the update rule slightly to fix it. Padam makes use of this updated algorithm.
where = max() (Amsgrad)
2.1 Padam Algorithm
Padam introduces a new partially adaptive parameter p that takes value within the range [0, 0.5]. On the extremities of this range it takes the form of SGD with momentum or AMSGrad. From Algorithm 1, when p is set to 0.0, Padam reduces to SGD with momentum whereas setting it to 0.5 leaves us with AMSGrad optimizer.
3 Experiments
In this section we describe our experiment settings for evaluating Padam. We have tried to keep our implementation faithful to the authors code ^{1}^{1}footnotemark: 1. We build three different CNN architectures as proposed in the paper, and compare Padam’s performance against the other baseline algorithms. We built the Amsgrad and Padam optimizers on top of the base code of Adam in tensorflow^{2}^{2}2Implementation of Adam at: https://github.com/tensorflow/tensorflow/blob/r1.12/tensorflow/python/training/adam.py.
3.1 Environmental Setup
We have built the experiments using Tensorflow version 1.13.0 and graphfree Eager Execution mode within Python 3.5.2. We ran the experiments on 4 Tesla Xp GPU cards (with 12Gb RAM per GPU).
3.2 Datasets
The experiments were conducted on two popular datasets for image classification: CIFAR10 and CIFAR100 (Krizhevsky (2009)). The performance of various optimizers on the aforementioned datasets was evaluated with three different CNN architecture: VGGNet (Simonyan & Zisserman (2014)), ResNet (He et al. (2016)) and Wide ResNet(Zagoruyko & Komodakis (2016)
). We run CIFAR10 and and CIFAR100 task for 200 epochs. The experiments for CIFAR datasets were performed with a learning rate decay at every 50th epoch ie. (50,100,150). We were unable to perform the experiments on the ImageNet dataset because of the time constraints and limited availability of computing resources.
3.3 Baseline Algorithms
We compare Padam against the most popular adaptive gradient optimizers and SGDmomentum. Note that the evaluation against AdamW was added by the authors at a later stage and the details about it were not completely disclosed in the updated version of the paper or code, owing to this delay we have not been able to carry out experiments with AdamW.
Optimizer  Padam ^{3}^{3}3We have used p=0.125 as the hyperparameter for the experiments unless specifically mentioned  SGD+Momentum  Adam  Amsgrad 

Initial Learning rate  0.1  0.1  0.001  0.001 
Beta1  0.9    0.9  0.9 
Beta2  0.999    0.99  0.99 
Weight decay  0.0005  0.0005  0.0001  0.0001 
Momentum    0.9     
Architectures used in experiments. Note that we do not explicitly show batch normalization layer after each convolution operation for brevity.
3.4 Architectures
We have built architectures faithful to the code released by the authors, and are shown in Figure 1.
3.4.1 VGGNet
The VGG16 network uses only 3 x 3 convolutional layers stacked on top of each other for increasing depth and adopts max pooling to reduce volume size. Finally, two fullyconnected layers are followed by a softmax classifier.
3.4.2 ResNet
Residual Neural Network (ResNet) He et al. (2016) introduces a novel architecture with “skip connections” and features a heavy use of batch normalization. As per authors, we use ResNet18 for this experiment, which contains 4 blocks each comprising of 2 basic building blocks.
3.4.3 Wide ResNet
Wide Residual Network Zagoruyko & Komodakis (2016) further exploits the “skip connections” used in ResNet and in the meanwhile increases the width of residual networks. In detail, we use the 16 layer Wide ResNet with 4 multipliers (WRN164) in the experiments.
4 Evaluation and Results
In this section we comment on the results we obtained with this reproducibility effort. We divide this section into four parts.
4.1 Train Experiments
Padam is compared with other proposed baselines. Figure 2 demonstrates the Top5 Test Error and Figure 3 shows the Train Loss and Test Error for the three architectures on CIFAR10. We find that Padam performs comparably with SGDmomentum on all the three architectures in test error and maintains a rate of convergence between Adam/AMSgrad and SGD. Padam works as proposed by the authors to bridge the gap between adaptive methods and SGD at the cost of introducing a new hyperparameter p, which requires tuning. Although, we don’t see a clear motivation behind the gridsearch approach used by the authors to select the value of this partially adaptive parameter p.
The results on CIFAR100 can be found in the appendix.
Methods  50th epoch  100th epoch  150th epoch  200th epoch 

SGD Momentum  68.71  87.88  92.94  92.95 
Adam  84.62  91.54  92.34  92.39 
Amsgrad  87.89  91.75  92.26  92.19 
Padam  67.92  90.86  93.08  93.06 
4.2 pvalue Experiments
In order to find an optimal working value of p the authors perform grid search over three options [0.25, 0.125, 0.0625]. They do so by keeping the base learning rate fixed to 0.1 and decaying it by a factor of 0.1 per 30 epochs. We perform the same experiment on CIFAR10 and CIFAR100, the results are plotted in Figure 4. We observe similar results as the authors, and find the most optimal setting for p to be 0.125, out of the proposed three values. Nevertheless, we would like to press that this value of p is suboptimal and may turn out to be sensitive to the learning rate’s base value. To analyze this we perform sensitivity experiments of p against various learning rates and it turns out that p is indeed sensitive to it.
4.3 Sensitivity Experiments
To evaluate the possibility that optimal value of partial adaptive parameter p is entangled with learning rate we run the sensitivity experiments. We perform experiments with three fixed values of p from 0.25, 0.125, 0.0625. For each fixed value of p we vary the base learning rate over {0.1, 0.01, 0.001}. We run each evaluation for 30 epochs on CIFAR10 and CIFAR100 with ResNet. We expect that this would uncover the dependence of p on base learning rate.
The results for CIFAR10 are plotted in Figure 5. From Figure 5(b) we observe that with p = 0.25 base learning rate of 0.01 or 0.001 seems to be a more appropriate choice as compared to 0.1 owing to its better Test Error performance. As we decrease the value of p, we find that higher base learning rates start performing better as evident from Figure 5(d) and 5(f). This observation favors the argument that p is indeed sensitive to the base learning rate.
Results of sensitivity experiments on CIFAR100 are moved to Appendix.
4.4 Proposed Further Study
From the sensitivity experiments we can infer that while using higher values of p Padam behaves more adaptivelike (performs better with lower learning rates) and with smaller values of p Padam demonstrates a behavior closer to SGD (performs better with higher learning rates).
Our primary objective behind designing Padam was to achieve two things: good convergence (initially) and better generalization (finally). In order to do so we would like Padam to behave adaptivelike initially and SGDlike finally. In this way Padam would be able to exploit both worlds to their fullest within the training lifecycle.
To do so we propose to initialize Padam with high p and low base learning rate and then decay p during the training lifecycle. Correspondingly the learning rate can be mildly decreased in the middle or towards the end of the training cycle in order to generate conditions for the SGDlike Padam to converge.
Recently, AdamW has demonstrated better generalization by decoupling the mechanism of weight decay from the update rule, this method can also further compliment Padam’s result. We haven’t been able to finish running these proposed experiments due to time and resource constraints.
5 Discrepancies, Suggestions and Conclusion
The authors argue that adaptive gradient methods when used with larger base learning rate gives rise to the gradient explosion problem because of the presence of the second order moment term in the denominator. This proposition implicitly assumes to be in between 0 and 1, which might not always be the case and hence the factor may cause the effective learning rate to either increase or decrease.
Overall, we conclude from the empirical evaluation that Padam is capable of mixing the benefits of adaptive gradient methods with those of SGD with momentum. Perhaps studying the newly introduced partially adaptive p parameter seems to be a good direction to further this work along.
References

Dozat (2016)
Timothy Dozat.
Incorporating nesterov momentum into adam, 2016.

Geoffrey Hinton & Swersky (2012)
Nitish Srivastava Geoffrey Hinton and Kevin Swersky.
Neural networks for machine learning lecture 6a overview of minibatch gradient descent, 2012.

He et al. (2016)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Jun 2016. doi: 10.1109/cvpr.2016.90. URL http://dx.doi.org/10.1109/CVPR.2016.90.  Kingma & Ba (2014) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2014.
 Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
 Loshchilov & Hutter (2017) Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam, 2017.
 Simonyan & Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition, 2014.
 Zagoruyko & Komodakis (2016) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks, 2016.
 Zeiler (2012) Matthew D. Zeiler. Adadelta: An adaptive learning rate method, 2012.
Appendix A Experiments on CIFAR100
Methods  50th epoch  100th epoch  150th epoch  200th epoch 

SGD Momentum  37.29  61.18  71.55  71.54 
Adam  55.44  65.67  66.70  66.65 
Amsgrad  58.85  68.21  69.94  69.95 
Padam  42.05  66.92  72.04  72.08 
Comments
There are no comments yet.