1 Introduction
In theory, deep neural networks (DNNs) are hard to train [2]
. Apart from sundry architecture configurations –such as network depth, layer width, and type of activation functions– there are key algorithmic hyperparameters that need to be properly tuned, in order to obtain a model that generalizes well, within reasonable amount of time.
Among the hyperparameters, the one of pivotal importance is step size [3, 4]
. To set the background, note that most algorithms in practice are gradientdescent based: given the current model estimate
and some training examples, we iteratively compute the gradient of the objective , and update the model by advancing along negative directions of the gradient , weighted by the step size ; i.e.,This is the crux of all gradientdescent based algorithms, including the ubiquitous stochastic gradient descent (SGD) algorithm. Here, the step size could be set as constant, or could be changing per iteration [5], usually based on a predefined learning rate schedule [6, 7, 8].
Beyond practical strategies and tricks that lead to faster convergence and better generalization for simple gradientbased algorithms [9, 10, 5], during the past decade we have witnessed a family of algorithms that argue for automatic hyperparameter adaptation [11] during training (including step size). The list includes AdaGrad [12], Adam [13], AdaDelta [14]
, RMSProp
[15], AdaMax [13], Nadam [16], just to name a few. These algorithms utilize current and past gradient information , for , to design preconditioning matrices that better pinpoint the local curvature of the objective function. A simpler description of the above^{1}^{1}1In this work, our theory focuses on adaptive but nonmomentumbased methods, such as AdaGrad. algorithms is as follows:The main argument is that eliminates presetting a learning rate schedule, or diminishes initial bad step size choices, thus, detaching the timeconsuming part of step size tuning from the practitioner [17].
Recently though, it has been argued that simple gradientbased algorithms may perform better compared to adaptive ones, creating skepticism regarding their efficiency. More specifically, for the linear regression setting, [1] shows that, under specific assumptions, the adaptive methods converge to a different solution than the minimum norm one. The latter has received attention due to its efficiency as the maximum margin solution in classification [18]. This behavior is also demonstrated using DNNs, where simple gradient descent generalizes at least as well as the adaptive methods, adaptive techniques require at least the same amount of tuning as the simple gradient descent methods.
In this paper, our aim is to further study this conjecture. The paper is separated into the theoretical (Sections 24) and the practical (Subsection 4.3 and Section 5) part. For our theory, we focus on simple linear regression (Section 2), and discuss the differences between under and overparameterization. Section 3 focuses on simple gradient descent, and establishes closedform solutions, for both settings. We study the AdaGrad algorithm on the same setting in Section 4, and we discuss under which conditions gradient descent and AdaGrad perform similarly or their behavior diverges.
Our findings can be summarized as follows:

[leftmargin=0.5cm]

For underparameterized linear regression, closedform solutions indicate that simple gradient descent and AdaGrad converge to the same solution; thus, adaptive methods have the same generalization capabilities, which is somewhat expected in underparameterized convex linear regression.

In overparameterized linear regression, simple gradient methods converge to the minimum norm solution. In contrast, AdaGrad converges to a different point: while the computed model predicts correctly within the training data set, it has different generalization behavior on unseen data, than the minimum norm solution. [1] shows that AdaGrad generalizes worse than gradient descent; in this work, we empirically show that an AdaGrad variant generalizes better than the minimum norm solution on a different counterexample. We conjecture that the superiority of simple or adaptive methods depends on the problem/data at hand, and the discussion “who is provably better” is inconclusive.

We conduct neural network experiments using different datasets and network architectures. Overall, we observe a similar behavior either using simple or adaptive methods. Our findings support the conclusions of [1] that adaptive methods still require fine parameter tuning. Generalizationwise, we observe that simple algorithms are not universally superior than adaptive ones.
2 Background on linear regression
Consider the linear regression setting:
where is the feature matrix and are the observations. There are two different settings, depending on the number of samples and dimensions:

[leftmargin=0.5cm]

Overparameterized case: In this case, we have more parameters than the number of samples: . In this case, assuming that is in general position, is full rank.

Underparameterized case: Here, the number of samples is larger than the number of parameters: . In this case, usually is full rank.
The most studied case is that of : the problem has solution , under full rankness assumption on . In the case where the problem is overparameterized , there is a solution of similar form that has received significant attention, despite the infinite cardinality of optimal solutions. This is the socalled minimum norm solution. The optimization instance to obtain the minimum norm solution is:
Let us denote its solution as . can be found through the Lagrange multipliers method. Define the Lagrange function: . The optimal conditions w.r.t. the variables and are:
Substituting back in the expression for , we get the minimum norm solution: . Any other solution has to have equal or larger Euclidean norm than .
Observe that the two solutions, and , differ between the two cases: in the underparameterized case, the matrix is welldefined (fullrank) and has an inverse, while in the overparameterized case, the matrix is full rank. Importantly, there are differences on how we obtain these solutions in an iterative fashion. We next show how both simple and adaptive gradient descent algorithms find for welldetermined systems. This does not hold for the overparameterized case: there are infinite solutions, and the question which one they select is central in the recent literature [1, 19, 20].
3 Closedform expressions for gradient descent in linear regression
Studying iterative routines in simple tasks provides intuitions on how they might perform in more complex problems, such as neural networks. Next, we will distinguish our analysis into under and overparameterized settings for linear regression.
3.1 Underparameterized linear regression.
Here, and is assumed to be full rank. Simple gradient descent with step size satisfies: . Unfolding for iterations, we get:
See also Section 7.1. The expression in the parentheses satisfies:
Therefore, we get the closed form solution: . In order to prove that gradient descent converges to the minimum norm solution, we need to prove that:
This is equivalent to showing that . From optimization theory [21], we need for convergence, where
denotes the eigenvalues of the argument. Then,
has spectral norm that is smaller than 1, i.e., . Combining the above, we make use of the following theorem.Theorem 1
Using the above theorem, has . Further, for sufficiently large , has a small value such that ; i.e., after some , , will be less than zero, converging to zero for increasing . As is going towards infinity, this concludes the proof, and leads to the left inverse solution: , as . This is identical to the closed for solution of wellconditioned linear regression.
3.2 Overparameterized linear regression.
For completeness, we briefly provide the analysis for the overparameterized setting, where and is assumed to be full rank. By inspection, unfolding gradient descent recursion gives:
Similarly, the summation can be simplified to:
and, therefore:
Under similar assumption on the spectral norm of and using Theorem 1, we obtain the right inverse solution: , as . Bottomline, in both cases, gradient descent converges to left and right inverse solutions, related to the MoorePenrose inverse.
4 Closedform expressions for adaptive gradient descent in linear regression
Let us now focus on adaptive methods. For simplicity, we study only nonaccelerated adaptive gradient descent methods, like AdaGrad, following the analysis in [1]; the momentumbased schemes are left for future work. While there exists considerable work analyzing the stochastic variants of adaptive methods in [12, 13, 24, 25], we concentrate on nonstochastic variants, for simplicity and ease of comparison with gradient descent. In summary, we study: . E.g., in the case of AdaGrad, we have:
The main ideas apply for any positive definite preconditioner. The case where , for a constant matrix, is deferred to the appendix (Section 7.2). Here, we focus on the case where varies per iteration.
4.1 Underparameterized linear regression.
When is varying (Section 7.3), we end up with the following proposition (folklore); the proof is in Section 7.4.:
Proposition 1
Consider the underparameterized case. Assume the recursion , for positive definite matrices. Then, after iterations, satisfies:
Using Theorem 1, we can again infer that, for sufficiently large and for sufficiently small , such that , we have: . Thus, for sufficiently large and assuming : , which is the same as the plain gradient descent approach. Thus, in this case, under proper assumptions (which might seem stricter than plain gradient descent), adaptive methods have the same generalization capabilities as gradient descent.
4.2 Overparameterized linear regression.
Let us focus on the case where . Finding a closed form for , as in the underparameterized case, seems much trickier to achieve, despite our best efforts. Here, we follow a different path than the previous sections.
What is the predictive power of adaptive methods within the training set?
For the first question, we look for a way to express the predictions within the training dataset, i.e., , where is found by the recursion for updates.
Proposition 2
Consider the overparameterized case. Assume the recursion , for positive definite matrices. Then, after iterations, the prediction satisfies:
The proof can be found in Section 7.5. Using Theorem 1, we observe that, for sufficiently large and for sufficiently small step size , . Thus, . This further implies that , as increases; i.e., adaptive methods fit the training data, and make the correct predictions within the training dataset.
What is the predictive power of adaptive methods on unseen data?
We start with the counterexample in [1], where adaptive methods –where takes the form of (4)–fail to find a solution that generalizes, in contrast to gradient descent methods (under assumptions).
Let us briefly describe their setting: we take to be of the order of , with ; empirically, the counterexample holds for various values of , as long as . For the responses, we consider two classes . For , we sample
with probability
as , and with probability as , for . Given , for each , we design the th row of , as:(1) 
Given this structure for , only the first feature is indicative for the predicted class: i.e., a model that always predicts correctly is ; however, we note that this is not the only model that might lead to the correct predictions. The and features are the same , and the rest features are unique for each (the positions of nonzeros in are unique for each ).
Given this generative model and assuming , [1] show theoretically that AdaGrad, with as in (4), only predicts correctly the positive class, while plain gradient descentbased schemes perform flawlessly (predicting both positive and negative classes correctly), as long as the number of positive examples in training is more than the of the negative ones. This shows that simple gradient descent generalizes better than adaptive methods for some simple problem instances; this further implies that such behavior might transfer to more complex cases, such as neural networks.
4.3 A counterexample for the counterexample
We alter the previous counterexample by slightly changing the problem setting: at first, we reduce the margin between the two classes; the case where we increase the margin is provided in the appendix. We empirically show that gradientdescent methods fail to generalize as well as adaptive methods –with a slightly different than AdaGrad.
In particular, for the responses, we consider two classes for some ; i.e., we consider a smaller margin between the two classes.^{2}^{2}2One can consider classes in , but the rest of the problem settings need to be weighted accordingly. We selected to weight the classes differently in order not to drift much from the couterexample from [1]. can take different values, and still we get the same performance, as we show in the experiments below. The rest of the problem setting is the same. Likewise as above, only the first feature is indicative of the correct class.
Given this generative model, we construct samples , and set , for different values. We compare two simple algorithms: the plain gradient descent for ; the recursion , where is set as above, and follows the rule:
(2) 
Observe that uses the dot product of gradients, squared. A variant of this preconditioner is found in [25]; however our purpose is not to recommend a particular preconditioner but to show that there are that lead to better performance than the minimum norm solution. We denote as , and the estimates of the adam, adagrad variant and simple gradient descent, respectively.
The experiment obeys the following steps: we train both gradient and adaptive gradient methods on the same training set, we test models on new data . We define performance in terms of the classification error: for a new sample and given , and , the only features that are nonzeros in both and ’s are the first 3 entries [1, pp. 5]. This is due to the fact that, for gradient descent and given the structure in , only these 3 features^{3}^{3}3Further experiments were performed in Section 7.7 to empirically verify this consistency affects the performance of gradient descent. Thus, the decision rules for both algorithms are:
where finds the nearest point w.r.t. . With this example, our aim is to show that adaptive methods lead to models that have better generalization than gradient descent.
Gradient Descent  AdaGrad variant  Adam  

Acc. (%)  63  100  91  
Acc. (%)  53  100  87  
Acc. (%)  58  99  84  
Acc. (%)  77  100  88  
Acc. (%)  80  100  89  
Acc. (%)  91  100  89  
Acc. (%)  85  100  95  
Acc. (%)  83  100  76  
Acc. (%)  100  100  90  
Table 1 summarizes the empirical findings. In order to cover a wider range of settings, we consider and set , as dictated by [1]. We generate as above, where instances in the positive class, , are generated with probability ; the cases where and are provided in the appendix section 7.7, and also convey the same message as in Table 1.
The simulation is completed as follows: For each setting , we generate 100 different instances for , and for each instance we compute the solutions from gradient descent, AdaGrad variant and Adam (RMSprop is included in the Appendix) and the minimum norm solution . In the appendix, we have the above table with the Adagrad variant that normalizes the final solution (Table 3) before calculating the distance w.r.t. the minimum norm solution: we observed that this step did not improve or worsen the performance, compared to the unnormalized solution. This further indicates that there is an infinite collection of solutions –with different magnitudes– that lead to better performance than plain gradient descent; thus our findings are not a pathological example where adaptive methods work better.
We record , where represents the corresponding solutions obtained by the algorithms in the comparison list. For each instance, we further generate , and we evaluate the performance of both models on predicting , .
Table 1 shows that gradient descent converges to the minimum norm solution, in contrast to the adaptive methods. This justifies the fact that the adaptive gradient methods (including the proposed adagrad variant) converge to a different solution than the minimum norm solution. Nevertheless, the accuracy on unseen data is higher in the adaptive methods (both our proposed AdaGrad variant and in most instances, Adam), than the plain gradient descent, when is small: the adaptive method successfully identifies the correct class, while gradient descent only predicts one class (the positive class; this is justified by the fact that the accuracy obtained is approximately close to , as increases).
The proposed AdaGrad variant described in equation 4.3 falls under the broad class of adaptive algorithms with . However, for the counter example in [1, pp. 5], the AdaGrad variant neither satisfies the convergence guarantees of Lemma 3.1 there, nor does it converge to the minimum norm solution evidenced by its norm in Table 1. To buttress our claim that the AdaGrad variant in (2) converges to a solution different than that of minimum norm (which is the case for plain gradient descent), we provide the following proposition for a specific class of problems^{4}^{4}4Not the problem proposed in the counterexample 1 on pg 5.; the proof is provided in Appendix 7.6.
Proposition 3
Suppose has no zero components. Define and assume there exists a scalar such that . Then, when initialized at 0, the AdaGrad variant in (2) converges to the unique solution .
This result, combined with our experiments, indicate that the minimum norm solution does not guarantee better generalization performance for overparameterized settings, even in cases of linear regression. Thus, it is unclear why that should be the case for deep neural networks.
A detailed analysis about the class of counterexamples is available in Section 7.7.1.
5 Experiments
We empirically compare two classes of algorithms in deep neural network training:

[leftmargin=0.5cm]

Plain gradient descent algorithms, including the minibatch stochastic gradient descent and the accelerated stochastic gradient descent, with constant momentum.
The details of the datasets and the DNN architectures used in our experiments are given in Table 2.
5.1 Hyperparameter tuning
Name  Network type  Dataset 

M1UP  Shallow CNN + FFN  MNIST 
M1OP  Shallow CNN + FFN  MNIST 
C1UP  Shallow CNN + FFN  CIFAR10 
C1OP  ResNet18  CIFAR10 
C2OP  PreActResNet18  CIFAR100 
C3OP  MobileNet  CIFAR100 
C4OP  MobileNetV2  CIFAR100 
C5OP  GoogleNet  CIFAR100 
Summary of the datasets and the architectures used for experiments. CNN stands for convolutional neural network, FF stands for feed forward network. More details are given in the main text.
Both for adaptive and nonadaptive methods, the step size and momentum parameters are key for favorable performance, as also concluded in [1]. Default values were used for the remaining parameters. The step size was tuned over an exponentiallyspaced set , while the momentum parameter was tuned over the values of . We observed that step sizes and momentum values smaller/bigger than these sets gave worse results. Yet, we note that a better step size could be found between the values of the exponentiallyspaced set. The decay models were similar to the ones used in [1]: no decay and fixed decay. We used fixed decay in the overparameterized cases, using the StepLR implementation in pytorch. We experimented with both the decay rate and the decay step in order to ensure fair comparisons with results in [1]
. A complete set of hyperparameters tuned over for comparison can be found in Section
7.8 in the Appendix.5.2 Results
Our main observation is that, both in under or overparameterized cases, adaptive and nonadaptive methods converge to solutions with similar testing accuracy: the superiority of simple or adaptive methods depends on the problem/data at hand. Further, as already pointed in [1], adaptive methods often require similar parameter tuning. Most of the experiments involve using readily available code from GitHub repositories. Since increasing/decreasing batchsize affects the convergence [26], all the experiments were simulated on identical batchsizes. Finally, our goal is to show performance results in the purest algorithmic setups: often, our tests did not achieve state of the art performance.
Overall, despite not necessarily converging to the same solution as gradient descent, adaptive methods generalize as well as their nonadaptive counterparts. In M1 and C1UP settings, we compute standard deviations from all Monte Carlo instances, and plot them with the learning curves (shown in shaded colors is the oneapart standard deviation plots; best illustrated in electronic form). For the cases of C{15}OP we show single runs due to lack of excessive computational resources.
MNIST dataset and the M1 architecture.
Each experiment for M1 is simulated over 50 epochs and 10 runs for both under and overparameterized settings. Both the MNIST architectures consisted of two convolutional layers (the second one with dropouts
[27]) followed by two fully connected layers. The primary difference between the M1OP (K parameters) and M1UP (K parameters) architectures was the number of channels in the convolutional networks and of nodes in the last fully connected hidden layer.Figure 1, left two columns, reports the results over 10 MonteCarlo realizations. Top row corresponds to the M1UP case; bottom row to the M1OP case. We plot both training errors and the accuracy results on unseen data. For the M1UP case, despite the grid search, observe that AdaGrad (and its variant) do not perform as well as the rest of the algorithms. Nevertheless, adaptive methods (such as Adam and RMSProp) perform similarly to simple SGD variants, supporting our conjecture that each algorithm requires a different configuration, but still can converge to a good local point; also that adaptive methods require the same (if not more) tuning. For the M1OP case, SGD momentum performs less favorably compared to plain SGD, and we conjecture that this is due to nonoptimal tuning. In this case, all adaptive methods perform similarly to SGD.
CIFAR10 dataset and the C1 architecture.
For C1, C1UP is trained for 10 runs over epochs, while C1OP was trained for 1 run, each consisting of epochs. The underparameterized setting is onpurpose tweaked to ensure that we have fewer parameters than examples (K parameters), and slightly deviates from [28]; our generalization guarantees () are in conjunction with the attained test accuracy levels. Similarly, for the C1OP case, we implement a Resnet [29] + dropout architecture ( million parameters) ^{5}^{5}5The code from the following github repository was used for experiments: https://github.com/kuangliu/pytorchcifar and obtained top1 accuracy of . Adam and RMSProp achieves the best performance than their nonadaptive counterparts for both the underparameterized and overparameterized settings.
Figure 1, right panel, follows the same pattern with the MNIST data; it reports the results over 10 MonteCarlo realizations. Again, we observe that AdaGrad methods do not perform as well as the rest of the algorithms. Nevertheless, adaptive methods (such as Adam and RMSProp) perform similarly to simple SGD variants.
CIFAR100 and other deep architectures (C{25}OP).
In this experiment, we focus only on the overparameterized case: DNNs are usually designed overparameterized in practice, with ever growing number of layers, and, eventually, a larger number of parameters [30]. Due to the depth and complexity of the networks, we only perform one run for each architecture. C2OP corresponds to PreActResNet18 from [31], C3OP corresponds to MobileNet from [32], C4OP is MobileNetV2 from [33], and C5OP is GoogleNet from [34]. The results are depicted in Figure 2. We did not perform a fine grid search over the hyperparameters, but selected the best choices among the parameters used for the MNIST/CIFAR10 experiments. The results show only slight superiority of nonadaptive methods, but overall support our claims: the superiority depends on the problem/data at hand; also, all algorithms require fine tuning to achieve their best performance. We note that a more comprehensive reasoning requires multiple runs for each network, as other hyperparameters (such as initialization) might plain significant role in closing the gap between different algorithms.
6 Conclusions and Future Work
In this work, we revisited the question of how adaptive and nonadaptive training algorithms compare: focusing on the linear regression setting as in [1], we show that there are similarities and differences between the behavior of adaptive and nonadaptive methods, depending on whether we have under or overparameterization; even when similarities may occur, there are differences on how the hyperparameters as set between the two algorithmic classes, in order to obtain similar behavior. In the overparameterized linear regression case, we provide a small toy example showing that adaptive methods, such as AdaGrad, tend to generalize better than plain gradient descent, under assumptions; however, this is not a rule that applies universally. Our findings on training DNNs show that there is no clear and provable superiority of plain or adaptive gradient methods. What was clear though from our experiments is that adaptive methods require no less fine tuning than the plain gradient methods.
We highlight that the small superiority of nonadaptive methods on some DNN simulations is not fully understood, and needs further investigation, beyond the simple linear regression model. A preliminary analysis of regularization for overparameterized linear regression reveals that it can act as an equalizer over the set of adaptive and nonadaptive optimization methods, i.e. force all optimizers to converge to the same solution. However, more work is needed to analyze its effect on the overall generalization guarantees both theoretically and experimentally as compared to the nonregularized versions of these algorithms.
References

[1]
A. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht.
The marginal value of adaptive gradient methods in machine learning.
In Advances in Neural Information Processing Systems, pages 4151–4161, 2017.  [2] A. Blum and R. Rivest. Training a 3node neural network is NPcomplete. In Advances in neural information processing systems, pages 494–501, 1989.

[3]
I. Sutskever, J. Martens, G. Dahl, and G. Hinton.
On the importance of initialization and momentum in deep learning.
In International conference on machine learning, pages 1139–1147, 2013.  [4] T. Schaul, S. Zhang, and Y. LeCun. No more pesky learning rates. In International Conference on Machine Learning, pages 343–351, 2013.
 [5] L. Bottou. Stochastic gradient descent tricks. In Neural networks: Tricks of the trade, pages 421–436. Springer, 2012.
 [6] L. Bottou. Largescale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.
 [7] W. Xu. Towards optimal one pass large scale learning with averaged stochastic gradient descent. arXiv preprint arXiv:1107.2490, 2011.
 [8] A. Senior, G. Heigold, and K. Yang. An empirical study of learning rates in deep neural networks for speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 6724–6728. IEEE, 2013.
 [9] Y. Bengio. Practical recommendations for gradientbased training of deep architectures. In Neural networks: Tricks of the trade, pages 437–478. Springer, 2012.
 [10] G. Orr and K.R. Müller. Neural networks: tricks of the trade. Springer, 2003.
 [11] Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016.
 [12] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
 [13] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [14] M. Zeiler. ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
 [15] T. Tieleman and G. Hinton. Lecture 6.5RMSPro: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.

[16]
T. Dozat.
Incorporating Nesterov momentum into Adam.
2016.  [17] J. Zhang and I. Mitliagkas. YellowFin and the art of momentum tuning. arXiv preprint arXiv:1706.03471, 2017.
 [18] T. Poggio, K. Kawaguchi, Q. Liao, B. Miranda, L. Rosasco, X. Boix, J. Hidary, and H. Mhaskar. Theory of deep learning III: explaining the nonoverfitting puzzle. arXiv preprint arXiv:1801.00173, 2017.
 [19] M. S. Nacson, J. Lee, S. Gunasekar, N. Srebro, and D. Soudry. Convergence of gradient descent on separable data. arXiv preprint arXiv:1803.01905, 2018.
 [20] S. Gunasekar, J. Lee, D. Soudry, and N. Srebro. Characterizing implicit bias in terms of optimization geometry. arXiv preprint arXiv:1802.08246, 2018.
 [21] Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
 [22] R. Horn and C. Johnson. Matrix analysis. 1985.
 [23] D. Dowler. Bounding the norm of matrix powers. 2013.
 [24] Rachel Ward, Xiaoxia Wu, and Leon Bottou. Adagrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization. arXiv preprint arXiv:1806.01811, 2018.
 [25] Mahesh Chandra Mukkamala and Matthias Hein. Variants of rmsprop and adagrad with logarithmic regret bounds. arXiv preprint arXiv:1706.05507, 2017.
 [26] Samuel L Smith, PieterJan Kindermans, and Quoc V Le. Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489, 2017.
 [27] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 [28] Mark D McDonnell and Tony Vladusich. Enhanced image classification with a fastlearning shallow convolutional neural network. In Neural Networks (IJCNN), 2015 International Joint Conference on, pages 1–7. IEEE, 2015.

[29]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016.  [30] M. Telgarsky. Benefits of depth in neural networks. arXiv preprint arXiv:1602.04485, 2016.
 [31] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision, pages 630–645. Springer, 2016.
 [32] A. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 [33] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.C. Chen. Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation. arXiv preprint arXiv:1801.04381, 2018.
 [34] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. Cvpr, 2015.
7 Supplementary Material
7.1 Unfolding gradient descent in the underparameterized setting
Let us unfold this recursion, assuming that :
What we observe is that:

[leftmargin=0.5cm]

The coefficients follow the Pascal triangle principle and can be easily expressed through binomial coefficients.

The step size appears with increasing power coefficient, as well as the term .

There are some constant terms, and .
The above lead to the following generic characterization of the gradient descent recursion:
The expression in the parentheses satisfies:
Since and commute, we can use the binomial theorem:
Thus, we finally get:
7.2 is a diagonal matrix with .
Here, we simplify the selection of preconditioner in adaptive gradient methods. Our purpose is to characterize their performance, and check how an adaptive (=preconditioned) algorithm performs in both under and overparameterized settings.
7.2.1 Underparameterized linear regression.
Unfolding the “adaptive" gradient descent recursion for , we get:
leading to the following closed form solution:
Once again the question is: Under which conditions on the above recursion converges to the left inverse solution?
For the special case of being a positive definite constant matrix, observe that, for full rank , the matrix is also full rank, and thus invertible. We can transform the above sum, using similar reasoning to above, to the following expression:
This further transforms our recursion into:
Using Theorem 1, we can again prove that, for sufficiently large and for sufficiently small step size , we can prove that , and thus, . Thus, for sufficiently large :
which is the left inverse solution, as in gradient descent.
7.2.2 Overparameterized linear regression.
For the overparameterized linear regression, we obtain a different expression by using a different kind of variable grouping in the unfolding procedure. In particular, we need to take in consideration that now is full rank, and thus the matrix is also full rank, and thus invertible. Going back to the main preconditioned gradient descent recursion:
leading to the following closed form solution:
The sum can be similarly simplified as:
This further transforms our recursion into:
Using Theorem 1, we can again prove that, for sufficiently large and for sufficiently small step size , we can prove that , and thus, . Thus, for sufficiently large :
which is not the same as the minimum norm solution, except when for some constant . This proves that preconditioned algorithms might lead to different solutions, depending on the selection of the preconditioning matrix/matrices.
7.3 Unfolding adaptive gradient descent with varying in the underparameterized setting
Unfolding the recursion, when is varying, we get:
Comments
There are no comments yet.