1 Introduction
Stochastic gradient based methods are dominant in optimization for most largescale machine learning problems, due to the simplicity of computation and their compatibility with modern parallel hardware, such as GPU.
In most cases these methods use overparametrized models allowing for interpolation, i.e., perfect fitting of the training data. While we do not yet have a full understanding of why these solutions generalize (as indicated by a wealth of empirical evidence, e.g.,
[22, 2]) we are beginning to recognize their desirable properties for optimization, particularly in the SGD setting [11].In this paper, we leverage the power of the interpolated setting to propose MaSS (Momentumadded Stochastic Solver), a stochastic momentum method for efficient training of overparametrized models. See pseudo code in Appendix A.
The algorithm keeps two variables (weights) and . These are updated using the following rules:
Here the stepsize , secondary stepsize and the acceleration parameter are fixed hyperparameters independent of . Updates are executed iteratively until certain convergence criteria are satisfied or a desired number of iterations have been completed. In each iteration, the algorithm takes a noisy firstorder gradient
, estimated based from a minibatch of the training data.
The algorithm is simple to implement and is nearly as memory and computationally efficient as the standard minibatch SGD. Unlike many existing accelerated SGD (ASGD) methods (e.g., [8, 1]), it requires no adjustments of hyperparameters during training and no costly computations of full gradients.
Note that, except for the additional compensation term (underscored in the equation above), the update rules are those of the stochastic variant of the classical Nesterov’s method (SGD+Nesterov) with constant coefficients [15]. In Section 3, we provide guarantees of exponential convergence for our algorithm in quadratic as well as more general convex interpolated settings. Moreover, for the quadratic case, we show provable acceleration over SGD. Furthermore, we show that the compensation term is essential to guarantee convergence in the stochastic setting by giving examples (in Subection 3.3) where SGD+Nesterov without this term provably diverges for a range of parameters, including those commonly used in practice. The observation of nonconvergence of SGD+Nesterov is intuitively consistent with the recent notion that “momentum hurts the convergence within the neighborhood of global optima”([10]).
In the case of the quadratic objective function, optimal parameter selection for the algorithm, including their dependence on the minibatch size, can be theoretically derived from our convergence analysis. In particular, in the case of the full batch (i.e., full gradient computation) our optimal parameter setting suggests setting the compensation parameter to zero, thus reducing (full gradient) MaSS to the classical Nesterov’s method. In that case, the convergence rate obtained here for MaSS matches the wellknown rate of the Nesterov’s method [15, 3]. In the minibatch case our method is an acceleration of SGD. Similarly to the case of SGD [11], the gains from increasing the minibatch size have a “diminishing returns” pattern beyond a certain critical value.
Under certain conditions it can also be shown to have computational advantage over the (full gradient) Nesterov’s method. We give examples of such settings in Section 3. We discuss related work in Section 4.
In Section 5, we provide an experimental evaluation of the performance of MaSS in both convex and nonconvex settings. Specifically, we show that MaSS is competitive or outperforms the popular Adam algorithm [8]
and SGD, on different architectures of neural networks including convolutional networks, Resnet and fully connected networks. Additionally, we show strong results in optimization of convex kernel methods, where our analysis provides direct cues for parameter selection.
2 Notations and Preliminaries
We denote column vectors in
by boldface lowercase letters. denotes the inner product in a Euclidean space. denotes the Euclidean norm of vectors, while denotes the Mahalanobis norm, i.e. .is a dataset, with the input feature, and the realvalued label. denotes a minibatch of the dataset sampled uniformly at random, with batchsize . denotes the exact gradient of function at point , and
denotes an unbiased estimate of the gradient evaluated based on a minibatch of
of size . Without ambiguity, we omit the subscript when we refer to the one point estimation, i.e. .We assume the objective function to be the form of finite sum,
(2) 
where each typically only depends on a single data point . In the case of the least square loss, . Note that
could be either the original feature vectors, or ones after certain transformations, e.g. kernel mapped features, neurons at a certain layer of neural networks.
We also use the concepts of strong convexity and smoothness of functions, see definitions in Appendix B.
2.1 Preliminaries for Least Square Objective Functions
Within the scope of the least square objective functions, we introduce the following notations. The Hessian matrix for the objective function is defined as . denotes the unbiased estimate of based on a minibatch of ; particularly, , is the onesample estimation of .
We assume that the least square objective function is
strongly convex, i.e. minimum Hessian eigenvalue
. Note that minibatch gradients are always perpendicular to the eigendirections with zero Hessian eigenvalues and no actual update happens along such eigendirections. Hence, without loss of generality, we can assume that the matrix has no zero eigenvalues.Following [6], we define computational and statistical condition numbers and , as follow: Let be the smallest positive number such that
(3) 
The (computational) condition number is defined to be . The statistical condition number is defined to be the smallest positive number such that
(4) 
Proposition 1.
Proof.
The smallest positive number such that is the largest eigenvalue , and
(5) 
Hence , implying is smooth. ∎
Remark 1.
If then then .
Remark 2.
The (computational) condition number defined as above, is always larger than the usual condition number .
Remark 3.
It is important to note that , since .
The following lemma is useful for dealing with the minibatch scenario:
Lemma 1.
(6) 
Proof.
∎
2.2 Interpolation and Automatic Variance Reduction
Interpolation is the setting where the loss is zero at every point. In other words, there exists in the parameter space such that
(7) 
It follows immediately that the training loss . For the least square objective functions, interpolation implies that the linear system has at least one solution (the socalled realizable case).
Denote the solution set as . Obviously, in the interpolation regime. Given any , we denote its closest solution as
and define the error . One should be aware that different may correspond to different
. Particularly for linear regression,
is an affine subspace in and and gradients are always perpendicular to . Therefore, we can assume without loss of generality that has no zeroeigenvalues.In the interpolation regime, one can always write the least square loss as
(8) 
A key property of interpolation is that the variance of the stochastic gradient of decreases to zero as the weight
approaches an optimal solution .Proposition 2 (Automatic Variance Reduction).
For the least square objective function the stochastic gradient at an arbitrary point can be written as
(9) 
Moreover, the variance of the stochastic gradient
(10) 
Proof.
Eq.(9) direction follows the fact that . ∎
Since that is a constant matrix, the above proposition unveils a linear dependence of variance of stochastic gradient on the norm square of error . Therefore, the closer to the more exact the gradient estimation is, which in turn helps convergence of the stochastic gradient based algorithms near the optimal solution. This observation underlies exponential convergence of SGD in certain convex settings [20, 12, 19, 13, 11].
2.3 An Equivalent Form of MaSS and Hyperparameters
We can rewrite MaSS in the following equivalent form, more convenient for analysis. Introducing an additional variable , update rules of MaSS can be written as:
(11a)  
(11b)  
(11c) 
There is a bijection between the hyperparameters () and (), which is given by:
(12) 
We will use () or () depending on the setting. Specifically, we use () for theoretical analysis and experimental reports. But we report () when discussing optimal hyperparameter selection.
Remark 4.
When the compensation term is switched off, i.e., , as in the case of SGD+Nesterov, the hyperparameter is fixed to be .
3 Convergence Analysis and Hyperparameters Selection
In this section, we analyze the convergence of MaSS for convex objective functions in the interpolation regime. The section is organized as follows: in subsection 3.1, we prove convergence of MaSS for the least square objective function. We derive optimal hyperparameters (including minibatch dependence) to obtain provable acceleration over minibatch SGD. In subsection 3.2 we extend the analysis to more general strongly convex functions. In section 3.3, we discuss the importance of the compensation term, by showing the nonconvergence of SGD+Nesterov for a range of hyperparameters.
3.1 Acceleration in the Least Square Setting
Based on the equivalent form of MaSS in Eq.(11), the following theorem shows that, for strongly convex least square loss in the interpolation regime, MaSS is guaranteed to have exponential convergence when hyperparameters satisfy certain conditions.
Theorem 1 (Convergence of MaSS).
Suppose the objective function is in the interpolation regime. Let be the smallest positive eigenvalue of the Hessian and be as defined in Eq.(3). Denote . In MaSS with mini batch of size , if the hyperparameters satisfy the following conditions:
(13) 
then, after iterations,
(14) 
Consequently,
(15) 
where is a finite constant.
Remark 5.
Since is selected from the interval , this theorem implies a exponential convergence of MaSS.
Proof of Theorem 1.
Hyperparameter Selection.
From Theorem 1, we observe that the convergence rate is determined by . Therefore, larger is preferred for faster convergence. Combining the conditions in Eq.(13), we have
(19) 
By setting , which maximizes the right hand side of the inequality, we obtain the optimal selection . Note that this setting of and determines a unique by the conditions in Eq.(13). By Eq.(32), the optimal selection of would be:
(20) 
is usually larger than 1, which implies that the coefficient of compensation term is nonnegative. Note that the gradient terms usually have negative coefficients so that the weights point in directions opposite to gradients. The nonnegative coefficient indicates that the weight is “overdescended” in SGD+Nesterov and needs to be compensated along the gradient direction.
With such an optimal hyperparameter selection, we have the following theorem:
Theorem 2.
Remark 6.
In the case of mini batch of size 1, , and the asymptotic convergence rate is , which is faster than that of SGD [11].
Remark 7 (MaSS reduces to Nesterov’s method in full batch).
In the limit of full batch , , the optimal parameter selection in Eq.(20) reduces to
(22) 
It is interesting to observe that, in the full batch scenario, the compensation term is suggested to be turned off and and are suggested to be the same as those in full gradient Nesterov’s method. Hence MaSS with the optimal hyperparameter selection reduces to Nesterov’s method in the limit of full batch. Moreover, the convergence rate in Theorem 2 reduces to , which is exactly the wellknown convergence rate of Nesterov’s method.
Remark 8 (Diminishing returns and the critical minibatch size).
Recall that , then monotonically decreases as the minibatch size increases. This fact implies that larger always results a faster convergence per iteration. However, improvements in convergence per iteration saturate as approaches . For the maximum improvement is at most . This phenomenon is parallel to minibatch convergence saturation in ordinary SGD analyzed in [11].
Example 1 (Gaussian Distributed Data).
Suppose the data feature vectors
are zeromean Gaussian distributed. Then, by the fact that
for zeromean Gaussian random variables
and , we have(23)  
(24) 
where is the dimension of the feature vectors. Hence and , and . This implies a convergence rate of of MaSS when batch size is 1. Particularly, if the feature vectors are dimensional, e.g. as in kernel learning, then MaSS with mini batches of size 1 has a convergence rate of .
3.2 Convergence Analysis for Strongly Convex Objective
Now, we extend the analysis to strongly convex objective functions in the interpolation regime. Now, we extend the definition of to general convex functions,
(25) 
It can be shown that this definition of is consistent with the definition in quadratic function case, see Eq.(3). We further assume that the objective function is strongly convex and there exists a such that , i.e. is in the interpolation regime.
Theorem 3.
Suppose there exists a strongly convex and smooth nonnegative function such that and , for some . In MaSS, if the hyperparameters are set to be:
(26) 
then after iterations,
(27) 
where is a finite constant.
Proof.
See Appendix C. ∎
Remark 9.
When is the square objective function, the function satisfies the assumptions made in Theorem 3.
3.3 On the importance of the compensation term
In this section we show that the compensation term is needed to ensure convergence. Specifically we demonstrate theoretically and empirically that SGD+Nesterov (which lacks the compensation term) diverges for a range of parameters including those commonly use in practice.
We show by providing examples that SGD+Nesterov can diverge for a range of parameters including those frequently used in practice. In contrast classical (full batch) Nesterov method is wellknown to converge.
Another view of SGD+Nesterov.
Before going into the technical details, we show how Theorem 1 provides intuition for the divergence of SGD+Nesterov. In short, hyperparameters corresponding to SGD+Nesterov are not in range for convergence given by Theorem 1.
Specifically, since SGD+Nesterov has no compensation term, in Eq.(31b) is fixed to be . When and are set to be optimal, i.e. and , in SGD+Nesterov. Note that smaller batch size results larger deviation of from . It is easy to check that such a hyperparameters setting of SGD+Nesterov does not satisfy the hyperparameter conditions in Theorem 1, except in full batch case.
Now, we present a family of examples, where SGD+Nesterov diverges.
Example: 2dimensional componentdecoupled data.
Fix an arbitrary and let is randomly drawn from the zeromean Gaussian distribution with variance , i.e. . The data points is constructed as follow:
(28) 
where are canonical basis vectors, .
Note that the corresponding least square loss function on is in the interpolation regime, since . The Hessian and stochastic Hessian matrices turn out to be
(29) 
Note that is diagonal, which implies that stochastic gradient based algorithms applied on this data evolve independently in each axisparallel direction. This allows a simplified directional analysis on the algorithms applied on it.
Here we list some useful results for analysis use later. The fourthmoment of Gaussian variable
. Hence, and , where superscript is the index for coordinates in . The computational and statistical condition numbers turns out to be and , respectively.Analysis of SGD+Nesterov.
Theorem 4 (Divergence of SGD+Nesterov).
Let stepsize and acceleration parameter , then SGD+Nesterov, when initialized with such that , diverges in expectation on the least square loss with the 2d component decoupled data defined in Eq.(28).
Proof.
See Appendix D. ∎
Remark 10.
When is randomly initialized over the parameter space, the condition
is satisfied with probability 1, since complementary cases form a lower dimensional manifold, a straight line in this case, which has measure 0.
We provide numerical validation of the divergence of SGD+Nesterov, as well as the faster convergence of MaSS, by training linear regression models on synthetic datasets. Figure 1 presents the training curves of different optimization algorithms on a realization of the component decoupled data. Batch size is 1 for all algorithms. Hyperparameters of MaSS are set as suggested by Eq.(20), i.e.,
(30) 
Hyperparamters of SGD+Nesterov are set to be identical as MaSS, but turning off the compensation term. Step size of SGD is the same as MaSS, i.e. .
We provide additional numerical validation on synthetic data in Appendix E, which includes more realizations of the component decoupled data and centered Gaussian distributed data.
It can be seen that MaSS with the suggested parameter selections indeed converges faster than SGD, and that SGD+Nesterov diverges, even if the parameter are set to the values at which the the fullgradient Nesterov’s method is guaranteed to converge quickly. Recall that MaSS differs from SGD+Nesterov by only a compensation term, this experiment illustrates the importance of this term. Note that the vertical axes are log scaled. Then the linear decrease of log losses in the plots implies an exponential loss decrease, and the slopes correspond to the coefficients in the exponents.
4 Related Work
Overparameterized models have drawn increasing attention in the literature as many modern machine learning models, especially neural networks, are overparameterized [4] and show strong generalization performance, as has been observed in practice [16, 22]. Overparameterized models usually result in nearly perfect fit (or interpolation) of the training data [22, 18, 2]. As discussed in subsection 2.2) interpolation helps convergence of SGDbased algorithms.
There is a large body of work on combining stochastic gradient with momentum in order to achieve an accelerated convergence over SGD including [6, 8, 17, 1]. This line of work has been particularly active after [21] demonstrated empirically that SGD with momentum achieves strong performance in training deep neural networks.
The work [6] is probably the most related to MaSS. The algorithm proposed there has a similar form to MaSS in Eq.(11), but with an extra hyperparameter, a different way to set hyperparameters, and a tailaveraging step at the output stage. In the interpolation setting, their algorithm achieves a convergence rate of for the square loss, when the tailaveraging is taken over the last iterations, and a slower rate of when not averaging. This compares to in our setting.
Perhaps the most practically used ASGD methods are Adam [8] and AMSGrad [17]
. These methods adaptively adjust the step size according to a weightdecayed accumulation of gradient history. These practical approaches show strong empirical performance on modern deep learning models. The authors also proved convergence of Adam and AMSGrad in convex case, under the setting of a time dependent learning rate decay. Note that in practical implementations the rate is typically not decayed.
Katyusha algorithm [1] introduced the socalled “Katyusha momentum”, which is computed based on a snapshot of full gradient, to reduce the variance of noisy gradients. The Katyusha method provably has a time complexity upper bound to achieve an error of in training loss. This method requires computation of full gradients in the course of the algorithm.
5 Empirical Evaluation
In this section, we report empirical evaluation results of the proposed algorithm, MaSS, on realworld datasets. We demonstrate that MaSS has strong optimization performance on both overparameterized neural network architectures and kernel machines.
5.1 Evaluation on Neural Networks
We investigate three types of architectures: fullyconnected network (FCN), convolutional neural network (CNN) and residual network (ResNet)
[5]. For each type of network, we compare the optimization and generalization performance of MaSS with that of Adam and SGD.For all the experiments, we tune and report the hyperparameter settings of MaSS. We use constant stepsize SGD with the same stepsize as MaSS. For Adam, the constant step size parameter is optimized by grid search. We use Adam hyperparameters and , which are the values typically used in training deep neural networks. All algorithms are implemented with mini batch of size .
Fullyconnected Networks.
We train a fullyconnected neural network with 3 hidden layers, with 80 RELUactivated neurons in each layer, on the MNIST data set, which has 60,000 training images and 10,000 test images of size
. After each hidden layer, there is a dropout layer with keep probability 0.5. This network takes 784dimensional vectors as input, and has 10 softmaxactivated output neurons. This network has 76,570 trainable parameters in total.We solve the classification problem by optimization the categorical crossentropy loss. We use the following hyperparameter setting in MaSS:
Curves of the training loss and test accuracy for different optimizers are shown in Figure 2.
Convolutional Networks.
We consider the image classification problem with convolutional neural networks on CIFAR10 dataset, which has 50,000 training images and 10,000 test images of size . Our CNN has two convolutional layers with 64 channels and kernel size of
and without padding. Each convolutional layer is followed by a
max pooling layer with stride of 2. On top of the last max pooling layer, there is a fullyconnected RELUactivated layer of size 64 followed by the output layer of size 10 with softmax nonlinearity. A dropout layer with keep probability 0.5 is applied after the fullconnected layer. This CNN architecture has 210,698 trainable parameters in total.Again, we minimize the categorical crossentropy loss. We use the following hyperparameter setting in MaSS:
See Figure 3 for the performance of different algorithms.
Residual Networks.
Finally, we train a residual network with 38 layers for the multiclass classification problem on CIFAR10. The ResNet we used has a sequence of 18 residual blocks [5]: the first 6 blocks have an output of shape , the following 6 blocks have an output of shape and the last 6 blocks have an output of shape . On top of these blocks, there is a average pooling layer with stride of 2, followed by a output layer of size 10. We use crossentropy loss for optimization, the output neurons are activated by softmax nonlinearity. This ResNet has 595,466 trainable parameters in total.
Figure 4 shows the performance of MaSS, Adam and SGD. In this experiment, we use the following hyperparameter setting for MaSS:
5.2 Evaluation on Kernel Regression Models
Linear regression with kernel mapped features is a convex optimization problem. We solve the linear regression with two different types of features: Gaussian kernel features and Laplacian kernel features, separately. We randomly subsample 2000 MNIST training images as the training set, and use the MNIST test images as test set. We generate kernel features using Gaussian and Laplacian kernels, considering each training point as a kernel center. Bandwidth values are set to 5 for both kernels. We train the linear regression models with the kernel features as inputs and onehotencoded labels as desired outputs. The training objectives are to minimize the mean squared error (MSE) between the model outputs and the onehotencoded ground truth labels. In each kernel regression task, the number of trainable parameters is
.We use the following hyperparameter setting in MaSS, for both kernel regression tasks:
The training curves of MaSS, Adam and SGD are demonstrated in Figure 5.
Acknowledgements
We thank NSF for financial support. We thank Xiao Liu for helping with the empirical evaluation of our proposed method and useful discussions.
References

[1]
Zeyuan AllenZhu.
Katyusha: The first direct acceleration of stochastic gradient
methods.
In
Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing
, pages 1200–1205. ACM, 2017.  [2] Mikhail Belkin, Siyuan Ma, and Soumik Mandal. To understand deep learning we need to understand kernel learning. arXiv preprint arXiv:1802.01396, 2018.
 [3] Sébastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(34):231–357, 2015.
 [4] Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. An analysis of deep neural network models for practical applications. arXiv preprint arXiv:1605.07678, 2016.

[5]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016.  [6] Prateek Jain, Sham M Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford. Accelerating stochastic gradient descent. arXiv preprint arXiv:1704.08227, 2017.
 [7] Rahul Kidambi, Praneeth Netrapalli, Prateek Jain, and Sham M Kakade. On the insufficiency of existing momentum schemes for stochastic optimization. International Conference on Learning Representations (ICLR), 2018.
 [8] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [9] Ji Liu and Stephen Wright. An accelerated randomized kaczmarz algorithm. Mathematics of Computation, 85(297):153–178, 2016.
 [10] Tianyi Liu, Zhehui Chen, Enlu Zhou, and Tuo Zhao. Toward deeper understanding of nonconvex stochastic optimization with momentum using diffusion approximations. arXiv preprint arXiv:1802.05155, 2018.
 [11] Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the effectiveness of sgd in modern overparametrized learning. International Conference on Machine Learning (ICML), 2018.

[12]
Eric Moulines and Francis R Bach.
Nonasymptotic analysis of stochastic approximation algorithms for machine learning.
In Advances in Neural Information Processing Systems, pages 451–459, 2011.  [13] Deanna Needell, Rachel Ward, and Nati Srebro. Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. In Advances in Neural Information Processing Systems, pages 1017–1025, 2014.
 [14] Yu Nesterov. Efficiency of coordinate descent methods on hugescale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.
 [15] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
 [16] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614, 2014.
 [17] Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. International Conference on Learning Representations (ICLR), 2018.
 [18] Levent Sagun, Utku Evci, V Ugur Guney, Yann Dauphin, and Leon Bottou. Empirical analysis of the hessian of overparametrized neural networks. arXiv preprint arXiv:1706.04454, 2017.
 [19] Mark Schmidt and Nicolas Le Roux. Fast convergence of stochastic gradient descent under a strong growth condition. arXiv preprint arXiv:1308.6370, 2013.
 [20] Thomas Strohmer and Roman Vershynin. A randomized kaczmarz algorithm with exponential convergence. Journal of Fourier Analysis and Applications, 15(2):262, 2009.
 [21] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147, 2013.
 [22] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. International Conference on Learning Representations (ICLR), 2017.
Appendix A Pseudocode for MaSS
Note that the proposed algorithm initializes the variables and with the same vector, which could be randomly generated.
As discussed in subsection 2.3, MaSS can be equivalently implemented using the following update rules:
(31a)  
(31b)  
(31c) 
In this case, variables , and should be initialized with the same vector.
There is a bijection between the hyperparameters () and (), which is given by:
(32) 
Appendix B Strong Convexity and Smoothness of Functions
Definition 1 (Strong Convexity).
A differentiable function is strongly convex (), if
(33) 
Definition 2 (Smoothness).
A differentiable function is smooth (), if
(34) 
Comments
There are no comments yet.