Stochastic gradient based methods are dominant in optimization for most large-scale machine learning problems, due to the simplicity of computation and their compatibility with modern parallel hardware, such as GPU.
In most cases these methods use over-parametrized models allowing for interpolation, i.e., perfect fitting of the training data. While we do not yet have a full understanding of why these solutions generalize (as indicated by a wealth of empirical evidence, e.g.,[22, 2]) we are beginning to recognize their desirable properties for optimization, particularly in the SGD setting .
In this paper, we leverage the power of the interpolated setting to propose MaSS (Momentum-added Stochastic Solver), a stochastic momentum method for efficient training of over-parametrized models. See pseudo code in Appendix A.
The algorithm keeps two variables (weights) and . These are updated using the following rules:
Here the step-size , secondary step-size and the acceleration parameter are fixed hyper-parameters independent of . Updates are executed iteratively until certain convergence criteria are satisfied or a desired number of iterations have been completed. In each iteration, the algorithm takes a noisy first-order gradient
, estimated based from a mini-batch of the training data.
The algorithm is simple to implement and is nearly as memory and computationally efficient as the standard mini-batch SGD. Unlike many existing accelerated SGD (ASGD) methods (e.g., [8, 1]), it requires no adjustments of hyper-parameters during training and no costly computations of full gradients.
Note that, except for the additional compensation term (underscored in the equation above), the update rules are those of the stochastic variant of the classical Nesterov’s method (SGD+Nesterov) with constant coefficients . In Section 3, we provide guarantees of exponential convergence for our algorithm in quadratic as well as more general convex interpolated settings. Moreover, for the quadratic case, we show provable acceleration over SGD. Furthermore, we show that the compensation term is essential to guarantee convergence in the stochastic setting by giving examples (in Subection 3.3) where SGD+Nesterov without this term provably diverges for a range of parameters, including those commonly used in practice. The observation of non-convergence of SGD+Nesterov is intuitively consistent with the recent notion that “momentum hurts the convergence within the neighborhood of global optima”().
In the case of the quadratic objective function, optimal parameter selection for the algorithm, including their dependence on the mini-batch size, can be theoretically derived from our convergence analysis. In particular, in the case of the full batch (i.e., full gradient computation) our optimal parameter setting suggests setting the compensation parameter to zero, thus reducing (full gradient) MaSS to the classical Nesterov’s method. In that case, the convergence rate obtained here for MaSS matches the well-known rate of the Nesterov’s method [15, 3]. In the mini-batch case our method is an acceleration of SGD. Similarly to the case of SGD , the gains from increasing the minibatch size have a “diminishing returns” pattern beyond a certain critical value.
Under certain conditions it can also be shown to have computational advantage over the (full gradient) Nesterov’s method. We give examples of such settings in Section 3. We discuss related work in Section 4.
In Section 5, we provide an experimental evaluation of the performance of MaSS in both convex and non-convex settings. Specifically, we show that MaSS is competitive or outperforms the popular Adam algorithm 
and SGD, on different architectures of neural networks including convolutional networks, Resnet and fully connected networks. Additionally, we show strong results in optimization of convex kernel methods, where our analysis provides direct cues for parameter selection.
2 Notations and Preliminaries
We denote column vectors inby bold-face lowercase letters. denotes the inner product in a Euclidean space. denotes the Euclidean norm of vectors, while denotes the Mahalanobis norm, i.e. .
is a dataset, with the input feature, and the real-valued label. denotes a mini-batch of the dataset sampled uniformly at random, with batch-size . denotes the exact gradient of function at point , and
denotes an unbiased estimate of the gradient evaluated based on a mini-batch ofof size . Without ambiguity, we omit the subscript when we refer to the one point estimation, i.e. .
We assume the objective function to be the form of finite sum,
where each typically only depends on a single data point . In the case of the least square loss, . Note that
could be either the original feature vectors, or ones after certain transformations, e.g. kernel mapped features, neurons at a certain layer of neural networks.
We also use the concepts of strong convexity and smoothness of functions, see definitions in Appendix B.
2.1 Preliminaries for Least Square Objective Functions
Within the scope of the least square objective functions, we introduce the following notations. The Hessian matrix for the objective function is defined as . denotes the unbiased estimate of based on a mini-batch of ; particularly, , is the one-sample estimation of .
We assume that the least square objective function is
strongly convex, i.e. minimum Hessian eigenvalue. Note that mini-batch gradients are always perpendicular to the eigen-directions with zero Hessian eigenvalues and no actual update happens along such eigen-directions. Hence, without loss of generality, we can assume that the matrix has no zero eigenvalues.
Following , we define computational and statistical condition numbers and , as follow: Let be the smallest positive number such that
The (computational) condition number is defined to be . The statistical condition number is defined to be the smallest positive number such that
The quadratic loss function, is -smooth, where is defined as above.
The smallest positive number such that is the largest eigenvalue , and
Hence , implying is -smooth. ∎
If then then .
The (computational) condition number defined as above, is always larger than the usual condition number .
It is important to note that , since .
The following lemma is useful for dealing with the mini-batch scenario:
2.2 Interpolation and Automatic Variance Reduction
Interpolation is the setting where the loss is zero at every point. In other words, there exists in the parameter space such that
It follows immediately that the training loss . For the least square objective functions, interpolation implies that the linear system has at least one solution (the so-called realizable case).
Denote the solution set as . Obviously, in the interpolation regime. Given any , we denote its closest solution as
and define the error . One should be aware that different may correspond to different
. Particularly for linear regression,is an affine subspace in and and gradients are always perpendicular to . Therefore, we can assume without loss of generality that has no zero-eigenvalues.
In the interpolation regime, one can always write the least square loss as
A key property of interpolation is that the variance of the stochastic gradient of decreases to zero as the weightapproaches an optimal solution .
Proposition 2 (Automatic Variance Reduction).
For the least square objective function the stochastic gradient at an arbitrary point can be written as
Moreover, the variance of the stochastic gradient
Eq.(9) direction follows the fact that . ∎
Since that is a constant matrix, the above proposition unveils a linear dependence of variance of stochastic gradient on the norm square of error . Therefore, the closer to the more exact the gradient estimation is, which in turn helps convergence of the stochastic gradient based algorithms near the optimal solution. This observation underlies exponential convergence of SGD in certain convex settings [20, 12, 19, 13, 11].
2.3 An Equivalent Form of MaSS and Hyper-parameters
We can rewrite MaSS in the following equivalent form, more convenient for analysis. Introducing an additional variable , update rules of MaSS can be written as:
There is a bijection between the hyper-parameters () and (), which is given by:
We will use () or () depending on the setting. Specifically, we use () for theoretical analysis and experimental reports. But we report () when discussing optimal hyper-parameter selection.
When the compensation term is switched off, i.e., , as in the case of SGD+Nesterov, the hyper-parameter is fixed to be .
3 Convergence Analysis and Hyper-parameters Selection
In this section, we analyze the convergence of MaSS for convex objective functions in the interpolation regime. The section is organized as follows: in sub-section 3.1, we prove convergence of MaSS for the least square objective function. We derive optimal hyper-parameters (including mini-batch dependence) to obtain provable acceleration over mini-batch SGD. In sub-section 3.2 we extend the analysis to more general strongly convex functions. In section 3.3, we discuss the importance of the compensation term, by showing the non-convergence of SGD+Nesterov for a range of hyper-parameters.
3.1 Acceleration in the Least Square Setting
Based on the equivalent form of MaSS in Eq.(11), the following theorem shows that, for strongly convex least square loss in the interpolation regime, MaSS is guaranteed to have exponential convergence when hyper-parameters satisfy certain conditions.
Theorem 1 (Convergence of MaSS).
Suppose the objective function is in the interpolation regime. Let be the smallest positive eigenvalue of the Hessian and be as defined in Eq.(3). Denote . In MaSS with mini batch of size , if the hyper-parameters satisfy the following conditions:
then, after iterations,
where is a finite constant.
Since is selected from the interval , this theorem implies a exponential convergence of MaSS.
Proof of Theorem 1.
By setting , which maximizes the right hand side of the inequality, we obtain the optimal selection . Note that this setting of and determines a unique by the conditions in Eq.(13). By Eq.(32), the optimal selection of would be:
is usually larger than 1, which implies that the coefficient of compensation term is non-negative. Note that the gradient terms usually have negative coefficients so that the weights point in directions opposite to gradients. The non-negative coefficient indicates that the weight is “over-descended” in SGD+Nesterov and needs to be compensated along the gradient direction.
With such an optimal hyper-parameter selection, we have the following theorem:
In the case of mini batch of size 1, , and the asymptotic convergence rate is , which is faster than that of SGD .
Remark 7 (MaSS reduces to Nesterov’s method in full batch).
In the limit of full batch , , the optimal parameter selection in Eq.(20) reduces to
It is interesting to observe that, in the full batch scenario, the compensation term is suggested to be turned off and and are suggested to be the same as those in full gradient Nesterov’s method. Hence MaSS with the optimal hyper-parameter selection reduces to Nesterov’s method in the limit of full batch. Moreover, the convergence rate in Theorem 2 reduces to , which is exactly the well-known convergence rate of Nesterov’s method.
Remark 8 (Diminishing returns and the critical mini-batch size).
Recall that , then monotonically decreases as the mini-batch size increases. This fact implies that larger always results a faster convergence per iteration. However, improvements in convergence per iteration saturate as approaches . For the maximum improvement is at most . This phenomenon is parallel to mini-batch convergence saturation in ordinary SGD analyzed in .
Example 1 (Gaussian Distributed Data).
Suppose the data feature vectors
are zero-mean Gaussian distributed. Then, by the fact that
for zero-mean Gaussian random variablesand , we have
where is the dimension of the feature vectors. Hence and , and . This implies a convergence rate of of MaSS when batch size is 1. Particularly, if the feature vectors are -dimensional, e.g. as in kernel learning, then MaSS with mini batches of size 1 has a convergence rate of .
3.2 Convergence Analysis for Strongly Convex Objective
Now, we extend the analysis to strongly convex objective functions in the interpolation regime. Now, we extend the definition of to general convex functions,
It can be shown that this definition of is consistent with the definition in quadratic function case, see Eq.(3). We further assume that the objective function is -strongly convex and there exists a such that , i.e. is in the interpolation regime.
Suppose there exists a -strongly convex and -smooth non-negative function such that and , for some . In MaSS, if the hyper-parameters are set to be:
then after iterations,
where is a finite constant.
See Appendix C. ∎
When is the square objective function, the function satisfies the assumptions made in Theorem 3.
3.3 On the importance of the compensation term
In this section we show that the compensation term is needed to ensure convergence. Specifically we demonstrate theoretically and empirically that SGD+Nesterov (which lacks the compensation term) diverges for a range of parameters including those commonly use in practice.
We show by providing examples that SGD+Nesterov can diverge for a range of parameters including those frequently used in practice. In contrast classical (full batch) Nesterov method is well-known to converge.
Another view of SGD+Nesterov.
Before going into the technical details, we show how Theorem 1 provides intuition for the divergence of SGD+Nesterov. In short, hyper-parameters corresponding to SGD+Nesterov are not in range for convergence given by Theorem 1.
Specifically, since SGD+Nesterov has no compensation term, in Eq.(31b) is fixed to be . When and are set to be optimal, i.e. and , in SGD+Nesterov. Note that smaller batch size results larger deviation of from . It is easy to check that such a hyper-parameters setting of SGD+Nesterov does not satisfy the hyper-parameter conditions in Theorem 1, except in full batch case.
Now, we present a family of examples, where SGD+Nesterov diverges.
Example: 2-dimensional component-decoupled data.
Fix an arbitrary and let is randomly drawn from the zero-mean Gaussian distribution with variance , i.e. . The data points is constructed as follow:
where are canonical basis vectors, .
Note that the corresponding least square loss function on is in the interpolation regime, since . The Hessian and stochastic Hessian matrices turn out to be
Note that is diagonal, which implies that stochastic gradient based algorithms applied on this data evolve independently in each axis-parallel direction. This allows a simplified directional analysis on the algorithms applied on it.
Here we list some useful results for analysis use later. The fourth-moment of Gaussian variable. Hence, and , where superscript is the index for coordinates in . The computational and statistical condition numbers turns out to be and , respectively.
Analysis of SGD+Nesterov.
Theorem 4 (Divergence of SGD+Nesterov).
Let step-size and acceleration parameter , then SGD+Nesterov, when initialized with such that , diverges in expectation on the least square loss with the 2-d component decoupled data defined in Eq.(28).
See Appendix D. ∎
When is randomly initialized over the parameter space, the condition
is satisfied with probability 1, since complementary cases form a lower dimensional manifold, a straight line in this case, which has measure 0.
We provide numerical validation of the divergence of SGD+Nesterov, as well as the faster convergence of MaSS, by training linear regression models on synthetic datasets. Figure 1 presents the training curves of different optimization algorithms on a realization of the component decoupled data. Batch size is 1 for all algorithms. Hyper-parameters of MaSS are set as suggested by Eq.(20), i.e.,
Hyper-paramters of SGD+Nesterov are set to be identical as MaSS, but turning off the compensation term. Step size of SGD is the same as MaSS, i.e. .
We provide additional numerical validation on synthetic data in Appendix E, which includes more realizations of the component decoupled data and centered Gaussian distributed data.
It can be seen that MaSS with the suggested parameter selections indeed converges faster than SGD, and that SGD+Nesterov diverges, even if the parameter are set to the values at which the the full-gradient Nesterov’s method is guaranteed to converge quickly. Recall that MaSS differs from SGD+Nesterov by only a compensation term, this experiment illustrates the importance of this term. Note that the vertical axes are log scaled. Then the linear decrease of log losses in the plots implies an exponential loss decrease, and the slopes correspond to the coefficients in the exponents.
4 Related Work
Over-parameterized models have drawn increasing attention in the literature as many modern machine learning models, especially neural networks, are over-parameterized  and show strong generalization performance, as has been observed in practice [16, 22]. Over-parameterized models usually result in nearly perfect fit (or interpolation) of the training data [22, 18, 2]. As discussed in sub-section 2.2) interpolation helps convergence of SGD-based algorithms.
There is a large body of work on combining stochastic gradient with momentum in order to achieve an accelerated convergence over SGD including [6, 8, 17, 1]. This line of work has been particularly active after  demonstrated empirically that SGD with momentum achieves strong performance in training deep neural networks.
The work  is probably the most related to MaSS. The algorithm proposed there has a similar form to MaSS in Eq.(11), but with an extra hyper-parameter, a different way to set hyper-parameters, and a tail-averaging step at the output stage. In the interpolation setting, their algorithm achieves a convergence rate of for the square loss, when the tail-averaging is taken over the last iterations, and a slower rate of when not averaging. This compares to in our setting.
. These methods adaptively adjust the step size according to a weight-decayed accumulation of gradient history. These practical approaches show strong empirical performance on modern deep learning models. The authors also proved convergence of Adam and AMSGrad in convex case, under the setting of a time dependent learning rate decay. Note that in practical implementations the rate is typically not decayed.
Katyusha algorithm  introduced the so-called “Katyusha momentum”, which is computed based on a snapshot of full gradient, to reduce the variance of noisy gradients. The Katyusha method provably has a time complexity upper bound to achieve an error of in training loss. This method requires computation of full gradients in the course of the algorithm.
5 Empirical Evaluation
In this section, we report empirical evaluation results of the proposed algorithm, MaSS, on real-world datasets. We demonstrate that MaSS has strong optimization performance on both over-parameterized neural network architectures and kernel machines.
5.1 Evaluation on Neural Networks
We investigate three types of architectures: fully-connected network (FCN), convolutional neural network (CNN) and residual network (ResNet). For each type of network, we compare the optimization and generalization performance of MaSS with that of Adam and SGD.
For all the experiments, we tune and report the hyper-parameter settings of MaSS. We use constant step-size SGD with the same step-size as MaSS. For Adam, the constant step size parameter is optimized by grid search. We use Adam hyper-parameters and , which are the values typically used in training deep neural networks. All algorithms are implemented with mini batch of size .
We train a fully-connected neural network with 3 hidden layers, with 80 RELU-activated neurons in each layer, on the MNIST data set, which has 60,000 training images and 10,000 test images of size. After each hidden layer, there is a dropout layer with keep probability 0.5. This network takes 784-dimensional vectors as input, and has 10 softmax-activated output neurons. This network has 76,570 trainable parameters in total.
We solve the classification problem by optimization the categorical cross-entropy loss. We use the following hyper-parameter setting in MaSS:
Curves of the training loss and test accuracy for different optimizers are shown in Figure 2.
We consider the image classification problem with convolutional neural networks on CIFAR-10 dataset, which has 50,000 training images and 10,000 test images of size . Our CNN has two convolutional layers with 64 channels and kernel size of
and without padding. Each convolutional layer is followed by amax pooling layer with stride of 2. On top of the last max pooling layer, there is a fully-connected RELU-activated layer of size 64 followed by the output layer of size 10 with softmax non-linearity. A dropout layer with keep probability 0.5 is applied after the full-connected layer. This CNN architecture has 210,698 trainable parameters in total.
Again, we minimize the categorical cross-entropy loss. We use the following hyper-parameter setting in MaSS:
See Figure 3 for the performance of different algorithms.
Finally, we train a residual network with 38 layers for the multiclass classification problem on CIFAR-10. The ResNet we used has a sequence of 18 residual blocks : the first 6 blocks have an output of shape , the following 6 blocks have an output of shape and the last 6 blocks have an output of shape . On top of these blocks, there is a average pooling layer with stride of 2, followed by a output layer of size 10. We use cross-entropy loss for optimization, the output neurons are activated by softmax non-linearity. This ResNet has 595,466 trainable parameters in total.
Figure 4 shows the performance of MaSS, Adam and SGD. In this experiment, we use the following hyper-parameter setting for MaSS:
5.2 Evaluation on Kernel Regression Models
Linear regression with kernel mapped features is a convex optimization problem. We solve the linear regression with two different types of features: Gaussian kernel features and Laplacian kernel features, separately. We randomly subsample 2000 MNIST training images as the training set, and use the MNIST test images as test set. We generate kernel features using Gaussian and Laplacian kernels, considering each training point as a kernel center. Bandwidth values are set to 5 for both kernels. We train the linear regression models with the kernel features as inputs and one-hot-encoded labels as desired outputs. The training objectives are to minimize the mean squared error (MSE) between the model outputs and the one-hot-encoded ground truth labels. In each kernel regression task, the number of trainable parameters is.
We use the following hyper-parameter setting in MaSS, for both kernel regression tasks:
The training curves of MaSS, Adam and SGD are demonstrated in Figure 5.
We thank NSF for financial support. We thank Xiao Liu for helping with the empirical evaluation of our proposed method and useful discussions.
Katyusha: The first direct acceleration of stochastic gradient
Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 1200–1205. ACM, 2017.
-  Mikhail Belkin, Siyuan Ma, and Soumik Mandal. To understand deep learning we need to understand kernel learning. arXiv preprint arXiv:1802.01396, 2018.
-  Sébastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015.
-  Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. An analysis of deep neural network models for practical applications. arXiv preprint arXiv:1605.07678, 2016.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In , pages 770–778, 2016.
-  Prateek Jain, Sham M Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford. Accelerating stochastic gradient descent. arXiv preprint arXiv:1704.08227, 2017.
-  Rahul Kidambi, Praneeth Netrapalli, Prateek Jain, and Sham M Kakade. On the insufficiency of existing momentum schemes for stochastic optimization. International Conference on Learning Representations (ICLR), 2018.
-  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  Ji Liu and Stephen Wright. An accelerated randomized kaczmarz algorithm. Mathematics of Computation, 85(297):153–178, 2016.
-  Tianyi Liu, Zhehui Chen, Enlu Zhou, and Tuo Zhao. Toward deeper understanding of nonconvex stochastic optimization with momentum using diffusion approximations. arXiv preprint arXiv:1802.05155, 2018.
-  Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the effectiveness of sgd in modern over-parametrized learning. International Conference on Machine Learning (ICML), 2018.
Eric Moulines and Francis R Bach.
Non-asymptotic analysis of stochastic approximation algorithms for machine learning.In Advances in Neural Information Processing Systems, pages 451–459, 2011.
-  Deanna Needell, Rachel Ward, and Nati Srebro. Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. In Advances in Neural Information Processing Systems, pages 1017–1025, 2014.
-  Yu Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.
-  Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
-  Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614, 2014.
-  Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. International Conference on Learning Representations (ICLR), 2018.
-  Levent Sagun, Utku Evci, V Ugur Guney, Yann Dauphin, and Leon Bottou. Empirical analysis of the hessian of over-parametrized neural networks. arXiv preprint arXiv:1706.04454, 2017.
-  Mark Schmidt and Nicolas Le Roux. Fast convergence of stochastic gradient descent under a strong growth condition. arXiv preprint arXiv:1308.6370, 2013.
-  Thomas Strohmer and Roman Vershynin. A randomized kaczmarz algorithm with exponential convergence. Journal of Fourier Analysis and Applications, 15(2):262, 2009.
-  Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147, 2013.
-  Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. International Conference on Learning Representations (ICLR), 2017.
Appendix A Pseudocode for MaSS
Note that the proposed algorithm initializes the variables and with the same vector, which could be randomly generated.
As discussed in subsection 2.3, MaSS can be equivalently implemented using the following update rules:
In this case, variables , and should be initialized with the same vector.
There is a bijection between the hyper-parameters () and (), which is given by:
Appendix B Strong Convexity and Smoothness of Functions
Definition 1 (Strong Convexity).
A differentiable function is -strongly convex (), if
Definition 2 (Smoothness).
A differentiable function is -smooth (), if
Appendix C Proof of Theorem 3
The update rule for variable is, as in Eq.(31b):
By -strong convexity of , we have
Taking expectation on both sides, we get
where in the last inequality we used the convexity of , the assumption in the Theorem and the definition of for general convex functions, see Eq.(25).
By the strong convexity of ,
On the other hand,