# MaSS: an Accelerated Stochastic Method for Over-parametrized Learning

In this paper we introduce MaSS (Momentum-added Stochastic Solver), an accelerated SGD method for optimizing over-parameterized networks. Our method is simple and efficient to implement and does not require changing parameters or computing full gradients in the course of optimization. We provide a detailed theoretical analysis for convergence and parameter selection including their dependence on the mini-batch size in the quadratic case. We also provide theoretical convergence results for a more general convex setting. We provide an experimental evaluation showing strong performance of our method in comparison to Adam and SGD for several standard architectures of deep networks including ResNet, convolutional and fully connected networks. We also show its performance for convex kernel machines.

## Authors

• 4 publications
• 38 publications
• ### Is Local SGD Better than Minibatch SGD?

We study local SGD (also known as parallel SGD and federated averaging),...
02/18/2020 ∙ by Blake Woodworth, et al. ∙ 5

• ### Escaping Saddle Points Faster with Stochastic Momentum

Stochastic gradient descent (SGD) with stochastic momentum is popular in...
06/05/2021 ∙ by Jun-Kun Wang, et al. ∙ 0

• ### EE-Grad: Exploration and Exploitation for Cost-Efficient Mini-Batch SGD

We present a generic framework for trading off fidelity and cost in comp...
05/19/2017 ∙ by Mehmet A. Donmez, et al. ∙ 0

• ### Accelerated Mini-Batch Stochastic Dual Coordinate Ascent

Stochastic dual coordinate ascent (SDCA) is an effective technique for s...
05/12/2013 ∙ by Shai Shalev-Shwartz, et al. ∙ 0

• ### Accelerated Sparsified SGD with Error Feedback

We study a stochastic gradient method for synchronous distributed optimi...
05/29/2019 ∙ by Tomoya Murata, et al. ∙ 0

• ### Momentum Improves Normalized SGD

We provide an improved analysis of normalized SGD showing that adding mo...
02/09/2020 ∙ by Ashok Cutkosky, et al. ∙ 0

• ### APMSqueeze: A Communication Efficient Adam-Preconditioned Momentum SGD Algorithm

Adam is the important optimization algorithm to guarantee efficiency and...
08/26/2020 ∙ by Hanlin Tang, et al. ∙ 11

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Stochastic gradient based methods are dominant in optimization for most large-scale machine learning problems, due to the simplicity of computation and their compatibility with modern parallel hardware, such as GPU.

In most cases these methods use over-parametrized models allowing for interpolation, i.e., perfect fitting of the training data. While we do not yet have a full understanding of why these solutions generalize (as indicated by a wealth of empirical evidence, e.g.,

[22, 2]) we are beginning to recognize their desirable properties for optimization, particularly in the SGD setting [11].

In this paper, we leverage the power of the interpolated setting to propose MaSS (Momentum-added Stochastic Solver), a stochastic momentum method for efficient training of over-parametrized models. See pseudo code in Appendix A.

The algorithm keeps two variables (weights) and . These are updated using the following rules:

 wt+1 ← ut−η1~∇f(ut), ut+1 ← (1+γ)wt+1−γwt+1+η2~∇f(ut)––––––––––,

Here the step-size , secondary step-size and the acceleration parameter are fixed hyper-parameters independent of . Updates are executed iteratively until certain convergence criteria are satisfied or a desired number of iterations have been completed. In each iteration, the algorithm takes a noisy first-order gradient

, estimated based from a mini-batch of the training data.

The algorithm is simple to implement and is nearly as memory and computationally efficient as the standard mini-batch SGD. Unlike many existing accelerated SGD (ASGD) methods (e.g., [8, 1]), it requires no adjustments of hyper-parameters during training and no costly computations of full gradients.

Note that, except for the additional compensation term (underscored in the equation above), the update rules are those of the stochastic variant of the classical Nesterov’s method (SGD+Nesterov) with constant coefficients [15]. In Section 3, we provide guarantees of exponential convergence for our algorithm in quadratic as well as more general convex interpolated settings. Moreover, for the quadratic case, we show provable acceleration over SGD. Furthermore, we show that the compensation term is essential to guarantee convergence in the stochastic setting by giving examples (in Subection 3.3) where SGD+Nesterov without this term provably diverges for a range of parameters, including those commonly used in practice. The observation of non-convergence of SGD+Nesterov is intuitively consistent with the recent notion that “momentum hurts the convergence within the neighborhood of global optima”([10]).

In the case of the quadratic objective function, optimal parameter selection for the algorithm, including their dependence on the mini-batch size, can be theoretically derived from our convergence analysis. In particular, in the case of the full batch (i.e., full gradient computation) our optimal parameter setting suggests setting the compensation parameter to zero, thus reducing (full gradient) MaSS to the classical Nesterov’s method. In that case, the convergence rate obtained here for MaSS matches the well-known rate of the Nesterov’s method [15, 3]. In the mini-batch case our method is an acceleration of SGD. Similarly to the case of SGD [11], the gains from increasing the minibatch size have a “diminishing returns” pattern beyond a certain critical value.

Under certain conditions it can also be shown to have computational advantage over the (full gradient) Nesterov’s method. We give examples of such settings in Section 3. We discuss related work in Section 4.

In Section 5, we provide an experimental evaluation of the performance of MaSS in both convex and non-convex settings. Specifically, we show that MaSS is competitive or outperforms the popular Adam algorithm [8]

and SGD, on different architectures of neural networks including convolutional networks, Resnet and fully connected networks. Additionally, we show strong results in optimization of convex kernel methods, where our analysis provides direct cues for parameter selection.

## 2 Notations and Preliminaries

We denote column vectors in

by bold-face lowercase letters. denotes the inner product in a Euclidean space. denotes the Euclidean norm of vectors, while denotes the Mahalanobis norm, i.e. .

is a dataset, with the input feature, and the real-valued label. denotes a mini-batch of the dataset sampled uniformly at random, with batch-size . denotes the exact gradient of function at point , and

denotes an unbiased estimate of the gradient evaluated based on a mini-batch of

of size . Without ambiguity, we omit the subscript when we refer to the one point estimation, i.e. .

We assume the objective function to be the form of finite sum,

 f(w)=1nn∑i=1fi(w), (2)

where each typically only depends on a single data point . In the case of the least square loss, . Note that

could be either the original feature vectors, or ones after certain transformations, e.g. kernel mapped features, neurons at a certain layer of neural networks.

We also use the concepts of strong convexity and smoothness of functions, see definitions in Appendix B.

### 2.1 Preliminaries for Least Square Objective Functions

Within the scope of the least square objective functions, we introduce the following notations. The Hessian matrix for the objective function is defined as . denotes the unbiased estimate of based on a mini-batch of ; particularly, , is the one-sample estimation of .

We assume that the least square objective function is

strongly convex, i.e. minimum Hessian eigenvalue

. Note that mini-batch gradients are always perpendicular to the eigen-directions with zero Hessian eigenvalues and no actual update happens along such eigen-directions. Hence, without loss of generality, we can assume that the matrix has no zero eigenvalues.

Following [6], we define computational and statistical condition numbers and , as follow: Let be the smallest positive number such that

 E[∥~x∥2~x~xT]⪯LH,~x∼D. (3)

The (computational) condition number is defined to be . The statistical condition number is defined to be the smallest positive number such that

 E[∥~x∥2H−1~x~xT]⪯~κH. (4)
###### Proposition 1.

, is -smooth, where is defined as above.

###### Proof.

The smallest positive number such that is the largest eigenvalue , and

 E[∥~x∥2~x~xT]=E[~H21]=H2+E[(~H1−H)2]⪰H2. (5)

Hence , implying is -smooth. ∎

If then then .

###### Remark 2.

The (computational) condition number defined as above, is always larger than the usual condition number .

###### Remark 3.

It is important to note that , since .

The following lemma is useful for dealing with the mini-batch scenario:

###### Lemma 1.
 E[~HmH−1~Hm]−H⪯1m(E[~HH−1~H]−H)⪯1m(~κ−1)H. (6)
###### Proof.
 E[~HmH−1~Hm] = 1m2E[m∑i=1~xi~xiTH−1~xi~xiT+∑i≠j~xi~xiTH−1~xj~xjT]⪯1m~κH+m−1mH.

### 2.2 Interpolation and Automatic Variance Reduction

Interpolation is the setting where the loss is zero at every point. In other words, there exists in the parameter space such that

 fi(w∗)=0, ∀i=1,2,⋯,n. (7)

It follows immediately that the training loss . For the least square objective functions, interpolation implies that the linear system has at least one solution (the so-called realizable case).

Denote the solution set as . Obviously, in the interpolation regime. Given any , we denote its closest solution as

 w∗:=argminv∈W∗∥w−v∥

and define the error . One should be aware that different may correspond to different

. Particularly for linear regression,

is an affine subspace in and and gradients are always perpendicular to . Therefore, we can assume without loss of generality that has no zero-eigenvalues.

In the interpolation regime, one can always write the least square loss as

 f(w)=12(w−w∗)TH(w−w∗)=12∥w−w∗∥2H. (8)

A key property of interpolation is that the variance of the stochastic gradient of decreases to zero as the weight

approaches an optimal solution .

###### Proposition 2 (Automatic Variance Reduction).

For the least square objective function the stochastic gradient at an arbitrary point can be written as

 ~∇f(w)=~H(w−w∗)=~Hϵ. (9)

Moreover, the variance of the stochastic gradient

 Var[~∇f(w)]⪯∥ϵ∥2E[(~H−H)2]. (10)
###### Proof.

Eq.(9) direction follows the fact that . ∎

Since that is a constant matrix, the above proposition unveils a linear dependence of variance of stochastic gradient on the norm square of error . Therefore, the closer to the more exact the gradient estimation is, which in turn helps convergence of the stochastic gradient based algorithms near the optimal solution. This observation underlies exponential convergence of SGD in certain convex settings [20, 12, 19, 13, 11].

### 2.3 An Equivalent Form of MaSS and Hyper-parameters

We can rewrite MaSS in the following equivalent form, more convenient for analysis. Introducing an additional variable , update rules of MaSS can be written as:

 wt+1 ← ut−η~∇f(ut), (11a) vt+1 ← (1−α)vt+αut−δ~∇f(ut), (11b) ut+1 ← (11c)

There is a bijection between the hyper-parameters () and (), which is given by:

 γ=(1−α)/(1+α), η1=η, η2=(η−αδ)/(1+α). (12)

We will use () or () depending on the setting. Specifically, we use () for theoretical analysis and experimental reports. But we report () when discussing optimal hyper-parameter selection.

###### Remark 4.

When the compensation term is switched off, i.e., , as in the case of SGD+Nesterov, the hyper-parameter is fixed to be .

## 3 Convergence Analysis and Hyper-parameters Selection

In this section, we analyze the convergence of MaSS for convex objective functions in the interpolation regime. The section is organized as follows: in sub-section 3.1, we prove convergence of MaSS for the least square objective function. We derive optimal hyper-parameters (including mini-batch dependence) to obtain provable acceleration over mini-batch SGD. In sub-section 3.2 we extend the analysis to more general strongly convex functions. In section 3.3, we discuss the importance of the compensation term, by showing the non-convergence of SGD+Nesterov for a range of hyper-parameters.

### 3.1 Acceleration in the Least Square Setting

Based on the equivalent form of MaSS in Eq.(11), the following theorem shows that, for strongly convex least square loss in the interpolation regime, MaSS is guaranteed to have exponential convergence when hyper-parameters satisfy certain conditions.

###### Theorem 1 (Convergence of MaSS).

Suppose the objective function is in the interpolation regime. Let be the smallest positive eigenvalue of the Hessian and be as defined in Eq.(3). Denote . In MaSS with mini batch of size , if the hyper-parameters satisfy the following conditions:

 α/μ≤δ,δ~κm+η2L/α−2η/α≤0, (13)

then, after iterations,

 E[∥vt−w∗∥2H−1+δα∥wt−w∗∥2]≤(1−α)t(∥v0−w∗∥2H−1+δα∥w0−w∗∥2). (14)

Consequently,

 ∥wt−w∗∥2≤C⋅(1−α)t, (15)

where is a finite constant.

###### Remark 5.

Since is selected from the interval , this theorem implies a exponential convergence of MaSS.

###### Proof of Theorem 1.

By Eq.(31b), we have

 ∥vt+1−w∗∥2H−1 = ∥(1−α)vt+αut−w∗∥2H−1~A+δ2∥~Hm(ut−w∗)∥2H−1~B −2δ⟨~Hm(ut−w∗),(1−α)vt+αut−w∗⟩H−1~C

By the convexity of the norm and -strong convexity of , we get

 E[~A] ≤ (1−α)∥vt−w∗∥2H−1+α∥ut−w∗∥2H−1≤(1−α)∥vt−w∗∥2H−1+αμ∥ut−w∗∥2.

Applying Lemma 1 on the term , we have

 E[~B]≤δ2~κm∥ut−w∗∥2H. (16)

Note that , then

 E[~C] = −2δ⟨ut−w∗,(1−α)vt+αut−w∗⟩ (17) = −2δ⟨ut−w∗,ut−w∗+1−αα(ut−wt)⟩ = −2δ∥ut−w∗∥2+1−ααδ(∥wt−w∗∥2−∥ut−w∗∥2−∥wt−ut∥2) ≤ 1−ααδ∥wt−w∗∥2−1+ααδ∥ut−w∗∥2.

Therefore,

 E[∥vt+1−w∗∥2H−1]≤(1−α)∥vt−w∗∥2H−1+1−ααδ∥wt−w∗∥2+(αμ−1+ααδ)∥ut−w∗∥2+δ2~κm∥ut−w∗∥2H.

On the other hand, by Proposition 1,

 E[∥wt+1−w∗∥2] = E[∥ut−w∗−η~Hm(ut−w∗)∥2] (18) ≤ ∥ut−w∗∥2−2η∥ut−w∗∥2H+η2L∥ut−w∗∥2H.

Hence,

 E[∥vt+1−w∗∥2H−1+δα∥wt+1−w∗∥2] ≤ (1−α)∥vt−w∗∥2H−1+1−ααδ∥wt−w∗∥2+⎛⎜⎝α/μ−δc1⎞⎟⎠∥ut−w∗∥2 +⎛⎜⎝δ2~κm+δη2L/α−2ηδ/αc2⎞⎟⎠∥ut−w∗∥2H.

By the condition in Eq.(13), , then the last two terms are non-positive. Hence,

 E[∥vt+1−x∗∥2H−1+δα∥wt+1−w∗∥2] ≤ (1−α)(∥vt−w∗∥2H−1+δα∥wt−w∗∥2) ≤ (1−α)t+1(∥v0−w∗∥2H−1+δα∥w0−w∗∥2).

#### Hyper-parameter Selection.

From Theorem 1, we observe that the convergence rate is determined by . Therefore, larger is preferred for faster convergence. Combining the conditions in Eq.(13), we have

 α2≤μ~κmη(2−ηL). (19)

By setting , which maximizes the right hand side of the inequality, we obtain the optimal selection . Note that this setting of and determines a unique by the conditions in Eq.(13). By Eq.(32), the optimal selection of would be:

 η∗1=1L,η∗2=η∗1√κ~κm√κ~κm+1(1−1~κm),γ∗=√κ~κm−1√κ~κm+1. (20)

is usually larger than 1, which implies that the coefficient of compensation term is non-negative. Note that the gradient terms usually have negative coefficients so that the weights point in directions opposite to gradients. The non-negative coefficient indicates that the weight is “over-descended” in SGD+Nesterov and needs to be compensated along the gradient direction.

With such an optimal hyper-parameter selection, we have the following theorem:

###### Theorem 2.

Under the same assumptions as in Theorem 1, if we set hyper-parameters in MaSS as in Eq.(20), then after iteration of MaSS with mini batch of size ,

 ∥wt−w∗∥2≤C⋅(1−1/√κ~κm)t, (21)

where is a finite constant.

###### Remark 6.

In the case of mini batch of size 1, , and the asymptotic convergence rate is , which is faster than that of SGD [11].

###### Remark 7 (MaSS reduces to Nesterov’s method in full batch).

In the limit of full batch , , the optimal parameter selection in Eq.(20) reduces to

 η∗1=1L, γ∗=√κ−1√κ+1, η∗2=0. (22)

It is interesting to observe that, in the full batch scenario, the compensation term is suggested to be turned off and and are suggested to be the same as those in full gradient Nesterov’s method. Hence MaSS with the optimal hyper-parameter selection reduces to Nesterov’s method in the limit of full batch. Moreover, the convergence rate in Theorem 2 reduces to , which is exactly the well-known convergence rate of Nesterov’s method.

###### Remark 8 (Diminishing returns and the critical mini-batch size).

Recall that , then monotonically decreases as the mini-batch size increases. This fact implies that larger always results a faster convergence per iteration. However, improvements in convergence per iteration saturate as approaches . For the maximum improvement is at most . This phenomenon is parallel to mini-batch convergence saturation in ordinary SGD analyzed in [11].

#### Example 1 (Gaussian Distributed Data).

Suppose the data feature vectors

are zero-mean Gaussian distributed. Then, by the fact that

for zero-mean Gaussian random variables

and , we have

 E[~H~H] = (2H+tr(H))H, (23) E[~HH−1~H] = E[~x~xTH−1~x~xT]=(2+d)H, (24)

where is the dimension of the feature vectors. Hence and , and . This implies a convergence rate of of MaSS when batch size is 1. Particularly, if the feature vectors are -dimensional, e.g. as in kernel learning, then MaSS with mini batches of size 1 has a convergence rate of .

### 3.2 Convergence Analysis for Strongly Convex Objective

Now, we extend the analysis to strongly convex objective functions in the interpolation regime. Now, we extend the definition of to general convex functions,

 L:=inf{c∈R | E[∥~∇f(w)∥2]≤2c(f(w)−f∗)}. (25)

It can be shown that this definition of is consistent with the definition in quadratic function case, see Eq.(3). We further assume that the objective function is -strongly convex and there exists a such that , i.e. is in the interpolation regime.

###### Theorem 3.

Suppose there exists a -strongly convex and -smooth non-negative function such that and , for some . In MaSS, if the hyper-parameters are set to be:

 η=12L,α=1−ϵ2κ,δ=12L, (26)

then after iterations,

 E[f(wt)]≤C⋅(1−1−ϵ2κ)t, (27)

where is a finite constant.

###### Proof.

See Appendix C. ∎

###### Remark 9.

When is the square objective function, the function satisfies the assumptions made in Theorem 3.

### 3.3 On the importance of the compensation term

In this section we show that the compensation term is needed to ensure convergence. Specifically we demonstrate theoretically and empirically that SGD+Nesterov (which lacks the compensation term) diverges for a range of parameters including those commonly use in practice.

We show by providing examples that SGD+Nesterov can diverge for a range of parameters including those frequently used in practice. In contrast classical (full batch) Nesterov method is well-known to converge.

#### Another view of SGD+Nesterov.

Before going into the technical details, we show how Theorem 1 provides intuition for the divergence of SGD+Nesterov. In short, hyper-parameters corresponding to SGD+Nesterov are not in range for convergence given by Theorem 1.

Specifically, since SGD+Nesterov has no compensation term, in Eq.(31b) is fixed to be . When and are set to be optimal, i.e. and , in SGD+Nesterov. Note that smaller batch size results larger deviation of from . It is easy to check that such a hyper-parameters setting of SGD+Nesterov does not satisfy the hyper-parameter conditions in Theorem 1, except in full batch case.

Now, we present a family of examples, where SGD+Nesterov diverges.

#### Example: 2-dimensional component-decoupled data.

Fix an arbitrary and let is randomly drawn from the zero-mean Gaussian distribution with variance , i.e. . The data points is constructed as follow:

 x={σ1⋅z⋅e1w.p. 0.5σ2⋅z⋅e2w.p. 0.5,andy=⟨w∗,x⟩, (28)

where are canonical basis vectors, .

Note that the corresponding least square loss function on is in the interpolation regime, since . The Hessian and stochastic Hessian matrices turn out to be

 H=[σ2100σ22],~H=[(x[1])200(x[2])2]. (29)

Note that is diagonal, which implies that stochastic gradient based algorithms applied on this data evolve independently in each axis-parallel direction. This allows a simplified directional analysis on the algorithms applied on it.

Here we list some useful results for analysis use later. The fourth-moment of Gaussian variable

. Hence, and , where superscript is the index for coordinates in . The computational and statistical condition numbers turns out to be and , respectively.

#### Analysis of SGD+Nesterov.

###### Theorem 4 (Divergence of SGD+Nesterov).

Let step-size and acceleration parameter , then SGD+Nesterov, when initialized with such that , diverges in expectation on the least square loss with the 2-d component decoupled data defined in Eq.(28).

###### Proof.

See Appendix D. ∎

###### Remark 10.

When is randomly initialized over the parameter space, the condition

is satisfied with probability 1, since complementary cases form a lower dimensional manifold, a straight line in this case, which has measure 0.

We provide numerical validation of the divergence of SGD+Nesterov, as well as the faster convergence of MaSS, by training linear regression models on synthetic datasets. Figure 1 presents the training curves of different optimization algorithms on a realization of the component decoupled data. Batch size is 1 for all algorithms. Hyper-parameters of MaSS are set as suggested by Eq.(20), i.e.,

 η=16, α=σ26σ1, δ=16σ1σ2. (30)

Hyper-paramters of SGD+Nesterov are set to be identical as MaSS, but turning off the compensation term. Step size of SGD is the same as MaSS, i.e. .

We provide additional numerical validation on synthetic data in Appendix E, which includes more realizations of the component decoupled data and centered Gaussian distributed data.

It can be seen that MaSS with the suggested parameter selections indeed converges faster than SGD, and that SGD+Nesterov diverges, even if the parameter are set to the values at which the the full-gradient Nesterov’s method is guaranteed to converge quickly. Recall that MaSS differs from SGD+Nesterov by only a compensation term, this experiment illustrates the importance of this term. Note that the vertical axes are log scaled. Then the linear decrease of log losses in the plots implies an exponential loss decrease, and the slopes correspond to the coefficients in the exponents.

## 4 Related Work

Over-parameterized models have drawn increasing attention in the literature as many modern machine learning models, especially neural networks, are over-parameterized [4] and show strong generalization performance, as has been observed in practice [16, 22]. Over-parameterized models usually result in nearly perfect fit (or interpolation) of the training data [22, 18, 2]. As discussed in sub-section 2.2) interpolation helps convergence of SGD-based algorithms.

There is a large body of work on combining stochastic gradient with momentum in order to achieve an accelerated convergence over SGD including [6, 8, 17, 1]. This line of work has been particularly active after [21] demonstrated empirically that SGD with momentum achieves strong performance in training deep neural networks.

The work [6] is probably the most related to MaSS. The algorithm proposed there has a similar form to MaSS in Eq.(11), but with an extra hyper-parameter, a different way to set hyper-parameters, and a tail-averaging step at the output stage. In the interpolation setting, their algorithm achieves a convergence rate of for the square loss, when the tail-averaging is taken over the last iterations, and a slower rate of when not averaging. This compares to in our setting.

Perhaps the most practically used ASGD methods are Adam [8] and AMSGrad [17]

. These methods adaptively adjust the step size according to a weight-decayed accumulation of gradient history. These practical approaches show strong empirical performance on modern deep learning models. The authors also proved convergence of Adam and AMSGrad in convex case, under the setting of a time dependent learning rate decay. Note that in practical implementations the rate is typically not decayed.

Katyusha algorithm [1] introduced the so-called “Katyusha momentum”, which is computed based on a snapshot of full gradient, to reduce the variance of noisy gradients. The Katyusha method provably has a time complexity upper bound to achieve an error of in training loss. This method requires computation of full gradients in the course of the algorithm.

Other works that incorporate momentum into stochastic optimizers include accelerated coordinate descent [14] and accelerated randomized Kaczmarz [9]. Both algorithms include a Nesterov-type momentum term.

## 5 Empirical Evaluation

In this section, we report empirical evaluation results of the proposed algorithm, MaSS, on real-world datasets. We demonstrate that MaSS has strong optimization performance on both over-parameterized neural network architectures and kernel machines.

### 5.1 Evaluation on Neural Networks

We investigate three types of architectures: fully-connected network (FCN), convolutional neural network (CNN) and residual network (ResNet)

[5]. For each type of network, we compare the optimization and generalization performance of MaSS with that of Adam and SGD.

For all the experiments, we tune and report the hyper-parameter settings of MaSS. We use constant step-size SGD with the same step-size as MaSS. For Adam, the constant step size parameter is optimized by grid search. We use Adam hyper-parameters and , which are the values typically used in training deep neural networks. All algorithms are implemented with mini batch of size .

#### Fully-connected Networks.

We train a fully-connected neural network with 3 hidden layers, with 80 RELU-activated neurons in each layer, on the MNIST data set, which has 60,000 training images and 10,000 test images of size

. After each hidden layer, there is a dropout layer with keep probability 0.5. This network takes 784-dimensional vectors as input, and has 10 softmax-activated output neurons. This network has 76,570 trainable parameters in total.

We solve the classification problem by optimization the categorical cross-entropy loss. We use the following hyper-parameter setting in MaSS:

 η=0.01,δ=1/12,γ=0.01.

Curves of the training loss and test accuracy for different optimizers are shown in Figure 2.

#### Convolutional Networks.

We consider the image classification problem with convolutional neural networks on CIFAR-10 dataset, which has 50,000 training images and 10,000 test images of size . Our CNN has two convolutional layers with 64 channels and kernel size of

and without padding. Each convolutional layer is followed by a

max pooling layer with stride of 2. On top of the last max pooling layer, there is a fully-connected RELU-activated layer of size 64 followed by the output layer of size 10 with softmax non-linearity. A dropout layer with keep probability 0.5 is applied after the full-connected layer. This CNN architecture has 210,698 trainable parameters in total.

Again, we minimize the categorical cross-entropy loss. We use the following hyper-parameter setting in MaSS:

 η=0.01,δ=1/12,γ=0.01.

See Figure 3 for the performance of different algorithms.

#### Residual Networks.

Finally, we train a residual network with 38 layers for the multiclass classification problem on CIFAR-10. The ResNet we used has a sequence of 18 residual blocks [5]: the first 6 blocks have an output of shape , the following 6 blocks have an output of shape and the last 6 blocks have an output of shape . On top of these blocks, there is a average pooling layer with stride of 2, followed by a output layer of size 10. We use cross-entropy loss for optimization, the output neurons are activated by softmax non-linearity. This ResNet has 595,466 trainable parameters in total.

Figure 4 shows the performance of MaSS, Adam and SGD. In this experiment, we use the following hyper-parameter setting for MaSS:

 η=0.001,δ=1/20,γ=0.005.

### 5.2 Evaluation on Kernel Regression Models

Linear regression with kernel mapped features is a convex optimization problem. We solve the linear regression with two different types of features: Gaussian kernel features and Laplacian kernel features, separately. We randomly subsample 2000 MNIST training images as the training set, and use the MNIST test images as test set. We generate kernel features using Gaussian and Laplacian kernels, considering each training point as a kernel center. Bandwidth values are set to 5 for both kernels. We train the linear regression models with the kernel features as inputs and one-hot-encoded labels as desired outputs. The training objectives are to minimize the mean squared error (MSE) between the model outputs and the one-hot-encoded ground truth labels. In each kernel regression task, the number of trainable parameters is

.

We use the following hyper-parameter setting in MaSS, for both kernel regression tasks:

 η=0.01,δ=1/12,γ=0.01.

The training curves of MaSS, Adam and SGD are demonstrated in Figure 5.

## Acknowledgements

We thank NSF for financial support. We thank Xiao Liu for helping with the empirical evaluation of our proposed method and useful discussions.

## Appendix A Pseudocode for MaSS

Note that the proposed algorithm initializes the variables and with the same vector, which could be randomly generated.

As discussed in subsection 2.3, MaSS can be equivalently implemented using the following update rules:

 wt+1 ← ut−η~∇f(ut), (31a) vt+1 ← (1−α)vt+αut−δ~∇f(ut), (31b) ut+1 ← (31c)

In this case, variables , and should be initialized with the same vector.

There is a bijection between the hyper-parameters () and (), which is given by:

 γ=(1−α)/(1+α), η1=η, η2=(η−αδ)/(1+α). (32)

## Appendix B Strong Convexity and Smoothness of Functions

###### Definition 1 (Strong Convexity).

A differentiable function is -strongly convex (), if

 f(x)≥f(z)+⟨∇f(z,x−z⟩+μ2∥x−z∥2,∀x,z∈Rd. (33)
###### Definition 2 (Smoothness).

A differentiable function is -smooth (), if

 f(x)≤f(z)+⟨∇f(z,x−z⟩+L2∥x−z∥2,∀x,z∈Rd. (34)

## Appendix C Proof of Theorem 3

###### Proof.

The update rule for variable is, as in Eq.(31b):

 vt+1=(1−α)vt+αut−δ~∇f(ut). (35)

By -strong convexity of , we have

 g(vt+1)≤g((1−α)vt+αut)+⟨∇g((1−α)vt+αut),−δ~∇f(ut)⟩+δ212μ∥~∇f(ut)∥2. (36)

Taking expectation on both sides, we get

 E[g(vt+1)] = g((1−α)vt+αut)+⟨∇g((1−α)vt+αut),−δ∇f(ut)⟩+δ212μE[∥~∇f(ut)∥2] ≤ (1−α)g(vt)+αg(ut)−δ(1−ϵ)⟨(1−α)vt+αut−w∗,ut−w∗⟩+δ22L2μf(ut),

where in the last inequality we used the convexity of , the assumption in the Theorem and the definition of for general convex functions, see Eq.(25).

By Eq.(17),

 −δ(1−ϵ)⟨(1−α)vt+αut−w∗,ut−w∗⟩≤δ(1−ϵ)2(1−αα∥wt−w∗∥2−1+αα∥ut−w∗∥2).

By the strong convexity of ,

 αg(ut)≤α2μ∥ut−w∗∥2.

Hence,

 E[g(vt+1)]≤(1−α)g(vt)+(1−α)δ(1−ϵ)2α∥wt−w∗∥2+(α2μ−δ(1+α)(1−ϵ)2α)∥ut−w∗∥2+δ2κf(ut). (37)

On the other hand,

 E[∥wt+1−w∗∥2] = ∥ut−w∗∥2−2η⟨ut−w∗,∇f(ut)⟩+η2E[∥~∇f(ut