# The Multiplicative Noise in Stochastic Gradient Descent: Data-Dependent Regularization, Continuous and Discrete Approximation

The randomness in Stochastic Gradient Descent (SGD) is considered to play a central role in the observed strong generalization capability of deep learning. In this work, we re-interpret the stochastic gradient of vanilla SGD as a matrix-vector product of the matrix of gradients and a random noise vector (namely multiplicative noise, M-Noise). Comparing to the existing theory that explains SGD using additive noise, the M-Noise helps establish a general case of SGD, namely Multiplicative SGD (M-SGD). The advantage of M-SGD is that it decouples noise from parameters, providing clear insights at the inherent randomness in SGD. Our analysis shows that 1) the M-SGD family, including the vanilla SGD, can be viewed as an minimizer with a data-dependent regularizer resemble of Rademacher complexity, which contributes to the implicit bias of M-SGD; 2) M-SGD holds a strong convergence to a continuous stochastic differential equation under the Gaussian noise assumption, ensuring the path-wise closeness of the discrete and continuous dynamics. For applications, based on M-SGD we design a fast algorithm to inject noise of different types (e.g., Gaussian and Bernoulli) into gradient descent. Based on the algorithm, we further demonstrate that M-SGD can approximate SGD with various noise types and recover the generalization performance, which reveals the potential of M-SGD to solve practical deep learning problems, e.g., large batch training with strong generalization performance. We have validated our observations on multiple practical deep learning scenarios.

## Authors

• 6 publications
• 3 publications
• 17 publications
• 19 publications
• 38 publications
• ### The Regularization Effects of Anisotropic Noise in Stochastic Gradient Descent

Understanding the generalization of deep learning has raised lots of con...
03/01/2018 ∙ by Zhanxing Zhu, et al. ∙ 0

• ### Shape Matters: Understanding the Implicit Bias of the Noise Covariance

The noise in stochastic gradient descent (SGD) provides a crucial implic...
06/15/2020 ∙ by Jeff Z. HaoChen, et al. ∙ 9

• ### Partial differential equation regularization for supervised machine learning

10/03/2019 ∙ by Adam M. Oberman, et al. ∙ 0

• ### Quasi-potential as an implicit regularizer for the loss function in the stochastic gradient descent

We interpret the variational inference of the Stochastic Gradient Descen...
01/18/2019 ∙ by Wenqing Hu, et al. ∙ 0

• ### A Stochastic Interpretation of Stochastic Mirror Descent: Risk-Sensitive Optimality

Stochastic mirror descent (SMD) is a fairly new family of algorithms tha...
04/03/2019 ∙ by Navid Azizan, et al. ∙ 6

• ### Why ADAM Beats SGD for Attention Models

While stochastic gradient descent (SGD) is still the de facto algorithm ...
12/06/2019 ∙ by Jingzhao Zhang, et al. ∙ 0

• ### Scalable Natural Gradient Langevin Dynamics in Practice

Stochastic Gradient Langevin Dynamics (SGLD) is a sampling scheme for Ba...
06/07/2018 ∙ by Henri Palacci, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

As the rise of deep learning, Stochastic Gradient Descent (SGD) has become one of the standard workhorses for optimizing deep models bottou1991stochastic

. The study on the memorization of deep neural networks suggested the commonly used learning algorithms, e.g., SGD, played an important role of implicit regularization, where these algorithms prevent the over-parameterized models converging to the minimum points that cannot generalize well

zhang2017understanding . More specifically, for the SGD algorithm, it is believed that the inherent randomness, due to the random sampling strategies adopted, contributes to the implicit regularization effects of SGD hu2019quasi ; zhu2018anisotropic . One direct evidence has been observed is that the large batch SGD typically performs worse than small batch ones hoffer2017 ; keskar2016large , since larger batch size reduces the randomness in SGD. Thus in order to further demystify of deep learning, understanding the randomness in SGD as well as its effect to generalization become critical.

Most of the previous research focused on studying the properties of SGD through modeling the algorithm as gradient descent (GD) with an unbiased noise term introduced by the random sampling. For example, daneshmand2018escaping ; jin2017escape ; kleinberg2018alternative studied the mechanism of SGD noise on helping the learning dynamics escape from saddles and local minima. For neural networks with one hidden layer, the implicit regularization effect of SGD has been studied in brutzkus2017sgd . Generalization bounds for stochastic gradient Langevin dynamics is obtained in mou2017generalization , shedding light on the regularization role of SGD noise.

Yet another important line of understanding SGD is from the continuous-time perspective, where stochastic differential equations (SDEs) have been used as mathematical tools oksendal2003stochastic to analyze SGD noise. The weak convergence between SGD and a continuous SDE is first established by li2017stochastic , from which more efforts have been done to understand SGD and its noise hu2017diffusion ; hu2019quasi ; feng2019uniform . To leverage the SGD noise, the work hoffer2017 ; jastrzkebski2017three studied approaches to control the scale of SGD noise through tuning the batch sizes and learning rates. In addition to the noise scale, zhu2018anisotropic studied the structure of the SGD noise, where the anisotropic property of SGD noise and its benefits on helping escape from (bad) local minima have been well examined. From the Bayesian perspective, the SDE based interpretation mandt2017stochastic ; chaudhari2017stochastic ; smith2018bayesian further suggested that SGD indeed performs variational inference through entropy regularization which prevents over-fitting. Though SDE li2017stochastic offers a powerful tool for analyzing SGD mathematically, to what extent the approximation holds in practice is not fully understood. As a reference, the most recent study simsekli2019tail argued that the SGD noise is heavy-tailed and non-Gaussian, thus it might not be the best tool to approximate SGD using SDE driven by Brownian motion.

Here we provide, for the first time, insights that understand the SGD noise from a mini-batch sampling perspective: instead of adopting the additive noise models, we proposed Multiplicative SGD (M-SGD)

as a general case of SGD that models the stochastic gradient estimated per iteration of SGD using the

matrix-vector product between a matrix of vertical gradient vectors, namely the gradient matrix, and a vector of random noises, namely the Multiplicative Noise (M-Noise). Compared with the traditional additive interpretation, M-Noise has the advantage of decoupling randomness from model parameters, shedding new lights on understanding SGD noise. Based on this novel perspective, we explicitly demonstrate the regularization effects of SGD, and introduce a fast algorithm to generate SGD-like noise to study the effects of SGD noise, and empirically verify the approximation between SGD and SDE li2017stochastic . Concisely, our main contributions are:

Result I - Our theoretical analysis on M-SGD shows that learning with SGD leads to an organic Structural Risk Minimization framework with a data-dependent regularizer resembling local Rademacher complexity. The finding thus explains the “implicit regularization” effects of SGD where the explicit regularization of SGD with local Rademacher complexity and the benefit of such regularization were studied in yang2019empirical .

Result II - Beyond the weak convergence between SGD and SDE li2017stochastic

, which primarily relies on the moments information of SGD noise, we show that a special case of M-SGD with Gaussian M-Noise, namely

M-SGD-Gaussian, holds strong convergence to the SDE.

Result III - Favorably, M-SGD model also provides us with an efficient way to approximate SGD with the noises of desired types, including the Gaussian noises based on either gradient covariance or Fisher matrices, where M-SGD is equipped with the M-Noise drawn from the interchangeable random distributions. Using this approach, we empirically verify that it is possible to approximate SGD noise by a Gaussian noise without loss of generalization performance, which supports our Result II.

Result IV - Moreover, we have empirically demonstrated that M-SGD could well approximate SGD with desired noises under practical mini-batch settings. We design a series of experiments systematically to show that, the M-Noise of SGD could be well approximated via 1) Bernoulli noise, 2) Gaussian noise with mini-batch estimated Fisher and 3) Sparse Gaussian noise. These results suggest the potential of using M-SGD to further develop practical learning algorithms.

## 2 M-SGD: Multiplicative Stochastic Gradient Descent

Machine learning problems usually involve minimizing an empirical loss over training data , , where is the loss over one example and is the parameter to be optimized. Define the “loss vector” as , then the gradient matrix is . Let , then .

#### Sgd

The typical SGD iteration works as: it first randomly draws a mini-batch of samples with index set , and then performs parameter update using the stochastic gradient estimated by the mini-batch and learning rate ,

 θt+1−θt=−η~g(θt),~g(θt)=1b∑i∈Bt\downθℓ(xi;θt). (1)

 ~g(θt)=\downθL(θt)+\Vcal(θt),\Vcal(θt):=~g(θt)−\downθL(θt), (2)

where represents the Additive-Noise (A-Noise) of SGD. We call the interpretation of SGD by Eq. (1) and Eq. (2) as Additive-SGD (A-SGD) model. Note that might not be a Gaussian noise simsekli2019tail , and its mean is zero and covariance is , where Though commonly adopted in literature li2017stochastic ; zhu2018anisotropic ; chaudhari2017stochastic ; mandt2017stochastic ; smith2018bayesian ; jastrzkebski2017three , it is clear the A-Noise is dependent on the parameter , thus it varies along the optimization path, and causes trouble for understanding and analyzing. To overcome this obstacle, many works assumed that A-Noise is constant or upper bounded by some constant chaudhari2017stochastic ; jastrzkebski2017three ; mandt2017stochastic ; zhang2017hitting . Thus a natural question raises: could the noise in SGD be decoupled from parameters? Fortunately, our multiplicative noise provides a positive answer, as elaborated in the following.

#### Multiplicative Noise (M-Noise)

By the definition of SGD, the randomness of SGD are indeed caused by the mini-batch sampling procedure,where this procedure is actually independent of current model parameter. Thus there should exist a parameter (i.e. )-independent model to characterize SGD noises rather than the aforementioned A-SGD. To this end, we propose the following formulation:

 ~g(θt)=\downθ\Lcal(θ)⋅\Wcalsgd, (3)

where is a random vector characterizing the mini-batch sampling process, i.e., for sampling without replacement, contains multiples of and multiples of zero, with random index.

We hereby use Multiplicative-SGD (M-SGD) to represent the method of modeling SGD by Eq. (1) and Eq. (3), and Multiplicative-Noise (M-Noise) to denote . Note that M-Noise is independent of parameter . The following Proposition 1 characterizes the properties of M-Noise of SGD.

###### Proposition 1.

(Mean and covariance of M-Noise in SGD) For mini-batch sampled with replacement, the M-Noise in SGD satisfies

 \Ebb[\Wcalsgd]=1N1,\Var[\Wcalsgd]=1bN(I−1N11T). (4)

For mini-batch sampled without replacement, the M-Noise in SGD satisfies

 \Ebb[\Wcal′sgd]=1N1,\Var[\Wcal′sgd]=N−bbN(N−1)(I−1N11T). (5)

Proof is left in Section 1.1 of the Appendix. We only consider the sampling with replacement case in the remaining parts, since most of our results hold for the other case, if not pointed out otherwise.

Besides SGD, we extend M-Noise to general cases and overload the notation M-SGD as:

 θt+1−θt=−η\downθ\Lcal(θt)⋅\Wcal,\Ebb[\Wcal]=1N1,\Wcal∈\RbbN×1. (6)

Note that: 1) M-SGD (6) becomes standard GD when , and SGD if . 2) The M-Noise is independent of model, parameter and dataset. Such decoupling provides us with a clear picture at the regularization effect of SGD. We will elaborate this point in next section. 3) One special case we would like to pay more attention is when is a Gaussian noise, i.e., , and we call Eq. (6) M-SGD-Gaussian. Our analysis will later show that there is a strong approximation of the discrete M-SGD-Gaussian (6) by continuous SDE li2017stochastic . Moreover, we will empirically demonstrate that approximating by achieves highly similar regularization effects. Thus it is meaningful to use SDE as a tool for understanding the generalization benefits of SGD and its variants.

#### Connection between A-Noise and M-Noise

Let M-Noise , then we have the corresponding A-Noise . Moreover, under the assumption that

follows a Gaussian distribution, then

is Gaussian too. This property plays a crucial role for us to design fast algorithm injecting noise into gradient based algorithms as shown in Section 5.1. Though A-Noise and M-Noise could convert between each other, M-Noise decouples noise and parameter, which gives us new insights on the behavior of SGD. For example, we now make use of M-Noise perspective to explicitly elaborate the implicit bias of SGD.

## 3 M-SGD Performs Data-Dependent Regularization

This section presents details of Result I. Let us first recall . In Eq. (6), let and rewrite it as

 θk+1−θk=−η\downθL(θk)−η\downθ1N\Lcal(θk)⋅\Vcal=−η\downθ(L(θk)+1N\Lcal(θk)\Vcal),\Ebb[\Vcal]=0. (7)

Thus learning by M-SGD (6) equals to applying GD to optimize the objective with a randomized data-dependent regularization term:

 ~L(θ) :=L(θ)+1N\Lcal(θ)\Vcal=1NN∑i=1ℓ(xi;θ)+1NN∑i=1viℓ(xi;θ) (8) ≤1NN∑i=1ℓ(xi;θ)+1Nsup∥θ′−θ∥2≤δ∣∣ ∣∣N∑i=1viℓ(xi;θ′)∣∣ ∣∣

We upper bound the random term in M-SGD by its local maximum in -ball, and as the inequality becomes tighter. The right hand side of the objective function (8) could be treated as the empirical realization of the population objective (9):

 ~Lpopu(θ):=\Ebbx[ℓ(x;θ)]+\Ebbx1,…,xN\Ebb\Vcal[1Nsup∥θ′−θ∥2≤δ∣∣ ∣∣N∑i=1viℓ(xi;θ′)∣∣ ∣∣]. (9)

The explicit regularization of SGD with the local Rademacher complexity with a -ball, and the empirical benefit of such regularization in image classification and neural network architecture search have been reported in yang2019empirical . The difference is that we show that SGD has an implicit regularization resembling local Rademacher complexity.

We denote For any M-Noise , defines a local complexity measure.

Note that the components of might not be independent. Specially, 1) for and be i.i.d., , is the local Rademacher complexity bartlett2002rademacher ; bartlett2005local ; bartlett2006local ; yang2019empirical ; 2) let , then is the local Gaussian complexity bartlett2002rademacher , which is the corresponding regularization term of M-SGD-Gaussian with independent Gaussian M-Noise; 3) for SGD noise , we name the local SGD complexity.111In literature, some define Rademacher/Gaussian complexity with absolute sign and some without. This does not cause big difference for obtaining generalization bound, and we adopt the one with absolute sign. We provide the following results to bound local Rademacher, Gaussian and SGD complexity.

###### Theorem 1.

(Local Rademacher, Gaussian and SGD complexity) Let

be the Rademacher, Gaussian and SGD random variables, respectively. Then there exist

such that:

 (1) cR(\Acal,θ,δ,N)≤R(\Gcal,θ,δ,N)≤ClnN⋅R(\Acal,θ,δ,N), (10) (2) R(\Vcalsgd,θ,δ,N)≤2(k−1)kR(\Acal,θ,δ,b),if N=kb,k∈\Nbb,k>1. (11)

The proof can be found in Sections 1.2 and 1.3 of the Appendix.

Theorem 1 tells us that local Gaussian complexity is equivalent to local Rademacher complexity, which explains the generalization advantage of M-SGD-Gaussian, since regularizing Rademacher complexity is known to bring benefits for generalization mou2018dropout ; yang2019empirical ; bartlett2006local . Though we cannot build perfect bridge between local SGD complexity and local Rademacher complexity yet, in Section 5 we will show that M-SGD-Gaussian could perfectly simulate SGD, given proper covariance of the Gaussian M-Noise. Thus we conclude that the local SGD complexity works similar to local Gaussian complexity and local Rademacher complexity, and the implicit bias of SGD is due to this data-dependent complexity regularizer.

Figure 1 (a)(d) show empirical comparison of the generalization performance of GD, SGD, M-SGD, and GD optimizing loss with Rademacher regularizer. We can clearly observe that SGD and M-SGD family function similarly to GD-Rademacher, thus supporting our understanding on the data-dependent regularization effect of SGD and M-SGD.

## 4 The Continuous Approximation of M-SGD

This section primarily focus on presenting Result II of our work. With the implicit bias of M-SGD known, we now address the issue of its continuous approximation. We first recollect the weak approximation between discrete A-SGD and continuous SDE li2017stochastic ; hu2017diffusion ; feng2019uniform .

Heuristically, let and , A-SGD iteration (1, 2) could be treated as an discretization of the following SDE

 \difΘt=−\downθL(Θt)\dift+√ηbΣsgd(Θt)\difWt. (12)

It is important to recognize the noises driving A-SGD iteration (1, 2) and SDE (12) are independent processes, hence we could only understand the approximation between them in a weak sense.

###### Theorem 2.

(Weak convergence between A-SGD and SDE li2017stochastic ) Let . Under mild assumptions, SGD (1) is an order weak approximation of the SDE (12), i.e., for a general class of test function ,

 ∥\Ebbg(Θkη)−\Ebbg(θk))∥≤Cηfor all 0≤k≤⌊T/η⌋ . (13)

Please refer to Theorem 1 in li2017stochastic for the rigorous statement and proof.

Similarly, the weak approximation also holds for M-SGD (1, 3), given the corresponding M-Noise shares the same covariance with the multiplicative noise of SGD, since Theorem 2

only makes use of the moments of SGD noise. The weak convergence provides us with the equivalence of the discrete iteration and the continuous SDE in sense of probability distributions. Nonetheless, the path-wise closeness between the two processes is not ensured.

M-SGD-Gaussian. To obtain stronger approximation, e.g., path-wise convergence, we need to make assumption that M-Noises are drawn from a Gaussian distribution, i.e., M-SGD-Gaussian. Concisely, Theorem 3 guarantees the strong convergence between M-SGD-Gaussian and SDE (12).

###### Theorem 3.

(Strong convergence between M-SGD-Gaussian and SDE) - Let . Assume are bounded with uniformly Lipschitz continuous gradients for all . Then Eq. (12) is an order strong approximation of M-SGD-Gaussian Eq. (6), i.e.,

 \Ebb∥Θkη−θk∥2≤Cη2for all 0≤k≤⌊T/η⌋ . (14)

The rigorous statement and proof are deferred to Section 1.4 in the Appendix.

The strong convergence guarantees the path-wise closeness between and , which indicates the close behavior not only at the level of probability distributions but also at the level of sample paths of the two processes. In Section 5 (Figure 1 (a)(d)), we will empirically verify that M-SGD-Gaussian achieves highly similar regularization effects as SGD, which makes it reasonable to understand SGD via M-SGD-Gaussian and its strong approximation, the continuous SDE.

## 5 The Discrete Approximation of SGD using M-SGD

In this section we study the way to approximate SGD using M-SGD with M-Noise drawn from interchangeable random distributions with/without mini-batch settings. Compared to A-SGD, our proposed M-SGD can easily generate noises of various useful and desired types with low computational complexity, using the M-Noise drawn from interested distributions. In the rest of this section, we first introduce the fast algorithm for implementing M-SGD-Gaussian, then present the details about Result III and Result IV, all based on the Fast M-SGD-Gaussian Algorithm and its variants.

### 5.1 Fast M-SGD-Gaussian: efficient Gaussian noise generation with gradient covariance

Approximating the noise in SGD by a Gaussian one is a common used trick zhu2018anisotropic ; jastrzkebski2017three ; wen2019interplay . The targeted noise is a Gaussian noise with gradient covariance , denoted as

. To obtain such noise, one would first compute the covariance matrix and then apply the singular value decomposition (SVD),

, to transform a white noise

into the noise desired, .

However there are two obstacles in the above generation procedures: 1) evaluating and storing is computationally unacceptable, with both and being large; 2) performing SVD for a matrix is comprehensively hard when is extremely large. Furthermore, one needs to repeat 1) and 2) at every parameter update step, since depends on parameter . In compromise, current works suggest to approximate gradient covariance using only its diagonal or block diagonal elements wen2019interplay ; zhu2018anisotropic ; jastrzkebski2017three ; martens2015optimizing . Generally, there is no guarantee that the diagonal information could approximate full gradient covariance well. Specifically,  zhu2018anisotropic empirically showed that such diagonal approximation cannot fully recover the regularization effects of SGD. Thus a more effective approach of generating Gaussian noise with gradient covariance is of both theoretical and empirical importance.

Inspired by M-SGD framework (6), we propose a fast algorithm to generate Gaussian-like SGD noise. First of all, through a little calculation it can be shown that

 Σsgd(θ)=1N(\downθ\Lcal(θ)(I−1N11T))(\downθ\Lcal(θ)(I−1N11T))T. (15)

In this way, the preferred Gaussian noise could be sampled as . Besides, since , we indeed can borrow M-SGD-Gaussian as the approximation of SGD with Gaussian noises such that

 θt+1−θt =−η\downθL(θ)−ηv=−η\downθ1N\Lcal(θ)1−η1√bN\downθ\Lcal(θ)(I−1N11T)ϵ (16) =−η\downθ\Lcal(θt)⋅\WcalG,\WcalG=1N+1√bN(I−1N11T)ϵ,ϵ∼\Ncal(0,I).

#### Fast Implementation

Thanks to the linearity of derivation operator and the feasibility of derivation operator to communicate with weight average operator, we can design the fast algorithm (described in Algorithm 1) to implement M-SGD-Gaussian (in the form of Eq. (16)).

Remark: 1) Before the deep learning era, the typical setting of machine learning is , i.e., the number of samples is larger than that of parameters. In this circumstance, the SVD way of generating Gaussian noise is indeed plausible. However, when it comes to deep networks where , or both numbers are high, it turns out computing the full gradient could be far more efficient than explicitly evaluating the covariance matrix and performing SVD, resulting in the computational advantage of our method over the traditional one. 2) Our method could be easily extended to generate other types of noise besides Gaussian, e.g., Bernoulli noise and the mini-batch version of noises. See the following for more discussions.

### 5.2 Approximate the M-Noise of SGD by Gaussian ones and components independent ones

Here, we present the details of Result III. First, based on the Fast M-SGD-Gaussian (16), we unify two types of commonly used Gaussian noise for simulating SGD’s behavior: Gaussian noise with gradient covariance (M-SGD-Cov) zhu2018anisotropic and Gaussian noise with Fisher (M-SGD-Fisher) wen2019interplay .

#### M-SGD-Cov and M-SGD-Fisher

First we know , where is the Fisher. Intuitively, M-SGD-Cov and M-SGD-Fisher should not be far away from each other. We can see this using SDE (12). At the beginning stage of SGD training, the drift term outlarges the diffusion term in scale zhu2018anisotropic ; shwartz2017opening , dominates the optimization, and the noise term almost makes no contribution, no matter whether it is gradient covariance noise or Fisher noise. During the latter diffusion stage, however, the gradient turns to be close to zero, thus . In a nutshell, covariance noise and Fisher noise should behave similarly for regularizing SGD iteration.

Thanks to M-SGD-Gaussian formulation, we could now give a mathematical analysis on the difference between these two types of noise. Let and be the M-Noises for generating Fisher noise and gradient covariance noise, respectively, then from Eq. (15) and Eq. (16), we have:

 (17)

Note that matrix centralizes a random vector. Thus the M-SGD perspective tells us the only difference between and is that, in the former one, the white noise for generating the M-Noise is firstly processed by centralization. On the other hand, since the components of are already identically distributed with zero mean, and is extremely large in deep learning with huge training data, thus and the centralization procedure barely does anything to the white noise, i.e., . Therefore along the whole optimization path, which leads to identical regularization effect for learning deep models.

#### M-SGD-Bernoulli

To further verify our observation, we introduce M-SGD-Bernoulli that employs Bernoulli M-Noise to approximate the behaviors of the SGD with its diagonal (part of) M-Noise covariance matrix, i.e., . Consider a random vector , , i.i.d. Then and . In this way, we can see that the covariance of Bernoulli M-Noise is the diagonal of the covariance of SGD M-Noise. Note that this “diagnoal” relationship might not hold for their corresponding A-Noises. This Bernoulli M-Noise could be viewed as the best approximation of SGD M-Noise, among all the random variables with independent components.

#### Results and Observations

The experiment results shown in Figures 1(a)(d) demonstrate that, under the same settings, M-SGD-Fisher and M-SGD-Cov algorithms perform almost the same, while the performance of M-SGD-Bernoulli tightly follows up the previous two. Together with our theoretical insights from M-SGD perspectives, we can conclude that 1) gradient covariance should be equivalent to the Fisher for SGD (validating our theoretical findings), and 2) the M-Noise of SGD could be well approximated by noises with independent components, e.g., and .

### 5.3 Practical SGD approximation using mini-batch M-SGD

M-SGD-[Fisher-b] and M-SGD-[Cov-b] To derive our results step by step, we first introduce two intermediate results based on M-SGD-[Fisher-b] or M-SGD-[Cov-b]. Such M-SGD variants approximate the behaviors of SGD that use mini-batch estimated gradient covariance or Fisher matrices to generate Gaussian random noises and the batch size is . The implementation of these two algorithms are addressed in Section 2 of the Appendix. Note that, though using mini-batch gradients to estimate Fisher/covariance matrix, the generated M-Noises (17) are not sparse since it is sitll the summation of a constant and a Gaussian noise.

[M-SGD-Fisher]-b and [M-SGD-Cov]-b We further define the mini-batch versions of M-SGD-Fisher and M-SGD-Cov. The M-Noises defined to be the composition of a mini-batch sampling random variable (with batch size ) and a Gaussian random variable. Thus the M-Noises are sparse Gaussian with non-zero Gaussian random elements. In this way, naturally becomes the batch size of M-SGD, as the gradients corresponding to the zero elements in the M-Noise vector would be ignored in matrix-vector product. Please refer to Section 2 of Appendix for implementation details. Note that these algorithms lower the computational complexity of M-SGD using only mini-batch of data for the parameter update per iteration.

Results and Observations Under the settings of , estimating a gradient covariance/Fisher matrix with a batch of gradients should be difficult. To our surprise again the experimental results shown in Figures 1 (b)(e) demonstrate that the generalization performances of M-SGD-[Fisher-b] and M-SGD-[Cov-b] are close to M-SGD-Fisher and M-SGD-Cov, which estimate the gradient covariance and Fisher matrices using full gradients. Furthermore, Figures 1 (c)(f) exhibit that testing accuracy of [M-SGD-Fisher]-b and [M-SGD-Cov]-b could still be maintained, even when the M-Noises are sparse, indicating the strong application prospects of M-SGD.

Large Batch Training Especially, when the batch size becomes large, the generalization of vanilla SGD would be hurt and perform even worse than the SGD with small batch size hoffer2017 ; keskar2016large . In the same Figures 1 (c)(f), our experiments show that M-SGD with various M-Noise settings can still recover the generalization performance under the same large batch settings (b= or

with ghost batch normalization, learning rate tuning, and regime adaptation)

hoffer2017 . Thus, our perspective of multiplicative SGD might shed new light on developing new algorithm of large batch training maintaining both the speed advantage and generalization guarantee. We leave further investigation along this direction as future work.

## 6 Discussions and Conclusions

In this work, we introduce Multiplicative SGD model (M-SGD) to interpret the randomness of SGD, from Multiplicative Noise (M-Noise) perspectives. First of all, we find the M-Noise helps establish a theory that connects the generalization of SGD to a data-dependent regularizer of Rademacher complexity type. Moreover, under the known Gaussian M-Noise assumptions, the M-SGD model holds a strong convergence to the known SDE of SGD, beyond the weak convergence obtained in li2017stochastic . In addition, based on M-SGD formulation a fast algorithm is developed to efficiently insert noise into gradient descent. Using the algorithm, we empirically verify that M-SGD with various desired types of M-Noises can well approximate the behaviors of SGD , in the sense of achieving similar generalization performance. Compared to the traditional analytical models based on the additive noise, we find multiplicative noises provides an alternative way to understand SGD, with insightful new results for both theory and application.

As the first work along the M-Noise road, there are several unsolved theoretical challenges, e.g., the relationship between local Rademacher complexity and local SGD complexity, and more general local complexity measures. These open problems are left for future work.

## Acknowledgement

The contributions of the authors are the following: JW came up with the core ideas, contributed to the proof of Theorem 1, implemented all the experiments and wrote most part of the paper. WH contributed to the proof of Theorems 1, 2 and participated in paper writing. HX led the research discussions with JW as an intern in Baidu Research and wrote part of this paper. JH participated in the discussion and wrote part of the paper. ZZ led the research on studying the behavior of SGD. With JW he jointly proposed and discussed research agenda on the multiplicative noise of SGD, proposed the core idea of Section 3, and wrote part of the paper.

## Appendix A Missing Proofs in Main Paper

### a.1 Proof of Proposition 1

###### Proof.

Sampling without replacement

By definition, the random variable could be decompose as

 \Wcalsgd=\Wcal1+⋯+\Wcalb, (18)

where are i.i.d, and they represent once sampling procedure. Thus contains one and zeros, with random index. By its definition, we know

 \Ebb[wij] =1bN, ∀j (19) \Ebb[wijwij] =1b2N, ∀j (20) \Ebb[wijwik] =0, ∀j≠k. (21)

Thus

 \Ebb[\Wcali] =1bN1 (22) \Var[\Wcali] =\Ebb[\Wcali(\Wcali)T]−\Ebb[\Wcali]\Ebb[\Wcali]T (23) =⎛⎜ ⎜ ⎜ ⎜⎝1b2N⋱1b2N⎞⎟ ⎟ ⎟ ⎟⎠−1b2N211T (24) =1b2N(I−1N11T). (25)

Because are i.i.d., we have

 \Ebb[\Wcal] =b\Ebb[\Wcali]=1N1 (26) \Var[\Wcal] =b\Var[\Wcali]=1bN(I−1N11T). (27)

Sampling with replacement

Let , by definition, contain s and zeros, with random index. Thus

 \Ebb[w′j] =(N−1b−1)1b(Nb)=1N, ∀j (28) \Ebb[(w′j)2] =(N−1b−1)1b2(Nb)=1bN, ∀j (29) \Ebb[w′jw′k] =(N−2b−2)1b2(Nb)=b−1bN(N−1), ∀j≠k. (30)

Hence

 \Ebb[\Wcal′sgd] =1N1 (31) \Var[\Wcal′sgd] =\Ebb[\Wcal′sgd(\Wcal′sgd)T]−\Ebb[\Wcali]\Ebb[\Wcali]T (32) =⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝1bNb−1bN(N−1)⋯b−1bN(N−1)b−1bN(N−1)1bN⋯b−1bN(N−1)⋮⋮⋱⋮b−1bN(N−1)b−1bN(N−1)⋯1b2N⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠−1N211T (33) =N−bbN(N−1)(I−1N11T) (34)

### a.2 Proof of Theorem 1: first half

Define the Rademacher variables with even probability. Define the standard Rademacher complexity

 R(A,θ,δ,N)=1N\Ebbx1,...,xN\Ebba1,...,aNsupf∣∣ ∣∣N∑i=1aif(xi)∣∣ ∣∣ . (36)

Let be a sequence of independent Gaussian random variables. Define the Gaussian Rademacher complexity

 R(G,θ,δ,N)=1N\Ebbx1,...,xN\Ebbg1,...,gNsupf∣∣ ∣∣N∑i=1gif(xi)∣∣ ∣∣ . (37)
###### Theorem (The first part of Theorem 1 in the paper, Lemma 4 in [2]).

There are absolute positive constants and such that

 cR(A,θ,δ,N)(???a)≤R(G,θ,δ,N)(???b)≤ClnN⋅R(A,θ,δ,N) . (38)
###### Proof.

The proof follows [2, 28, 24].

Indeed, our proof holds for not only local Rademacher and Gaussian complexity, but also original Rademacher and Gaussian complexity. Thus for the simplicity of notations, we omit in function and write as .

We first prove the inequality (38a). Set be the product probability measure on and let , note that and are identical distributed. Then

 \Ebbaisupf∣∣∑aif(xi)∣∣ (39) = 1b\Ebbaisupf[∣∣∣∑ai∫|gi|\difμf(xi)∣∣∣] (40) ≤ 1b\Ebbai∫[supf∣∣∑ai|gi|f(xi)∣∣]\difμ (41) = 1b[\Ebbai∫g1>0,…,gN>0supf∣∣∑aigif(xi)∣∣\difμ+⋯ (42) (43) = 1b[\Ebbai∫g1>0,…,gN>0supf∣∣∑aigif(xi)∣∣\difμ+⋯ (44) (45) = 1b\Ebbai∫supf∣∣∑aigif(xi)∣∣\difμ (46) = 1b\Ebbgisupf∣∣∑gif(xi)∣∣. (47)

Hence (38a) holds.

Let us now demonstrate (38b). To this end we first propose the following estimate [24]. If , then

 \Ebbaisupf∣∣ ∣∣N∑i=1aiαif(xi)∣∣ ∣∣≤\Ebbaisupf∣∣ ∣∣N∑i=1aif(xi)∣∣ ∣∣ . (48)

If we apply (48) to , then we get

 \Ebbaisupf∣∣ ∣∣N∑i=1ai|gi|f(xi)∣∣ ∣∣≤(maxi=1,...,N|gi|)⋅\Ebbaisupf∣∣ ∣∣N∑i=1aif(xi)∣∣ ∣∣ , (49)

and thus

 \Ebbgisupf∣∣ ∣∣N∑i=1gif(xi)∣∣ ∣∣= \Ebbgi\Ebbaisupf∣∣ ∣∣N∑i=1ai|gi|f(xi)∣∣ ∣∣ (50) (since gi and −gi are identically distributed) (51) ≤ (\Ebbgimaxi=1,...,N|gi|)⋅\Ebbaisupf∣∣ ∣∣N∑i=1aif(xi)∣∣ ∣∣ , (52)

so that we conclude (38b) by noticing that  ([6], Lemma 11.3).

It remains to show (48). Due to the absolute sign inside the sup and the symmetry, without loss of generality we can always assume that . If , we are left to show that

 12supf|α1f(x1)+α2f(x2)|+12supf|α1f(x1)−α2f(x2)|≤ (53) 12supf|f(x1)+f(x2)|+12supf|f(x1)−f(x2)| .

We can fix and consider the function . It can be directly verified that is convex in , since it is the summation of two convex function in . Also , and thus for any we have . In a same way , and we conclude (48) for .

The case for general follows the same idea by introducing the function

 F(α1,...,αN)=12N−1∑all 2N−1 combinations of (1,±1,...,±1)supf|α1f(x1)±α2f(x2)±...±αNf(xN)| , (54)

and iteratively , which is (48).

In summary we finish the proof. ∎

### a.3 Proof of Theorem 1: second half

Let be the M-Noise of SGD, by definition we know that for , the number of is and the number of is . For simplicity, let . Thus in , the number of is and the number of is .

###### Theorem (The second part of Theorem 1 in the paper).

Assume , then

 R(\Vcal,θ,δ,bk)≤2(k−1)kR(\Acal,θ,δ,b). (55)
###### Proof.

First we know that for i.i.d. examples , the following equation holds for any function :

 \Ebbx1,…,XNF(x1,…,xN)=\Ebbx1,…,XNF(xi1,…,xiN), (56)

where is a permutation of .

Thus by definition of SGD complexity

 R(\Vcal,θ,δ,N)=1N\Ebbxi\Ebbvisupf∣∣ ∣∣N∑i=1vif(xi)∣∣ ∣∣ (57)

and the definition of M-Noise , i.e., the number of is and the number of is at any cases, we could permute the index of , such that , , without affecting the SGD complexity. Thus we have

 R(\Vcal,θ,δ,N)=1N\Ebbxisupf∣∣∣(<