1 Introduction
As the rise of deep learning, Stochastic Gradient Descent (SGD) has become one of the standard workhorses for optimizing deep models bottou1991stochastic
. The study on the memorization of deep neural networks suggested the commonly used learning algorithms, e.g., SGD, played an important role of implicit regularization, where these algorithms prevent the overparameterized models converging to the minimum points that cannot generalize well
zhang2017understanding . More specifically, for the SGD algorithm, it is believed that the inherent randomness, due to the random sampling strategies adopted, contributes to the implicit regularization effects of SGD hu2019quasi ; zhu2018anisotropic . One direct evidence has been observed is that the large batch SGD typically performs worse than small batch ones hoffer2017 ; keskar2016large , since larger batch size reduces the randomness in SGD. Thus in order to further demystify of deep learning, understanding the randomness in SGD as well as its effect to generalization become critical.Most of the previous research focused on studying the properties of SGD through modeling the algorithm as gradient descent (GD) with an unbiased noise term introduced by the random sampling. For example, daneshmand2018escaping ; jin2017escape ; kleinberg2018alternative studied the mechanism of SGD noise on helping the learning dynamics escape from saddles and local minima. For neural networks with one hidden layer, the implicit regularization effect of SGD has been studied in brutzkus2017sgd . Generalization bounds for stochastic gradient Langevin dynamics is obtained in mou2017generalization , shedding light on the regularization role of SGD noise.
Yet another important line of understanding SGD is from the continuoustime perspective, where stochastic differential equations (SDEs) have been used as mathematical tools oksendal2003stochastic to analyze SGD noise. The weak convergence between SGD and a continuous SDE is first established by li2017stochastic , from which more efforts have been done to understand SGD and its noise hu2017diffusion ; hu2019quasi ; feng2019uniform . To leverage the SGD noise, the work hoffer2017 ; jastrzkebski2017three studied approaches to control the scale of SGD noise through tuning the batch sizes and learning rates. In addition to the noise scale, zhu2018anisotropic studied the structure of the SGD noise, where the anisotropic property of SGD noise and its benefits on helping escape from (bad) local minima have been well examined. From the Bayesian perspective, the SDE based interpretation mandt2017stochastic ; chaudhari2017stochastic ; smith2018bayesian further suggested that SGD indeed performs variational inference through entropy regularization which prevents overfitting. Though SDE li2017stochastic offers a powerful tool for analyzing SGD mathematically, to what extent the approximation holds in practice is not fully understood. As a reference, the most recent study simsekli2019tail argued that the SGD noise is heavytailed and nonGaussian, thus it might not be the best tool to approximate SGD using SDE driven by Brownian motion.
Here we provide, for the first time, insights that understand the SGD noise from a minibatch sampling perspective: instead of adopting the additive noise models, we proposed Multiplicative SGD (MSGD)
as a general case of SGD that models the stochastic gradient estimated per iteration of SGD using the
matrixvector product between a matrix of vertical gradient vectors, namely the gradient matrix, and a vector of random noises, namely the Multiplicative Noise (MNoise). Compared with the traditional additive interpretation, MNoise has the advantage of decoupling randomness from model parameters, shedding new lights on understanding SGD noise. Based on this novel perspective, we explicitly demonstrate the regularization effects of SGD, and introduce a fast algorithm to generate SGDlike noise to study the effects of SGD noise, and empirically verify the approximation between SGD and SDE li2017stochastic . Concisely, our main contributions are:Result I  Our theoretical analysis on MSGD shows that learning with SGD leads to an organic Structural Risk Minimization framework with a datadependent regularizer resembling local Rademacher complexity. The finding thus explains the “implicit regularization” effects of SGD where the explicit regularization of SGD with local Rademacher complexity and the benefit of such regularization were studied in yang2019empirical .
Result II  Beyond the weak convergence between SGD and SDE li2017stochastic
, which primarily relies on the moments information of SGD noise, we show that a special case of MSGD with Gaussian MNoise, namely
MSGDGaussian, holds strong convergence to the SDE.Result III  Favorably, MSGD model also provides us with an efficient way to approximate SGD with the noises of desired types, including the Gaussian noises based on either gradient covariance or Fisher matrices, where MSGD is equipped with the MNoise drawn from the interchangeable random distributions. Using this approach, we empirically verify that it is possible to approximate SGD noise by a Gaussian noise without loss of generalization performance, which supports our Result II.
Result IV  Moreover, we have empirically demonstrated that MSGD could well approximate SGD with desired noises under practical minibatch settings. We design a series of experiments systematically to show that, the MNoise of SGD could be well approximated via 1) Bernoulli noise, 2) Gaussian noise with minibatch estimated Fisher and 3) Sparse Gaussian noise. These results suggest the potential of using MSGD to further develop practical learning algorithms.
2 MSGD: Multiplicative Stochastic Gradient Descent
Machine learning problems usually involve minimizing an empirical loss over training data , , where is the loss over one example and is the parameter to be optimized. Define the “loss vector” as , then the gradient matrix is . Let , then .
Sgd
The typical SGD iteration works as: it first randomly draws a minibatch of samples with index set , and then performs parameter update using the stochastic gradient estimated by the minibatch and learning rate ,
(1) 
Additive Noise (ANoise)
Traditionally, the SGD noise is interpreted from additive viewpoint, i.e.,
(2) 
where represents the AdditiveNoise (ANoise) of SGD. We call the interpretation of SGD by Eq. (1) and Eq. (2) as AdditiveSGD (ASGD) model. Note that might not be a Gaussian noise simsekli2019tail , and its mean is zero and covariance is , where Though commonly adopted in literature li2017stochastic ; zhu2018anisotropic ; chaudhari2017stochastic ; mandt2017stochastic ; smith2018bayesian ; jastrzkebski2017three , it is clear the ANoise is dependent on the parameter , thus it varies along the optimization path, and causes trouble for understanding and analyzing. To overcome this obstacle, many works assumed that ANoise is constant or upper bounded by some constant chaudhari2017stochastic ; jastrzkebski2017three ; mandt2017stochastic ; zhang2017hitting . Thus a natural question raises: could the noise in SGD be decoupled from parameters? Fortunately, our multiplicative noise provides a positive answer, as elaborated in the following.
Multiplicative Noise (MNoise)
By the definition of SGD, the randomness of SGD are indeed caused by the minibatch sampling procedure,where this procedure is actually independent of current model parameter. Thus there should exist a parameter (i.e. )independent model to characterize SGD noises rather than the aforementioned ASGD. To this end, we propose the following formulation:
(3) 
where is a random vector characterizing the minibatch sampling process, i.e., for sampling without replacement, contains multiples of and multiples of zero, with random index.
We hereby use MultiplicativeSGD (MSGD) to represent the method of modeling SGD by Eq. (1) and Eq. (3), and MultiplicativeNoise (MNoise) to denote . Note that MNoise is independent of parameter . The following Proposition 1 characterizes the properties of MNoise of SGD.
Proposition 1.
(Mean and covariance of MNoise in SGD) For minibatch sampled with replacement, the MNoise in SGD satisfies
(4) 
For minibatch sampled without replacement, the MNoise in SGD satisfies
(5) 
Proof is left in Section 1.1 of the Appendix. We only consider the sampling with replacement case in the remaining parts, since most of our results hold for the other case, if not pointed out otherwise.
Besides SGD, we extend MNoise to general cases and overload the notation MSGD as:
(6) 
Note that: 1) MSGD (6) becomes standard GD when , and SGD if . 2) The MNoise is independent of model, parameter and dataset. Such decoupling provides us with a clear picture at the regularization effect of SGD. We will elaborate this point in next section. 3) One special case we would like to pay more attention is when is a Gaussian noise, i.e., , and we call Eq. (6) MSGDGaussian. Our analysis will later show that there is a strong approximation of the discrete MSGDGaussian (6) by continuous SDE li2017stochastic . Moreover, we will empirically demonstrate that approximating by achieves highly similar regularization effects. Thus it is meaningful to use SDE as a tool for understanding the generalization benefits of SGD and its variants.
Connection between ANoise and MNoise
Let MNoise , then we have the corresponding ANoise . Moreover, under the assumption that
follows a Gaussian distribution, then
is Gaussian too. This property plays a crucial role for us to design fast algorithm injecting noise into gradient based algorithms as shown in Section 5.1. Though ANoise and MNoise could convert between each other, MNoise decouples noise and parameter, which gives us new insights on the behavior of SGD. For example, we now make use of MNoise perspective to explicitly elaborate the implicit bias of SGD.3 MSGD Performs DataDependent Regularization
This section presents details of Result I. Let us first recall . In Eq. (6), let and rewrite it as
(7) 
Thus learning by MSGD (6) equals to applying GD to optimize the objective with a randomized datadependent regularization term:
(8)  
We upper bound the random term in MSGD by its local maximum in ball, and as the inequality becomes tighter. The right hand side of the objective function (8) could be treated as the empirical realization of the population objective (9):
(9) 
The explicit regularization of SGD with the local Rademacher complexity with a ball, and the empirical benefit of such regularization in image classification and neural network architecture search have been reported in yang2019empirical . The difference is that we show that SGD has an implicit regularization resembling local Rademacher complexity.
We denote For any MNoise , defines a local complexity measure.
Note that the components of might not be independent. Specially, 1) for and be i.i.d., , is the local Rademacher complexity bartlett2002rademacher ; bartlett2005local ; bartlett2006local ; yang2019empirical ; 2) let , then is the local Gaussian complexity bartlett2002rademacher , which is the corresponding regularization term of MSGDGaussian with independent Gaussian MNoise; 3) for SGD noise , we name the local SGD complexity.^{1}^{1}1In literature, some define Rademacher/Gaussian complexity with absolute sign and some without. This does not cause big difference for obtaining generalization bound, and we adopt the one with absolute sign. We provide the following results to bound local Rademacher, Gaussian and SGD complexity.
Theorem 1.
(Local Rademacher, Gaussian and SGD complexity) Let
be the Rademacher, Gaussian and SGD random variables, respectively. Then there exist
such that:(10)  
(11) 
The proof can be found in Sections 1.2 and 1.3 of the Appendix.
Theorem 1 tells us that local Gaussian complexity is equivalent to local Rademacher complexity, which explains the generalization advantage of MSGDGaussian, since regularizing Rademacher complexity is known to bring benefits for generalization mou2018dropout ; yang2019empirical ; bartlett2006local . Though we cannot build perfect bridge between local SGD complexity and local Rademacher complexity yet, in Section 5 we will show that MSGDGaussian could perfectly simulate SGD, given proper covariance of the Gaussian MNoise. Thus we conclude that the local SGD complexity works similar to local Gaussian complexity and local Rademacher complexity, and the implicit bias of SGD is due to this datadependent complexity regularizer.
Figure 1 (a)(d) show empirical comparison of the generalization performance of GD, SGD, MSGD, and GD optimizing loss with Rademacher regularizer. We can clearly observe that SGD and MSGD family function similarly to GDRademacher, thus supporting our understanding on the datadependent regularization effect of SGD and MSGD.
(a)  (b)  (c) 
(d)  (e)  (f) 
4 The Continuous Approximation of MSGD
This section primarily focus on presenting Result II of our work. With the implicit bias of MSGD known, we now address the issue of its continuous approximation. We first recollect the weak approximation between discrete ASGD and continuous SDE li2017stochastic ; hu2017diffusion ; feng2019uniform .
Heuristically, let and , ASGD iteration (1, 2) could be treated as an discretization of the following SDE
(12) 
It is important to recognize the noises driving ASGD iteration (1, 2) and SDE (12) are independent processes, hence we could only understand the approximation between them in a weak sense.
Theorem 2.
(Weak convergence between ASGD and SDE li2017stochastic ) Let . Under mild assumptions, SGD (1) is an order weak approximation of the SDE (12), i.e., for a general class of test function ,
(13) 
Please refer to Theorem 1 in li2017stochastic for the rigorous statement and proof.
Similarly, the weak approximation also holds for MSGD (1, 3), given the corresponding MNoise shares the same covariance with the multiplicative noise of SGD, since Theorem 2
only makes use of the moments of SGD noise. The weak convergence provides us with the equivalence of the discrete iteration and the continuous SDE in sense of probability distributions. Nonetheless, the pathwise closeness between the two processes is not ensured.
MSGDGaussian. To obtain stronger approximation, e.g., pathwise convergence, we need to make assumption that MNoises are drawn from a Gaussian distribution, i.e., MSGDGaussian. Concisely, Theorem 3 guarantees the strong convergence between MSGDGaussian and SDE (12).
Theorem 3.
The rigorous statement and proof are deferred to Section 1.4 in the Appendix.
The strong convergence guarantees the pathwise closeness between and , which indicates the close behavior not only at the level of probability distributions but also at the level of sample paths of the two processes. In Section 5 (Figure 1 (a)(d)), we will empirically verify that MSGDGaussian achieves highly similar regularization effects as SGD, which makes it reasonable to understand SGD via MSGDGaussian and its strong approximation, the continuous SDE.
5 The Discrete Approximation of SGD using MSGD
In this section we study the way to approximate SGD using MSGD with MNoise drawn from interchangeable random distributions with/without minibatch settings. Compared to ASGD, our proposed MSGD can easily generate noises of various useful and desired types with low computational complexity, using the MNoise drawn from interested distributions. In the rest of this section, we first introduce the fast algorithm for implementing MSGDGaussian, then present the details about Result III and Result IV, all based on the Fast MSGDGaussian Algorithm and its variants.
5.1 Fast MSGDGaussian: efficient Gaussian noise generation with gradient covariance
Approximating the noise in SGD by a Gaussian one is a common used trick zhu2018anisotropic ; jastrzkebski2017three ; wen2019interplay . The targeted noise is a Gaussian noise with gradient covariance , denoted as
. To obtain such noise, one would first compute the covariance matrix and then apply the singular value decomposition (SVD),
, to transform a white noise
into the noise desired, .However there are two obstacles in the above generation procedures: 1) evaluating and storing is computationally unacceptable, with both and being large; 2) performing SVD for a matrix is comprehensively hard when is extremely large. Furthermore, one needs to repeat 1) and 2) at every parameter update step, since depends on parameter . In compromise, current works suggest to approximate gradient covariance using only its diagonal or block diagonal elements wen2019interplay ; zhu2018anisotropic ; jastrzkebski2017three ; martens2015optimizing . Generally, there is no guarantee that the diagonal information could approximate full gradient covariance well. Specifically, zhu2018anisotropic empirically showed that such diagonal approximation cannot fully recover the regularization effects of SGD. Thus a more effective approach of generating Gaussian noise with gradient covariance is of both theoretical and empirical importance.
Inspired by MSGD framework (6), we propose a fast algorithm to generate Gaussianlike SGD noise. First of all, through a little calculation it can be shown that
(15) 
In this way, the preferred Gaussian noise could be sampled as . Besides, since , we indeed can borrow MSGDGaussian as the approximation of SGD with Gaussian noises such that
(16)  
Fast Implementation
Thanks to the linearity of derivation operator and the feasibility of derivation operator to communicate with weight average operator, we can design the fast algorithm (described in Algorithm 1) to implement MSGDGaussian (in the form of Eq. (16)).
Remark: 1) Before the deep learning era, the typical setting of machine learning is , i.e., the number of samples is larger than that of parameters. In this circumstance, the SVD way of generating Gaussian noise is indeed plausible. However, when it comes to deep networks where , or both numbers are high, it turns out computing the full gradient could be far more efficient than explicitly evaluating the covariance matrix and performing SVD, resulting in the computational advantage of our method over the traditional one. 2) Our method could be easily extended to generate other types of noise besides Gaussian, e.g., Bernoulli noise and the minibatch version of noises. See the following for more discussions.
5.2 Approximate the MNoise of SGD by Gaussian ones and components independent ones
Here, we present the details of Result III. First, based on the Fast MSGDGaussian (16), we unify two types of commonly used Gaussian noise for simulating SGD’s behavior: Gaussian noise with gradient covariance (MSGDCov) zhu2018anisotropic and Gaussian noise with Fisher (MSGDFisher) wen2019interplay .
MSGDCov and MSGDFisher
First we know , where is the Fisher. Intuitively, MSGDCov and MSGDFisher should not be far away from each other. We can see this using SDE (12). At the beginning stage of SGD training, the drift term outlarges the diffusion term in scale zhu2018anisotropic ; shwartz2017opening , dominates the optimization, and the noise term almost makes no contribution, no matter whether it is gradient covariance noise or Fisher noise. During the latter diffusion stage, however, the gradient turns to be close to zero, thus . In a nutshell, covariance noise and Fisher noise should behave similarly for regularizing SGD iteration.
Thanks to MSGDGaussian formulation, we could now give a mathematical analysis on the difference between these two types of noise. Let and be the MNoises for generating Fisher noise and gradient covariance noise, respectively, then from Eq. (15) and Eq. (16), we have:
(17) 
Note that matrix centralizes a random vector. Thus the MSGD perspective tells us the only difference between and is that, in the former one, the white noise for generating the MNoise is firstly processed by centralization. On the other hand, since the components of are already identically distributed with zero mean, and is extremely large in deep learning with huge training data, thus and the centralization procedure barely does anything to the white noise, i.e., . Therefore along the whole optimization path, which leads to identical regularization effect for learning deep models.
MSGDBernoulli
To further verify our observation, we introduce MSGDBernoulli that employs Bernoulli MNoise to approximate the behaviors of the SGD with its diagonal (part of) MNoise covariance matrix, i.e., . Consider a random vector , , i.i.d. Then and . In this way, we can see that the covariance of Bernoulli MNoise is the diagonal of the covariance of SGD MNoise. Note that this “diagnoal” relationship might not hold for their corresponding ANoises. This Bernoulli MNoise could be viewed as the best approximation of SGD MNoise, among all the random variables with independent components.
Results and Observations
The experiment results shown in Figures 1(a)(d) demonstrate that, under the same settings, MSGDFisher and MSGDCov algorithms perform almost the same, while the performance of MSGDBernoulli tightly follows up the previous two. Together with our theoretical insights from MSGD perspectives, we can conclude that 1) gradient covariance should be equivalent to the Fisher for SGD (validating our theoretical findings), and 2) the MNoise of SGD could be well approximated by noises with independent components, e.g., and .
5.3 Practical SGD approximation using minibatch MSGD
MSGD[Fisherb] and MSGD[Covb] To derive our results step by step, we first introduce two intermediate results based on MSGD[Fisherb] or MSGD[Covb]. Such MSGD variants approximate the behaviors of SGD that use minibatch estimated gradient covariance or Fisher matrices to generate Gaussian random noises and the batch size is . The implementation of these two algorithms are addressed in Section 2 of the Appendix. Note that, though using minibatch gradients to estimate Fisher/covariance matrix, the generated MNoises (17) are not sparse since it is sitll the summation of a constant and a Gaussian noise.
[MSGDFisher]b and [MSGDCov]b We further define the minibatch versions of MSGDFisher and MSGDCov. The MNoises defined to be the composition of a minibatch sampling random variable (with batch size ) and a Gaussian random variable. Thus the MNoises are sparse Gaussian with nonzero Gaussian random elements. In this way, naturally becomes the batch size of MSGD, as the gradients corresponding to the zero elements in the MNoise vector would be ignored in matrixvector product. Please refer to Section 2 of Appendix for implementation details. Note that these algorithms lower the computational complexity of MSGD using only minibatch of data for the parameter update per iteration.
Results and Observations Under the settings of , estimating a gradient covariance/Fisher matrix with a batch of gradients should be difficult. To our surprise again the experimental results shown in Figures 1 (b)(e) demonstrate that the generalization performances of MSGD[Fisherb] and MSGD[Covb] are close to MSGDFisher and MSGDCov, which estimate the gradient covariance and Fisher matrices using full gradients. Furthermore, Figures 1 (c)(f) exhibit that testing accuracy of [MSGDFisher]b and [MSGDCov]b could still be maintained, even when the MNoises are sparse, indicating the strong application prospects of MSGD.
Large Batch Training Especially, when the batch size becomes large, the generalization of vanilla SGD would be hurt and perform even worse than the SGD with small batch size hoffer2017 ; keskar2016large . In the same Figures 1 (c)(f), our experiments show that MSGD with various MNoise settings can still recover the generalization performance under the same large batch settings (b= or
with ghost batch normalization, learning rate tuning, and regime adaptation)
hoffer2017 . Thus, our perspective of multiplicative SGD might shed new light on developing new algorithm of large batch training maintaining both the speed advantage and generalization guarantee. We leave further investigation along this direction as future work.6 Discussions and Conclusions
In this work, we introduce Multiplicative SGD model (MSGD) to interpret the randomness of SGD, from Multiplicative Noise (MNoise) perspectives. First of all, we find the MNoise helps establish a theory that connects the generalization of SGD to a datadependent regularizer of Rademacher complexity type. Moreover, under the known Gaussian MNoise assumptions, the MSGD model holds a strong convergence to the known SDE of SGD, beyond the weak convergence obtained in li2017stochastic . In addition, based on MSGD formulation a fast algorithm is developed to efficiently insert noise into gradient descent. Using the algorithm, we empirically verify that MSGD with various desired types of MNoises can well approximate the behaviors of SGD , in the sense of achieving similar generalization performance. Compared to the traditional analytical models based on the additive noise, we find multiplicative noises provides an alternative way to understand SGD, with insightful new results for both theory and application.
As the first work along the MNoise road, there are several unsolved theoretical challenges, e.g., the relationship between local Rademacher complexity and local SGD complexity, and more general local complexity measures. These open problems are left for future work.
Acknowledgement
The contributions of the authors are the following: JW came up with the core ideas, contributed to the proof of Theorem 1, implemented all the experiments and wrote most part of the paper. WH contributed to the proof of Theorems 1, 2 and participated in paper writing. HX led the research discussions with JW as an intern in Baidu Research and wrote part of this paper. JH participated in the discussion and wrote part of the paper. ZZ led the research on studying the behavior of SGD. With JW he jointly proposed and discussed research agenda on the multiplicative noise of SGD, proposed the core idea of Section 3, and wrote part of the paper.
References
 (1) Peter L Bartlett, Olivier Bousquet, Shahar Mendelson, et al. Local rademacher complexities. The Annals of Statistics, 33(4):1497–1537, 2005.
 (2) Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
 (3) Peter L Bartlett and Shahar Mendelson. Local rademacher complexities and empirical minimization. Annals of Statistics, 34, 2006.
 (4) Vivek S Borkar and Sanjoy K Mitter. A strong approximation theorem for stochastic recursive algorithms. Journal of optimization theory and applications, 100(3):499–513, 1999.
 (5) Léon Bottou. Stochastic gradient learning in neural networks. Proceedings of NeuroNımes, 91(8), 1991.
 (6) Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.
 (7) Alon Brutzkus, Amir Globerson, Eran Malach, and Shai ShalevShwartz. Sgd learns overparameterized networks that provably generalize on linearly separable data. arXiv preprint arXiv:1710.10174, 2017.
 (8) Pratik Chaudhari and Stefano Soatto. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. arXiv preprint arXiv:1710.11029, 2017.
 (9) Hadi Daneshmand, Jonas Kohler, Aurelien Lucchi, and Thomas Hofmann. Escaping saddles with stochastic gradients. arXiv preprint arXiv:1803.05999, 2018.
 (10) Yuanyuan Feng, Tingran Gao, Lei Li, JianGuo Liu, and Yulong Lu. Uniformintime weak error analysis for stochastic gradient descent algorithms via diffusion approximation. arXiv preprint arXiv:1902.00635, 2019.
 (11) Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 1731–1741. Curran Associates, Inc., 2017.
 (12) Wenqing Hu, Chris Junchi Li, Lei Li, and JianGuo Liu. On the diffusion approximation of nonconvex stochastic gradient descent. arXiv preprint arXiv:1705.07562, 2017.
 (13) Wenqing Hu, Zhanxing Zhu, Haoyi Xiong, and Jun Huan. Quasipotential as an implicit regularizer for the loss function in the stochastic gradient descent. arXiv preprint arXiv:1901.06054, 2019.
 (14) Stanisław Jastrzębski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in sgd. arXiv preprint arXiv:1711.04623, 2017.
 (15) Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points efficiently. arXiv preprint arXiv:1703.00887, 2017.
 (16) N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On largebatch training for deep learning: Generalization gap and sharp minima. In In International Conference on Learning Representations (ICLR), 2017.
 (17) Robert Kleinberg, Yuanzhi Li, and Yang Yuan. An alternative view: When does sgd escape local minima? arXiv preprint arXiv:1802.06175, 2018.
 (18) Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and adaptive stochastic gradient algorithms. In International Conference on Machine Learning, pages 2101–2110, 2017.

(19)
Stephan Mandt, Matthew D Hoffman, and David M Blei.
Stochastic gradient descent as approximate bayesian inference.
The Journal of Machine Learning Research, 18(1):4873–4907, 2017.  (20) James Martens and Roger Grosse. Optimizing neural networks with kroneckerfactored approximate curvature. In International conference on machine learning, pages 2408–2417, 2015.
 (21) Wenlong Mou, Liwei Wang, Xiyu Zhai, and Kai Zheng. Generalization bounds of sgld for nonconvex learning: Two theoretical viewpoints. arXiv preprint arXiv:1707.05947, 2017.
 (22) Wenlong Mou, Yuchen Zhou, Jun Gao, and Liwei Wang. Dropout training, datadependent regularization, and generalization bounds. In International Conference on Machine Learning, pages 3642–3650, 2018.
 (23) Bernt Øksendal. Stochastic differential equations. In Stochastic differential equations, pages 65–84. Springer, 2003.
 (24) Laurent Schwartz and Paul R. Chernoff. Geometry and Probability in Banach Spaces, Lecture 12. Springer Berlin Heidelberg, Berlin, Heidelberg, 1981.
 (25) Ravid ShwartzZiv and Naftali Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017.
 (26) Umut Simsekli, Levent Sagun, and Mert Gurbuzbalaban. A tailindex analysis of stochastic gradient noise in deep neural networks. arXiv preprint arXiv:1901.06053, 2019.
 (27) Samuel L. Smith and Quoc V. Le. A bayesian perspective on generalization and stochastic gradient descent. International Conference on Learning Representations, 2018.
 (28) Nicole TomczakJaegermann. BanachMazur distances and finitedimensional operator ideals, volume 38. Longman Sc & Tech, 1989.
 (29) Yeming Wen, Kevin Luk, Maxime Gazeau, Guodong Zhang, Harris Chan, and Jimmy Ba. Interplay between optimization and generalization of stochastic gradient descent with covariance noise, 2019.
 (30) Yingzhen Yang, Xingjian Li, and Jun Huan. An empirical study on regularization of deep neural networks by local rademacher complexity. arXiv preprint arXiv:1902.00873, 2019.
 (31) C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017.
 (32) Yuchen Zhang, Percy Liang, and Moses Charikar. A hitting time analysis of stochastic gradient langevin dynamics. arXiv preprint arXiv:1702.05575, 2017.
 (33) Zhanxing Zhu, Jingfeng Wu, Bing Yu, Lei Wu, and Jinwen Ma. The anisotropic noise in stochastic gradient descent: Its behavior of escaping from minima and regularization effects. arXiv preprint arXiv:1803.00195, 2018.
Appendix A Missing Proofs in Main Paper
a.1 Proof of Proposition 1
Proof.
Sampling without replacement
By definition, the random variable could be decompose as
(18) 
where are i.i.d, and they represent once sampling procedure. Thus contains one and zeros, with random index. By its definition, we know
(19)  
(20)  
(21) 
Thus
(22)  
(23)  
(24)  
(25) 
Because are i.i.d., we have
(26)  
(27) 
Sampling with replacement
Let , by definition, contain s and zeros, with random index. Thus
(28)  
(29)  
(30) 
Hence
(31)  
(32)  
(33)  
(34) 
∎
a.2 Proof of Theorem 1: first half
Define the Rademacher variables with even probability. Define the standard Rademacher complexity
(36) 
Let be a sequence of independent Gaussian random variables. Define the Gaussian Rademacher complexity
(37) 
Theorem (The first part of Theorem 1 in the paper, Lemma 4 in [2]).
There are absolute positive constants and such that
(38) 
Proof.
Indeed, our proof holds for not only local Rademacher and Gaussian complexity, but also original Rademacher and Gaussian complexity. Thus for the simplicity of notations, we omit in function and write as .
We first prove the inequality (38a). Set be the product probability measure on and let , note that and are identical distributed. Then
(39)  
(40)  
(41)  
(42)  
(43)  
(44)  
(45)  
(46)  
(47) 
Hence (38a) holds.
Let us now demonstrate (38b). To this end we first propose the following estimate [24]. If , then
(48) 
If we apply (48) to , then we get
(49) 
and thus
(50)  
(since and are identically distributed)  (51)  
(52) 
so that we conclude (38b) by noticing that ([6], Lemma 11.3).
It remains to show (48). Due to the absolute sign inside the sup and the symmetry, without loss of generality we can always assume that . If , we are left to show that
(53)  
We can fix and consider the function . It can be directly verified that is convex in , since it is the summation of two convex function in . Also , and thus for any we have . In a same way , and we conclude (48) for .
The case for general follows the same idea by introducing the function
(54) 
and iteratively , which is (48).
In summary we finish the proof. ∎
a.3 Proof of Theorem 1: second half
Let be the MNoise of SGD, by definition we know that for , the number of is and the number of is . For simplicity, let . Thus in , the number of is and the number of is .
Theorem (The second part of Theorem 1 in the paper).
Assume , then
(55) 
Proof.
First we know that for i.i.d. examples , the following equation holds for any function :
(56) 
where is a permutation of .
Thus by definition of SGD complexity
(57) 
and the definition of MNoise , i.e., the number of is and the number of is at any cases, we could permute the index of , such that , , without affecting the SGD complexity. Thus we have
Comments
There are no comments yet.