As the rise of deep learning, Stochastic Gradient Descent (SGD) has become one of the standard workhorses for optimizing deep models bottou1991stochastic
. The study on the memorization of deep neural networks suggested the commonly used learning algorithms, e.g., SGD, played an important role of implicit regularization, where these algorithms prevent the over-parameterized models converging to the minimum points that cannot generalize wellzhang2017understanding . More specifically, for the SGD algorithm, it is believed that the inherent randomness, due to the random sampling strategies adopted, contributes to the implicit regularization effects of SGD hu2019quasi ; zhu2018anisotropic . One direct evidence has been observed is that the large batch SGD typically performs worse than small batch ones hoffer2017 ; keskar2016large , since larger batch size reduces the randomness in SGD. Thus in order to further demystify of deep learning, understanding the randomness in SGD as well as its effect to generalization become critical.
Most of the previous research focused on studying the properties of SGD through modeling the algorithm as gradient descent (GD) with an unbiased noise term introduced by the random sampling. For example, daneshmand2018escaping ; jin2017escape ; kleinberg2018alternative studied the mechanism of SGD noise on helping the learning dynamics escape from saddles and local minima. For neural networks with one hidden layer, the implicit regularization effect of SGD has been studied in brutzkus2017sgd . Generalization bounds for stochastic gradient Langevin dynamics is obtained in mou2017generalization , shedding light on the regularization role of SGD noise.
Yet another important line of understanding SGD is from the continuous-time perspective, where stochastic differential equations (SDEs) have been used as mathematical tools oksendal2003stochastic to analyze SGD noise. The weak convergence between SGD and a continuous SDE is first established by li2017stochastic , from which more efforts have been done to understand SGD and its noise hu2017diffusion ; hu2019quasi ; feng2019uniform . To leverage the SGD noise, the work hoffer2017 ; jastrzkebski2017three studied approaches to control the scale of SGD noise through tuning the batch sizes and learning rates. In addition to the noise scale, zhu2018anisotropic studied the structure of the SGD noise, where the anisotropic property of SGD noise and its benefits on helping escape from (bad) local minima have been well examined. From the Bayesian perspective, the SDE based interpretation mandt2017stochastic ; chaudhari2017stochastic ; smith2018bayesian further suggested that SGD indeed performs variational inference through entropy regularization which prevents over-fitting. Though SDE li2017stochastic offers a powerful tool for analyzing SGD mathematically, to what extent the approximation holds in practice is not fully understood. As a reference, the most recent study simsekli2019tail argued that the SGD noise is heavy-tailed and non-Gaussian, thus it might not be the best tool to approximate SGD using SDE driven by Brownian motion.
Here we provide, for the first time, insights that understand the SGD noise from a mini-batch sampling perspective: instead of adopting the additive noise models, we proposed Multiplicative SGD (M-SGD)
as a general case of SGD that models the stochastic gradient estimated per iteration of SGD using thematrix-vector product between a matrix of vertical gradient vectors, namely the gradient matrix, and a vector of random noises, namely the Multiplicative Noise (M-Noise). Compared with the traditional additive interpretation, M-Noise has the advantage of decoupling randomness from model parameters, shedding new lights on understanding SGD noise. Based on this novel perspective, we explicitly demonstrate the regularization effects of SGD, and introduce a fast algorithm to generate SGD-like noise to study the effects of SGD noise, and empirically verify the approximation between SGD and SDE li2017stochastic . Concisely, our main contributions are:
Result I - Our theoretical analysis on M-SGD shows that learning with SGD leads to an organic Structural Risk Minimization framework with a data-dependent regularizer resembling local Rademacher complexity. The finding thus explains the “implicit regularization” effects of SGD where the explicit regularization of SGD with local Rademacher complexity and the benefit of such regularization were studied in yang2019empirical .
Result II - Beyond the weak convergence between SGD and SDE li2017stochastic
, which primarily relies on the moments information of SGD noise, we show that a special case of M-SGD with Gaussian M-Noise, namelyM-SGD-Gaussian, holds strong convergence to the SDE.
Result III - Favorably, M-SGD model also provides us with an efficient way to approximate SGD with the noises of desired types, including the Gaussian noises based on either gradient covariance or Fisher matrices, where M-SGD is equipped with the M-Noise drawn from the interchangeable random distributions. Using this approach, we empirically verify that it is possible to approximate SGD noise by a Gaussian noise without loss of generalization performance, which supports our Result II.
Result IV - Moreover, we have empirically demonstrated that M-SGD could well approximate SGD with desired noises under practical mini-batch settings. We design a series of experiments systematically to show that, the M-Noise of SGD could be well approximated via 1) Bernoulli noise, 2) Gaussian noise with mini-batch estimated Fisher and 3) Sparse Gaussian noise. These results suggest the potential of using M-SGD to further develop practical learning algorithms.
2 M-SGD: Multiplicative Stochastic Gradient Descent
Machine learning problems usually involve minimizing an empirical loss over training data , , where is the loss over one example and is the parameter to be optimized. Define the “loss vector” as , then the gradient matrix is . Let , then .
The typical SGD iteration works as: it first randomly draws a mini-batch of samples with index set , and then performs parameter update using the stochastic gradient estimated by the mini-batch and learning rate ,
Additive Noise (A-Noise)
Traditionally, the SGD noise is interpreted from additive viewpoint, i.e.,
where represents the Additive-Noise (A-Noise) of SGD. We call the interpretation of SGD by Eq. (1) and Eq. (2) as Additive-SGD (A-SGD) model. Note that might not be a Gaussian noise simsekli2019tail , and its mean is zero and covariance is , where Though commonly adopted in literature li2017stochastic ; zhu2018anisotropic ; chaudhari2017stochastic ; mandt2017stochastic ; smith2018bayesian ; jastrzkebski2017three , it is clear the A-Noise is dependent on the parameter , thus it varies along the optimization path, and causes trouble for understanding and analyzing. To overcome this obstacle, many works assumed that A-Noise is constant or upper bounded by some constant chaudhari2017stochastic ; jastrzkebski2017three ; mandt2017stochastic ; zhang2017hitting . Thus a natural question raises: could the noise in SGD be decoupled from parameters? Fortunately, our multiplicative noise provides a positive answer, as elaborated in the following.
Multiplicative Noise (M-Noise)
By the definition of SGD, the randomness of SGD are indeed caused by the mini-batch sampling procedure,where this procedure is actually independent of current model parameter. Thus there should exist a parameter (i.e. )-independent model to characterize SGD noises rather than the aforementioned A-SGD. To this end, we propose the following formulation:
where is a random vector characterizing the mini-batch sampling process, i.e., for sampling without replacement, contains multiples of and multiples of zero, with random index.
We hereby use Multiplicative-SGD (M-SGD) to represent the method of modeling SGD by Eq. (1) and Eq. (3), and Multiplicative-Noise (M-Noise) to denote . Note that M-Noise is independent of parameter . The following Proposition 1 characterizes the properties of M-Noise of SGD.
(Mean and covariance of M-Noise in SGD) For mini-batch sampled with replacement, the M-Noise in SGD satisfies
For mini-batch sampled without replacement, the M-Noise in SGD satisfies
Proof is left in Section 1.1 of the Appendix. We only consider the sampling with replacement case in the remaining parts, since most of our results hold for the other case, if not pointed out otherwise.
Besides SGD, we extend M-Noise to general cases and overload the notation M-SGD as:
Note that: 1) M-SGD (6) becomes standard GD when , and SGD if . 2) The M-Noise is independent of model, parameter and dataset. Such decoupling provides us with a clear picture at the regularization effect of SGD. We will elaborate this point in next section. 3) One special case we would like to pay more attention is when is a Gaussian noise, i.e., , and we call Eq. (6) M-SGD-Gaussian. Our analysis will later show that there is a strong approximation of the discrete M-SGD-Gaussian (6) by continuous SDE li2017stochastic . Moreover, we will empirically demonstrate that approximating by achieves highly similar regularization effects. Thus it is meaningful to use SDE as a tool for understanding the generalization benefits of SGD and its variants.
Connection between A-Noise and M-Noise
Let M-Noise , then we have the corresponding A-Noise . Moreover, under the assumption that
follows a Gaussian distribution, thenis Gaussian too. This property plays a crucial role for us to design fast algorithm injecting noise into gradient based algorithms as shown in Section 5.1. Though A-Noise and M-Noise could convert between each other, M-Noise decouples noise and parameter, which gives us new insights on the behavior of SGD. For example, we now make use of M-Noise perspective to explicitly elaborate the implicit bias of SGD.
3 M-SGD Performs Data-Dependent Regularization
This section presents details of Result I. Let us first recall . In Eq. (6), let and rewrite it as
Thus learning by M-SGD (6) equals to applying GD to optimize the objective with a randomized data-dependent regularization term:
We upper bound the random term in M-SGD by its local maximum in -ball, and as the inequality becomes tighter. The right hand side of the objective function (8) could be treated as the empirical realization of the population objective (9):
The explicit regularization of SGD with the local Rademacher complexity with a -ball, and the empirical benefit of such regularization in image classification and neural network architecture search have been reported in yang2019empirical . The difference is that we show that SGD has an implicit regularization resembling local Rademacher complexity.
We denote For any M-Noise , defines a local complexity measure.
Note that the components of might not be independent. Specially, 1) for and be i.i.d., , is the local Rademacher complexity bartlett2002rademacher ; bartlett2005local ; bartlett2006local ; yang2019empirical ; 2) let , then is the local Gaussian complexity bartlett2002rademacher , which is the corresponding regularization term of M-SGD-Gaussian with independent Gaussian M-Noise; 3) for SGD noise , we name the local SGD complexity.111In literature, some define Rademacher/Gaussian complexity with absolute sign and some without. This does not cause big difference for obtaining generalization bound, and we adopt the one with absolute sign. We provide the following results to bound local Rademacher, Gaussian and SGD complexity.
(Local Rademacher, Gaussian and SGD complexity)
Let be the Rademacher, Gaussian and SGD random variables, respectively.
Then there exist
be the Rademacher, Gaussian and SGD random variables, respectively. Then there existsuch that:
The proof can be found in Sections 1.2 and 1.3 of the Appendix.
Theorem 1 tells us that local Gaussian complexity is equivalent to local Rademacher complexity, which explains the generalization advantage of M-SGD-Gaussian, since regularizing Rademacher complexity is known to bring benefits for generalization mou2018dropout ; yang2019empirical ; bartlett2006local . Though we cannot build perfect bridge between local SGD complexity and local Rademacher complexity yet, in Section 5 we will show that M-SGD-Gaussian could perfectly simulate SGD, given proper covariance of the Gaussian M-Noise. Thus we conclude that the local SGD complexity works similar to local Gaussian complexity and local Rademacher complexity, and the implicit bias of SGD is due to this data-dependent complexity regularizer.
Figure 1 (a)(d) show empirical comparison of the generalization performance of GD, SGD, M-SGD, and GD optimizing loss with Rademacher regularizer. We can clearly observe that SGD and M-SGD family function similarly to GD-Rademacher, thus supporting our understanding on the data-dependent regularization effect of SGD and M-SGD.
4 The Continuous Approximation of M-SGD
This section primarily focus on presenting Result II of our work. With the implicit bias of M-SGD known, we now address the issue of its continuous approximation. We first recollect the weak approximation between discrete A-SGD and continuous SDE li2017stochastic ; hu2017diffusion ; feng2019uniform .
Please refer to Theorem 1 in li2017stochastic for the rigorous statement and proof.
only makes use of the moments of SGD noise. The weak convergence provides us with the equivalence of the discrete iteration and the continuous SDE in sense of probability distributions. Nonetheless, the path-wise closeness between the two processes is not ensured.
M-SGD-Gaussian. To obtain stronger approximation, e.g., path-wise convergence, we need to make assumption that M-Noises are drawn from a Gaussian distribution, i.e., M-SGD-Gaussian. Concisely, Theorem 3 guarantees the strong convergence between M-SGD-Gaussian and SDE (12).
The rigorous statement and proof are deferred to Section 1.4 in the Appendix.
The strong convergence guarantees the path-wise closeness between and , which indicates the close behavior not only at the level of probability distributions but also at the level of sample paths of the two processes. In Section 5 (Figure 1 (a)(d)), we will empirically verify that M-SGD-Gaussian achieves highly similar regularization effects as SGD, which makes it reasonable to understand SGD via M-SGD-Gaussian and its strong approximation, the continuous SDE.
5 The Discrete Approximation of SGD using M-SGD
In this section we study the way to approximate SGD using M-SGD with M-Noise drawn from interchangeable random distributions with/without mini-batch settings. Compared to A-SGD, our proposed M-SGD can easily generate noises of various useful and desired types with low computational complexity, using the M-Noise drawn from interested distributions. In the rest of this section, we first introduce the fast algorithm for implementing M-SGD-Gaussian, then present the details about Result III and Result IV, all based on the Fast M-SGD-Gaussian Algorithm and its variants.
5.1 Fast M-SGD-Gaussian: efficient Gaussian noise generation with gradient covariance
Approximating the noise in SGD by a Gaussian one is a common used trick zhu2018anisotropic ; jastrzkebski2017three ; wen2019interplay . The targeted noise is a Gaussian noise with gradient covariance , denoted as
. To obtain such noise, one would first compute the covariance matrix and then apply the singular value decomposition (SVD),
, to transform a white noiseinto the noise desired, .
However there are two obstacles in the above generation procedures: 1) evaluating and storing is computationally unacceptable, with both and being large; 2) performing SVD for a matrix is comprehensively hard when is extremely large. Furthermore, one needs to repeat 1) and 2) at every parameter update step, since depends on parameter . In compromise, current works suggest to approximate gradient covariance using only its diagonal or block diagonal elements wen2019interplay ; zhu2018anisotropic ; jastrzkebski2017three ; martens2015optimizing . Generally, there is no guarantee that the diagonal information could approximate full gradient covariance well. Specifically, zhu2018anisotropic empirically showed that such diagonal approximation cannot fully recover the regularization effects of SGD. Thus a more effective approach of generating Gaussian noise with gradient covariance is of both theoretical and empirical importance.
Inspired by M-SGD framework (6), we propose a fast algorithm to generate Gaussian-like SGD noise. First of all, through a little calculation it can be shown that
In this way, the preferred Gaussian noise could be sampled as . Besides, since , we indeed can borrow M-SGD-Gaussian as the approximation of SGD with Gaussian noises such that
Thanks to the linearity of derivation operator and the feasibility of derivation operator to communicate with weight average operator, we can design the fast algorithm (described in Algorithm 1) to implement M-SGD-Gaussian (in the form of Eq. (16)).
Remark: 1) Before the deep learning era, the typical setting of machine learning is , i.e., the number of samples is larger than that of parameters. In this circumstance, the SVD way of generating Gaussian noise is indeed plausible. However, when it comes to deep networks where , or both numbers are high, it turns out computing the full gradient could be far more efficient than explicitly evaluating the covariance matrix and performing SVD, resulting in the computational advantage of our method over the traditional one. 2) Our method could be easily extended to generate other types of noise besides Gaussian, e.g., Bernoulli noise and the mini-batch version of noises. See the following for more discussions.
5.2 Approximate the M-Noise of SGD by Gaussian ones and components independent ones
Here, we present the details of Result III. First, based on the Fast M-SGD-Gaussian (16), we unify two types of commonly used Gaussian noise for simulating SGD’s behavior: Gaussian noise with gradient covariance (M-SGD-Cov) zhu2018anisotropic and Gaussian noise with Fisher (M-SGD-Fisher) wen2019interplay .
M-SGD-Cov and M-SGD-Fisher
First we know , where is the Fisher. Intuitively, M-SGD-Cov and M-SGD-Fisher should not be far away from each other. We can see this using SDE (12). At the beginning stage of SGD training, the drift term outlarges the diffusion term in scale zhu2018anisotropic ; shwartz2017opening , dominates the optimization, and the noise term almost makes no contribution, no matter whether it is gradient covariance noise or Fisher noise. During the latter diffusion stage, however, the gradient turns to be close to zero, thus . In a nutshell, covariance noise and Fisher noise should behave similarly for regularizing SGD iteration.
Thanks to M-SGD-Gaussian formulation, we could now give a mathematical analysis on the difference between these two types of noise. Let and be the M-Noises for generating Fisher noise and gradient covariance noise, respectively, then from Eq. (15) and Eq. (16), we have:
Note that matrix centralizes a random vector. Thus the M-SGD perspective tells us the only difference between and is that, in the former one, the white noise for generating the M-Noise is firstly processed by centralization. On the other hand, since the components of are already identically distributed with zero mean, and is extremely large in deep learning with huge training data, thus and the centralization procedure barely does anything to the white noise, i.e., . Therefore along the whole optimization path, which leads to identical regularization effect for learning deep models.
To further verify our observation, we introduce M-SGD-Bernoulli that employs Bernoulli M-Noise to approximate the behaviors of the SGD with its diagonal (part of) M-Noise covariance matrix, i.e., . Consider a random vector , , i.i.d. Then and . In this way, we can see that the covariance of Bernoulli M-Noise is the diagonal of the covariance of SGD M-Noise. Note that this “diagnoal” relationship might not hold for their corresponding A-Noises. This Bernoulli M-Noise could be viewed as the best approximation of SGD M-Noise, among all the random variables with independent components.
Results and Observations
The experiment results shown in Figures 1(a)(d) demonstrate that, under the same settings, M-SGD-Fisher and M-SGD-Cov algorithms perform almost the same, while the performance of M-SGD-Bernoulli tightly follows up the previous two. Together with our theoretical insights from M-SGD perspectives, we can conclude that 1) gradient covariance should be equivalent to the Fisher for SGD (validating our theoretical findings), and 2) the M-Noise of SGD could be well approximated by noises with independent components, e.g., and .
5.3 Practical SGD approximation using mini-batch M-SGD
M-SGD-[Fisher-b] and M-SGD-[Cov-b] To derive our results step by step, we first introduce two intermediate results based on M-SGD-[Fisher-b] or M-SGD-[Cov-b]. Such M-SGD variants approximate the behaviors of SGD that use mini-batch estimated gradient covariance or Fisher matrices to generate Gaussian random noises and the batch size is . The implementation of these two algorithms are addressed in Section 2 of the Appendix. Note that, though using mini-batch gradients to estimate Fisher/covariance matrix, the generated M-Noises (17) are not sparse since it is sitll the summation of a constant and a Gaussian noise.
[M-SGD-Fisher]-b and [M-SGD-Cov]-b We further define the mini-batch versions of M-SGD-Fisher and M-SGD-Cov. The M-Noises defined to be the composition of a mini-batch sampling random variable (with batch size ) and a Gaussian random variable. Thus the M-Noises are sparse Gaussian with non-zero Gaussian random elements. In this way, naturally becomes the batch size of M-SGD, as the gradients corresponding to the zero elements in the M-Noise vector would be ignored in matrix-vector product. Please refer to Section 2 of Appendix for implementation details. Note that these algorithms lower the computational complexity of M-SGD using only mini-batch of data for the parameter update per iteration.
Results and Observations Under the settings of , estimating a gradient covariance/Fisher matrix with a batch of gradients should be difficult. To our surprise again the experimental results shown in Figures 1 (b)(e) demonstrate that the generalization performances of M-SGD-[Fisher-b] and M-SGD-[Cov-b] are close to M-SGD-Fisher and M-SGD-Cov, which estimate the gradient covariance and Fisher matrices using full gradients. Furthermore, Figures 1 (c)(f) exhibit that testing accuracy of [M-SGD-Fisher]-b and [M-SGD-Cov]-b could still be maintained, even when the M-Noises are sparse, indicating the strong application prospects of M-SGD.
Large Batch Training Especially, when the batch size becomes large, the generalization of vanilla SGD would be hurt and perform even worse than the SGD with small batch size hoffer2017 ; keskar2016large . In the same Figures 1 (c)(f), our experiments show that M-SGD with various M-Noise settings can still recover the generalization performance under the same large batch settings (b= or
with ghost batch normalization, learning rate tuning, and regime adaptation)hoffer2017 . Thus, our perspective of multiplicative SGD might shed new light on developing new algorithm of large batch training maintaining both the speed advantage and generalization guarantee. We leave further investigation along this direction as future work.
6 Discussions and Conclusions
In this work, we introduce Multiplicative SGD model (M-SGD) to interpret the randomness of SGD, from Multiplicative Noise (M-Noise) perspectives. First of all, we find the M-Noise helps establish a theory that connects the generalization of SGD to a data-dependent regularizer of Rademacher complexity type. Moreover, under the known Gaussian M-Noise assumptions, the M-SGD model holds a strong convergence to the known SDE of SGD, beyond the weak convergence obtained in li2017stochastic . In addition, based on M-SGD formulation a fast algorithm is developed to efficiently insert noise into gradient descent. Using the algorithm, we empirically verify that M-SGD with various desired types of M-Noises can well approximate the behaviors of SGD , in the sense of achieving similar generalization performance. Compared to the traditional analytical models based on the additive noise, we find multiplicative noises provides an alternative way to understand SGD, with insightful new results for both theory and application.
As the first work along the M-Noise road, there are several unsolved theoretical challenges, e.g., the relationship between local Rademacher complexity and local SGD complexity, and more general local complexity measures. These open problems are left for future work.
The contributions of the authors are the following: JW came up with the core ideas, contributed to the proof of Theorem 1, implemented all the experiments and wrote most part of the paper. WH contributed to the proof of Theorems 1, 2 and participated in paper writing. HX led the research discussions with JW as an intern in Baidu Research and wrote part of this paper. JH participated in the discussion and wrote part of the paper. ZZ led the research on studying the behavior of SGD. With JW he jointly proposed and discussed research agenda on the multiplicative noise of SGD, proposed the core idea of Section 3, and wrote part of the paper.
- (1) Peter L Bartlett, Olivier Bousquet, Shahar Mendelson, et al. Local rademacher complexities. The Annals of Statistics, 33(4):1497–1537, 2005.
- (2) Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
- (3) Peter L Bartlett and Shahar Mendelson. Local rademacher complexities and empirical minimization. Annals of Statistics, 34, 2006.
- (4) Vivek S Borkar and Sanjoy K Mitter. A strong approximation theorem for stochastic recursive algorithms. Journal of optimization theory and applications, 100(3):499–513, 1999.
- (5) Léon Bottou. Stochastic gradient learning in neural networks. Proceedings of Neuro-Nımes, 91(8), 1991.
- (6) Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.
- (7) Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. Sgd learns over-parameterized networks that provably generalize on linearly separable data. arXiv preprint arXiv:1710.10174, 2017.
- (8) Pratik Chaudhari and Stefano Soatto. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. arXiv preprint arXiv:1710.11029, 2017.
- (9) Hadi Daneshmand, Jonas Kohler, Aurelien Lucchi, and Thomas Hofmann. Escaping saddles with stochastic gradients. arXiv preprint arXiv:1803.05999, 2018.
- (10) Yuanyuan Feng, Tingran Gao, Lei Li, Jian-Guo Liu, and Yulong Lu. Uniform-in-time weak error analysis for stochastic gradient descent algorithms via diffusion approximation. arXiv preprint arXiv:1902.00635, 2019.
- (11) Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 1731–1741. Curran Associates, Inc., 2017.
- (12) Wenqing Hu, Chris Junchi Li, Lei Li, and Jian-Guo Liu. On the diffusion approximation of nonconvex stochastic gradient descent. arXiv preprint arXiv:1705.07562, 2017.
- (13) Wenqing Hu, Zhanxing Zhu, Haoyi Xiong, and Jun Huan. Quasi-potential as an implicit regularizer for the loss function in the stochastic gradient descent. arXiv preprint arXiv:1901.06054, 2019.
- (14) Stanisław Jastrzębski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in sgd. arXiv preprint arXiv:1711.04623, 2017.
- (15) Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points efficiently. arXiv preprint arXiv:1703.00887, 2017.
- (16) N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In In International Conference on Learning Representations (ICLR), 2017.
- (17) Robert Kleinberg, Yuanzhi Li, and Yang Yuan. An alternative view: When does sgd escape local minima? arXiv preprint arXiv:1802.06175, 2018.
- (18) Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and adaptive stochastic gradient algorithms. In International Conference on Machine Learning, pages 2101–2110, 2017.
Stephan Mandt, Matthew D Hoffman, and David M Blei.
Stochastic gradient descent as approximate bayesian inference.The Journal of Machine Learning Research, 18(1):4873–4907, 2017.
- (20) James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pages 2408–2417, 2015.
- (21) Wenlong Mou, Liwei Wang, Xiyu Zhai, and Kai Zheng. Generalization bounds of sgld for non-convex learning: Two theoretical viewpoints. arXiv preprint arXiv:1707.05947, 2017.
- (22) Wenlong Mou, Yuchen Zhou, Jun Gao, and Liwei Wang. Dropout training, data-dependent regularization, and generalization bounds. In International Conference on Machine Learning, pages 3642–3650, 2018.
- (23) Bernt Øksendal. Stochastic differential equations. In Stochastic differential equations, pages 65–84. Springer, 2003.
- (24) Laurent Schwartz and Paul R. Chernoff. Geometry and Probability in Banach Spaces, Lecture 12. Springer Berlin Heidelberg, Berlin, Heidelberg, 1981.
- (25) Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017.
- (26) Umut Simsekli, Levent Sagun, and Mert Gurbuzbalaban. A tail-index analysis of stochastic gradient noise in deep neural networks. arXiv preprint arXiv:1901.06053, 2019.
- (27) Samuel L. Smith and Quoc V. Le. A bayesian perspective on generalization and stochastic gradient descent. International Conference on Learning Representations, 2018.
- (28) Nicole Tomczak-Jaegermann. Banach-Mazur distances and finite-dimensional operator ideals, volume 38. Longman Sc & Tech, 1989.
- (29) Yeming Wen, Kevin Luk, Maxime Gazeau, Guodong Zhang, Harris Chan, and Jimmy Ba. Interplay between optimization and generalization of stochastic gradient descent with covariance noise, 2019.
- (30) Yingzhen Yang, Xingjian Li, and Jun Huan. An empirical study on regularization of deep neural networks by local rademacher complexity. arXiv preprint arXiv:1902.00873, 2019.
- (31) C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017.
- (32) Yuchen Zhang, Percy Liang, and Moses Charikar. A hitting time analysis of stochastic gradient langevin dynamics. arXiv preprint arXiv:1702.05575, 2017.
- (33) Zhanxing Zhu, Jingfeng Wu, Bing Yu, Lei Wu, and Jinwen Ma. The anisotropic noise in stochastic gradient descent: Its behavior of escaping from minima and regularization effects. arXiv preprint arXiv:1803.00195, 2018.
Appendix A Missing Proofs in Main Paper
a.1 Proof of Proposition 1
Sampling without replacement
By definition, the random variable could be decompose as
where are i.i.d, and they represent once sampling procedure. Thus contains one and zeros, with random index. By its definition, we know
Because are i.i.d., we have
Sampling with replacement
Let , by definition, contain s and zeros, with random index. Thus
a.2 Proof of Theorem 1: first half
Define the Rademacher variables with even probability. Define the standard Rademacher complexity
Let be a sequence of independent Gaussian random variables. Define the Gaussian Rademacher complexity
There are absolute positive constants and such that
Indeed, our proof holds for not only local Rademacher and Gaussian complexity, but also original Rademacher and Gaussian complexity. Thus for the simplicity of notations, we omit in function and write as .
We first prove the inequality (38a). Set be the product probability measure on and let , note that and are identical distributed. Then
Hence (38a) holds.
If we apply (48) to , then we get
|(since and are identically distributed)||(51)|
It remains to show (48). Due to the absolute sign inside the sup and the symmetry, without loss of generality we can always assume that . If , we are left to show that
We can fix and consider the function . It can be directly verified that is convex in , since it is the summation of two convex function in . Also , and thus for any we have . In a same way , and we conclude (48) for .
The case for general follows the same idea by introducing the function
and iteratively , which is (48).
In summary we finish the proof. ∎
a.3 Proof of Theorem 1: second half
Let be the M-Noise of SGD, by definition we know that for , the number of is and the number of is . For simplicity, let . Thus in , the number of is and the number of is .
Theorem (The second part of Theorem 1 in the paper).
Assume , then
First we know that for i.i.d. examples , the following equation holds for any function :
where is a permutation of .
Thus by definition of SGD complexity
and the definition of M-Noise , i.e., the number of is and the number of is at any cases, we could permute the index of , such that , , without affecting the SGD complexity. Thus we have