One of the main goals of modern statistical learning theory is to derive algorithm-dependent and data-dependent generalization bounds for learning algorithms and models. A learning algorithm may use a large hypothesis space, but its randomized way of exploring the space controls actual capacity in a data-dependent manner. As a result, algorithm-dependent bounds usually go beyond classical notions of model capacities, such as VC dimensions and Rademacher complexities. For stochastic gradient methods in particular, the number of iterations and step sizes serve as implicit regularization and restrict the growth of model capacity. Algorithm-dependent generalization bounds have been intensively studied for SGM under convex settings (Hardt et al., 2015; Lin and Rosasco, 2016; Lin et al., 2016), but very few is known for the non-convex case. Nevertheless, practitioners believe the latter to hold true in a regime far beyond existing theories. The prevailing success of stochastic gradient methods is also attributed not only to computational speed, but also to its learning-theoretic merits, known as ”Train Faster, Generalize Better”.
The most important arena for algorithm-dependent bound is perhaps deep learning. It is revealed by experiments that the algorithm-independent model capacities are too large to guarantee meaningful generalization performance (Zhang et al., 2016)
. With natural images as inputs, they show that a standard neural networks can fit completely noisy labels in the training set. Obviously, such a network has no generalization power at all, and if the capacity of neural network itself was the only thing to control the generalization performance, the DNN models in real-world use would be at the same risk. Fortunately, a key difference between training procedures with random labels and true labels lies in the running time: the random labels will cost SGD algorithm significantly more steps to reach optimal point. Therefore, it is possible that good generalization performance with real labels can be guaranteed by algorithm-dependent bounds for stochastic gradient methods, while the running time for training with random labels becomes too large to yield reasonable bounds. In this sense, classical wisdom of algorithm-dependent generalization bounds could find its place critical to understanding generalization performance of deep learning, and bounds for stochastic gradient methods with non-convex objectives are central to this question.
Therefore, the goal of this paper is to understand the effect of stochastic gradient methods on generalization performance with non-convex risk minimization. We would also like to emphasize that algorithm-dependent bounds for multi-pass non-convex optimization algorithms play a much more non-trivial role than their convex counterparts: single pass of SGD for convex objectives already achieves optimality in stochastic optimization; but in non-convex settings, the computational aspects naturally requires going through training data for much more than one passes. We adopt the standard settings in learning theory, where we perform the (regularized) empirical risk minimization procedure:
We are interested in the generalization error, which is defined as the gap between empirical and population loss. We consider the error by taking expectation with respect to randomized algorithm.
The loss functionfor optimization algorithm may coincide with , or is a surrogate function of (e.g. in classification problems we usually use hinge loss as a surrogate for 0-1 loss)
Instead of working on SGD itself, we consider Stochastic Gradient Langevin Dynamics, a popular variant of stochastic gradient methods which adds isotropic Gaussian noise in each round of gradient updates, i.e.,
We assume that the algorithm is initialized with Gaussian distribution. The stochastic gradients
in each round are unbiased estimates for, which can be decomposed as . Popular choices for includes full gradient and one-point stochastic gradient with .
To obtain data-dependent and algorithm-dependent bounds, we adopt two theoretical tools: uniform stability (Elisseeff et al., 2005; Rakhlin et al., 2005) and PAC-Bayesian theory (McAllester, 2003; Germain et al., 2016). These two approaches not only make it convenient to analyze generalization properties along optimization trajectory, but also provide different viewpoints towards the effect of SGLD on generalization: stability only depends on relative location between parameter trained with neighboring datasets, and fast rates are usually available; on the other hand, PAC-Bayes bounds can benefit from norm-based regularization, and it is also adaptive to optimization trajectory, instead of taking worst-case upper bounds.
The main contributions of this paper are thus two-fold. The two generalization bounds obtained by two methods reveals different aspects in which SGLD controls model complexity. It is important to note that the bounds have no dependence on dimension of parameter space, nor do they explicitly depend on norm of parameters. By assuming only the Lipschitz assumption on the objective function, the generalization bounds are controlled by aggregated step sizes. The informal versions of our results are stated as follows:
Theorem 1 (Uniform Stability, Informal).
Assuming is uniformly -Lipschitz, let be result of SGLD at -th round. Under regularity conditions on the tail behavior, the following inequality holds, where the expectation in LHS is taken with respect to random draw of training data.
Theorem 2 (PAC-Bayesian Theory, Informal).
For regularized ERM problem with , let be result of SGLD at -th round. Under regularity conditions on the tail behavior and appropriate initialization, the following inequality holds with high probability:
-th round. Under regularity conditions on the tail behavior and appropriate initialization, the following inequality holds with high probability:
The stability-based bounds exhibit a faster rate of convergence, with complexity factor mainly depends on square root of aggregated step sizes. The PAC-Bayes bounds, though having a slower rate, can make impact of step sizes in earlier iterations decay with time. The uniform Lipschitz parameter is also replaced with norm of actual gradients along optimization path. Both results greatly advance algorithm-dependent generalization bounds for non-convex stochastic gradient methods in existing literature (Raginsky et al., 2017; Hardt et al., 2015). With the help of Gaussian noise, they even outperforms previous results in the convex case assuming constant : the former bound allows us to perform gradient updates for step sizes . In the second bound, generalization error is controlled not only by what step sizes parameter we take, but also how large the actual steps are. In most optimization problems including deep learning, the norm of gradient diminishes along trajectory, as the iteration approaches a stationary point, even if uniform Lipschitz constants are very large. This phenomenon pushes above PAC-Bayesian bounds into a favorable situation, where the earlier large gradient steps are greatly abated by the exponentially decaying factor, while the gradients taken in latter stage are inherently small.
1.1 Related Work
The effect of stochastic gradient methods on statistical learning has attracted lots of interests in existing literature: For linear regression in Hilbert spaces,Lin and Rosasco (2016); Lin et al. (2016) analyze multi-pass stochastic gradient methods, leading to optimal population risks; More general cases are studied via uniform stability of parameters under norm (Hardt et al., 2015; London, 2016); From statistical inference aspects, Chen et al. (2016)
constructed confidence sets based on the Markov chain induced by SGD for strongly-convex objective functions. Most of them requires objective function to be convex. WhileHardt et al. (2015) considered non-convex smooth objective functions, their results require fast decay of step sizes, and the bound has exponential dependence on smoothness parameter. With the presence of Gaussian noise, our bounds for non-convex objectives become even better than their results in convex case.
Deliberate injection of Gaussian noise has become a rising star in the literature of non-convex optimization. Ge et al. (2015); Jin et al. (2017) show that Gaussian noise helps SGD escape 2nd order saddle points efficiently. Stochastic Gradient Langevin Dynamics, proposed as discrete version of Langevin Equation , also plays an important role in optimization and sampling. It is well-known that Langevin Equation asymptotically converges to equilibrium distribution , see e.g. (Markowich and Villani, 2000). This property has been utilized for posterior sampling, known as Langevin Monte Carlo. The discretization error and mixing time are intensively studied by Bubeck et al. (2015); Nagapetyan et al. (2017), for log-concave distributions. Dalalyan and Tsybakov (2012) also used Langevin MC to approximate Exponential Weighted Aggregate, and proved PAC-Bayesian bounds for regression learning with sparsity prior. For non-convex learning and optimization, Raginsky et al. (2017) makes the first attempt towards excess risks by non-convex SGLD, combining algorithmic convergence and generalization error. But their results are based on convergence to equilibrium, which relies upon constants in Poincaré Inequality, leading to inevitably exponential dependence on dimension. Though the mixing time can be prohibitive in non-convex case, Zhang et al. (2017) recently show that hitting time of SGLD for small-loss region can be much better, and the Gaussian noise in SGLD helps to escape shallow local minima. Their results also emphasize the importance of generalization guarantees for discrete-time non-asymptotic SGLD in non-convex settings.
Besides, several recent works also studied the connection between SGD and stochastic differential equations, such as SME (Li et al., 2015, 2017). Though our results for SGLD cannot directly extend to their SDEs with data-dependent diffusion term, our methods are potentially applicable for generalization error bounds in their settings.
1.2 Why Gaussian Noise is Useful for Generalization?
Previous analyses of the Gaussian noise in stochastic gradient methods mainly focus on its benefit for optimization aspect. The question naturally comes whether it also helps generalization a lot. Before going into our theoretical results, we will first illustrate why prior analyses on stability can be very large for non-convex objective function, and how this can be overcome by adding Gaussian noise. This important observation motivates our analysis based on KL-Divergence and Hellinger distances, which highlights the effect of smooth distributions on generalization error bounds.
Stability-based analysis for gradient algorithms on non-convex losses will suffer from a ”fence-sitting” situation, as illustrated in Figure 1. Consider a non-convex empirical loss surface with two local minima, which is divided into two regions by a ridge. If lies on one side of this ridge, a noiseless first-order method will lead to the local minimum on this side. However, if comes close to the ridge in its trajectory, small shift on the loss surface caused by changing one point will lead it to a completely different local minimum, as we can see from the figure.
To guarantee stability, we need to randomly decide which side to go when it comes close to the ridge. The noise needs to be isotropic and smooth enough in order to cross this ridge, as the direction of variation can be quite arbitrary. SGLD successfully tackles the fence-sitting problem by smoothing the probability of going either side, and adding noise to subsequent steps to avoid unstable shallow local minima. The bounds for SGD in Hardt et al. (2015) also exploits randomness of choosing , but the noise is not smooth enough. So their bound requires the subsequent steps to be very small, to keep not far from the ridge.
Notation: Suppose each data . A pair of neighboring datasets means that and differ on exactly one data point. For a continuous time SDE over , the iteration point at time is denoted as , and corresponding density function is denoted as . For discrete time SGLD run over , the iteration point and its density function at round are written as respectively. All above notations are also suitable for with an additional prime. When analyzing their derivatives, we sometimes omit the subscript for without confusion. is the step size of discrete SGLD at iteration , and . Let be the stochastic gradient operator at round without regularization, and let be the actual stochastic gradient. Without extra explanations, represents the Lipschitz constant of the objective function for any . represents the squared Hellinger distance between density function and .
Now we define an important concept which will be frequently used later:
Definition 1 (non-expansive).
Suppose and are two random points in , and their distributions are denoted as . We say a bivariate functional defined on two density functions, is non-expansive, if for any mapping , there is
It is well known that all -divergence (including KL divergence and squared Hellinger distance) are non-expansive and jointly convex (Csiszár et al., 2004).
2.1 Stability and generalization
Stability of the algorithm has a close relation with its generalization performance, and this line of research dates back to Bousquet and Elisseeff (2002). Intuitively, the more stable an algorithm is, the better its generalization performance will be. Here, we adopt the notion of uniform stability of a randomized algorithm Elisseeff et al. (2005); Hardt et al. (2015).
Definition 2 (Uniform Stability).
We say that a randomized algorithm is -uniformly stable with respect to the loss , if for all neighboring datasets , there is
where the expectation is over randomness of the algorithm, and are outputs of on and respectively.
Once a randomized algorithm is uniformly stable, it is straightforward to see its generalization performance in expectation, using standard symmetrization argument (Hardt et al., 2015).
Theorem 3 (Generalization in expectation).
Suppose a randomized algorithm is -uniformly stable, then there is
High-probability bounds with an additional term are also available by assuming uniformly bounded loss (Elisseeff et al., 2005). In this paper, we always take expectation with respect to randomized learning algorithm when discussing generalization bounds. Under suitable assumptions, it is straightforward to extend our results to high-probability guarantees with respect to random draw of training data, using McDiarmid Inequality. For simplicity of presentation, we restrict our attention to itself and expected generalization bounds.
2.2 PAC-Bayesian theory
Different with uniform stability theory above, which needs to consider the worst case in some sense, the generalization bound implied by PAC-Bayesian theory is completely algorithmic and data dependent. However, most of generalization bounds in PAC-Bayesian form require the loss function to be bounded (McAllester, 1999, 2003), which is usually not satisfied in reality, such as cross entropy loss or hinge loss. Germain et al. (2016) extended previous results to sub-Gaussian losses, but their result introduced an extra additive error term , where
is the Sub-Gaussian variance factor. To get rid of this additive term and facilitate our later analysis, we first improve the PAC-Bayesian result inGermain et al. (2016) as follows:
For loss function class and data distribution . Given any prior distribution over . If loss class is -subGaussian with respect to , i.e.,
Let be a class of posterior distributions over , with , we have the following inequality holds uniformly for all posterior distributions , with probability :
2.3 Fokker-Planck equation
As we know, the movement of a particle in the -dimensional space influenced by its current state and random forces (here we only consider a simple case), can be characterized by the following stochastic differential equation (SDE):
where is the random position of the particle at time , is the
-dimensional random drift vector, andis the dimensional Brownian motion. Denote the density function of as , then Fokker-Planck equation describes the evolution of :
where is the Laplace operator.
3 Ideal Case: Generalization Bounds for Langevin Equation
Intuitively, SGLD can be seen as a discretization for Langevin Equation. Understanding generalization performance of the ideal continuous-time algorithm provides important insights into deep results about discrete-time algorithm. In this section, we will present two generalization error bounds for SGLD, using stability and PAC-Bayesian theory, respectively. We elaborate on the techniques used in our analysis, which gives a high-level view of how generalization bound for discrete-time SGLD can be obtained.
Consider the following continuous-time Langevin Equation, where is (regularized) empirical objective function.
where is the standard Brownian motion in .
Assume the pdf of is , then it satisfies a Fokker-Planck equation:
3.1 Uniform Stability
We are going to bound uniform stability with respect to loss function, which directly controls generalization in expectation:
For uniform stability, we assume that the following condition which is slightly weaker than uniform Lipschitz. Note that the generalization performance is defined in terms of loss function , which may not be continuous, but the Lipschitz assumption is imposed on objective of our algorithm, which can be a surrogate function for .
As a result, we have for different samples , ,
We first control via squared Hellinger distance:
The last inequality holds by assuming is uniformly bounded by .
Compared with Hardt et al. (2015), the bound based on -divergence can better characterize stability with non-convex objective: through one step of iteration, the distance between parameters can expand a lot due to shape of non-convex surface, but -divergences are non-expansive under the same transformation, and will decrease by convolution with Gaussian noise. This property makes it possible to obtain much better bounds.
Under above assumptions, the expected generalization error for continuous-time Langevin Equation is bounded by:
According to the analysis above, we only need to bound from above.
Apparently, at time , . We then estimate :
The last equality is due to integration by parts. Technical conditions such as uniform decaying tails of and can be found in (Risken, 1989). We then proceed to calculate the part induced by gradient update (with coefficient ) and those induced by Gaussian convolution (with coefficient ) individually, which can be described as follows:
Integrating through time and plugging into the estimate above, we have:
3.2 PAC-Bayesian Bounds
We can also obtain PAC-Bayesian bounds for Fokker-Planck Equation with finite .
Let prior distribution . Assume that is -subGaussian with respect to . Within a class of posteriors with uniform upper bound , the following holds for Langevin Dynamics with probability :
We only need to bound the KL divergence to prior distribution .
We use Cauchy-Schwartz inequality in the second step, and the constant will be determined later. The first term is minus Fisher information , which can be upper bounded by itself using logarithmic Sobolev inequality (Markowich and Villani, 2000):
Let and plug into the log Sobolev inequality, we get:
Solving for with initial value , we get:
Since we use Gaussian prior, the second term in the expectation can be directly calculated as , making the bound dependent on norm of the parameter. This is undesirable in the high-dimensional settings: as , concentrates around with high probability, resulting in a term linearly dependent on . Fortunately, this can be eliminated by imposing a small regularization term. ∎
Instead of minimizing empirical risk itself, we consider the regularized ERM problem with regularization term . To make the gradient of cancel out with the term, we choose . Using the same method of analysis, we get:
Using the same methods as before, we get:
Under the same assumptions as in Proposition 2, the Langevin Equation for regularized ERM problem with satisfies:
By assuming uniform -Lipschitzness of , we can get a simpler upper bound:
4 Stability of Discrete-Time SGLD
Though the ideal continuous-time Langevin Equation attains small generalization error, they cannot imply bounds for discrete-time SGLD algorithms. Most previous analyses relate discrete-time analysis and continuous-time ones by estimation of discretization gap, which usually results in at least linear dependence on (Raginsky et al., 2017). In our analyses, we directly construct various SDEs that are similar to Langevin Equation, based on discrete-time updates. This technique makes it possible to circumvent the potentially large gaps between discrete and continuous time algorithms, as we can see from this and next section.
In this section, we will consider the stability of SGLD algorithm for non-convex objectives. To begin with, we will give the stability result of Langevin Monte Carlo (LMC), a special case of SGLD which uses full gradient in each iteration. LMC is closer to continuous-time algorithm, relatively easy to analyze and reaches the uniform stability of . However, from the practical view, SGLD with a randomly drawn example in each round is much more attractive. Hence we extend our methods and provide analyses for SGLD algorithms. A simple analysis is first presented with stability bound of . When step sizes are small, a lot more decrease in squared Hellinger distance can be acquired, and the bound can be improved to . We also obtain a rough estimate for larger step sizes. Combining two results together, a stability bound that nearly matches the ideal case is obtained for SGLD.
4.1 Stability of Langevin Monte Carlo
We consider the following LMC algorithm, which uses full gradients in each update.
To give an intuitive analysis, suppose two neighboring datasets differing only in the -th data. Then one can divide each iteration into two parts: the first part just update and with gradients over same data and , i.e.
and then we obtain and by adding Gaussian noise and replacing the gradient of sample in by the gradient of sample , i.e. . In the first step, squared Hellinger distance does not increase because of the non-expansive property. For the second step, one can view them as consecutive SDEs with drift term of order . Hence we can prove the increments of after one iteration is of order , which leads to the following generalization bound.
Theorem 5 (Generalization Error of LMC).
Let be result of LMC at -th round. Under regularity conditions on the tail behavior, then the following inequality holds:
where the expectation is taken over the randomness of training data.
4.2 Stability of SGLD - A Succinct Analysis
As random draw of a training example is more popular in practice, it is desirable to analyze generalization properties of SGLD. In the rest part of this section, we will assume , where is the index of randomly drawn training example. We will first present a simple analysis for stability of SGLD. Though the resulting bound is not optimal, the analysis illustrates important principles for understanding how SGLD helps stability. In the following, we will derive upper bounds for recursively. There are two possible cases for :
If , then SGLD implemented over or will use the same gradient mapping, i.e. , then we have
Furthermore let , by the convexity of squared Hellinger distance (which is implied by joint convexity of -divergence), there is
So in this case, the SGLD update is non-expansive with respect to .
If , we have nothing but limited step size in hand. The increase of -divergence can be bounded through norm-based shifts in parameter space only under smoothness conditions, which is helped by Gaussian noise. Therefore, we expand the discrete-time update into a stochastic process, where the effect of gradient flow is smoothed by Gaussian at each time .
, the update can be interpolated as:
However, is not a Markov process, as it always involves the initial random point . Using the same technique as in Raginsky et al. (2017), we define . The mimicking distribution results (Gyöngy, 1986) guarantees solution to the following SDE has the same one-time marginal as .
The corresponding Fokker-Planck equation for above process is:
We also have counterparts for the neighboring dataset, denoted as . With the help of these PDEs, we can bound the variation of squared Hellinger distance.
As in the ideal case, we can compute that
For , we have
Combining above two cases and using the convexity of squared Hellinger distance, we obtain
Putting them together, we get following guarantees for SGLD:
Consider rounds of SGLD with parameters and . If we assume
the loss function is uniformly bounded by ;
, the gradients of objective function satisfy
Then we have the following generalization bound in expectation
4.3 Stability of SGLD - An Improved Analysis
Though above analysis for the stability of SGLD is intuitive, the result is not satisfactory, as the bound has a gap compared with Langevin MC. Technically, if we choose in -th round, both and will be smoothed by the Gaussian noise, and their squared Hellinger distance will decrease by a quadratic information-type term. This term was completely ignored in the succinct analysis, and by making use of this term we can also obtain fast rate for SGLD.
Before proceeding into improved bound, we will first introduce a framework for combining different stability results. This is motivated by time-varying step sizes in SGLD: as step size changes, the best method of estimation may be different. To utilize their respective advantages, we first prove the following theorem.
Suppose there are two types of bivariant-functionals and between p.d.fs for estimating stability of SGLD and there are constants and depending only on such that can be bounded by
For a SGLD algorithm with step sizes , assume we can estimate and by
Moreover, assume and are nonexpansive and convex, then for any integer , we can bound stability by
We assume there is a mixed process that use samples for the first steps and samples for the rest steps. We denote the corresponding paramters and p.d.fs by and .
Here by nonexpansiveness and that for step the mixed process uses sample set ’, .
Note that for , then .
Therefore, we obtain
When the step sizes are large, e.g., , this step will make a contribution larger than in the succinct bound. However, a stochastic gradient step can change a distribution within at most scale with respect to distance. So if step sizes are large, a rough estimate based on distance will be better.
First, it is easy to see that stability can be well-controlled by distance for bounded loss.