# Generalization Error Bounds with Probabilistic Guarantee for SGD in Nonconvex Optimization

The success of deep learning has led to a rising interest in the generalization property of the stochastic gradient descent (SGD) method, and stability is one popular approach to study it. Existing works based on stability have studied nonconvex loss functions, but only considered the generalization error of the SGD in expectation. In this paper, we establish various generalization error bounds with probabilistic guarantee for the SGD. Specifically, for both general nonconvex loss functions and gradient dominant loss functions, we characterize the on-average stability of the iterates generated by SGD in terms of the on-average variance of the stochastic gradients. Such characterization leads to improved bounds for the generalization error for SGD. We then study the regularized risk minimization problem with strongly convex regularizers, and obtain improved generalization error bounds for proximal SGD. With strongly convex regularizers, we further establish the generalization error bounds for nonconvex loss functions under proximal SGD with high-probability guarantee, i.e., exponential concentration in probability.

## Authors

• 69 publications
• 47 publications
• 15 publications
• ### Fine-Grained Analysis of Stability and Generalization for Stochastic Gradient Descent

Recently there are a considerable amount of work devoted to the study of...
06/15/2020 ∙ by Yunwen Lei, et al. ∙ 0

• ### Stability and Generalization of Learning Algorithms that Converge to Global Optima

We establish novel generalization bounds for learning algorithms that co...
10/23/2017 ∙ by Zachary Charles, et al. ∙ 0

• ### Stability and Optimization Error of Stochastic Gradient Descent for Pairwise Learning

In this paper we study the stability and its trade-off with optimization...
04/25/2019 ∙ by Wei Shen, et al. ∙ 0

• ### Hybrid Stochastic-Deterministic Minibatch Proximal Gradient: Less-Than-Single-Pass Optimization with Nearly Optimal Generalization

Stochastic variance-reduced gradient (SVRG) algorithms have been shown t...
09/18/2020 ∙ by Pan Zhou, et al. ∙ 0

• ### Generalization Error Bounds for Optimization Algorithms via Stability

Many machine learning tasks can be formulated as Regularized Empirical R...
09/27/2016 ∙ by Qi Meng, et al. ∙ 0

• ### Learning with Non-Convex Truncated Losses by SGD

Learning with a convex loss function has been a dominating paradigm for...
05/21/2018 ∙ by Yi Xu, et al. ∙ 0

• ### Revisiting SGD with Increasingly Weighted Averaging: Optimization and Generalization Perspectives

Stochastic gradient descent (SGD) has been widely studied in the literat...
03/09/2020 ∙ by Zhishuai Guo, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Many machine learning applications can be formulated as risk minimization problems, in which each data sample

is assumed to be generated by an underlying multivariate distribution . The loss function measures the performance on the sample

and its form depends on specific applications, e.g., square loss for linear regression problems, logistic loss for classification problems and cross entropy loss for training deep neural networks, etc. The goal is to solve the following generic population risk minimization (PRM) problem over a certain parameter space

.

 minw∈Ω f(w):=Ez∼D ℓ(w;z). (PRM)

Directly solving the PRM can be difficult in practice, as either the distribution is unknown or evaluation of the expectation in the loss function induces a high computational cost. To avoid such difficulties, one usually samples a set of data samples from the distribution , and instead solves the following empirical risk minimization (ERM) problem.

 minw∈Ω fS(w):=1nn∑k=1ℓ(w;zk). (ERM)

The ERM serves as an approximation of the PRM with finite samples. In particular, when the number of data samples is large, one wishes the solution found by optimizing the ERM with the data set has a good generalization performance, i.e., it also induces a small loss on the population risk. The gap between these two risk functions is referred to as the generalization error at , and is formally written as

 (generalization error):=|fS(wS)−f(wS)|. (1)

Various theoretical frameworks have been established to study the generalization error from different aspects (see related work for references). This paper adopts the stability framework (Bousquet & Elisseeff, 2002; Elisseeff et al., 2005).

For a particular learning algorithm , its stability corresponds to how stable the output of the algorithm is with regard to the variations in the data set. As an example, consider two data sets and that differ at one data sample, and denote and as the outputs of algorithm when applied to solve the ERM with the data sets and , respectively. Then, the stability of the algorithm measures the gap between the output function values of the algorithm on the perturbed data sets.

The stability framework has been applied to study the generalization property of the output produced by learning algorithms, and various notions of stability have been proposed that provide probabilistic guarantees for the generalization error (Bousquet & Elisseeff, 2002; Elisseeff et al., 2005). Recently, the stability framework has been further developed to study the generalization performance of the output produced by the stochastic gradient descent (SGD) method from various theoretical aspects (Hardt et al., 2016; Charles & Papailiopoulos, 2017; Mou et al., 2017; Yin et al., 2017; Kuzborskij & Lampert, 2017). These studies showed that the output of SGD can achieve a vanishing generalization error after multiple passes over the data set as the sample size . These results provide theoretical justifications in part to the success of SGD on training complex objectives such as deep neural networks. Generally, two metrics are typically taken for measuring the generalization error. One is the generalization error in expectation given by

 (2)

where corresponds to the output of SGD at the -th iteration when applied to solve the ERM with the data set , and the expectation is taken over the randomness of both the algorithm and the draw of the data. The second one is the generalization error bound with probabilistic guarantee, i.e., for any the quantity

 P(∣∣fS(wT,S)−f(wT,S)∣∣<ϵ) (3)

converges to one as . Compared to the in expectation guarantee, the probabilistic guarantee is a stronger metric that guarantees small generalization error with high probability.

The focus of this paper is on the generalization error of nonconvex optimization, and we adopt the stronger metric of probabilistic guarantee. In particular, we develop tighter bounds on the generalization error of SGD for smooth nonconvex optimization, and then further explore the impact of the gradient dominance condition as well as regularization on the generalization error. Although these topics have been studied very recently in other studies, our results turn out to offer different insights. We summarize our contributions as follows in the context of the existing art.

### Our Contributions with Comparison to Existing Art

For general nonconvex functions, the generalization error of SGD has been studied in (Hardt et al., 2016; Mou et al., 2017; Kuzborskij & Lampert, 2017) under the stability framework, all under the notion of in expectation. In contrast, this paper adopts the stronger notion of probabilistic guarantee. In particular, we proposed a new analysis of the on-average stability of the iterates generated by SGD that exploits the optimization properties. Specifically, we showed that the on-average stability of SGD is bounded by the average variance of the stochastic gradient over the randomness of the data sets. Such results improve the generalization error bounds in (Hardt et al., 2016) which depend on the uniform upper bound for the norm of the stochastic gradient. By employing existing characterizations of the mean square generalization error for randomized learning algorithms, our characterization of the on-average stability of the SGD further leads to probabilistic guarantee for the corresponding generalization errors.

For nonconvex functions that satisfy the gradient dominance condition, (Charles & Papailiopoulos, 2017) analyzed the generalization error further under a quadratic growth condition, and the analysis also required SGD to converge to a global minimizer. In contrast, this paper does not require additional conditions other than the gradient dominance condition. We show that the gradient dominance condition does improve the generalization error bound compared to general nonconvex functions.

Then, we consider nonconvex risk minimization problems with strongly convex regularizers, and study the role that the regularization plays in characterizing the generalization error bound of proximal SGD (which allows the analysis of nonsmooth regularizers). Regularization has been studied in (Hardt et al., 2016) under the notion of in expectation and in (Mou et al., 2017) under the notion of in expectation with respect to the algorithm randomness and in probability with respect to the data randomness. This paper studies the stronger notion of probabilistic guarantee with respect to both the algorithm and data randomness. Specifically, we characterize the generalization error bounds based on the on-average variance of the stochastic gradients. Our results show that strongly convex regularizers can substantially improve the generalization error bounds of SGD in nonconvex optimization. In particular, a general nonconvex loss function under a strongly convex regularizer can achieve the same order-level generalization error bound of a strongly convex loss function as characterized in (London, 2017).

We further study the high-probability guarantee (with exponential concentration in probability) for the generalization error of SGD in regularized nonconvex optimization by characterizing its uniform stability. Although (Mou et al., 2017) also studied regularized nonconvex problems, they only considered a particular choice of strongly convex regularizer and the high-probability guarantee is only with respect to the data randomness. In comparison, we consider all strongly convex regularizers and our probabilistic guarantee is with respect to both the algorithm and data randomness. We show that while the uniform stability of the SGD for nonconvex loss functions cannot yield an exponential concentration probability bound, it does lead to such probabilistic guarantee for the generalization error with strongly convex regularizers.

While this paper focuses on the generalization error bounds in terms of the function value, the generalization error of the gradient is also of interest to determine the convergence to a critical point. In the supplemental materials, we provide the analysis of the generalization error bounds in terms of the gradient, which is similar in spirit to the analysis in terms of the function value.

### Related Works

The stability approach was initially proposed by (Bousquet & Elisseeff, 2002) to study the generalization error, where various notions of stability were introduced to provide bounds on the generalization error with probabilistic guarantee. (Elisseeff et al., 2005) further extended the stability framework to characterize the generalization error of randomized learning algorithms. (Shalev-Shwartz et al., 2010) developed various properties of stability on learning problems. In (Hardt et al., 2016), the authors first applied the stability framework to study the expected generalization error for the SGD, and (Kuzborskij & Lampert, 2017) further provided a data dependent generalization error bound. In (Mou et al., 2017), the authors studied the generalization error of SGD with additive Gaussian noise. (Yin et al., 2017) studied the role that gradient diversity plays in characterizing the expected generalization error of SGD. All these works studied the expected generalization error of SGD. In (Charles & Papailiopoulos, 2017), the authors studied the generalization error of several first-order algorithms for loss functions satisfying the gradient dominance and the quadratic growth conditions. (Poggio et al., 2011) studied the stability of online learning algorithms.

The PAC Bayesian theory (Valiant, 1984; McAllester, 1999) is another popular framework for studying the generalization error in machine learning. It was recently used to develop bounds on the generalization error of SGD (London, 2017; Mou et al., 2017). Specifically, (Mou et al., 2017) applied the PAC Bayesian theory to study the generalization error of SGD with additive Gaussian noise. (London, 2017) combined the stability framework with the PAC Bayesian theory and provided bound on the generalization error with probabilistic guarantee of the SGD for strongly convex loss functions. The bound incorporates the divergence between the prior distribution and the posterior distribution of the parameters.

Recently, (Russo & Zou, 2016; Xu & Raginsky, 2017) applied information-theoretic tools to characterize the generalization capability of learning algorithms, and (Pensia et al., 2018)

further extended the framework to study the generalization error of various first-order algorithms with noisy updates. Other approaches were also developed for characterizing the generalization error as well as the estimation error, which include, for example, the algorithm robustness framework

(Xu & Mannor, 2012; Zahavy et al., 2017), large margin theory (Bartlett et al., 2017; Neyshabur et al., 2017; Sokolić et al., 2017) and the classical VC theory (Vapnik, 1995, 1998). Also, some methods have been developed to study excessive risk of the output for a learning algorithm, which include the robust stochastic approach (Nemirovski et al., 2009), the sample average approximation approach (Shapiro & Nemirovski, 2005; Lin & Rosasco, 2017), etc.

## 2 Preliminary and On-Average Stability

Consider applying the SGD to solve the empirical risk minimization (ERM) with a particular data set . In particular, at each iteration , the algorithm samples one data sample from the data set uniformly at random. Denote the index of the sampled data sample at the -th iteration as . Then, with a stepsize sequence and a fixed initialization , the update rule of the SGD can be written as, for ,

 wt+1=wt−αt∇ℓ(wt;zξt). (SGD)

Throughout the paper, we denote the iterate sequence along the optimization path as , where in the subscript indicates that the sequence is generated by the algorithm using the data set . The stepsize sequence is a decreasing positive sequence, and typical choices for SGD are (Bottou, 2010), which we adopt in our study.

Clearly, the output is determined by the data set and the sample path of SGD. We are interested in the generalization error of the -th output generated by SGD, i.e., , and we adopt the following standard assumptions (Hardt et al., 2016; Kuzborskij & Lampert, 2017) on the loss function in our study throughout the paper.

###### Assumption 1.

For all , the loss function satisfies:

1. [leftmargin=*,topsep=0pt,noitemsep]

2. Function is continuously differentiable;

3. Function is nonnegative and -Lipschitz, and is uniformly bounded by ;

4. The gradient is -Lipschitz, and is uniformly bounded by .

The generalization error of SGD can be viewed as a nonnegative random variable whose randomnesses are due to the draw of the data set

and the sample path of the algorithm. In particular, the mean square generalization error has been studied in (Elisseeff et al., 2005) for general randomized learning algorithms. Specifically, a direct application of [Lemma 11, (Elisseeff et al., 2005)] to SGD under creftypecap 1 yields the following result. Throughout the paper, we denote as the data set that replaces one data sample of with an i.i.d copy generated from the distribution and denote as the output of SGD for solving the ERM with the data set .

###### Proposition 1.

Let creftypecap 1 hold. Apply the SGD with the same sample path to solve the ERM with the data sets and , respectively. Then, the mean square generalization error of SGD satisfies

 E[|fS(wT,S)−f(wT,S)|2]≤2M2n+12MσE[δT,S,¯¯¯S],

where and the expectation is taken over the random variables and .

Proposition 1 characterizes the mean square generalization error of SGD, which is related to the quantity . Intuitively, captures the variation of the output of the algorithm with regard to the variation of the data set. Hence, its expectation can be understood as the on-average stability of the iterates generated by SGD. We note that similar notions of stabilities were proposed in (Kuzborskij & Lampert, 2017; Shalev-Shwartz et al., 2010; Elisseeff et al., 2005), which are based on the variation of the function values at the output instead.

## 3 Nonconvex Optimization

In this section, we study the mean square generalization error of the SGD by characterizing the corresponding on-average stability of the algorithm iterates. which incorporates the properties of the corresponding optimization path. Such results further provide bound with probabilistic guarantee for the generalization error.

As the iterates are updated by SGD, the variance of the stochastic gradients becomes an intrinsic quantity that affects the corresponding optimization path. In order to capture the impact of the variance of the stochastic gradients on the generalization error, we adopt the following standard assumption from the stochastic optimization theory (Bottou, 2010; Nemirovski et al., 2009; Ghadimi et al., 2016).

###### Assumption 2.

For any fixed data set and any that is generated uniformly from at random, there exists a constant such that for all one has

 Eξ∥∥∥∇ℓ(w;zξ)−1nn∑k=1∇ℓ(w;zk)∥∥∥2≤ν2S. (4)

creftypecap 2 essentially bounds the variance of the stochastic gradients for the particular data set . The variance of the stochastic gradient is typically much smaller than the uniform upper bound in creftypecap 1 for the norm of the stochastic gradient, e.g., normal random variable has unit variance and is unbounded, and hence may provide a tighter bound on the generalization error.

Based on creftypecap 2, we obtain the following bound on the mean square generalization error by exploring the optimization path of the SGD.

###### Theorem 1.

(Mean Square Bound) Suppose is nonconvex. Let Assumptions 1 and 2 hold. Apply the SGD to solve the ERM with the data set , and choose the step size with . Then the following bound holds.

 Eξ,S [|fS(wT,S)−f(wT,S)|2]≤ 2M2n+24Mσcn√2Lf(w0)+12ES[ν2S]logT. (5)

The proof of creftypecap 1 is to characterize the on-average stability of the iterates generated by SGD, and it explores the optimization path by applying the technical tools developed in stochastic optimization theory. Thus, the obtained bound depends on the initialization and the on-average variance of the stochastic gradients. Intuitively, the on-average variance measures the ‘stability’ of the stochastic gradients over all realizations of the data set . If this quantity is large, then the sampled stochastic gradient changes substantially when replacing it with a new i.i.d sample gradient, which consequently implies a larger generalization error.

The bound in creftypecap 1 by nature is different from that in (Hardt et al., 2016), although the two bounds may not be directly comparable (creftypecap 1 is on the mean square generalization error whereas (Hardt et al., 2016) is on the expected generalization error). In specific, (Hardt et al., 2016) developed the bound based on the uniform stability , i.e., uniform upper bound over all data sets and so that the bound is in terms of the uniform upper bound on . In contrast, our bound is based on the on-average stability . Hence, our bound replaces

with the smaller on-average standard deviation

We note that (Kuzborskij & Lampert, 2017) also exploited the optimization path to characterize the expected generalization error of SGD. However, their analysis assumes that the iterate is independent of , which may not hold after multiple passes over the data samples. Also, their result does not capture the on-average variance of the stochastic gradients. In comparison, our analysis does not require such independence to exploit the optimization path information, and we characterize the mean square generalization error, which is a stronger notion than the expected generalization error.

###### Outline of the Proof of creftypecap 1.

We provide an outline of the proof of creftypecap 1 here, and relegate the detailed proof in the supplementary materials.

The central idea is to bound the on-average stability of the iterates in Proposition 1. Hence, suppose we apply the SGD with the same sample path to solve the ERM with the data sets and , respectively. We first obtain the following recursive property of the on-average iterate stability (Lemma 2 in Appendix A):

 ES,¯¯¯S,ξ[δt+1,S,¯¯¯S] ≤(1+αtL)ES,¯¯¯S,ξ[δt,S,¯¯¯S] +2αtnES,ξ[∥∥∇ℓ(wt,S;z1)∥∥]. (6)

We then further derive the following bound on by exploiting the optimization path of the SGD (Lemma 3 in Appendix A):

 Eξ,S[∥∇ℓ(wt,S;z1)∥]≤√2Lf(w0)+12ES[ν2S]. (7)

Substituting eq. 7 into eq. 6 and telescoping, we obtain an upper bound on . Then, creftypecap 1 follows by substituting such a bound into Proposition 1. ∎

Furthermore, creftypecap 1, together with the Chebyshev’s inequality, immediately implies the following probabilistic guarantee for the generalization error of SGD.

###### Theorem 2.

(Bound with Probabilistic Guarantee) Suppose is nonconvex. Let Assumptions 1 and 2 hold. Apply the SGD to solve the ERM with the data set , and choose the step size with . Then, for any , with probability at least , we have

 |fS(wT,S)−f(wT,S)|≤ √1nδ(2M2+24Mσc√2Lf(w0)+12ES[ν2S]logT).

creftypecap 2 provides probabilistic guarantee for the generalization error of SGD. Intuitively, if SGD has a small on-average variance , the optimization paths of SGD on two slightly different data sets are close. This further leads to a better stability of the iterates and in turn yields a low generalization error. Compared to (London, 2017), our generalization bound captures the variance in the optimization, whereas they capture the divergence between the prior distribution and the posterior distribution of the parameters.

## 4 Gradient Dominant Nonconvex Optimization

In this section, we consider nonconvex loss functions with the empirical risk function further satisfying the following gradient dominance condition.

###### Definition 1.

Denote . Then, the function is said to be -gradient dominant for if

 f(w)−f∗≤12γ∥∇f(w)∥2, ∀w∈Ω. (8)

The gradient dominance condition (also referred to as Polyak-Łojasiewicz condition (Polyak, 1963; Łojasiewicz, 1963)) guarantees a linear convergence of the function value sequence generated by gradient-based first-order methods (Karimi et al., 2016). It is a condition that is much weaker than the strong convexity, and many nonconvex machine learning problems satisfy this condition around the global minimizers (Li et al., 2016; Zhou et al., 2016).

The gradient dominance condition helps to improve the bound on the on-average stochastic gradient norm (see Lemma 3 in Appendix A), which is given by

 Eξ,S [∥∇ℓ(wt,S;z1)∥] ≤√2LES[f∗S]+1t(2Lf(w0)+ES[ν2S]). (9)

Compared to eq. 7 for general nonconvex functions, the above bound is further improved by a factor of . This is because SGD converges sub-linearly to the optimum function value under the gradient dominance condition, and is essentially the convergence rate of SGD. In particular, for sufficiently large , the on-average stochastic gradient norm is essentially bounded by , which is much smaller then the bound in eq. 7. With the bound in eq. 9, we obtain the following theorem.

###### Theorem 3.

(Mean Square Bound) Suppose is nonconvex, and is -gradient dominant (). Let Assumptions 1 and 2 hold. Apply the SGD to solve the ERM with the data set and choose with . Then, the following bound holds.

 Eξ,S[|fS(wT,S)−f(wT,S)|2]≤ 2M2n+24Mσcn(√2LES[f∗S]logT+√2Lf(w0)+ES[ν2S]).

The above bound for the mean square generalization error under gradient dominance condition improves that for general nonconvex functions in creftypecap 1, as the dominant term (i.e., -dependent term) has coefficient , which is much smaller than the term in the bound of creftypecap 1. As an intuitively understanding, the on-average variance of the SGD is further reduced by its fast convergence rate under the gradient dominance condition. This results in a more stable on-average iterate stability which in turn improves the mean square generalization error. We note that (Charles & Papailiopoulos, 2017) also studied the generalization error of SGD for loss functions satisfying both the gradient dominance condition and an additional quadratic growth condition. They also assumed that the algorithm converges to a global minimizer point, which may not always hold for noisy algorithms like SGD.

creftypecap 3 directly implies the following probabilistic guarantee for the generalization error of SGD.

###### Theorem 4.

(Bound with Probabilistic Guarantee) Suppose is nonconvex, and is -gradient dominant (). Let Assumptions 1 and 2 hold. Apply the SGD to solve the ERM with the data set , and choose with . Then, for any , with probability at least , we have

 |fS(wT,S)−f(wT,S)|≤ √2M2nδ+24Mσcnδ(√2LES[f∗S]logT+√2Lf(w0)+ES[ν2S]).

## 5 Regularized Nonconvex Optimization

In practical applications, regularization is usually applied to the risk minimization problem in order to either promote certain structures on the desired solution or to restrict the parameter space. In this section, we explore how regularization can improve the generation error, and hence help to avoid overfitting for SGD. Such a topic has been considered in (Hardt et al., 2016; Mou et al., 2017) to study the expected generalization error of SGD. Here, we focus on the probabilistic guarantee for the generalization error, and show that strongly convex regularizers can improve the generalization error bounds by the order-level.

Here, for any weight , we consider the regularized population risk minimization (R-PRM) and the regularized empirical risk minimization (R-ERM):

 minw∈Ω Φ(w):=f(w)+λh(w), minw∈Ω ΦS(w):=fS(w)+λh(w),

where corresponds to the regularizer and are the population and empirical risks, respectively. In particular, we are interested in the following class of regularizers.

###### Assumption 3.

The regularizer function is 1-strongly convex and nonnegative.

Without loss of generality, we assume that the strongly convex parameter of is 1, and this can be adjusted by scaling the weight parameter . Strongly convex regularizers are commonly used in machine learning applications, and typical examples include

for ridge regression, Tikhonov regularization

and elastic net , etc. Here, we allow the regularizer to be non-differentiable (e.g., the elastic net), and introduce the following proximal mapping with parameter to deal with the non-smoothness.

 proxαh(w):=argminu∈Ωh(u)+12α∥u−w∥2. (10)

The proximal mapping is the core of the proximal method for solving convex problems (Parikh & Boyd, 2014; Beck & Teboulle, 2009) and nonconvex ones (Li et al., 2017; Attouch et al., 2013). In particular, we apply the proximal SGD to solve the R-ERM. With the same notations as those defined in Section 2, the update rule of the proximal SGD can be written as, for

 wt+1=proxαth(wt−αt∇ℓ(wt;zξt)). (proximal-SGD)

Similarly, we denote as the iterate sequence generated by the proximal SGD with the data set .

It is clear that the generalization error of the function value for the regularized risk minimization, i.e., , is the same as that for the un-regularized risk minimization. Hence, Proposition 1 is also applicable to the mean square generalization error of the regularized risk minimization. However, the development of the generalization error bound is different from the analysis in Section 3 from two aspects. First, the analysis of the on-average iterate stability of the proximal SGD is technically more involved than that of SGD due to the possibly non-smooth regularizer. Secondly, the proximal mappings of strongly convex functions are strictly contractive (see item 2 of Lemma 5 in Appendix B). Thus, the proximal step in the proximal SGD enhances the stability between the iterates and that are generated by the algorithm using perturbed data sets, and this further improves the generalization error. The next result provides a quantitative statement.

###### Theorem 5.

Consider the regularized risk minimization. Suppose is nonconvex. Let Assumptions 1, 2 and 3 hold, and apply the proximal SGD to solve the R-ERM with the data set . Let and with . Then, the following bound holds with probability at least .

 |Φ( wT,S)−ΦS(wT,S)|≤ √1nδ(2M2+24Mσ(λ−L)√LΦ(w0)+ES[ν2S]).

creftypecap 5 provides probabilistic guarantee for the generalization error of the proximal SGD in terms of the on-average variance of the stochastic gradients. Comparison of creftypecap 5 with Theorems 2 and 4 indicates that a strongly convex regularizer substantially improves the generalization error bound of SGD for nonconvex loss functions by removing the logarithm dependence on . It is also interesting to compare creftypecap 5 with [Proposition 4 and Theorem 1, (London, 2017)], which characterize the generalization error of SGD for strongly convex functions with probabilistic guarantee. The two bounds have the same order in terms of and , indicating that a strongly convex regularizer even improves the generalization error for a nonconvex function to be the same as that for a strongly convex function. We note that in practice, the regularization weight should be properly chosen to balance between the generalization error and the empirical risk, as otherwise the parameter space can be too restrictive to yield a good solution for the risk function. We further demonstrate this via experiments in Section 7.

## 6 High-Probability Guarantee

The studies of the previous sections explore the probabilistic guarantee for the generalization errors of nonconvex loss functions, gradient dominant loss functions and nonconvex loss functions with strongly convex regularizers. For example, apply SGD to solve a generic nonconvex loss function, then creftypecap 2 suggests that for any ,

 P(|f(wT,S)−fS(wT,S)|>ϵ)

which decays sublinearly as . In this section, we study a stronger probabilistic guarantee for the generalization error, i.e., the probability for it to be less than decays exponentially. We refer to such a notion as high-probability guarantee. In particular, we explore for which cases of nonconvex loss functions we can establish such a stronger performance guarantee.

Towards this end, we adopt the uniform stability framework proposed in (Elisseeff et al., 2005). Note that (Hardt et al., 2016) also studied the uniform stability of SGD, but only characterized the generalization error in expectation, which is weaker than the exponential probabilistic concentrtion bound that we study here.

Suppose we apply the SGD with the same sample path to solve the ERM with the data sets and , respectively, and denote and as the corresponding outputs. Also, suppose we apply the SGD with different sample paths and to solve the same problem with the data set , respectively, and denote and as the corresponding outputs. Here, denotes the sample path that replaces one of the sampled indices, say , with an i.i.d copy . The following result is a variant of [Theorem 15, (Elisseeff et al., 2005)].

###### Lemma 1.

Let creftypecap 1 hold. If SGD satisfies the following conditions 111The first condition in Lemma 1 is slightly different from that in [Theorem 15, (Elisseeff et al., 2005)], in which excludes a particular sample instead of replacing it. Nevertheless, the proof follows the same idea and we omit it for simplicity.

 supS,¯¯¯S,zEξ|ℓ(wT,S,ξ;z)−ℓ(wT,¯¯¯S,ξ;z)|≤β, supξ,¯ξ,S,z|ℓ(wT,S,ξ;z)−ℓ(wT,S,¯ξ;z)|≤ρ.

Then, the following bound holds with probability at least .

 |Φ(wT,S)− ΦS(wT,S)|≤2β+(2√nβ+√2Tρ)√log2δ.

Note that Lemma 1 implies that

 P(|Φ(wT,S)−ΦS(wT,S)|>ϵ)≤O(exp(−ϵ2√nβ+√Tρ)).

Hence, if and , then we have exponential decay in probability as and .

It turns out that our analysis of the uniform stability of SGD for general nonconvex functions yields that , which does not lead to the desired high-probability guarantee for the generalization error. On the other hand, the analysis of the uniform stability of the proximal SGD for nonconvex loss functions with strongly convex regularizers yields that which leads to the high-probability guarantee if we choose and . This further demonstrates that a strongly convex regularizer can significantly improve the quality of the probabilistic bound for the generalization error. The following result is a formal statement of the above discussion.

###### Theorem 6.

Consider the regularized risk minimization with the nonconvex loss function . Let Assumptions 1 and 3 hold, and apply the proximal SGD to solve the R-ERM with the data set . Choose and with . Then, the following bound holds with probability at least

 |Φ(wT,S) −ΦS(wT,S)|≤ (4σ2√n(λ−L)+4σ2cTc(λ−L)−12)√log2δ.

creftypecap 6 implies that

 P(|Φ(wT,S)−ΦS(wT,S)|>ϵ)≤O(exp(−ϵ2n−12+T12−c(λ−L))).

Hence, if we choose and run the proximal SGD for iterations (i.e., constant passes over the data), then the probability of the event decays exponentially as .

The proof of creftypecap 6 characterizes the uniform iterate stability of the proximal SGD with regard to the perturbations of both the data set and the sample path. Unlike the on-average stability in creftypecap 1 where the stochastic gradient norm is bounded by the on-average variance of the stochastic gradients, the uniform stability captures the worst case among all data sets, and hence uses the uniform upper bound for the stochastic gradient norm.

We note that [Theorem 3, (London, 2017)] also established a comparable probabilistic bound as ours under the PAC Bayesian framework. However, their result holds only for strongly convex loss functions. As a comparison, creftypecap 6 relaxes the requirement of strong convexity for loss functions to nonconvex loss functions with strongly convex regularizers, and hence serves as a complementary result to theirs. Also, (Mou et al., 2017) establishes the high-probability bound for the generalization error of SGD with regularization. However, their result holds only for the particular regularizer , and high-probability bound holds only with regard to the random draw of the data. As a comparison, our result holds for all strongly convex regularizers, and the high-probability bound hold with regard to both the draw of data and randomness of algorithm.

## 7 Experimental Evaluation

In this section, we provide experimental results on the generalization error of SGD. We perform two experiments: solving a logistic regression problem with the a9a data set

(Chang & Lin, 2011)

and training a three-layer ReLU neural network with the MNIST data set

(Lecun et al., 1998). For both experiments, we use a fixed initialization. We set the batch size to be 10 % of the training sample size for the logistic regression and 160 for the neural network, and report the averaged results over multiple trials of the experiments.

Un-regularized optimization: We first explore the generalization error and the training error when there is no regularizer in the objective function, and the results are shown in the left column of Figure 1. Note that the left y-axis corresponds to the generalization error curves and the right y-axis corresponds to the training error curves.

It can be seen that the generalization error of SGD for logistic regression improves along the training epoches, and it increases only very slowly for neural network training. This is consistent with the theoretical result, which suggests that SGD can generalize well after multiple epoches. Also, it tends to support our theoretical finding that an amenable curvature of the problem helps to improve the generalization performance of SGD, as the logistic loss is a convex function with well curved geometry whereas the loss function of neural network is a highly nonconvex objective.

Regularized optimization: We also explore the effect of regularization on the generalization error by adding the regularizer to the objective functions. In particular, we apply the proximal SGD to solve the two problems. The right column of Figure 1 shows the results. For both logistic regression and neural network training, it can be seen that the corresponding generalization errors improve (i.e., decrease) as the regularization weight increases. This agrees with our theoretical finding on the impact of regularization. On the other hand, the training performances for both problems degrade as the regularization weight increases beyond a certain threshold, which is reasonable because in such a case the optimization focuses too much on the regularizer and the obtained solution does not minimize the loss function well. Hence, there is a trade-off between the training performance and generalization performance in tuning the regularization parameter.

## 8 Conclusion

In this paper, we provided the probability guarantee for the generalization error of SGD for various nonconvex optimization scenarios. We obtained the improved bounds based on the variance of the stochastic gradients by exploiting the optimization path of SGD. In particular, the gradient dominant geometry improves the generalization error bound by facilitating the convergence in optimization, and the strongly convex regularizers significantly improve the probabilistic concentration bounds for the generalization error from the sublinear rate to the exponential rate. Our study demonstrates that the geometric structure of the problem can be an important factor in improving the generalization performance of algorithms. Thus, it is of interest to explore the generalization error under various geometric conditions of the objective function in the future work.

## Appendix A Proof of Main Results

### Proof of Proposition 1

The proof is based on [Lemma 11, (Elisseeff et al., 2005)] and creftypecap 1. Denote as the data set that replaces the -th sample of with an i.i.d. copy generated from the distribution . Following from Lemma 11 of (Elisseeff et al., 2005), we obtain

 ES,ξ|fS(wT,S)−f(wT,S)|2 ≤2M2n+12Mnn∑i=1Eξ,S,Si[|ℓ(wT,S;zi)−ℓ(wT,Si;zi)|] ≤2M2n+12Mσnn∑i=1Eξ,S,Si∥wT,S−wT,Si∥ =2M2n+12MσEξ,S,¯¯¯S∥wT,S−wT,¯¯¯S∥,

where the second inequality uses the Lipschitz property of the loss function in creftypecap 1, and the last equality is due to the fact that the perturbed samples in and are generated i.i.d from the underlying distribution.

### Proof of creftypecap 1

The proof is based on the following two important lemmas, which we prove first.

###### Lemma 2.

Let creftypecap 1 hold. Apply SGD with the same sample path to solve the ERM with data sets and , respectively. Choose with , then the following bound holds.

 ES,¯¯¯S,ξ[δt+1,S,¯¯¯S] ≤(1+αtL)ES,¯¯¯S,ξ[δt,S,¯¯¯S]+2αtnES,ξ[∥∥∇ℓ(wt,S;z1)∥∥].
###### Proof of Lemma 2.

Consider the two fixed data sets and that differ at, say, the first data sample. At the -th iteration, we consider two cases of the sampled index . In the first case, (w.p. ), i.e., the sampled data from and are the same, and we obtain that

 δt+1,S,¯¯¯S ≤(1+αtL)δt,S,¯¯¯S, (11)

where the last inequality uses the -Lipschitz property of . In the other case, (w.p. ), we obtain that

 δt+1,S,¯¯¯S =∥∥wt,S−αt∇ℓ(wt,S;z1)−wt,¯¯¯S+αt∇ℓ(wt,¯¯¯S;z′1)∥∥ ≤δt,S,¯¯¯S+αt(∥∥∇ℓ(wt,S;z1)∥∥+∥∇ℓ(wt,¯¯¯S;z′1)∥). (12)

Combining the above two cases and taking expectation with respect to all randomness, we obtain that

 ES,¯¯¯S,ξ[δt+1,S,¯¯¯S] ≤[n−1n(1+αtL)+1n]ES,¯¯¯S,ξ[δt,S,¯¯¯S]+1nαtES,¯¯¯S,ξ(∥∥∇ℓ(wt,S;z1)∥∥+∥∇ℓ(wt,¯¯¯S;z′1)∥) (i)≤(1+αtL)ES,¯¯¯S,ξ[δt,S,¯¯¯S]+2αtnES,ξ[∥∥∇ℓ(wt,S;z1)∥∥], (13)

where (i) uses the fact that is an i.i.d. copy of . ∎

###### Lemma 3.

Let Assumptions 1 and 2 hold. Apply SGD to solve the ERM with data set and choosing for some . Then, the following bound holds.

 Eξ,S[∥∇ℓ(wt,S;z1)∥]≤√2Lf(w0)+12ES[ν2S].
###### Proof of Lemma 3.

By creftypecap 1, is nonnegative and is -Lipschitz. Then, eq. (12.6) of (Shalev-Shwartz & Ben-David, 2014) shows that

 ∀w,∥∇ℓ(w;z)∥≤√2Lℓ(w;z). (14)

Based on eq. 14, we further obtain that

 Eξ,S∥∇ℓ(wt,S;z1)∥ ≤√2LEξ,S√ℓ(wt,S;z1)(i)≤√2L√Eξ,Sℓ(wt,S;z1) (ii)≤√2L ⎷Eξ,S1nn∑j=1ℓ(wt,S;zj)=√2L√Eξ,SfS(wt,S), (15)

where (i) uses the Jesen’s inequality and (ii) uses the fact that all samples in are generated i.i.d. from .

Next, consider a fixed data set and denote as the sampled stochastic gradient at iteration . Then, by smoothness of and the update rule of the SGD, we obtain that

 fS(wt+1,S)−fS(wt,S) ≤⟨wt+1,S−wt,S,∇fS(wt,S)⟩+L2∥wt+1,S−wt,S∥2 =⟨−αtgt,S,∇fS(wt,S)⟩+Lα2t2∥∥gt,S∥∥2.

Conditioning on and taking expectation with respect to , we further obtain from the above inequality that

 Eξ [fS(wt+1,S)−fS(wt,S)|wt,S] ≤(Lα2t2−αt)∥∥∇fS(wt,S)∥∥2+Lα2t2Eξ[∥∥gt,S∥∥2−∥∥∇fS(wt,S)∥∥2|wt,S]. (16)

Note that by our choice of stepsize. Further taking expectation with respect to the randomness of and , and telescoping the above inequality over , we obtain that

 Eξ,S[fS(wt,S)] (i)≤ESfS(w0)+t−1∑t′=0Lα2t′2ES[ν2S] =f(w0)+t−1∑t′=0Lc2ES[ν2S]2(t′+2)2(ii)≤f(w0)+Lc2ES[ν2S]4,

where (i) uses the fact that the variance of the stochastic gradients is bounded by , and (ii) upper bounds the summation by the integral, i.e., . Substituting the above result into eq. 15 and noting that , we obtain the desired result. ∎

Now by Lemma 2, we obtain that

 ES,¯¯¯S,ξ[δt+1,S,¯¯¯S] ≤(1+αtL)ES,¯¯¯S,ξ[δt,S,¯¯¯S]+2αtnES,ξ[∥∥∇ℓ(wt,S;z1)∥∥] (i)≤(1+αtL)ES,¯¯¯S,ξ[δt,S,¯¯¯S]+2αt√2Lf(w0)+ES[ν2S]2n, (17)

where (i) applies Lemma 3. Recursively applying eq. 17 over and noting that and , we obtain

 ES,¯¯¯S,ξ[δT] ≤T−1∑t=0[T−1∏k=t+1(1+αkL)]2c√2Lf(w0)+ES[ν2S]2(t+2)log(t+2)n (i)≤T−1∑t=0[exp(T−1∑k=t+1cL(k+2)log(k+2))]2c√2Lf(w0)+ES[ν2S]2(t+2)log(t+2)n (ii)≤T−1∑t=0(logTlog(t+2))cL2c√2Lf(w0)+E