For different supervised learning algorithms ranging from classical linear regression, logistic regression, boosting, to modern large-scale deep networks, the overall performance or expected excess risk can always be decomposed into two parts: the empirical error (or the training error) and the generalization error (characterizing the discrepancy between the test error and the training error). A central theme in machine learning is to find an appropriate balance between empirical error and generalization error, because improperly emphasizing one over the other typically results in either overfitting or underfitting. Specifically, in the context of supervised learning models trained by iterative optimization algorithms, the empirical error at each iteration is commonly controlled by convergence rate analysis, and the generalization error can be handled by algorithmic stability analysis(Devroye and Wagner, 1979; Bousquet and Elisseeff, 2002).
Convergence rate of an algorithm portrays how fast the optimization error decreases as the number of iterations grows. Recent years have witnessed a rapid advance on convergence rates analysis of specific optimization methods for a particular class of loss functions that they are optimizing over. In fact, such analysis has been carried out for many gradient methods, including gradient descent (GD), Nesterov accelerated gradient descent (NAG), stochastic gradient descent (SGD), stochastic gradient Langevin dynamics (SGLD) for convex, strongly convex, or even nonconvex functions (see e.g. Boyd and Vandenberghe (2004); Bubeck et al. (2015); Nesterov (2013); Jin et al. (2017); Raginsky et al. (2017)). However, until the optimization error and generalization error of these algorithms are analyzed together, it is not clear whether the fastest converging optimization algorithm is the best for learning.
On the other hand, algorithmic stability (Devroye and Wagner, 1979; Bousquet and Elisseeff, 2002) in learning problems has been introduced as an alternative way to control generalization error instead of uniform convergence results such as classical VC-theory (Vapnik et al., 1994) and Rademacher complexity (Bartlett and Mendelson, 2003). The stability concept has an intuitive appeal: an algorithm is stable if it is robust to small perturbations in the composition of the learning data set. Recently it has been shown that algorithmic stability is well suited for controlling generalization error of stochastic gradient methods (Hardt et al., 2016), as well as stochastic gradient Langevin dynamics algorithm (Mou et al., 2017).
While most previous papers study convergence rate and the algorithmic stability of an optimization algorithm separately, a natural question arises: What is the relationship or trade-off between the convergence rate and the algorithmic stability of an iterative algorithm? Is it possible to design an algorithm that converges the fastest and at the same time most stable? If not, is there any fundamental limit on the trade-off between the two quantities so that a fast algorithm has to be unstable?
This paper shows that there is a fundamental limit on the trade-off. That is, for any iterative algorithms, at any time step, the sum of optimization error and stability is lower bounded by the minimax statistical error over a given loss function class. Therefore, a fast converging algorithm can not be too stable, and a stable algorithm can not converge too fast. This framework therefore provides a new criterion for comparing optimization algorithms by considering jointly convergence rate and algorithm stability. As a consequence, our framework can be immediately applied to provide a new class of convergence lower bounds for algorithms with different stability rates.
In particular, we focus on two settings where the loss functions are either convex smooth or strongly convex smooth. In the first setting, we discuss the stability upper bounds of gradient descent (GD), stochastic gradient descent (SGD) and their variants with decreasing step sizes. New stability upper bounds are provided for Nesterov’s accelerated gradient descent (NAG) and the heavy ball method (HB) under quadratic loss, and we conjecture these upper bounds still hold for the general convex smooth losses. Applying the stability upper bounds for GD and SGD in our trade-off framework, we obtain the convergence lower bounds for them that match the known convergence upper bounds up to constants. Considering jointly convergence rate and algorithm stability for NAG and GD, the trade-off shows that NAG must be less stable than GD even though it converges faster than GD. In the second setting where the loss functions are strongly convex and smooth, we also provide stability upper bound and deduce the convergence lower bound results for GD and NAG via our trade-off framework. Finally, simulations are conducted to show that the stability bounds established have the correct rates as a function of and iteration . These bounds are demonstrated to be particularly useful in large scale learning settings for understanding the overall performance of an algorithm than the classical uniform convergence bounds because the stability bounds capture better generalization errors at early iterations of these algorithms.
1.1 Related work
The first quantitative results that focus on generalization error via algorithmic stability date back to (Rogers and Wagner, 1978; Devroye and Wagner, 1979). This line of research was further developed by Bousquet and Elisseeff (2002) to provide guarantees for general supervised learning algorithms and insights for the practice of regularized algorithms. It remains unclear, however, what is the algorithmic stability of general iterative optimization algorithms. Recently, to show the effectiveness of commonly used optimization algorithms in many large-scale learning problems, algorithmic stability has been established for stochastic gradient methods (Hardt et al., 2016), stochastic gradient Langevin dynamics (Mou et al., 2017), as well as for any algorithm in situations where global minima are approximately achieved (Charles and Papailiopoulos, 2017).
Lower bounds on convergence rate
Given the importance of efficient optimization methods, many papers have been devoted to understanding the fundamental computational limits of convex optimization. Those lower bounds typically focus on a specific class of algorithms. A classical line of research has been focused on first-order algorithms where only first-order information (i.e. gradients) can be queried through oracle model; see the book (Boyd and Vandenberghe, 2004), the monograph (Bubeck et al., 2015) and references therein for further details. For convex functions, the first lower bound argument given in (Nemirovsky et al., 1982) applies to first-order algorithms whose current iterate lies in the linear span of previous gradients. It has been later extended to any deterministic, then stochastic first-order algorithm (Agarwal and Bottou, 2015; Woodworth and Srebro, 2016).
1.2 Organization of the paper
The rest of the paper is organized as follows: In Section 2, we set up the necessary backgrounds on the classical excess risk decomposition and introduce the optimization error (or computational bias) and generalization error trade-off. In Section 3, we provide the main theorem on the trade-off between convergence rate (as an upper bound on optimization error) and algorithmic stability (as an upper bound on generalization error). In Section 4, we establish uniform stability bounds for several gradient methods and show that our main theorem applies to these algorithms to obtain their convergence lower bounds. In Section 5, we first provide simulation results validating the correct rates as a function of sample size and iteration number of the stability bounds we established, and then illustrate via a simulated logistic regression problem that our stability bounds reflect the generalization errors better than the simple uniform convergence bounds for GD and NAG.
In this section, we set up the necessary backgrounds on excess risk decomposition and convex optimization. Using classical excess risk decomposition, we introduce the expected optimization error and generalization error trade-off which are crucial to state our main result in the next section.
2.1 Excess risk decomposition
Throughout this paper, we consider the standard setting of supervised learning. Suppose that we are given samples , each lying in some space and drawn i.i.d. according to a distribution
. The standard decision-theoretic approach is to estimate a parameterby minimizing a loss function of the form , which measures the fit between the model indexed by the parameter and the sample .
Given the collection of samples and a loss function , the principle of empirical risk minimization is based on the objective function
This empirical risk above serves as a sample-average proxy for the population risk
We denote by an estimator computed from sample . The statistical question is how to bound the excess risk, measured in terms of the difference between the population risk and the minimal risk over the entire parameter space ,
In most of our analysis, is the output of an optimization algorithm at a particular iteration based on sample . We further denote an empirical risk minimizer. Note that and are in general not the same estimator.
For simplicity, we assume that there exists some such that .111If the infimum is not achieved within (for example is an open set), we can choose some where this equality holds up to some arbitrarily small error.
Controlling the excess risk of the estimator is usually done by decomposing it into three terms as follows:
Term is the generalization error of the model . Term is the empirical risk difference between the model and the population risk minimizer . Term is the generalization error of .
Taking expectation on the previous decomposition and noticing that , we obtain first a decomposition of the expected excess risk and then an upper bound:
The last inequality follows from the fact that is the empirical risk minimizer. Hence, the expected excess risk is upper bounded by the sum of expected generalization error and the expected optimization error or computational bias . We formally define these two quantities indexed by the estimator , loss function , data distribution and sample size to be
Making the optimization error appear in the decomposition is useful for analyzing optimization algorithms in an iterative manner. As noted in Bousquet and Bottou (2008), introducing optimization error allows to analyze algorithms doing approximate optimization. However, our framework is different to that introduced by Bousquet and Bottou (2008). We control the generalization error via iteration-dependent algorithmic stability instead of directly invoking uniform convergence results. As we are going to show, for most iterative optimization algorithms, upper bounding the generalization error by a simple uniform convergence is often loose and algorithmic stability can serve as a tighter bound.
2.2 Algorithmic Stability
Many forms of algorithmic stability have been introduced to characterize generalization error (Bousquet and Elisseeff, 2002; Kutin and Niyogi, 2002). For the purpose of this paper, we are only interested in the uniform stability notion introduced by Bousquet and Elisseeff (2002). An algorithm, which outputs a model for sample , is -uniform stable if for all , for all data sample pair and , each or is i.i.d sampled from , we have
As we did for the generalization error, we use to denote the uniform stability of an algorithm .
A stable algorithm has the property that removing one element in its learning data set does not change much of its outcome. Such a data perturbation scheme is closely related to Jackknife in statistics (Efron, 1982). One can further show that uniform stability implies expected generalization (Bousquet and Elisseeff, 2002) . For completeness, we reformulate this property in the following lemma. An algorithm, which outputs a model for sample , is -uniformly stable, then its expected generalization error is bounded as follows,
Lemma 2.2 implies that . The proof provided by Bousquet and Elisseeff (2002) relies on a symmetrization argument and makes use of the i.i.d assumptions of samples in . Combining the expected excess risk decomposition in previous section, we conclude that the sum of uniform stability and expected optimization error (or computational bias) constitutes an upper bound for the expected excess risk,
Note that the result is stated for a fixed loss function and a fixed data distribution . Equation (2
) is a key inequality for our analysis. Not only it provides a way to upper bound the expected excess risk without uniform convergence results, but also it makes the connection between the statistical excess risk and the optimization convergence rate (or computational bias). This can also be seen as reminiscent of the bias-variance trade-off of an algorithm in a computational sense since stability serves as a computational variability term and optimization error as a computational bias term.
2.3 Convex optimization settings
Throughout the paper, we focus on two types of loss functions: The first type of loss function is -strongly convex and -smooth for every ; The second type of loss function is convex and -smooth for every . We also make use of the -Lipschitz condition. We provide their definitions here. More technical details about convex optimization and relevant results are deferred to Appendix B. A function is -Lipschitz if for all , we have
A function continuously differentiable is -smooth if for all , we have
A function is convex if for all , we have
A function is -strongly convex if for all , we have
3 Trade-off between stability and convergence rate
In this section, we introduce the trade-off between stability and convergence rate via excess risk decomposition under two settings of loss functions mentioned in the previous section: the convex smooth setting and the strongly convex smooth setting. We show that for any iterative algorithm, at any time step, the sum of optimization error and stability is lower bounded by the minimax statistical error over a given loss function class. Thus algorithms sharing the same stability upper bound can be grouped to obtain convergence rate lower bounds. This provides a new class of convergence lower bounds for algorithms with different stability bounds.
We are interested in distribution independent stability and convergence where we take supremum of these two quantities over distributions and losses. For a fixed iteration algorithm that outputs at iteration , we define its uniform stability and optimization error as follows,
Note that in this paper, the supremum is taken over the class of all loss functions under either of the two settings considered (convex smooth and strongly convex smooth settings).
3.1 Trade-off in the convex smooth setting
Before we state the main theorem, we first define the loss function class of interest in this section. We define the class of all convex smooth loss functions as follows,
In the convex smooth setting, we have the following lower bound on the sum of stability and convergence rate. Suppose an iterative algorithm outputs at iteration on an empirical loss built upon a loss and an i.i.d. sample of size , and it has uniform stability and optimization error , then there exists a universal constant such that,
The first inequality of Theorem 3.1 is a simple outcome of the empirical risk decomposition in Equation (2). This first inequality is not tied to the convex smooth setting and can generalize to a wide class of optimization algorithms. The second inequality is based on an adaptation of the classical Le Cam (1986)’s method for minimax estimation lower bound to the convex smooth loss function class. Further, if we know precisely, we can obtain an immediate corollary that provide convergence lower bound for stable optimization algorithms.
Under conditions in Theorem 3.1, if an algorithms has uniform stability
with a divergent function of , i.e.
then there exists a universal constant , a sample size and an iteration number , such that for , its convergence rate is lower bounded as follows,
Even though Theorem 3.1 is valid for any pair of , Corollary 3.1 requires to choose a specific sample size in construction. However, under the assumption that the optimization algorithm has convergence rate independent of the sample size (i.e. is not a function of ), we can obtain via Corollary 3.1 a convergence lower bound that is comparable to the lower bounds in the convex optimization literature. We remark that this assumption is satisfied for commonly-used optimization algorithms such as GD and NAG.
Theorem 3.1 and Corollary 3.1 provide the trade-off between stability and optimization convergence rate. All iterative optimization methods that are algorithmic uniform stable can not converge too fast. This motivates the idea of grouping optimization methods with their algorithmic stability. Optimization methods that share the same algorithmic stability would have the same optimization lower bound. The proof of Theorem 3.1 is provided in Appendix A.1 and that of Corollary 3.1 in Appendix A.2.
3.2 Trade-off in the strongly convex smooth setting
Similar to the convex smooth setting, we define the class of all strongly convex smooth loss functions as follows,
In the strongly convex smooth setting, we have the following lower bound on the sum of stability and convergence rate. Suppose an iterative algorithm outputs at iteration on an empirical loss built upon a loss and an i.i.d. sample of size , and it has uniformly stability and has optimization error , then there exists a universal constant such that
The trade-off in the strongly convex smooth setting is similar to that of convex smooth setting, except that the minimax estimation rate is of order instead of . Theorem 3.2 provides the trade-off between stability and optimization convergence rate in the strongly convex setting. Note that a similar corollary like Corollary 3.1. The proof of Theorem 3.2 is provided in Appendix A.3.
4 Stability of first order optimization algorithms and implications for convergence lower bounds
This section is devoted to establishing stability bounds of popular first order optimization algorithms and showing that our main theorem can be applied to these algorithms to obtain their convergence lower bounds. In particular, Subsection 4.1 establishes uniform stability for first order iterative methods in the convex smooth setting and Subsection 4.2 discusses the consequence after applying Theorem 3.1 to various optimization algorithms. Subsection 4.3 provides uniform stability for first order iterative algorithms in the strongly convex smooth setting and Subsection 4.4 discusses the consequence after applying Theorem 3.2 to GD and NAG.
The goal of proving uniform stability for iteration is to bound the difference
for the sample and the perturbed one , uniformly for every . and are drawn i.i.d from a distribution . Here denotes the output model of our optimization algorithm at iteration based on sample . The optimization algorithm is applied on a pair of data samples to get two sequences of successive models and . For simplicity, we use to denote and for . We first bound the model estimate difference , then use the -Lipschitz condition of to prove stability.
Recall that the empirical loss function for data sample is
where we have replaced with to improve readability. On the other hand, the empirical loss function for the perturbed sample is
Remark that the two empirical loss functions only differ on one term that is proportional to the inverse of sample size .
4.1 Stability in the convex smooth setting
We establish uniform stability for gradient descent, stochastic gradient descent, Nesterov accelerated gradient method and heavy ball method with fixed momentum parameter when the loss function is convex smooth.
4.1.1 Gradient descent (GD)
The gradient descent algorithm is an iterative method for optimization, which uses the full gradient at each iteration (See book by Boyd and Vandenberghe (2004)). Given a convex smooth objective , GD starts at some initial point , and iterates with the following recursion
where is the step-size. Typically, one would choose fixed to ensure convergence (Boyd and Vandenberghe, 2004). In the empirical risk minimization setting, the objective of the optimization is either or .
Given a data distribution , under the assumption that is a convex, -Lipschitz and -smooth function for every , the gradient method with constant step-size on the empirical risk with sample size , which outputs at iteration , has the following uniform stability bound for all ,
We remark that this stability bound does not depend on the exact form of the loss function and the exact form of the data distribution . The proof of this theorem is provided in Appendix B.1. The key step of our proof is that in such a set-up, the error caused by the difference in empirical loss functions accumulates linearly as the iteration increases. We also show in Appendix B.1 that this stability upper bound can be achieved by a linear loss function.
4.1.2 Nesterov accelerated gradient methods (NAG)
The Nesterov’s accelerated gradient method attains the optimal convergence rate in the smooth non-strongly convex setting under the deterministic first order oracle (Nesterov, 1983). Given a convex smooth objective , starting at some initial point , NAG uses the following updates,
where is the step-size. The parameter is defined by the following recursion
satisfying . We only provide a uniform stability bound for NAG when the empirical risk function is quadratic. We conjecture that the same stability bound holds for general convex smooth functions.
Given a data distribution , under the assumption that is a -Lipschitz, -smooth convex quadratic loss function defined on a bounded domain for every , Nesterov accelerated gradient method with fixed step-size , which outputs at iteration , has the following uniform stability bound for all ,
The proof of the theorem is provided in Appendix B.2. We also show in Appendix that this stability upper bound is achieved by a linear loss function. Note that unlike the full gradient method and stochastic gradient descent, the stability bound of Nesterov accelerate gradient method depends quadratically on the iteration . Even though NAG can still have small stability when early stopping is used, its stability grows faster than that of GD at the same iteration.
4.1.3 The heavy ball method with a fixed momentum
The heavy ball method (HB), like NAG, is also a multi-step extension of the gradient descent method (Polyak, 1964). Fixed step-size and fixed momentum parameter heavy ball method has the following updates. For ,
with fixed . As for the NAG, we provide only a uniform stability bound for the heavy ball method when the empirical risk function is quadratic. We conjecture that the same stability bound holds for general convex smooth functions.
Given a data distribution , under the assumption that is a -Lipschitz, -smooth convex quadratic loss function defined on a bounded domain for every , the heavy ball method with a fixed step-size and a fixed momentum parameter , which outputs at iteration , has the following uniform stability bound for all ,
The proof of this theorem is provided in Appendix B.3. This theorem shows that the Heavy ball method with a fixed step-size and a fixed momentum parameter also uses multi-step gradients, it is more stable than NAG with a stability bound of order . This demonstrates that the multi-step setup does not necessarily lead to a similar or worse stability bound than that of NAG.
4.1.4 Other methods with known stability
In this subsection, we restate the stability bounds of some other gradient methods in this subsection for completeness. The stability bounds stated in this subsection are not new, but they serve as basis of our discussion for their convergence lower bounds implied by Theorem 3.1 in Subsection 4.2.
Stochastic gradient descent (SGD) with fixed or varying step-size
The stochastic gradient descent is a randomized iterative algorithm for optimization. Instead of using the full gradient information, it randomly chooses one data sample and updates the parameter estimate according to the gradient on that sample. It starts at some initial point , and iterates with the following recursion with chosen from the set uniformly at random:
Hardt et al. (2016) adapted the definition of uniform stability to randomized algorithms and showed that the fixed step-size stochastic gradient descent has a -uniform stability bound in the convex, -Lipschitz and -smooth setting. According to Theorem 3.8 in Hardt et al. (2016), we have
for any convex -Lipschitz and -smooth loss function . This is a restatement of the result of Hardt et al. (2016) in our notation.
Hardt et al. (2016) further considers stochastic gradient descent with decreasing step-sizes and shows that stochastic gradient descent with decreasing step-sizes has -uniform stability in the same setting.
Stochastic gradient Langevin dynamics (SGLD)
Stochastic gradient Langevin dynamics (SGLD) is a popular variant of stochastic gradient descent, where properly scaled isotropic Gaussian noise is added to an unbiased estimate of the gradient at each iteration(Gelfand and Mitter, 1991). Stochastic gradient Langevin dynamics with temperature parameter and step-size , starts at some initial point , and iterates with the following recursion with chosen from the set uniformly at random, and ,
SGLD plays an important role in sampling and optimization. It is proposed as a stochastic discrete version of the Langevin Equation , where is the Brownian motion. Recent work by Raginsky et al. (2017) has shown its effective in non-convex learning with optimization and generalization guarantees.
When SGLD is applied to optimization, a decreasing step with should be used to ensure convergence to local minima. We study this particular step-size setting of SGLD. It has been shown by Mou et al. (2017) that SGLD has the following uniform stability for -Lipschitz convex loss function,
where . Plugging in the step-size, we have that SGLD has a uniform stability bound
at iteration , for any convex -Lipschitz and -smooth loss function . This is an adaptation of the result of Mou et al. (2017) in our notation.
4.2 Consequences for the convergence lower bound in convex smooth setting
In this section, we apply Theorem 3.1 and Corollary 3.1 to obtain convergence lower bounds for a variety of first order optimization algorithms mentioned above. Furthermore, we compare the convergence lower bound we obtain with the known convergence upper bound for each of the optimization methods mentioned in the previous section. The known convergence upper bounds mentioned in this section can be found in the optimization textbooks (See Boyd and Vandenberghe (2004) or Bubeck et al. (2015)). We also discuss how our lower bounds compare to those obtained from classical oracle model of complexity by Nemirovsky et al. (1982).
Note that the assumptions in Theorem 3.1 are slightly different to what we use when we establish stability bounds in the previous section: the former assume bounded domain while the latter assume -Lipschitz. To make these two assumptions compatible, in this subsection, we assume that the domain is fixed and for all , there exists such that . Then we have the loss is -Lipschitz with . This is because for any ,
In Table 1, we summarize all the uniform stability results and the corresponding convergence lower bound under convex smooth setting. While exact constants are provided in the main text, we only show the dependency on iteration number and sample size in the table.
|Method||Uniform stability||Convergence upper bound (known)||Convergence lower bound (ours)|
|HB*, fixed momentum|
4.2.1 Gradient descent
According to Equation (3) in Theorem 4.1.1, the fixed-step-size full gradient method has -uniform stability. Applying Corollary 3.1, knowing that its convergence does not depend on , we obtain that its convergence rate is lower bounded by
It is known (see e.g. Bubeck et al. (2015)) that for convex an -smooth on , the full gradient method with step-size satisfies
The convergence rate lower bound obtained via our stability trade-off thus matches the known upper bound up to constant factors.
4.2.2 Stochastic gradient descent
According to Hardt et al. (2016), the fixed step-size stochastic gradient descent also has -uniform stability. Applying Corollary 3.1, we obtain a convergence rate lower bound of order . However, it is known that fixed-step-size stochastic gradient descent can not converge arbitrarily small error at the rate (Delyon and Juditsky, 1993). The best rate of convergence to minimize a smooth non-strongly convex function with noisy gradients is of order (Nemirovski et al., 2009). Therefore, in the case of fixed step-size SGD, the convergence lower bound we provide is valid but loose. The fixed step-size SGD is a stable algorithm but is not a convergent algorithm.
On the other hand, it is shown in the same work (Nemirovski et al., 2009) that convergence rate is achieved by stochastic gradient descent with decreasing step-size of order . Using our stability argument, we provide insights why the stochastic gradient descent with decreasing step-size is not converging too fast. It has also been shown by Hardt et al. (2016) that stochastic gradient descent with decreasing step-size of order has uniform stability. Applying Corollary 3.1, we conclude that when this decreasing step-size is used, gradient descent can not converge as fast as .
Similar arguments can be used to explain the conjecture by Moulines and Bach (2011) on the optimal convergence rates for stochastic gradient descent of step-size. It is shown in Moulines and Bach (2011) that, for , the convergence rate of stochastic gradient descent for the convex -smooth case is upper bounded by . It is shown by Hardt et al. (2016) that stochastic gradient descent of step-size has uniform stability in this set-up. Applying Corollary 3.1, we provide a proof of this conjecture, confirming the optimality of this convergence rate upper bound.
4.2.3 Nesterov accelerated gradient descent
According to Theorem 4.1.2, the Nesterov accelerated gradient descent with fixed step-size has -uniform stability for quadratic loss functions. Under the conjecture that the same stability holds for convex smooth loss functions, according to Corollary 3.1, we could obtain that its convergence rate is lower bounded by
This is compatible with its convergence rate upper bound provided in Nesterov (1983). For convex and -smooth function, Nesterov accelerated gradient method with step-size satisfies
We can compare our stability based lower bounds to classical ways of getting complexity lower bound using the classical first-order oracle of complexity (Nemirovsky et al., 1982; Nesterov, 2013). The classical oracle model based lower bound provides lower bound for all first order optimization methods that falls into the following black-box framework. It assumes that the optimization methods takes initialization and at iteration t, is in the linear span of all previous gradients. Whereas our results show that all optimization methods with order uniform stability in the smooth non-strongly convex setting would have convergence rate lower bounded by . The two lower bounds have similar form, but apply under different scenarios. One remarkable property of our result is that it does not depend on how exactly the algorithm is initialized.
4.2.4 Heavy ball method with fixed step-size
According to Theorem 4.1.3, heavy ball method with fixed step-size and fixed momentum parameter has
uniform stability for quadratic loss functions. Under the conjecture that the same stability holds for convex smooth loss functions, applying Corollary 3.1, we obtain that its convergence rate is lower bounded by . First, this lower bound matches the convergence rate upper bound proved in Ghadimi et al. (2015). Second, unlike Nesterov accelerated gradient descent, even though multiple steps of gradients are used, heavy ball method with fixed step-size is not able to achieve the optimal convergence rate . Another viewpoint on this result is that the smart choice of weighting coefficients in NAG is necessary to its optimal convergence guarantee.
4.2.5 Stochastic gradient Langevin dynamics (SGLD)
According to Mou et al. (2017), stochastic gradient Langevin dynamics with temperature and decreasing step-size , when used for convex optimization, has
uniform-stability. Applying Corollary 3.1, we conclude that its convergence rate is lower bounded by . While the additional noise added in SGLD might be helpful for certain non-convex optimization settings in escaping local minima as stated in Mou et al. (2017), SGLD has a slower worst-case convergence than the GD or SGD based on our stability argument.
4.3 Stability in the strongly convex smooth setting
In this subsection, we establish uniform stability for gradient descent, Nesterov accelerated gradient method in the strongly convex smooth setting. In the strongly convex smooth setting, the loss function is strongly-convex, -smooth for every .
4.3.1 Gradient descent (GD)
The gradient descent method in the strongly convex setting has exactly the same updates as before, given a strongly convex smooth objective , for ,
where is the step-size. While the algorithm stays the same, the strongly convex property of the loss function allows the algorithm to have a better stability.
Given a data distribution , under the assumption that is -strongly convex, -smooth and -Lipschitz for every , the full gradient method with constant step-size , which outputs at iteration , has uniform stability
The proof of this theorem is provided in Appendix C.1.
4.3.2 Stochastic gradient descent (SGD) with fixed step-size
The stochastic gradient descent in the strongly convex setting has the exactly same updates as before. It starts at some initial point , and iterates with the following recursion with chosen from the set uniformly at random,
The stability of SGD under strongly convex setting has been first discussed in Hardt et al. (2016). According to Theorem 3.10 in Hardt et al. (2016), the stability of SGD under strongly convex setting is upper bounded by
at iteration , for any -strongly convex, -Lipschitz and -smooth loss function .
4.3.3 Nesterov accelerated gradient descent (NAG)
Unlike in the convex smooth setting, Nesterov’s accelerated gradient descent can take fixed momentum parameter in the strongly convex smooth setting.
where is the step-size, .
We prove its uniform stability for strongly-convex, -smooth for quadratic loss function.
Given a data distribution , under the assumption that is -strongly convex, -smooth and -Lipschitz for every , Nesterov accelerated gradient descent method described above, which outputs at iteration , has uniform stability
The proof of this theorem is provided in Appendix C.2.
4.4 Consequences for the convergence lower bound in the strongly convex setting
In this subsection, we obtain convergence lower bound for GD and NAG in the -strongly convex -smooth setting via Theorem 3.2. In Table 2, we summarize all the uniform stability results and the corresponding convergence lower bounds under strongly convex smooth setting. While exact constants are provided in the main text, we only show the dependency on iteration number and sample size in the table.
|Method||Uniform stability||Convergence upper bound (known)||Convergence lower bound (ours)|
4.4.1 Gradient descent
According to Theorem 4.3.1, gradient descent with fixed step-size in the strongly convex smooth setting has
uniform stability. We apply Theorem 3.2 to obtain a lower bound on the convergence of GD for strongly convex smooth functions.
If the leading constants and match, we could directly obtain a lower bound on its convergence of order as we expect. Unfortunately, due to our proof of the empirical risk minimization lower bound, a couple factors of constants are lost. Thus directly applying the stability bound makes it impossible to match the leading constants. We always have
Therefore, our trade-off result only gives convergence lower bound of GD with an offset of as stated in Equation (13).
Remark that a similar lower bound can be obtained for stochastic gradient descent using exactly the same argument for GD.
4.4.2 Nesterov accelerated gradient descent
According to Theorem 4.3.3, Nesterov accelerated gradient descent with fixed step-size in the strongly convex smooth setting has
uniform stability for quadratic loss function. Since the construction of the minimax lower bound in Theorem 3.2 is based on quadratic loss functions, applying Theorem 3.2 by restricting to quadratic loss functions, we obtain an expected convergence lower bound of order with an offset,
5 Simulations Experiments
In this section, we first show via simulation results of a simple logistic regression applied on breast-cancer-wisconsin dataset that the stability bounds established in this paper have the right scaling on the iteration number . Second, we illustrate via a logistic regression problem that the stability bound characterize better the generalization error than simple uniform convergence bound at least for the first iterations of GD and NAG.
5.1 Algorithmic Stability Rate Scaling
We evaluate our stability bounds for all gradient methods mentioned on logistic regression with the binary classification datasets breast-cancer-wisconsin (Wolberg and Mangasarian, 1990). This dataset has sample size and dimension . The problem of logistic regression is formulated as follows.
Given a set of i.i.d. samples , with and , we want to estimate the parameter which characterizes the conditional distribution of given :
Let and be the matrix with as -row. The log-likelihood function we optimize over is as follows,
It can be shown that this objective has the Lipschitz constant equal to and the smoothness parameter equal to when the covariate matrix
is normalized to have its maximum eigenvalue equal to. When there is no regularization, each loss function is not strongly convex . In all of our experiments we set constant step-size . To construct samples that differ only on one data point, we first fix a sample with size from dataset, then construct a perturbed sample by changing one data point in and finally run our optimization algorithm to compute and plot the model difference . The norm difference constitute an estimate for the uniform stability up to constants independent of and . Finally, the perturbation on the sample is repeated times. Figure 1 shows the estimated uniform stability, averaged over independent repeats, for all gradient methods methods, Nesterov accelerated gradient, heavy ball method with fixed momentum (), full gradient method with fixed step-size, full gradient method with decreasing step-size (), stochastic gradient method with fixed step-size and stochastic gradient method with decreasing step-size (). We observe that the estimated uniform stabilities of full gradient method, stochastic gradient method and heavy ball method with fixed step-size all have slope in log-log plot, while Nesterov accelerated gradient method has slope . As expected, methods with decreasing step-size have a slope smaller than . Even though the stability bounds of NAG and HB are only established for quadratic loss, the estimated stability in the simulation makes us conjecture that the stability bounds of NAG and HB still hold in the general convex smooth setting.
5.2 Algorithmic stability vs simple uniform convergence bounds
The goal of this simulation is to show that algorithmic stability characterize the generalization error better than the simple uniform convergence bounds, which can not easily take into account of the growth of the function space for iterative algorithms. For -dimensional estimation problem, simple uniform convergence bound would give an generalization error bound of order . The exact constant in the uniform convergence bound depends on the function space and is hard to characterize for iterative algorithms. We think that more refined uniform convergence bound via Rademacher complexity (Bartlett and Mendelson, 2003) might be possible, but we are not aware of such results for general iterative algorithms. In this section, we show via simulations that the simple uniform convergence bound of order is less precise than the stability in characterizing generalization error. More precisely, we can see that when the dimension and the number of samples are large and iteration is small
where is the stability bound for GD or NAG. We show in the next two experiments that this comparison is valid and the stability bound is more relevant in large scale problems.
In the both experiments, we fix the true parameter and we random draw i.i.d. samples