# Stability and Deviation Optimal Risk Bounds with Convergence Rate O(1/n)

The sharpest known high probability generalization bounds for uniformly stable algorithms (Feldman, Vondrák, 2018, 2019), (Bousquet, Klochkov, Zhivotovskiy, 2020) contain a generally inevitable sampling error term of order Θ(1/√(n)). When applied to excess risk bounds, this leads to suboptimal results in several standard stochastic convex optimization problems. We show that if the so-called Bernstein condition is satisfied, the term Θ(1/√(n)) can be avoided, and high probability excess risk bounds of order up to O(1/n) are possible via uniform stability. Using this result, we show a high probability excess risk bound with the rate O(log n/n) for strongly convex and Lipschitz losses valid for any empirical risk minimization method. This resolves a question of Shalev-Shwartz, Shamir, Srebro, and Sridharan (2009). We discuss how O(log n/n) high probability excess risk bounds are possible for projected gradient descent in the case of strongly convex and Lipschitz losses without the usual smoothness assumption.

## Authors

• 5 publications
• 15 publications
• ### High probability generalization bounds for uniformly stable algorithms with nearly optimal rate

Algorithmic stability is a classical approach to understanding and analy...
02/27/2019 ∙ by Vitaly Feldman, et al. ∙ 2

• ### Boosting with the Logistic Loss is Consistent

This manuscript provides optimization guarantees, generalization bounds,...
05/13/2013 ∙ by Matus Telgarsky, et al. ∙ 0

• ### Optimal learning with Bernstein Online Aggregation

We introduce a new recursive aggregation procedure called Bernstein Onli...
04/04/2014 ∙ by Olivier Wintenberger, et al. ∙ 0

• ### Tight Analyses for Non-Smooth Stochastic Gradient Descent

Consider the problem of minimizing functions that are Lipschitz and stro...
12/13/2018 ∙ by Nicholas J. A. Harvey, et al. ∙ 0

• ### A Simple Analysis for Exp-concave Empirical Minimization with Arbitrary Convex Regularizer

In this paper, we present a simple analysis of fast rates with high pr...
09/09/2017 ∙ by Tianbao Yang, et al. ∙ 0

• ### Empirical Hypothesis Space Reduction

Selecting appropriate regularization coefficients is critical to perform...
09/04/2019 ∙ by Akihiro Yabe, et al. ∙ 0

• ### Convex Optimization Over Risk-Neutral Probabilities

We consider a collection of derivatives that depend on the price of an u...
03/05/2020 ∙ by Shane Barratt, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Stability is a standard method to analyze the generalization properties of learning algorithms. This approach can be traced back to the foundational works of Vapnik and Chervonenkis [45]. Using the sensitivity of the learning algorithms to the removal of one example in the learning sample, they proved optimal bounds (scaling as , where is the sample size) on the average risk of hard margin SVM

and of the Perceptron algorithm. The ideas of stability were further developed by Rogers and Wagner

[39], Devroye and Wagner [12, 13], Lugosi and Pawlak [32], Kearns and Ron [25] and other authors. Stability arguments are notorious for only providing in-expectation error bounds. High probability guarantees require more effort and lead to several long-standing open problems in the literature. For example, the classical stability analysis of Vapnik and Chervonenkis [45, 21] has only recently been refined to allow high probability guarantees with the optimal error rate [49, 7, 18].

The widely used notion of stability allowing high probability upper bounds is called uniform stability. It was introduced in the seminal work of Bousquet and Elisseeff [6]. Let us introduce some standard notation. We have a set of i.i.d. observations sampled according to some unknown distribution defined on an abstract set . One may naturally think of as a set of instances with their labels. Our decision rules are indexed by a set that is always assumed to be a closed subset of a separable Hilbert space. Given the learning sample , a learning algorithm produces the decision rule

. For the loss function

, we define the risk and the empirical risk of , respectively as

 R(w)=EPℓ(X,w),Rn(w)=1nn∑i=1ℓ(Xi,w).

Following [6], an algorithm (we always use the word algorithm both for the mapping and for the decision rule) is uniformly -stable, if for any and , it holds that

 |ℓ(x,wn(x1,…,xn))−ℓ(x,wn(x1,…,xi−1,x′,xi+1,…,xn))|≤γ.

The paper of Hardt, Recht, and Singer [19] on the stability of gradient descent methods has generated a wave of interest in this direction. Recent works use various notions of stability in their analysis: some authors are motivated by the analysis of gradient descent algorithms [31, 29, 15, 4], while others use the notion of average stability to obtain the in-expectation rate for regularized regression [28, 17, 46] and some more specific improper learning procedures [36, 37]. One of the key open questions left in [19] is related to the lack of high probability generalization bounds. Their question inspired a line of research focused on getting the sharpest possible generalization bounds for uniformly stable algorithms. Based on the recent progress by Feldman and Vondrák [15, 16], the sharpest known high probability bound for uniformly stable algorithms was shown by Bousquet, Klochkov and Zhivotovskiy [6]. Their result states that for a -uniformly stable algorithm if the loss is bounded by , then for any , with probability at least , it holds that

 R(wn)−Rn(wn)≲γlognlog(1δ)stability error+M√1nlog(1δ)sampling error. (1)

One problem inherent to all high probability generalization bounds is that they are insensitive to the stability parameter being smaller than . That is, in this favorable case the sampling error term scaling as controls the generalization error. The situation where is smaller than happens in the literature on stochastic convex optimization where the strongly convex objectives are frequently considered [41, 42, 38, 20]. Unfortunately, there is no generic way to remove the term in (1). It appears even for the algorithms that always output the same decision rule (-uniform stability). The problem is that generalization bounds compare the finite-sample risk with its non-empirical counterpart, namely, the population risk .

A frequently used alternative to generalization bounds, which avoids the sampling error, are the excess risk bounds. That is, we are interested in upper bounding

 R(wn)−infw∈WR(w).

Via a standard decomposition, the generalization bounds of the form (1) can be translated into the excess risk bounds for the empirical risk minimization algorithm (ERM). However, in this case the sampling error is propagated in the excess risk bound leading to suboptimal results in the cases where we expect the rate of convergence. Thus, we are focusing on the following question:

Can uniform stability provide high probability excess risk bounds with the rate (up to) ?

The main result of this paper answers this question positively and provides the first high probability bound based on uniform stability allowing the rate of convergence. Similar questions appeared earlier in the literature on stochastic convex optimization, where optimal in-expectation results usually follow from stability. In particular, Shalev-Shwartz, Shamir, Srebro, and Sridharan asked in their pathbreaking paper [41] if a high probability excess risk bound for strongly convex and Lipschitz losses with the rate is possible. As a corollary of our main result, we resolve their question by getting an almost optimal high probability bound with the rate .

### 1.1 Main results

It is well known that the rate of convergence for the excess risk cannot be achieved for free. So we need to introduce an additional assumption. We consider the following generalization of the so-called Bernstein condition allowing multiple global risk minimizers. The version below is originally due to Koltchinskii [26].

###### Assumption 1.1 (Generalized Bernstein condition).

Assume that is a set of risk minimizers in a closed set . We say that together with the measure and the loss satisfy the generalized Bernstein assumption if for some for any , there is such that

 E(ℓ(w,Z)−ℓ(w∗,Z))2≤B(R(w)−R(w∗)).

Observe that the Bernstein assumption is independent of a specific learning algorithm. It is also not too restrictive and often accompanies uniform stability. In Section 2.1, we provide some examples and a detailed discussion.

Suppose that we are given a uniformly stable algorithm that attempts to minimize the empirical loss . We denote the optimization error of such an algorithm by

 Δopt=Rn(wn)−minw∈WRn(w).

In particular, for ERM, we have . The following theorem is our first main result.

###### Theorem 1.1.

Assume that the loss is bounded by . Suppose also that Assumption 1.1 is satisfied with the parameter . Let be a -stable algorithm that has the optimization error . There is an absolute constant such that the following holds. Fix any . Then, with probability at least , it holds that

 R(wn)−infw∈WR(w)≤Δopt+ηEΔopt+c(1+1/η)(γlogn+M+Bn)log(1δ).

Our main application of this bound is an almost optimal high probability bound for ERM with strongly convex and Lipschitz losses. See Section 2.2 and Proposition 2.1 for more detail. Observe that the bound of Theorem 1.1 contains the term corresponding to the expected optimization error , where the expectation is taken with respect to the learning sample. This does not pose a problem in applications known to us. In particular, Theorem 1.1 implies that if our uniformly stable algorithm is ERM in , then and, with probability at least ,

 R(wn)−R(w∗)≤c(γlogn+M+Bn)log(1δ).

Our second main result complements the generalization bound (1) and provides the variance-type bound, allowing us to completely remove the term in (1) whenever the empirical error is small.

###### Theorem 1.2.

There is an absolute constant such that the following holds. Let be a -stable algorithm and assume that the loss is bounded by . Fix any . Then, with probability at least , it holds that

 R(wn)≤(1+η)Rn(wn)+c(1+1/η)(γlogn+Mn)log(1δ).

This result has a clear motivation: in modern practice, learning algorithms achieve a small or even zero empirical error on the learning sample, and the analysis should take this into account. Note that there are several recent variance-type stability bounds in the literature [31, 35] but under significantly stronger assumptions. In particular, in these papers, the loss is a generalized linear function, whereas we are working in the canonical framework of Bousquet and Elisseeff [6]. It is also important to note that in some examples, a small empirical error may lead to worse stability: this is called the fitting-stability tradeoff in the textbook [40]

. For instance, in ridge regression, regularization improves stability but at the same time leads to an increased empirical error. And vice versa, by removing regularization, we may fit the data but lose stability.

For any two functions (or random variables)

the symbol means that there is an absolute constant such that on the entire domain. The gradient and subgradient of function at point are denoted by and , respectively. The notation stands for the inner product and by writing , we usually mean .

## 2 Stochastic convex optimization with strongly convex losses

Stochastic convex optimization is a classical setup in which one minimizes a convex function based on some values or gradients at a given sequence of points. The most common setting is where at each round, the learner gets information on through a stochastic gradient oracle (see [38] and references therein). Another related setup that allows us to analyze generalization is when we observe the values of the losses for an i.i.d. sample . Arguably the most well-studied case is when the following properties of the loss hold for any :

• The loss is -strongly convex. That is, for any ,

 ℓ(x,w1)−ℓ(x,w2)≥⟨g,w1−w2⟩+(λ/2)∥w1−w2∥2.
• The loss is -Lipschitz. That is, for any and any ,

 |ℓ(x,w1)−ℓ(x,w2)|≤L∥w1−w2∥.

These assumptions on the loss are standard in the literature and have been studied in, e.g., [22, 23, 43, 41, 47, 48] as well as in the recent work on stability of gradient descent methods [19]. One can reasonably argue that both assumptions are rather restrictive (see the discussions in [43, 1]

). Despite that, these assumptions are fundamental to the machine learning community and provide a clear illustration of our excess risk bounds.

In this setup, given a convex and closed set , we want to analyze the ERM strategy (also referred to as Sample Average Approximation (SAA)). That is, we are aiming to provide a high probability upper bound on the excess risk

 R(ˆw)−R(w∗),whereˆw=argminw∈WRn(w).

The question of deviation optimal bounds in a closely related setup was recently revived by Harvey, Liaw, Plan, and Randhawa [20]

. They proved a generalization of Freedman’s inequality for martingale differences to show high probability guarantees for stochastic gradient descent, resolving several open questions. In our case, high probability excess risk bounds are known for some specific algorithms but follow from the regret bounds in the online setting combined with the martingale-based online to batch conversion techniques

[23] (see also [41, Section 2.2]). Despite numerous attempts [43, 42, 31, 47, 48], the question of whether dimension-free high probability bounds are achievable by any algorithm minimizing the empirical error remained open.

On the technical side, since ERM cannot be seen as a result of an online to batch conversion111The result in the textbook [10, Theorem 3.1] shows that the follow-the-leader strategy (an adaptation of ERM in the online setup) achieves the regret after rounds. However, after the online to batch conversion [23], we only get a high probability bound for an average of empirical risk minimizers., the existing martingale-based techniques cannot be directly exploited. More importantly, uniform convergence, which is a standard tool for obtaining high probability bounds for ERM, fails in our case. This follows from an example in [42, Section 4.1 and page 2646] (see also [14]). One may wonder if a more precise localized analysis [33, 26, 2] should help in our setup. This is also not the case, since according to [41, Section 5.3] there is no uniform convergence for an arbitrary localization radius. Fortunately, our stability-based method proves the desired upper bound.

### 2.1 Verifying the Bernstein assumption

When applying Theorem 1.1, we first need to check that the Bernstein assumption holds. Let us discuss this assumption in more detail. Assumption 1.1 appears first in a similar generality in the work of Massart [33] and under the name Bernstein class assumption in [3]. This assumption is used as one of the components for proving the rates of convergence faster than (see the textbook [27]). The Bernstein assumption is usually implied by the convexity of the underlying class and the convexity of the loss function. We refer to [44] for an extensive survey on related results.

For our purposes, we verify Assumption 1.1 for strongly convex and Lipschitz losses. The following result is well-understood and appears (usually implicitly) in the literature. In our case, there is a unique risk minimizer ; that is, . From one perspective, the Lipschitz property implies for any ,

 E(ℓ(w,X)−ℓ(w∗,X))2≤L2∥w−w∗∥2.

From another perspective, since the loss is -strongly convex and minimizes the risk in the convex set , we have

 R(w)−R(w∗)≥(λ/2)∥w−w∗∥2.

Comparing the two inequalities, we have

 E(ℓ(w,X)−ℓ(w∗,X))2≤L2∥w−w∗∥2≤(2L2/λ)(R(w)−R(w∗)). (2)

This implies that -strongly convex and -Lipschitz losses satisfy Assumption 1.1 with .

Our version of the Bernstein condition, namely Assumption 1.1, is due to Koltchinskii [26, Page 2618]. The key difference from the standard Bernstein assumption is that we allow multiple minimizers but can still provide rates of convergence. Our motivation lies in the recent interest in relaxing the strong convexity assumption in (stochastic) optimization problems. One of such alternatives is the Polyak-Łojasiewicz condition (PL) (see [24]). In this context, the work [30] extends the standard Bernstein assumption to go beyond the strong convexity assumptions allowing multiple risk minimizers. Likewise, [11] claims that uniform stability results hold when the strong convexity assumption on the losses is replaced by the (PL) assumption222The bounds in [11] require that (PL) holds for the empirical error. To the best of our knowledge, no stability results are known if (PL) is satisfied only for individual losses.. Thus, our general results can potentially be useful in this direction.

### 2.2 High probability bound for almost risk minimizers

In this section, we present the main application of Theorem 1.1. In the strongly convex case, we provide a sharp high probability guarantee valid for any learning algorithm depending on its optimization error.

###### Proposition 2.1.

Let be a convex closed set. Assume that the loss function is -strongly convex and -Lipschitz as defined above. Let an approximate empirical minimizer have an optimization error bounded by for any learning sample. Then, with probability ,

 R(ˆw)−R(w∗)≲¯¯¯¯¯Δ+⎛⎝L2λn+√L2¯¯¯¯¯Δλ⎞⎠lognlog(1δ).

In particular, if is ERM in , then and

 R(ˆw)−R(w∗)≲L2λnlognlog(1δ). (3)

The in-expectation version of (3) without an additional -factor is well-known and attributed to the foundational papers [6, 41, 42]. As we mentioned, the possibility of a high probability bound with the rate was asked in [41, Discussion after Claim 6.2]. Despite the recent progress, the term is present in the sharpest known high probability bound [16, Corollary 4.2]. Proposition 2.1 settles this question up to a logarithmic factor. We note that high-probability bounds are known for ERM in the particular case where the loss is a generalized linear function with a strongly convex penalty [43]. The analysis in [43] is based on localized Rademacher complexities and exploits the linear structure of the loss. As we mentioned above, uniform convergence cannot help in our setup.

### 2.3 Application to projected gradient descent without smoothness assumptions

Let us consider a simple illustration of Proposition 2.1. In what follows, we focus on the statistical rather than the computational part of the story. The method of Projected Gradient Descent (full-batch PGD) consists of iteration of the following update rules for ,

 yt =wt−νtgt,wheregt∈∂Rn(wt), wt+1 =ΠW(yt),

where is the total number of steps, is an initial approximation, and is the projection operator onto the convex closed set . The choice of the number of iterations and the step values affects the optimization error. For instance, when the loss is -strongly convex and -Lipschitz, choosing gives the following optimization error (see [9, Theorem 3.9]),

 Rn(¯¯¯¯wT)−minw∈WRn(w)≤4L2/λT,

where is the weighted average of iterations. Therefore, PGD achieves the optimization error after steps. By Proposition 2.1, with probability at least , it holds that

 R(¯¯¯¯wT)−R(w∗)≲L2λnlognlog(1δ). (4)

This is the first high probability excess risk bound PGD. Our techniques do not give an answer to the question whether a smaller number of iterations is sufficient in the nonsmooth case.

We note that stability of PGD can be analyzed regardless of the optimization error. Indeed, the derivations of Hardt, Recht, and Singer [19, Section 3.4] (see also [16, Section 4.1.2]) imply that if the loss is -smooth in addition to strong convexity and the Lipschitz property, that is,

 ∥∇wℓ(x,w1)−∇wℓ(x,w2)∥≤β∥w1−w2∥,for all w1,w2∈W,

then PGD with the constant step size is -uniformly stable for any number of steps. As a result, the smoothness assumption and the error bound for PGD without averaging [9, Theorem 3.10] imply the previously unknown risk bound (4) after only steps.

## 3 Proofs

Throughout the proofs, we rely on the norm. Denote the -norm of a random variable as

. A moment bound can be translated into a high-probability bound as follows (see, e.g.,

[8, Section 2]). Assume that for some and all , it holds that . Then, there is an absolute constant such that for any , with probability at least , it holds that

 |Z|≤C(a√log(1/δ)+blog(1/δ)). (5)

As we mentioned, generalization bounds of form (1) cannot provide excess risk bounds with the rate better than . The following lemma separates the sampling term from the generalization error.

###### Lemma 3.1.

Let be a -stable algorithm and let be its independent copy. Then, for any ,

 ∥∥ ∥∥Rn(wn)−R(wn)−1nn∑i=1E[ℓ(Xi,w′n)|Xi]+ER(wn)∥∥ ∥∥p≲γplogn.

Such decomposition is possible due to the following extension of the bounded difference inequality by Bousquet, Klochkov, and Zhivotovskiy [8, Theorem 4].

###### Theorem.

Assume that are independent variables and the functions satisfy the following properties for ,

• almost surely;

• has the bounded differences property with respect to all but the -th variable: for all and , we have

• almost surely.

Then, the following moment bounds hold for all ,

 ∥∥∑ni=1gi∥∥p≤12√2βpnlogn+4K√pn. (6)

In addition, we will use the following version of the Bernstein inequality [5, Theorem 15.11]: if are zero mean, independent and bounded almost surely, then

 ∥X1+⋯+Xn∥p≤6√(∑ni=1EX2i)p+4pM,∀p≥2. (7)

Our last tool is the concentration inequality for non-negative weakly self-bounded functions. Assume that . We say that the function if -weakly self-bounded if there exist functions that satisfy for all ,

 ∑ni=1(f(x)−fi(x))2≤af(x)+b.

The following concentration inequality is a lower tail version of [5, Theorem 6.19], which is originally due to Maurer [34]. The difference is that in their result it is assumed that for any . The proof of the version below is standard, and we reproduce it in Appendix for the sake of completeness. Since we consider the lower tail, we remove the term present in [5, Theorem 6.19].

###### Theorem.

Suppose that are independent random variables and the function is -weakly self-bounded, and the corresponding functions satisfy for and any . Then, for any ,

 Pr(Ef(X1,…,Xn)≥f(X1,…,Xn)+t)≤exp(−t22aEf(X1,…,Xn)+2b). (8)

### 3.1 Proof of Lemma 3.1

For , where is an independent copy of , consider the functions

 gi(X1,…,Xn)=EX′iℓ(Xi,w(i)n)−EX′iR(w(i)n).

One can immediately verify that these functions satisfy all three properties needed to apply (6) with . It is standard to check that (see e.g., [8, Lemma 7])

 ∣∣n(Rn(wn)−R(wn))−∑ni=1gi∣∣≤2γn.

Let us consider for the functions where the functions preserve the stability property (up to a factor of ). Observe that almost surely, which implies . Therefore, applying (6) to the functions , we have that for any ,

 ∥∥∑ni=1gi−E[gi|Xi]∥∥p≤48√2γpnlogn.

Notice that . Our result follows. ∎

### 3.2 Proof of Theorem 1.1

The proof starts with a standard decomposition that turns the generalization bound into an excess risk bound. Denote . We have for any ,

 R(wn)−R∗ =R(wn)−Rn(wn)+Rn(wn)−Rn(w∗)+Rn(w∗)−R∗ ≤Δopt−(Rn(wn)−R(wn))+Rn(w∗)−R∗.

Here, the expression stands for the generalization error and is typically of order . To avoid this, we use the decomposition of Lemma 3.1,

 Rn(wn)−R(wn)=ξ+1nn∑i=1E′ℓ(Xi,w′n)−ER(wn),

where for any . We now need to couple the remainder term with to achieve the rate. Let be an independent copy of and we write to denote the expectation with respect to this independent copy. Since , then for any , it holds that

 R(wn)−R∗≤

Since we are free to choose any , let us take the one corresponding to in Assumption 1.1. Notice that neither , nor depend on this choice. In other words,

is a random vector induced by

, where we write instead of to point out this dependence. Therefore, we rewrite our last display as follows

 R(wn)−R∗≤ Δopt−ξ−1nn∑i=1(E′ℓ(Xi,w′n)−ℓ(Xi,w′))+ER(w′n)−R∗.

Notice that here the only terms that depend on are . Taking the expectation of both sides of this inequality, we obtain

 R(wn)−R∗≤ Δopt−ξ−1nn∑i=1E′[ℓ(Xi,w′n)−ℓ(Xi,w′)]+ER(w′n)−R∗. (9)

Here, , and as we have already noticed, . Moreover, by the Bernstein condition and Jensen’s inequality,

 E(E′[ℓ(Xi,w′n)−ℓ(Xi,w′)])2 ≤E(ℓ(Xi,w′n)−ℓ(Xi,w′))2 =EE[(ℓ(Xi,w′n)−ℓ(Xi,w′))2|w′n] ≤B(ER(w′n)−R∗).

Having this variance bound, we are ready to apply the moment Bernstein inequality (7) to the sum of independent random variables . Since is exactly the expectation of each of these summands, we have for all ,

 ∥∥ ∥∥1nn∑i=1E′[ℓ(Xi,w′n)−ℓ(Xi,w′)]−ER(w′n)+R∗∥∥ ∥∥p≲√B(ER(wn)−R∗)pn+pMn. (10)

Plugging this into (9), we obtain for each and some absolute constant ,

 ∥R(wn)−R∗−Δopt∥p ≤C(γplogn+√B(ER(wn)−R∗)pn+pMn) ≤η(ER(wn)−R∗)+C(γplogn+(Bη+M)pn), (11)

where the second inequality holds since for any , it holds that .

Finally, we need an upper bound on . Taking in (11) and using the Cauchy-Schwarz inequality, we have

 ER(wn)−R∗−EΔopt ≤∥R(wn)−R∗−Δopt∥2 ≤η(ER(wn)−R∗)+C(2γlogn+2(B/η+M)/n).

Subtracting from both sides and dividing by , we obtain

 ER(wn)−R∗≤11−ηEΔopt+C1−η(γlogn+(Bη+M)1n).

Plugging this bound back into (11), assuming that , and translating the moment bound into the high-probability bound through (5), we obtain that, with probability at least ,

 R(wn)−R∗≤Δopt+C′(η1−ηEΔopt+γlognlog(1δ)+(Bη+M)log(1/δ)n),

where is an absolute constant. By replacing by , we finish the proof. ∎

### 3.3 Proof of Theorem 1.2

We will show that under the conditions of the theorem the following variance bound holds. For any , we have, with probability at least ,

 R(wn)−Rn(wn)≲γlognlog(1δ)+√MR(wn)nlog(1δ)+Mnlog(1δ). (12)

The statement of the theorem follows immediately by applying the inequality to the middle term of the right-hand side and choosing the appropriate value of .

The proof of (12) repeats the arguments of Theorem 1.1 with several important changes listed below. As in the proof of Theorem 1.1, we use the generalization bound of Lemma 3.1, and then apply the Bernstein inequality to the correcting term. Converting the moment bound into a high probability bound by (5), we have, with probability ,

 R(wn)−Rn(wn)≲γlognlog(1δ)+√E(ℓ(X′,wn))2nlog(1δ)+Mnlog(1δ), (13)

where we used that the variance of is bounded by due to Jensen’s inequality.

Our goal is to replace the non-random term with its empirical version , where slightly abusing the notation, denotes the integration only with respect to the independent copy . Unfortunately, a naive application of the bounded difference inequality leads to a suboptimal bound in our case. Instead, we use second order concentration through the weakly self-bounding property. Set and , so that for all . We show that is -weakly self-bounded. By the uniform stability and Jensen’s inequality, we have

 ∑ni=1(f−fi)2 ≤∑ni=1(E′(ℓ(X′,wn))2−supxi∈XE′(ℓ(X′,wn))2)2 ≤nγ2(2E′ℓ(X′,wn)+γ)2 ≤8nγ2f+2nγ4.

Therefore, by the concentration inequality (8) we have that, with probability ,

 E(ℓ(X′,wn))2−E′(ℓ(X′,wn))2≲√(nγ2E(ℓ(X′,wn))2+nγ4)log(1/δ).

Using for all and , we obtain on the same event

 E(ℓ(X′,wn))2−2MR(wn)≲nγ2log(1/δ).

Plugging this bound into (13) and using the union bound, we obtain (12). Hence, the theorem follows. ∎

### 3.4 Proof of Proposition 2.1

We first check the uniform stability of . For this we need to prove that for any ,

 |ℓ(x,ˆw)−ℓ(x,ˆw(i))|≤4L2/(λn)+√8L2¯¯¯¯¯Δ/λ,

where and . Let also be the minimizer of , which denotes the empirical risk on the sample , and is the minimizer of , which denotes the empirical risk on the sample . Then, by [42, Theorem 2],

 for any x∈X,|ℓ(x,˜w)−ℓ(x,˜w(i))|≤4L2/(λn).

On the other hand, since is -strongly convex,

 (λ/2)∥ˆw−˜w∥2 ≤Rn(ˆw)−Rn(˜w)≤¯¯¯¯¯Δ,

which implies . A similar relation holds between and . Using the -Lipschitz property, we conclude that for all ,

 |ℓ(x,ˆw)−ℓ(x,ˆw(i))| ≤|ℓ(x,˜w)−ℓ(x,˜w(i))|+|ℓ(x,˜w(i))−ℓ(x,ˆw(i))|+|ℓ(x,ˆw)−ℓ(x,˜w)| ≤4L2/(λn)+√8L2¯¯¯¯¯Δ/λ.

Since is stable, we apply Theorem 1.1. It is only left to check that the loss is bounded. This follows from the fact that it is both -Lipschitz and -strongly convex at the same time. Indeed, we have for any and , that

 (λ/2)∥w−w∗∥2≤R(w)−R(w∗)≤L∥w−w∗∥,

so that the convex set is bounded and contained in the ball . Using again the Lipschitz property of we conclude that for any ,

 |ℓ(x,w)−ℓ(x,w∗)|≤2L2/λ.

Although the conditions of Theorem 1.1 require a uniform bound , it only enters in the proof in (10), where we apply the Bernstein inequality (7) to the sum of independent random variables . Therefore, the inequality still holds with in place of . The rest of the proof of Theorem 1.1 provides us with the required bound. ∎

#### Acknowledgments.

We thank Jaouad Mourtada for his comment on the follow-the-leader strategy and for providing several important references. We also thank Tomas Vaškevičius for valuable feedback.

## References

• [1] F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate . In Advances in Neural Information Processing Systems, volume 26, 2013.
• [2] P. L. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities. The Annals of Statistics, 33(4):1497–1537, 2005.
• [3] P. L. Bartlett and S. Mendelson. Empirical minimization. Probability theory and related fields, 135(3):311–334, 2006.
• [4] R. Bassily, V. Feldman, C. Guzmán, and K. Talwar. Stability of stochastic gradient descent on nonsmooth convex losses. In Advances in Neural Information Processing Systems, volume 33, pages 4381–4391, 2020.
• [5] S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013.
• [6] O. Bousquet and A. Elisseeff. Stability and generalization. The Journal of Machine Learning Research, 2:499–526, 2002.
• [7] O. Bousquet, S. Hanneke, S. Moran, and N. Zhivotovskiy. Proper learning, Helly number, and an optimal SVM bound. In Conference on Learning Theory, volume 125, pages 582–609. PMLR, 2020.
• [8] O. Bousquet, Y. Klochkov, and N. Zhivotovskiy. Sharper bounds for uniformly stable algorithms. In Conference on Learning Theory, pages 610–626. PMLR, 2020.
• [9] S. Bubeck. Convex Optimization: Algorithms and Complexity. Found. Trends Mach. Learn., 8(3–4):231–357, 2015.
• [10] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.
• [11] Z. Charles and D. Papailiopoulos. Stability and generalization of learning algorithms that converge to global optima. In International Conference on Machine Learning, pages 745–754. PMLR, 2018.
• [12] L. Devroye and T. Wagner.

Distribution-free inequalities for the deleted and holdout error estimates.

IEEE Transactions on Information Theory, 25(2):202–207, 1979.
• [13] L. Devroye and T. Wagner. Distribution-free performance bounds for potential function rules. IEEE Transactions on Information Theory, 25(5):601–604, 1979.
• [14] V. Feldman. Generalization of ERM in stochastic convex optimization: The dimension strikes back. In Advances in Neural Information Processing Systems, volume 29, 2016.
• [15] V. Feldman and J. Vondrák. Generalization bounds for uniformly stable algorithms. In Advances in Neural Information Processing Systems, volume 31, 2018.
• [16] V. Feldman and J. Vondrák. High probability generalization bounds for uniformly stable algorithms with nearly optimal rate. In Conference on Learning Theory and arXiv preprint arXiv:1902.10710, pages 1270–1279. PMLR, 2019.
• [17] A. Gonen and S. Shalev-Shwartz. Average stability is invariant to data preconditioning: Implications to exp-concave empirical risk minimization. The Journal of Machine Learning Research, 18(1):8245–8257, 2017.
• [18] S. Hanneke and A. Kontorovich. Stable sample compression schemes: New applications and an optimal SVM margin bound. In Conference on Algorithmic Learning Theory, volume 132, pages 697–721. PMLR, 2021.
• [19] M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning and arXiv preprint arXiv:1509.01240, volume 48, pages 1225–1234. PMLR, 2016.
• [20] N. J. Harvey, C. Liaw, Y. Plan, and S. Randhawa. Tight analyses for non-smooth stochastic gradient descent. In Conference on Learning Theory, pages 1579–1613. PMLR, 2019.
• [21] D. Haussler, N. Littlestone, and M. K. Warmuth. Predicting 0, 1-functions on randomly drawn points. Information and Computation, 115(2):248–292, 1994.
• [22] E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169–192, 2007.
• [23] S. M. Kakade and A. Tewari. On the generalization ability of online strongly convex programming algorithms. In Advances in Neural Information Processing Systems, pages 801–808, 2008.
• [24] H. Karimi, J. Nutini, and M. Schmidt. Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 795–811. Springer, 2016.
• [25] M. Kearns and D. Ron. Algorithmic stability and sanity-check bounds for leave-one-out cross-validation. Neural computation, 11(6):1427–1453, 1999.
• [26] V. Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization. Annals of Statistics, 34(6):2593–2656, 2006.
• [27] V. Koltchinskii. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems, volume 2033 of Ecole d’Eté de Probabilités de Saint-Flour XXXVIII-2008. Springer Science & Business Media, 2011.
• [28] T. Koren and K. Levy. Fast rates for exp-concave empirical risk minimization. In Advances in Neural Information Processing Systems, volume 28, 2015.
• [29] I. Kuzborskij and C. Lampert. Data-dependent stability of stochastic gradient descent. In International Conference on Machine Learning, pages 2815–2824. PMLR, 2018.
• [30] M. Liu, X. Zhang, L. Zhang, R. Jin, and T. Yang. Fast rates of ERM and stochastic approximation: Adaptive to error bound conditions. In Advances in Neural Information Processing Systems, volume 31, 2018.
• [31] T. Liu, G. Lugosi, G. Neu, and D. Tao. Algorithmic stability and hypothesis complexity. In International Conference on Machine Learning, pages 2159–2167. PMLR, 2017.
• [32] G. Lugosi and