The main motivation of our studies is the analysis of learning algorithms that are uniformly stable (we recall the definition introduced in  below). In this context we are given an i.i.d sample of points distributed independently according to some unknown measure on , a learning algorithm , which maps , that is given the learning sample it outputs the function mapping the instance space into the space of labels . The output of the learning algorithm based on the sample will be denoted by
. We also use the loss function.
Given the random sample the risk of the algorithm is defined as
and the empirical risk as
By the generalization bounds we mean the high probability bounds on which is the difference between the actual risk of the algorithms and its empirical performance on the learning sample. The standard way to prove the generalization bounds is based on the sensitivity of the algorithm to changes in the learning sample, such as leaving one of the data points out or replacing it with a different one. To the best of our knowledge, this idea was first used by Vapnik and Chervonenkis to prove the in-expectation generalization bound for what now is known as hard-margin SVM . Later works by Devroy and Wagner used the notions of stability to prove the high probability generalization bounds for -nearest neighbors . The paper  provides an extensive analysis of different notions of stability and the corresponding (sometimes) high probability generalization bounds. Among some recent contributions on high probability upper bounds based on the notions of stability is the paper of Maurer 
, which studies generalization bounds for a particular case of linear regression with a strongly convex regularizer, as well as the recent work, which provides sharp exponential upper bounds for the SVM in the realizable case. In order not to repeat an extensive survey on the topic, we refer to  and  and the references therein.
We return to the problem of generalization bounds. For the sake of simplicity, we denote . The learning algorithm is uniformly stable with parameter if given any samples
for any we have
Since it is only a matter of normalization, in order to simplify the notation in what follows, we analyze the generalization error multiplied by the sample size , which is the quantity
The basic, and until very recently the best known, result is the high probability upper bound in  claims that for any uniformly stable algorithm with parameter and provided that the loss bounded by , we have with probability at least
It is easy to observe that this generalization bound is tight only when , which means that only under this assumption, the generalization error converges to zero with the optimal rate . However, in some applications the regime is of interest, and the bound (1.2) can not guarantee any convergence. In order to consider the values of close to , Feldman and Vondrák provided sharper generalization bounds. In a series of breakthrough papers [4, 5], they at first showed the generalization bound of the form,
where as before, the parameter corresponds to the stability, and the parameter bounds the loss function uniformly. In their second paper , Feldman and Vondrák showed a stronger generalization bound,
Up to the logarithmic factors, the bound (1.4) shows that with high probability in the regime , the generalization error converges to zero with the optimal rate . However, as claimed by Feldman and Vondrák, the bound (1.3) should not be wholly discarded since it does not contain additional logarithmic factors and . More importantly, the bound (1.3) is sub-gaussian, which means that the dependence on comes only in the form . At the same time, the bound (1.4) shows both sub-gaussian and the sub-exponential regimes since it contains two types of terms: and . We will discuss the notions of sub-gaussian and sub-exponential high probability upper bounds below.
In , the authors ask if their high-probability upper bounds (1.3) and (1.4) can be strengthened and if they can be matched by a high probability lower bound. In this paper, we are making some progress in answering both questions. We shortly summarize our findings:
which removes the parasitic term from (1.4). We emphasize that our analysis is inspired by the original sample-splitting argument of Feldman and Vondrák , although our proof is significantly more straightforward. In particular, we avoid several involved technical steps, which ultimately leads us to better generalization bounds.
In Section 4, we show that the bound of our Theorem 3.1 is tight unless some additional properties of the corresponding random variables are used. Our lower bounds are presented by some specific functions satisfying the assumptions of Theorem 3.1. We remark that our lower bound does not entirely answer the question of the optimality of (1.5) for uniformly stable algorithms, as it only shows the tightness of the bound implying (1.5). We discuss it in more detail in Section 4.
We provide some notation that will be used throughout the paper. The symbol will denote an indicator of the event . For a pair of non-negative functions the notation or will mean that for some universal constant it holds that . Similarly, we introduce to be equivalent to . For we define and . The norm of a random variable will be denoted as . Let denote the set . To avoid some technical problems for by we usually mean
In what follows we work with the functions of independent variables . For we will write . In addition, for and we write . In particular, if we have an a.s. bound for any realisation of , then by a simple integration argument we have
i.e., in this sense a conditional bound is stronger than the unconditional one. Finally, for slightly abusing the notation we set and .
Several facts from Probability Theory
When dealing with high probability bounds in the Learning Theory, one usually derives a bound of the form,
with probability at least for any and some . Here, is a random variable of interest, e.g., the excess risk. The term with is referred to as a sub-gaussian tail, as it matches the deviations of a Gaussian random variable. The term with is called a sub-exponential tail. In general, the bound above represents a mixture of sub-gaussian and sub-exponential tails. In particular, all the known generalization bounds (1.2), (1.3), (1.4) are of the form (1.7).
An alternative way of studying tail bounds is via the moment norms. Recall that the -norm of a random variable is . It is well-known that a sub-gaussian random variable has moments
where does not depend on , see e.g., Proposition 2.5.2 in . In addition, the moment norms of a sub-expenential r.v. grows not faster than , i.e.
for some not depending on , see e.g., Proposition 2.7.1 in . In what follows, we will consider the random variables with two levels of moments, that is for some that do not depend on
In fact, the above bound and the bound (1.7) are equivalent up to a constant, as the following simple result suggests.
Lemma 1.1 (Equivalence of tails and moments).
Suppose, a random variable has a mixture of sub-gaussian and sub-exponential tails, in the sense that it satisfies for any with probability at least ,
for some . Then, for any it holds that
And vice versa, if for any then for any we have with probability at least ,
The proof is a simple adaptation of Theorem 2.3 from . For the sake of completeness, we present it in Section A. We conclude that moment bounds appear naturally when one deals with deviation inequalities. In addition, the moments are easier to work with when we are interested in lower bounds, as we will see in Section 4. Below we also state several well-known moment inequalities for sums and functions of independent random variables. One of them is the moment version of the bounded differences inequality, which follows immediately from Theorem 15.7 in .
Lemma 1.2 (Bounded differences/McDiarmid’s inequality).
Consider a function of independent random variables that take their values in . Suppose, that satisfies the bounded differences property, namely, for any and any it holds that
Then, we have for any ,
Next, we use the following version of the classical Marcinkiewicz-Zygmund inequality (we also refer to Chapter 15 of  that contains similar inequalities.)
Lemma 1.3 (The Marcinkiewicz-Zygmund inequality ).
Let be independent centered random variables with finite -th moment for . Then,
Notice that it is easy to apply the above lemma in the case when a.s. and . Since , we have
We will refer to it as the moment version of Hoeffding’s inequality.
2 From generalization to concentration of the sum of dependent random variables
In this section, we modify the generalization bound in order to get an equivalent statement about the concentration of the sum of non-independent random variables. Slightly abusing the notation, we denote
where is an independent copy of . Using the uniform stability, we can write down the following leave-one-out decomposition (see e.g., )
Our computations lead to the following simple lemma.
Under the uniform stability condition with parameter (1.1) and uniform boundedness of the loss function we have for defined by , that
Moreover, we have almost surely for , and .
Finally, as a deterministic function satisfies the bounded difference condition (1.8) with for all except the variable.
Similarly to the computations above we have,
The remaining properties can be immediately verified. ∎
Therefore, up to a constant term bounded by , which corresponds to in the original generalization bound, obtaining the high probability bounds for is equivalent to obtaining high probability upper bounds for .
In the next example, we provide some intuition on why the naive application of the bounded differences inequality fails to prove sharp generalization bounds. Surprisingly, it appears that the proof of the bound (1.2) is essentially equivalent to applying the triangle inequality to the sum of weakly dependent random variables.
On the sub-optimality of the bound (1.2)
At first, we prove an exact moment analog of (1.2) for , for defined by (2.1). We have in mind the illustrative regime of and , this is exactly when the bound (1.4) balances the two terms. By the triangle inequality we have
where we used that conditionally on the random variable is centered and combined this fact with Lemma 1.2. Since is a sum of independent centered bounded random variables, Hoeffding’s inequality (1.9) is applicable to .
Observe that in the proof above we lose a lot by replacing with . Indeed, it is easy to see that random variables , as well as , are weakly correlated. In order to see that, set
where is an independent copy of . For using and together with the bounded difference property, we have
This suggests that for , the random variables and have small correlation. However, the original argument in  does not take this into account and would give the same bound even if all were replaced by the same random variable .
3 The general moment bound
Here we present an upper bound that relies solely on the properties of the functions (2.1) described in Lemma 2.1. In this section, we slightly abuse the notation: the random variables do not have to be related to any learning algorithm. For the sake of brevity we sometimes denote by . At first, we prove our strongest moment bound, which is the main contribution of the paper.
Without loss of generality, we suppose that . Otherwise, we can add extra functions equal to zero, increasing the number of terms by at most two times.
Consider a sequence of partitions with , , and to get from we split each subset in into two equal parts. We have
By construction, we have and for each . For each and , denote by the only set from that contains . In particular, and .
For each and every consider the random variables
i.e. conditioned on and all the variables that are not in the same set as in the partition . In particular, and . We can write a telescopic sum for each ,
and the total sum of interest satisfies by the triangle inequality
Since and , by applying (1.9) we have
The only non-trivial part is the second term of the r.h.s. of (3.1). Observe that
that is, the expectation is taken w.r.t. the variables . It is also not hard to see that the function preserves the bounded differences property, just like the the function . Therefore, if we apply Lemma 1.2 conditionally on , we obtain a uniform bound
as there are indices in . We have as well by (1.6).
Let us take a look at the sum for . Since for depends only on , the terms are independent and zero mean conditionally on . Applying Lemma 1.3, we have for any ,
Integrating with respect to and using , we have
It is left to use the triangle inequality over all sets . We have,
Recall, that due to the possible extension of the sample. Then,
Before we start the discussion of the details of the proof , let us obtain the following simple corollary.
Under the uniform stability condition with parameter (1.1) and the uniform boundedness of the loss function , we have that for any with probability at least ,
The last bound is an improvement of the recent upper bound for uniformly stable algorithms by Feldman and Vondrák (1.4). To be precise, we removed the parasitic term.
The strategy of the proof of Theorem 3.1 is inspired by the original approach of Feldman and Vondrák . Their clamping can be related to the analysis of the terms . It is important to notice that the truncation part of their analysis creates some technical difficulties since it introduces some bias and changes the stability parameter. In particular, the truncation brings an unnecessary logarithmic factor. We entirely avoid these steps by a simple application of the Marcinkiewicz-Zygmund inequality. The analog of the dataset reduction step of Feldman and Vondrák is our nested partition scheme. However, the recursive structure of their approach is replaced by an application of telescopic sums, whereas the union bound, which also brings a logarithmic factor, is replaced by the triangle inequality for norms. Apart from an elementary proof, our analysis leads to a better result: we eliminate the parasitic term.
Another interesting direction is the analysis of the first bound of Feldman and Vondrák (1.3), which was originally proved by the techniques taking their roots in Differential Privacy (see the discussions on three various ways to prove this bound in ). As already noticed in , the bound (1.3) should not be discarded due to the fact that it does not contain additional -factors and, from our point of view, more importantly, it has the sub-gaussian form, since it depends only on . Recall that the second bound (1.4) is a mixture of sub-gaussian and sub-exponential tails. Although we can adapt the moment technique to prove (1.3), we instead come to the following more general observation:
In order to show this, we have by Theorem 3.1, provided that almost surely
Since and for , (which is rather crude) we have for ,
Similarly to the proof of Corollary 3.2, it immediately implies
which is (1.3) up to an unnecessary factor. The latter is clearly an artifact of the proof in our case.
4 Lower bounds
Since the bound of Theorem 3.1 implies the best known risk bound, it is natural to ask if it can be improved in general. By Lemma 2.1, we know that the analysis of the generalization bounds is closely related to the analysis of the functions satisfying the assumptions of Theorem 3.1. Therefore, it is interesting to know how sharp the general bound (3.3) is. Recall that
where as before, is a uniform bound on . In this section, we prove that one can not improve the bound of Theorem 3.1, apart from the -factor, and the bound is tight with respect to the parameters in some regimes. We notice, however, that this does not completely answer the question of the optimality of the risk bound of Corollary 3.2 for uniformly stable algorithms, but shows that this is the best we can hope for as long as our upper bound is based only on the parameters . In particular, Theorem 3.1 disregards the condition . We discuss this in more detail in what follows.
As before, we need two well-known facts from probability theory. The first lemma is a moment version of the Montgomery-Smith bound which is due to Hitczenko . It characterizes the moments of Rademacher sums up to a multiplicative constant factor.
Lemma 4.1 (Moments of weighted Rademacher sums ).
Let be a non-increasing sequence and are i.i.d. Rademacher signs. Then,
The next lemma is Chebyshev’s association inequality, see e.g., Theorem 2.14 in .
Let and be non-decreasing real-valued functions defined on the real line. If is a real-valued random variable, then
Proposition 4.3 (The lower bound, ).
Let be i.i.d. Rademacher signs. There is an absolute constant and functions that satisfy the conditions of Theorem 3.1 with the parameters , , such that we have for any ,
Consider the functions,
It is easy to check that and a.s. Moreover, each satisfies the bounded difference property with parameter w.r.t. all except the -th variable. Denoting we have,
By the triangle inequality we have,
For we obviously have . By Lemma 4.2 and since both functions and are non-decreasing for non-negative , we have
Finally, due to the symmetry of and Lemma 4.1, we have for ,
Finally, for some , our construction implies the following lower bound
corresponds to a special case of Rademacher chaos. The behaviour of (4.4) is well understood and the desired lower bound of order for will immediately follow from Corollary 1, Example 2 by Latała . We present the corresponding bound in the proof of the inequality (4.5) below. This approach, in the case , removes the assumption from Proposition 4.3.
The lower bound of Proposition 4.3 matches the result of Theorem 3.1 up to the logarithmic factor in the regime . In particular, it means that in this regime, the bound has to be sub-exponential unless we use some properties of the functions , other than mentioned in Theorem 3.1. We additionally note that our moment lower bounds imply the deviation lower bounds. We can show that there are absolute constants such that the functions defined in (4.2) satisfy for every ,
Besides, in the case , a trivial bound is the best one can have. Like in (4.3), consider the functions , where
are i.i.d. Rademacher signs (it corresponds to a learning algorithm that always outputs the same classifier). By Lemma4.1 we have for ,