Sharper bounds for uniformly stable algorithms

10/17/2019 ∙ by Olivier Bousquet, et al. ∙ 0

The generalization bounds for stable algorithms is a classical question in learning theory taking its roots in the early works of Vapnik and Chervonenkis and Rogers and Wagner. In a series of recent breakthrough papers, Feldman and Vondrak have shown that the best known high probability upper bounds for uniformly stable learning algorithms due to Bousquet and Elisseeff are sub-optimal in some natural regimes. To do so, they proved two generalization bounds that significantly outperform the original generalization bound. Feldman and Vondrak also asked if it is possible to provide sharper bounds and prove corresponding high probability lower bounds. This paper is devoted to these questions: firstly, inspired by the original arguments of, we provide a short proof of the moment bound that implies the generalization bound stronger than both recent results. Secondly, we prove general lower bounds, showing that our moment bound is sharp (up to a logarithmic factor) unless some additional properties of the corresponding random variables are used. Our main probabilistic result is a general concentration inequality for weakly correlated random variables, which may be of independent interest.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The main motivation of our studies is the analysis of learning algorithms that are uniformly stable (we recall the definition introduced in [2] below). In this context we are given an i.i.d sample of points distributed independently according to some unknown measure on , a learning algorithm , which maps , that is given the learning sample it outputs the function mapping the instance space into the space of labels . The output of the learning algorithm based on the sample will be denoted by

. We also use the loss function

.

Given the random sample the risk of the algorithm is defined as

and the empirical risk as

By the generalization bounds we mean the high probability bounds on which is the difference between the actual risk of the algorithms and its empirical performance on the learning sample. The standard way to prove the generalization bounds is based on the sensitivity of the algorithm to changes in the learning sample, such as leaving one of the data points out or replacing it with a different one. To the best of our knowledge, this idea was first used by Vapnik and Chervonenkis to prove the in-expectation generalization bound for what now is known as hard-margin SVM [13]. Later works by Devroy and Wagner used the notions of stability to prove the high probability generalization bounds for -nearest neighbors [3]. The paper [2] provides an extensive analysis of different notions of stability and the corresponding (sometimes) high probability generalization bounds. Among some recent contributions on high probability upper bounds based on the notions of stability is the paper of Maurer [9]

, which studies generalization bounds for a particular case of linear regression with a strongly convex regularizer, as well as the recent work 

[15], which provides sharp exponential upper bounds for the SVM in the realizable case. In order not to repeat an extensive survey on the topic, we refer to [4] and [5] and the references therein.

We return to the problem of generalization bounds. For the sake of simplicity, we denote . The learning algorithm is uniformly stable with parameter if given any samples

for any we have

(1.1)

Since it is only a matter of normalization, in order to simplify the notation in what follows, we analyze the generalization error multiplied by the sample size , which is the quantity

The basic, and until very recently the best known, result is the high probability upper bound in [2] claims that for any uniformly stable algorithm with parameter and provided that the loss bounded by , we have with probability at least

(1.2)

It is easy to observe that this generalization bound is tight only when , which means that only under this assumption, the generalization error converges to zero with the optimal rate . However, in some applications the regime is of interest, and the bound (1.2) can not guarantee any convergence. In order to consider the values of close to , Feldman and Vondrák provided sharper generalization bounds. In a series of breakthrough papers [4, 5], they at first showed the generalization bound of the form,

(1.3)

where as before, the parameter corresponds to the stability, and the parameter bounds the loss function uniformly. In their second paper [5], Feldman and Vondrák showed a stronger generalization bound,

(1.4)

Up to the logarithmic factors, the bound (1.4) shows that with high probability in the regime , the generalization error converges to zero with the optimal rate . However, as claimed by Feldman and Vondrák, the bound (1.3) should not be wholly discarded since it does not contain additional logarithmic factors and . More importantly, the bound (1.3) is sub-gaussian, which means that the dependence on comes only in the form . At the same time, the bound (1.4) shows both sub-gaussian and the sub-exponential regimes since it contains two types of terms: and . We will discuss the notions of sub-gaussian and sub-exponential high probability upper bounds below.

In [5], the authors ask if their high-probability upper bounds (1.3) and (1.4) can be strengthened and if they can be matched by a high probability lower bound. In this paper, we are making some progress in answering both questions. We shortly summarize our findings:

  • Our main probabilistic result is Theorem 3.1, presented in Section 3. As one of the immediate corollaries, it implies the risk bound of the form,

    (1.5)

    which removes the parasitic term from (1.4). We emphasize that our analysis is inspired by the original sample-splitting argument of Feldman and Vondrák [5], although our proof is significantly more straightforward. In particular, we avoid several involved technical steps, which ultimately leads us to better generalization bounds.

  • Our Theorem 3.1 will also easily imply the sub-gaussian bound (1.3). Therefore, we also make a natural bridge between the bounds of the form (1.3) and (1.4), which have different dependencies on , , and .

  • In Section 4, we show that the bound of our Theorem 3.1 is tight unless some additional properties of the corresponding random variables are used. Our lower bounds are presented by some specific functions satisfying the assumptions of Theorem 3.1. We remark that our lower bound does not entirely answer the question of the optimality of (1.5) for uniformly stable algorithms, as it only shows the tightness of the bound implying (1.5). We discuss it in more detail in Section 4.

Notation

We provide some notation that will be used throughout the paper. The symbol will denote an indicator of the event . For a pair of non-negative functions the notation or will mean that for some universal constant it holds that . Similarly, we introduce to be equivalent to . For we define and . The norm of a random variable will be denoted as . Let denote the set . To avoid some technical problems for by we usually mean

In what follows we work with the functions of independent variables . For we will write . In addition, for and we write . In particular, if we have an a.s. bound for any realisation of , then by a simple integration argument we have

(1.6)

i.e., in this sense a conditional bound is stronger than the unconditional one. Finally, for slightly abusing the notation we set and .

Several facts from Probability Theory

When dealing with high probability bounds in the Learning Theory, one usually derives a bound of the form,

(1.7)

with probability at least for any and some . Here, is a random variable of interest, e.g., the excess risk. The term with is referred to as a sub-gaussian tail, as it matches the deviations of a Gaussian random variable. The term with is called a sub-exponential tail. In general, the bound above represents a mixture of sub-gaussian and sub-exponential tails. In particular, all the known generalization bounds (1.2), (1.3), (1.4) are of the form (1.7).

An alternative way of studying tail bounds is via the moment norms. Recall that the -norm of a random variable is . It is well-known that a sub-gaussian random variable has moments

where does not depend on , see e.g., Proposition 2.5.2 in [14]. In addition, the moment norms of a sub-expenential r.v. grows not faster than , i.e.

for some not depending on , see e.g., Proposition 2.7.1 in [14]. In what follows, we will consider the random variables with two levels of moments, that is for some that do not depend on

In fact, the above bound and the bound (1.7) are equivalent up to a constant, as the following simple result suggests.

Lemma 1.1 (Equivalence of tails and moments).

Suppose, a random variable has a mixture of sub-gaussian and sub-exponential tails, in the sense that it satisfies for any with probability at least ,

for some . Then, for any it holds that

And vice versa, if for any then for any we have with probability at least ,

The proof is a simple adaptation of Theorem 2.3 from [1]. For the sake of completeness, we present it in Section A. We conclude that moment bounds appear naturally when one deals with deviation inequalities. In addition, the moments are easier to work with when we are interested in lower bounds, as we will see in Section 4. Below we also state several well-known moment inequalities for sums and functions of independent random variables. One of them is the moment version of the bounded differences inequality, which follows immediately from Theorem 15.7 in [1].

Lemma 1.2 (Bounded differences/McDiarmid’s inequality).

Consider a function of independent random variables that take their values in . Suppose, that satisfies the bounded differences property, namely, for any and any it holds that

(1.8)

Then, we have for any ,

Next, we use the following version of the classical Marcinkiewicz-Zygmund inequality (we also refer to Chapter 15 of [1] that contains similar inequalities.)

Lemma 1.3 (The Marcinkiewicz-Zygmund inequality [11]).

Let be independent centered random variables with finite -th moment for . Then,

Notice that it is easy to apply the above lemma in the case when a.s. and . Since , we have

(1.9)

We will refer to it as the moment version of Hoeffding’s inequality.

2 From generalization to concentration of the sum of dependent random variables

In this section, we modify the generalization bound in order to get an equivalent statement about the concentration of the sum of non-independent random variables. Slightly abusing the notation, we denote

where is an independent copy of . Using the uniform stability, we can write down the following leave-one-out decomposition (see e.g., [2])

Denote

(2.1)

Our computations lead to the following simple lemma.

Lemma 2.1.

Under the uniform stability condition with parameter (1.1) and uniform boundedness of the loss function we have for defined by , that

Moreover, we have almost surely for , and .

Finally, as a deterministic function satisfies the bounded difference condition (1.8) with for all except the variable.

Proof.

Similarly to the computations above we have,

The remaining properties can be immediately verified. ∎

Therefore, up to a constant term bounded by , which corresponds to in the original generalization bound, obtaining the high probability bounds for is equivalent to obtaining high probability upper bounds for .

In the next example, we provide some intuition on why the naive application of the bounded differences inequality fails to prove sharp generalization bounds. Surprisingly, it appears that the proof of the bound (1.2) is essentially equivalent to applying the triangle inequality to the sum of weakly dependent random variables.

On the sub-optimality of the bound (1.2)

At first, we prove an exact moment analog of (1.2) for , for defined by (2.1). We have in mind the illustrative regime of and , this is exactly when the bound (1.4) balances the two terms. By the triangle inequality we have

where we used that conditionally on the random variable is centered and combined this fact with Lemma 1.2. Since is a sum of independent centered bounded random variables, Hoeffding’s inequality (1.9) is applicable to .

Observe that in the proof above we lose a lot by replacing with . Indeed, it is easy to see that random variables , as well as , are weakly correlated. In order to see that, set

where is an independent copy of . For using and together with the bounded difference property, we have

(2.2)

This suggests that for , the random variables and have small correlation. However, the original argument in [2] does not take this into account and would give the same bound even if all were replaced by the same random variable .

We also note that the argument analogous to (2.2) was first used in [4]

to prove the following sharp variance bound

(2.3)

3 The general moment bound

Here we present an upper bound that relies solely on the properties of the functions (2.1) described in Lemma 2.1. In this section, we slightly abuse the notation: the random variables do not have to be related to any learning algorithm. For the sake of brevity we sometimes denote by . At first, we prove our strongest moment bound, which is the main contribution of the paper.

Theorem 3.1.

Let

be a vector of independent random variables each taking values in

, and let be some functions such that the following holds for any :

  • a.s.,

  • a.s.,

  • has a bounded difference (1.8) with respect to all variables except the -th variable.

Then, for any ,

Proof.

Without loss of generality, we suppose that . Otherwise, we can add extra functions equal to zero, increasing the number of terms by at most two times.

Consider a sequence of partitions with , , and to get from we split each subset in into two equal parts. We have

By construction, we have and for each . For each and , denote by the only set from that contains . In particular, and .

For each and every consider the random variables

i.e. conditioned on and all the variables that are not in the same set as in the partition . In particular, and . We can write a telescopic sum for each ,

and the total sum of interest satisfies by the triangle inequality

(3.1)

Since and , by applying (1.9) we have

(3.2)

The only non-trivial part is the second term of the r.h.s. of (3.1). Observe that

that is, the expectation is taken w.r.t. the variables . It is also not hard to see that the function preserves the bounded differences property, just like the the function . Therefore, if we apply Lemma 1.2 conditionally on , we obtain a uniform bound

as there are indices in . We have as well by (1.6).

Let us take a look at the sum for . Since for depends only on , the terms are independent and zero mean conditionally on . Applying Lemma 1.3, we have for any ,

Integrating with respect to and using , we have

It is left to use the triangle inequality over all sets . We have,

Recall, that due to the possible extension of the sample. Then,

Plugging the above bound together with (3.2) into (3.1), we get the required bound. ∎

Before we start the discussion of the details of the proof , let us obtain the following simple corollary.

Corollary 3.2.

Under the uniform stability condition with parameter (1.1) and the uniform boundedness of the loss function , we have that for any with probability at least ,

The last bound is an improvement of the recent upper bound for uniformly stable algorithms by Feldman and Vondrák (1.4). To be precise, we removed the parasitic term.

Proof.

Combining Lemma 2.1 and Theorem 3.1 with defined in (2.1), , and , we have for any ,

The deviation bound now follows immediately from Lemma 1.1. ∎

Remark 3.3.

The strategy of the proof of Theorem 3.1 is inspired by the original approach of Feldman and Vondrák [5]. Their clamping can be related to the analysis of the terms . It is important to notice that the truncation part of their analysis creates some technical difficulties since it introduces some bias and changes the stability parameter. In particular, the truncation brings an unnecessary logarithmic factor. We entirely avoid these steps by a simple application of the Marcinkiewicz-Zygmund inequality. The analog of the dataset reduction step of Feldman and Vondrák is our nested partition scheme. However, the recursive structure of their approach is replaced by an application of telescopic sums, whereas the union bound, which also brings a logarithmic factor, is replaced by the triangle inequality for norms. Apart from an elementary proof, our analysis leads to a better result: we eliminate the parasitic term.

Another interesting direction is the analysis of the first bound of Feldman and Vondrák (1.3), which was originally proved by the techniques taking their roots in Differential Privacy (see the discussions on three various ways to prove this bound in [5]). As already noticed in [5], the bound (1.3) should not be discarded due to the fact that it does not contain additional -factors and, from our point of view, more importantly, it has the sub-gaussian form, since it depends only on . Recall that the second bound (1.4) is a mixture of sub-gaussian and sub-exponential tails. Although we can adapt the moment technique to prove (1.3), we instead come to the following more general observation:

The bound of Theorem 3.1 is strong enough to almost recover the sub-gaussian bound (1.3).

In order to show this, we have by Theorem 3.1, provided that almost surely

(3.3)

Since and for , (which is rather crude) we have for ,

Similarly to the proof of Corollary 3.2, it immediately implies

(3.4)

which is (1.3) up to an unnecessary factor. The latter is clearly an artifact of the proof in our case.

4 Lower bounds

Since the bound of Theorem 3.1 implies the best known risk bound, it is natural to ask if it can be improved in general. By Lemma 2.1, we know that the analysis of the generalization bounds is closely related to the analysis of the functions satisfying the assumptions of Theorem 3.1. Therefore, it is interesting to know how sharp the general bound (3.3) is. Recall that

where as before, is a uniform bound on . In this section, we prove that one can not improve the bound of Theorem 3.1, apart from the -factor, and the bound is tight with respect to the parameters in some regimes. We notice, however, that this does not completely answer the question of the optimality of the risk bound of Corollary 3.2 for uniformly stable algorithms, but shows that this is the best we can hope for as long as our upper bound is based only on the parameters . In particular, Theorem 3.1 disregards the condition . We discuss this in more detail in what follows.

As before, we need two well-known facts from probability theory. The first lemma is a moment version of the Montgomery-Smith bound 

[10] which is due to Hitczenko [7]. It characterizes the moments of Rademacher sums up to a multiplicative constant factor.

Lemma 4.1 (Moments of weighted Rademacher sums [7]).

Let be a non-increasing sequence and are i.i.d. Rademacher signs. Then,

The next lemma is Chebyshev’s association inequality, see e.g., Theorem 2.14 in [1].

Lemma 4.2.

Let and be non-decreasing real-valued functions defined on the real line. If is a real-valued random variable, then

Proposition 4.3 (The lower bound, ).

Let be i.i.d. Rademacher signs. There is an absolute constant and functions that satisfy the conditions of Theorem 3.1 with the parameters , , such that we have for any ,

(4.1)
Proof.

Consider the functions,

(4.2)

It is easy to check that and a.s. Moreover, each satisfies the bounded difference property with parameter w.r.t. all except the -th variable. Denoting we have,

(4.3)

By the triangle inequality we have,

For we obviously have . By Lemma 4.2 and since both functions and are non-decreasing for non-negative , we have

Finally, due to the symmetry of and Lemma 4.1, we have for ,

Finally, for some , our construction implies the following lower bound

Remark 4.4.

The fact that the lower bound (4.1) contains the sub-exponential term may be alternatively understood as follows. In the case when , the sum (4.2), which is

(4.4)

corresponds to a special case of Rademacher chaos. The behaviour of (4.4) is well understood and the desired lower bound of order for will immediately follow from Corollary 1, Example 2 by Latała [8]. We present the corresponding bound in the proof of the inequality (4.5) below. This approach, in the case , removes the assumption from Proposition 4.3.

The lower bound of Proposition 4.3 matches the result of Theorem 3.1 up to the logarithmic factor in the regime . In particular, it means that in this regime, the bound has to be sub-exponential unless we use some properties of the functions , other than mentioned in Theorem 3.1. We additionally note that our moment lower bounds imply the deviation lower bounds. We can show that there are absolute constants such that the functions defined in (4.2) satisfy for every ,

(4.5)

This bound can be derived through the Paley-Zygmund inequality, e.g., using the standard arguments as in [6]. For the sake of completeness, we derive this inequality in Section B.

Besides, in the case , a trivial bound is the best one can have. Like in (4.3), consider the functions , where

are i.i.d. Rademacher signs (it corresponds to a learning algorithm that always outputs the same classifier). By Lemma 

4.1 we have for ,

Some concluding remarks and remaining open questions

The question of the last remaining -factor is still open. Using the line (2.2) and the bounded differences inequality 1.2 it is easy to prove that for the bound of Theorem 3.1 may be improved. We have