## 1 Introduction

Most generalization bounds in learning theory hold for a fixed hypothesis set, selected before receiving a sample. This includes learning bounds based on covering numbers, VC-dimension, pseudo-dimension, Rademacher complexity, local Rademacher complexity, and other complexity measures (Pollard, 1984; Zhang, 2002; Vapnik, 1998; Koltchinskii and Panchenko, 2002; Bartlett et al., 2002). Some alternative guarantees have also been derived for specific algorithms. Among them, the most general family is that of uniform stability bounds given by Bousquet and Elisseeff (2002). These bounds were recently significantly improved by Feldman and Vondrak (2018), who proved guarantees that are informative, even when the stability parameter is only in , as opposed to . New bounds for a restricted class of algorithms were also recently presented by Maurer (2017)

, under a number of assumptions on the smoothness of the loss function. Appendix

A gives more background on stability.In practice, machine learning engineers commonly resort to hypothesis sets depending on the

*same sample*

as the one used for training. This includes instances where a regularization, a feature transformation, or a data normalization is selected using the training sample, or other instances where the family of predictors is restricted to a smaller class based on the sample received. In other instances, as is common in deep learning, the data representation and the predictor are learned using the same sample. In ensemble learning, the sample used to train models sometimes coincides with the one used to determine their aggregation weights. However, standard generalization bounds cannot be used to provide guarantees for these scenarios since they assume a fixed hypothesis set.

This paper studies generalization in a broad setting that admits as
special cases both that of standard learning bounds for fixed
hypothesis sets based on some complexity measure, and that of
algorithm-dependent uniform stability bounds. We present an extensive
study of generalization for *sample-dependent* hypothesis sets,
that is for learning with a hypothesis set selected after
receiving the training sample . This defines two stages for the
learning algorithm: a first stage where is chosen after
receiving , and a second stage where a hypothesis is selected
from . Standard generalization bounds correspond to the case
where is equal to some fixed independent of
. Algorithm-dependent analyses, such as uniform stability bounds,
coincide with the case where is chosen to be a singleton
. Thus, the scenario we study covers both existing
settings and, additionally, includes many other intermediate
scenarios. Figure 1 illustrates our general scenario.

We present a series of results for generalization with data-dependent hypothesis sets. We first present general learning bounds for data-dependent hypothesis sets using a notion of transductive Rademacher complexity (Section 3). These bounds hold for arbitrary bounded losses and improve upon previous guarantees given by Gat (2001) and Cannon et al. (2002) for the binary loss, which were expressed in terms of a notion of shattering coefficient adapted to the data-dependent case, and are more explicit than the guarantees presented by Philips (2005)[corollary 4.6 or theorem 4.7]. Nevertheless, such bounds may often not be sufficiently informative, since they ignore the relationship between hypothesis sets based on similar samples.

To derive a finer analysis, we introduce a key notion of
*hypothesis set stability*, which admits algorithmic stability as
a special case, when the hypotheses sets are reduced to singletons. We
also introduce a new notion of Rademacher complexity for
data-dependent hypothesis sets. Our main results are two
generalization bounds for stable data-dependent hypothesis sets, both
expressed in terms of the hypothesis set stability parameter, our
notion of Rademacher complexity, and a notion of cross-validation
stability that, in turn, can be upper-bounded by the diameter of the
family of hypothesis sets. Our first learning bound
(Section 4) is expressed in terms of a finer notion of
diameter but admits a dependency in terms of the stability parameter
similar to that of uniform stability bounds of
Bousquet and Elisseeff (2002). In Section 5, we use
proof techniques from the differential privacy literature
(Steinke and Ullman, 2017; Bassily et al., 2016; Feldman and Vondrak, 2018)
to derive a learning bound expressed in terms of a somewhat coarser
definition of diameter but with a more favorable dependency on
, matching the dependency of the recent more favorable bounds
of Feldman and Vondrak (2018).
Our learning bounds admit as special
cases both standard Rademacher complexity bounds and
algorithm-dependent uniform stability bounds.

Shawe-Taylor et al. (1998) presented an analysis
of structural risk minimization over data-dependent hierarchies based
on a concept of *luckiness*

, which generalizes the notion of margin of linear classifiers. Their analysis can be viewed as an alternative study of data-dependent hypothesis sets, using luckiness functions and

-smallness (or -smoothness) conditions. A luckiness function helps decompose a hypothesis set into*lucky sets*, that is sets of functions

*luckier*than a given function. The -smallness condition requires that the size of the family of loss functions corresponding to the lucky set of any function

with respect to a double-sample, measured by packing or covering numbers, be bounded with high probability by a function

of the luckiness of on the sample. The luckiness framework is attractive and the notion of luckiness, for example margin, can in fact be combined with our results. However, finding pairs of truly data-dependent luckiness and -smallness functions, other than those based on the margin and the empirical VC-dimension, is quite difficult, in particular because of the very technical -smallness condition (see Philips, 2005, p. 70). In contrast, our hypothesis set stability is simpler and often easier to bound. The notions of luckiness and -smallness have also been used by Herbrich and Williamson (2002) to derive algorithm-specific guarantees. The authors show a connection with algorithmic stability (not hypothesis set stability), at the price of a guarantee requiring the strong condition that the stability parameter be in , where is the sample size (see Herbrich and Williamson, 2002, pp. 189-190).In section 6, we illustrate the generality and the benefits of our hypothesis set stability learning bounds by applying them to the analysis of several scenarios (see also Appendix K). In Appendix J, we briefly discuss several extensions of our framework and results, including the extension to almost everywhere hypothesis set stability as in (Kutin and Niyogi, 2002). The next section introduces the definitions and properties used in our analysis.

## 2 Definitions and Properties

Let be the input space and the output space. We denote by the unknown distribution over according to which samples are drawn.

The hypotheses we consider map to a set sometimes different from . For example, in binary classification, we may have and . Thus, we denote by a loss function defined on and taking non-negative real values bounded by one. We denote the loss of a hypothesis at point by . We denote by the generalization error or expected loss of a hypothesis and by its empirical loss over a sample :

In the general framework we consider, a hypothesis set depends on the sample received. We will denote by the hypothesis set depending on the labeled sample of size .

[Hypothesis set uniform stability]
Fix . We will say that a family of data-dependent
hypothesis sets is *-uniformly
stable* (or simply -stable) for some , if for any two
samples and of size differing only by one point, the
following holds:

(1) |

Thus, two hypothesis sets derived from samples differing by one element are close in the sense that any hypothesis in one admits a counterpart in the other set with -similar losses.

Next, we define a notion of cross-validation stability for data-dependent hypothesis sets. The notion measures the maximal change in loss of a hypothesis on a training example and the loss of a hypothesis on the same training example, when the hypothesis is chosen from the hypothesis set corresponding to the a sample where the training example in question is replaced by a newly sampled example.
[Hypothesis set Cross-Validation (CV) stability]
Fix . We will say that a family of data-dependent
hypothesis sets has
* CV-stability* for some , if
the following holds (here, denotes the sample
obtained by replacing by ):

(2) |

We say that has * average CV-stability* for some if the following holds:

(3) |

We also define a notion of diameter of data-dependent hypothesis sets,
which is useful in bounding CV-stability. In applications, we will
typically bound the diameter, thereby the CV-stability.
[Diameter of data-dependent hypothesis sets]
Fix . We define the *diameter and average
diameter of a family of data-dependent hypothesis
sets* by

(4) |

Notice that, for consistent hypothesis sets, the diameter is reduced to zero since for any and . As mentioned earlier, CV-stability of hypothesis sets can be bounded in terms of their stability and diameter: A family of data-dependent hypothesis sets with -uniform stability, diameter , and average diameter has -CV-stability and -average CV-stability. Let , , and . For any and , by the -uniform stability of , there exists such that . Thus,

This implies the inequality

and the lemma follows.

We also introduce a new notion of Rademacher complexity for data-dependent hypothesis sets. To introduce its definition, for any two samples

and a vector of Rademacher variables

, denote by the sample derived from by replacing its th element with the th element of , for all with . We will use to denote the hypothesis set .[Rademacher complexity of data-dependent hypothesis sets]
Fix . The *empirical Rademacher complexity
and the Rademacher complexity of a
family of data-dependent hypothesis sets
* for two samples
and in
are defined by

(5) |

When the family of data-dependent hypothesis sets is -stable with , the empirical Rademacher complexity is sharply concentrated around its expectation , as with the standard empirical Rademacher complexity (see Lemma B.2).

Let denote the union of all hypothesis sets based on subsamples of of size : . Since for any , we have , the following simpler upper bound in terms of the standard empirical Rademacher complexity of can be used for our notion of empirical Rademacher complexity:

where is the standard empirical Rademacher complexity of for the sample .

The Rademacher complexity of data-dependent hypothesis sets can be bounded by , as indicated previously. It can also be bounded directly, as illustrated by the following example of data-dependent hypothesis sets of linear predictors. For any sample , define the hypothesis set as follows:

where . Define and as follows: and . Then, it can be shown that the empirical Rademacher complexity of the family of data-dependent hypothesis sets can be upper-bounded as follows (Lemma B.1):

Notice that the bound on the Rademacher complexity is non-trivial since it depends on the samples and , while a standard Rademacher complexity for non-data-dependent hypothesis set containing would require taking a maximum over all samples of size . Other upper bounds are given in Appendix B.

Let denote the family of loss functions associated to :

(6) |

and let denote the family of hypothesis sets . Our main results will be expressed in terms of . When the loss function is -Lipschitz, by Talagrand’s contraction lemma (Ledoux and Talagrand, 1991), in all our results, can be replaced by .

## 3 General learning bound for data-dependent hypothesis sets

In this section, we present general learning bounds for data-dependent hypothesis sets that do not make use of the notion of hypothesis set stability.

One straightforward idea to derive such guarantees for data-dependent
hypothesis sets is to replace the hypothesis set depending on
the observed sample by the union of all such hypothesis sets over
all samples of size , . However, in general, can be very rich, which can
lead to uninformative learning bounds. A somewhat better alternative
consists of considering the union of all such hypothesis sets for
samples of size included in some supersample of size ,
with ,
. We will derive learning guarantees based
on the maximum *transductive Rademacher complexity* of
. There is a trade-off in the choice of : smaller
values lead to less complex sets , but they also lead
to weaker dependencies on sample sizes. Our bounds are more refined
guarantees than the shattering-coefficient bounds originally given for
this problem by Gat (2001) in the case , and later by
Cannon et al. (2002) for any . They also
apply to arbitrary bounded loss functions and not just the binary
loss.
They are expressed in terms of the following notion of
*transductive Rademacher complexity for data-dependent
hypothesis sets*:

where and where is a vector of

independent random variables taking value

with probability , and with probability . Our notion of transductive Rademacher complexity is simpler than that of El-Yaniv and Pechyony (2007) (in the data-independent case) and leads to simpler proofs and guarantees. A by-product of our analysis is learning guarantees for standard transductive learning in terms of this notion of transductive Rademacher complexity, which can be of independent interest.Let be a family of data-dependent hypothesis sets. Then, for any with and any , the following inequality holds:

where . For , the inequality becomes:

We use the following symmetrization result, which holds for any with for data-dependent hypothesis sets (Lemma D, Appendix D):

To bound the right-hand side, we use an extension of McDiarmid’s inequality to sampling without replacement (Cortes et al., 2008) applied to . Lemma E (Appendix E) is then used to bound in terms of our notion of transductive Rademacher complexity. The full proof is given in Appendix C.

## 4 Learning bound for stable data-dependent hypothesis sets

In this section, we present generalization bounds for data-dependent hypothesis sets using the notion of Rademacher complexity defined in the previous section, as well as that of hypothesis set stability.

Let be a -stable family of data-dependent hypothesis sets with average CV-stability. Let be defined as in (6). Then, for any , with probability at least over the draw of a sample , the following inequality holds for all :

(7) |

For any two samples , define as follows:

The proof consists of applying McDiarmid’s inequality to . The first stage consists of proving the -sensitivity of , with . The main part of the proof then consists of upper bounding the expectation in terms of both our notion of Rademacher complexity, and in terms of our notion of cross-validation stability. The full proof is given in Appendix F. The generalization bound of the theorem admits as a special case the standard Rademacher complexity bound for fixed hypothesis sets (Koltchinskii and Panchenko, 2002; Bartlett and Mendelson, 2002): in that case, we have for some , thus coincides with the standard Rademacher complexity ; furthermore, the family of hypothesis sets is -stable, thus the bound holds with . It also admits as a special case the standard uniform stability bound (Bousquet and Elisseeff, 2002): in that case, is reduced to a singleton, , and our notion of hypothesis set stability coincides with that of uniform stability of single hypotheses; furthermore, we have , since . Thus, using in the right-hand side inequality, the expression of the learning bound matches that of a uniform stability bound for single hypotheses.

## 5 Differential privacy-based bound for stable data-dependent hypothesis sets

In this section, we use recent techniques introduced in the differential privacy literature to derive improved generalization guarantees for stable data-dependent hypothesis sets (Steinke and Ullman, 2017; Bassily et al., 2016) (see also (McSherry and Talwar, 2007)). Our proofs also benefit from the recent improved stability results of Feldman and Vondrak (2018). We will make use of the following lemma due to Steinke and Ullman (2017, Lemma 1.2), which reduces the task of deriving a concentration inequality to that of upper bounding an expectation of a maximum.

Fix . Let

be a random variable with probability distribution

and independent copies of . Then, the following inequality holds:We will also use the following result which, under a sensitivity
assumption, further reduces the task of upper bounding the expectation
of the maximum to that of bounding a more favorable expression. The *sensitivity* of a function is .

[(McSherry and Talwar, 2007; Bassily et al., 2016; Feldman and Vondrak, 2018)] Let be scoring functions with sensitivity . Let be the algorithm that, given a dataset and a parameter , returns the index with probability proportional to . Then, is -differentially private and, for any , the following inequality holds:

Notice that, if we define , then, by the same result, the algorithm returning the index with probability proportional to is -differentially private and the following inequality holds for any :

(8) |

Let be a -stable family of data-dependent hypothesis sets with CV-stability. Let be defined as in (6). Then, for any , with probability at least over the draw of a sample , the following inequality holds for all :

For any two samples of size , define as follows:

The proof consists of deriving a high-probability bound for
. To do so, by Lemma 5 applied to the random
variable , it suffices to bound
, where
with , , independent
samples of size drawn from .
To bound that expectation, we use Lemma 5 and
instead bound
, where is an -differentially private algorithm.
To apply Lemma 5, we first show that, for any
, the function is
-sensitive with .
Lemma G helps us express our upper
bound in terms of the CV stability coefficient .
The full proof is given in Appendix G.
The hypothesis set-stability bound of this theorem admits the same
favorable dependency on the stability parameter as the best
existing bounds for uniform-stability recently presented by
Feldman and Vondrak (2018). As with Theorem 4, the bound of
Theorem 5 admits as special cases both standard
Rademacher complexity bounds ( for some fixed and
) and uniform-stability bounds (). In
the latter case, our bound coincides with that of
Feldman and Vondrak (2018) modulo constants that could be chosen to be
the same for both results.^{1}^{1}1The differences in constant terms
are due to slightly difference choices of the parameters and a
slightly different upper bound in our case where multiplies
the stability and the diameter, while the paper of
Feldman and Vondrak (2018) does not seem to have that factor.
Notice that the current bounds for standard uniform stability may not
be optimal since no matching lower bound is known yet
(Feldman and Vondrak, 2018). It is very likely, however, that improved
techniques used for deriving more refined algorithmic stability bounds
could also be used to improve our hypothesis set stability guarantees.
In Appendix H, we give an alternative version of
Theorem 5 with a proof technique only making use of
recent methods from the differential privacy literature, including
to derive a Rademacher complexity bound.
It might be possible to achieve a better dependency on for the
term in the bound containing the Rademacher complexity. In
Appendix I, we initiate such an analysis by deriving a
finer analysis on the expectation
.

## 6 Applications

In this section, we discuss several applications of the learning guarantees presented in the previous sections. We discuss other applications in Appendix K. As already mentioned, both the standard setting of a fixed hypothesis set not varying with , that is that of standard generalization bounds, and the uniform stability setting where , are special cases benefitting from our learning guarantees.

### 6.1 Stochastic convex optimization

Here, we consider data-dependent hypothesis sets based on stochastic convex optimization algorithms. As shown by Shalev-Shwartz et al. (2010), uniform convergence bounds do not hold for the stochastic convex optimization problem in general. As a result, the data-dependent hypothesis sets we will define cannot be analyzed using standard tools for deriving generalization bounds. However, using arguments based on our notion of hypothesis set stability, we can provide learning guarantees here.

Consider stochastic optimization algorithms , each returning vector , after receiving sample , . We assume that the algorithms are all -sensitive in norm, that is, for all , we have if and differ by one point. We will also assume that these vectors are bounded by some that is , for all . This can be shown to be the case, for example, for algorithms based on empirical risk minimization with a strongly convex regularization term with (Shalev-Shwartz et al., 2010).

Assume that the loss is -Lipschitz with respect to its first argument . Let the data-dependent hypothesis set be defined as follows:

where is in the simplex of distributions and is the ball of radius around . We choose . A natural choice for would be the uniform mixture.

Since the loss function is -Lipschitz, the family of hypotheses is -stable. Additionally, for any and any , we have

where is the subordinate norm of matrix defined by . Thus, the average diameter admits the following upper bound: . In view of that, by Theorem 5, for any , with probability at least , the following holds for all :

The second stage of an algorithm in this context consists of choosing , potentially using a non-stable algorithm. This application both illustrates the use of our learning bounds using the diameter and its application even in the absence of uniform convergence bounds.

### 6.2 -sensitive feature mappings

Consider the scenario where the training sample is used to learn a non-linear feature mapping that is -sensitive for some .

may be the feature mapping corresponding to some positive definite symmetric kernel or a mapping defined by the top layer of an artificial neural network trained on

, with a stability property.The second stage may consist of selecting a hypothesis out of the family of linear hypotheses based on :

Assume that the loss function is -Lipschitz with respect to its first argument. Then, for any hypothesis and any sample differing from by one element, the hypothesis admits losses that are -close to those of , with , since, for all , by the Cauchy-Schwarz inequality, the following inequality holds:

Thus, the family of hypothesis set is uniformly -stable with . In view that, by Theorem 4, for any , with probability at least over the draw of a sample , the following inequality holds for any :

(9) |

Notice that this bound applies even when the second stage of an algorithm, which consists of selecting a hypothesis in , is not stable. A standard uniform stability guarantee cannot be used in that case. The setting described here can be straightforwardly extended to the case of other norms for the definition of sensitivity and that of the norm used in the definition of .

### 6.3 Distillation

Here, we consider *distillation algorithms* which, in the first
stage, train a very complex model on the labeled sample. Let
denote the resulting predictor for a
training sample of size . We will assume that the training
algorithm is -sensitive, that is
for and
differing by one point.

In the second stage, a distillation algorithm selects a hypothesis that is -close to from a less complex family of predictors . This defines the following sample-dependent hypothesis set:

Assume that the loss is -Lipschitz with respect to its first argument and that is a subset of a vector space. Let and be two samples differing by one point. Note, may not be in , but we will assume that is in . Let be in , then the hypothesis is in since . Figure 2 illustrates the hypothesis sets. By the -Lipschitzness of the loss, for any , . Thus, the family of hypothesis sets is -stable.

In view that, by Theorem 4, for any , with probability at least over the draw of a sample , the following inequality holds for any :

Notice that a standard uniform-stability argument would not necessarily apply here since could be relatively complex and the second stage not necessarily stable.

### 6.4 Bagging

*Bagging* (Breiman, 1996) is a prominent ensemble method used
to improve the stability of learning algorithms.
It consists of generating new samples ,
each of size , by sampling uniformly with replacement from the
original sample of size . An algorithm is then trained on
each of these samples to generate predictors ,
. In regression, the predictors are combined by taking a
convex combination . Here, we analyze a
common instance of bagging to illustrate the application of our
learning guarantees: we will assume a regression setting and a uniform
sampling from *without replacement*.^{2}^{2}2Sampling
without replacement is only adopted to make the analysis more
concise; its extension to sampling with replacement is
straightforward. We will also assume that the loss function is
-Lipschitz in the predictions and that the predictions are in the
range , and all the mixing weights are bounded by
for some constant , in order to ensure that no
subsample is overly influential in the final regressor (in
practice, a uniform mixture is typically used in bagging).

To analyze bagging in this setup, we cast it in our framework. First,
to deal with the randomness in choosing the subsamples, we can
equivalently imagine the process as choosing *indices* in
to form the subsamples rather than samples in , and then once
is drawn, the subsamples are generated by filling in the samples at
the corresponding indexes. Thus, for any index , the chance
that it is picked in any subsample is . Thus, by
Chernoff’s bound, with probability at least , no index in
appears in more than
subsamples. In the following, we condition on the random seed of the
bagging algorithm so that this is indeed the case, and later use a
union bound to control the chance that the chosen random seed does not
satisfy this property, as elucidated in
section J.2.

Define the data-dependent family of hypothesis sets as , where denotes the simplex of distributions over items with all weights . Next, we give upper bounds on the hypothesis set stability and the Rademacher complexity of . Assume that algorithm admits uniform stability (Bousquet and Elisseeff, 2002), i.e. for any two samples and of size that differ in exactly one data point and for all , we have . Now, let and be two samples of size differing by one point at the same index, and . Then, consider the subsets of which are obtained from the ’s by copying over all the elements except , and replacing all instances of by . For any , if , then and, if , then for any . We can bound now the hypothesis set uniform stability as follows: since is -Lipschitz in the prediction, for any , and any we have

Bounding the Rademacher complexity for is non-trivial. Instead, we can derive a reasonable upper bound by analyzing the Rademacher complexity of a larger function class. Specifically, for any , define the dimensional vector . Then the class of functions is . Clearly . Since , a standard Rademacher complexity bound (see Theorem 11.15 in (Mohri et al., 2018)) implies . Thus, by Talagrand’s inequality, we conclude that . In view of that, by Theorem 5, for any , with probability at least over the draws of a sample and the randomness in the bagging algorithm, the following inequality holds for any :

For
and , the generalization gap goes to as
, *regardless* of the stability of . This
gives a new generalization guarantee for bagging, similar (but incomparable) to the one derived by Elisseeff et al. (2005). Note however that unlike their bound, our bound allows for non-uniform averaging schemes.

As an aside, we note that the same analysis can be carried over to the stochastic convex optimization setting of section 6.1, by setting to be a stochastic convex optimization algorithm which outputs a weight vector . This yields generalization bounds for aggregating over a larger set of mixing weights, albeit with the restriction that each algorithm uses only a small part of .

## 7 Conclusion

We presented a broad study of generalization with

Comments

There are no comments yet.