Hypothesis Set Stability and Generalization

04/09/2019 ∙ by Dylan J. Foster, et al. ∙ MIT University of Southern California NYU college cornell university 0

We present an extensive study of generalization for data-dependent hypothesis sets. We give a general learning guarantee for data-dependent hypothesis sets based on a notion of transductive Rademacher complexity. Our main results are two generalization bounds for data-dependent hypothesis sets expressed in terms of a notion of hypothesis set stability and a notion of Rademacher complexity for data-dependent hypothesis sets that we introduce. These bounds admit as special cases both standard Rademacher complexity bounds and algorithm-dependent uniform stability bounds. We also illustrate the use of these learning bounds in the analysis of several scenarios.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Most generalization bounds in learning theory hold for a fixed hypothesis set, selected before receiving a sample. This includes learning bounds based on covering numbers, VC-dimension, pseudo-dimension, Rademacher complexity, local Rademacher complexity, and other complexity measures (Pollard, 1984; Zhang, 2002; Vapnik, 1998; Koltchinskii and Panchenko, 2002; Bartlett et al., 2002). Some alternative guarantees have also been derived for specific algorithms. Among them, the most general family is that of uniform stability bounds given by Bousquet and Elisseeff (2002). These bounds were recently significantly improved by Feldman and Vondrak (2018), who proved guarantees that are informative, even when the stability parameter is only in , as opposed to . New bounds for a restricted class of algorithms were also recently presented by Maurer (2017)

, under a number of assumptions on the smoothness of the loss function. Appendix 

A gives more background on stability.

In practice, machine learning engineers commonly resort to hypothesis sets depending on the

same sample

as the one used for training. This includes instances where a regularization, a feature transformation, or a data normalization is selected using the training sample, or other instances where the family of predictors is restricted to a smaller class based on the sample received. In other instances, as is common in deep learning, the data representation and the predictor are learned using the same sample. In ensemble learning, the sample used to train models sometimes coincides with the one used to determine their aggregation weights. However, standard generalization bounds cannot be used to provide guarantees for these scenarios since they assume a fixed hypothesis set.

This paper studies generalization in a broad setting that admits as special cases both that of standard learning bounds for fixed hypothesis sets based on some complexity measure, and that of algorithm-dependent uniform stability bounds. We present an extensive study of generalization for sample-dependent hypothesis sets, that is for learning with a hypothesis set selected after receiving the training sample . This defines two stages for the learning algorithm: a first stage where is chosen after receiving , and a second stage where a hypothesis is selected from . Standard generalization bounds correspond to the case where is equal to some fixed independent of . Algorithm-dependent analyses, such as uniform stability bounds, coincide with the case where is chosen to be a singleton . Thus, the scenario we study covers both existing settings and, additionally, includes many other intermediate scenarios. Figure 1 illustrates our general scenario.

We present a series of results for generalization with data-dependent hypothesis sets. We first present general learning bounds for data-dependent hypothesis sets using a notion of transductive Rademacher complexity (Section 3). These bounds hold for arbitrary bounded losses and improve upon previous guarantees given by Gat (2001) and Cannon et al. (2002) for the binary loss, which were expressed in terms of a notion of shattering coefficient adapted to the data-dependent case, and are more explicit than the guarantees presented by Philips (2005)[corollary 4.6 or theorem 4.7]. Nevertheless, such bounds may often not be sufficiently informative, since they ignore the relationship between hypothesis sets based on similar samples.

Figure 1: Decomposition of the learning algorithm’s hypothesis selection into two stages. In the first stage, the algorithm determines a hypothesis associated to the training sample which may be a small subset of the set of all hypotheses that could be considered, say . The second stage then consists of selecting a hypothesis out of .

To derive a finer analysis, we introduce a key notion of hypothesis set stability, which admits algorithmic stability as a special case, when the hypotheses sets are reduced to singletons. We also introduce a new notion of Rademacher complexity for data-dependent hypothesis sets. Our main results are two generalization bounds for stable data-dependent hypothesis sets, both expressed in terms of the hypothesis set stability parameter, our notion of Rademacher complexity, and a notion of cross-validation stability that, in turn, can be upper-bounded by the diameter of the family of hypothesis sets. Our first learning bound (Section 4) is expressed in terms of a finer notion of diameter but admits a dependency in terms of the stability parameter similar to that of uniform stability bounds of Bousquet and Elisseeff (2002). In Section 5, we use proof techniques from the differential privacy literature (Steinke and Ullman, 2017; Bassily et al., 2016; Feldman and Vondrak, 2018) to derive a learning bound expressed in terms of a somewhat coarser definition of diameter but with a more favorable dependency on , matching the dependency of the recent more favorable bounds of Feldman and Vondrak (2018). Our learning bounds admit as special cases both standard Rademacher complexity bounds and algorithm-dependent uniform stability bounds.

Shawe-Taylor et al. (1998) presented an analysis of structural risk minimization over data-dependent hierarchies based on a concept of luckiness

, which generalizes the notion of margin of linear classifiers. Their analysis can be viewed as an alternative study of data-dependent hypothesis sets, using luckiness functions and

-smallness (or -smoothness) conditions. A luckiness function helps decompose a hypothesis set into lucky sets, that is sets of functions luckier than a given function. The -smallness condition requires that the size of the family of loss functions corresponding to the lucky set of any function

with respect to a double-sample, measured by packing or covering numbers, be bounded with high probability by a function

of the luckiness of on the sample. The luckiness framework is attractive and the notion of luckiness, for example margin, can in fact be combined with our results. However, finding pairs of truly data-dependent luckiness and -smallness functions, other than those based on the margin and the empirical VC-dimension, is quite difficult, in particular because of the very technical -smallness condition (see Philips, 2005, p. 70). In contrast, our hypothesis set stability is simpler and often easier to bound. The notions of luckiness and -smallness have also been used by Herbrich and Williamson (2002) to derive algorithm-specific guarantees. The authors show a connection with algorithmic stability (not hypothesis set stability), at the price of a guarantee requiring the strong condition that the stability parameter be in , where is the sample size (see Herbrich and Williamson, 2002, pp. 189-190).

In section 6, we illustrate the generality and the benefits of our hypothesis set stability learning bounds by applying them to the analysis of several scenarios (see also Appendix K). In Appendix J, we briefly discuss several extensions of our framework and results, including the extension to almost everywhere hypothesis set stability as in (Kutin and Niyogi, 2002). The next section introduces the definitions and properties used in our analysis.

2 Definitions and Properties

Let be the input space and the output space. We denote by the unknown distribution over according to which samples are drawn.

The hypotheses we consider map to a set sometimes different from . For example, in binary classification, we may have and . Thus, we denote by a loss function defined on and taking non-negative real values bounded by one. We denote the loss of a hypothesis at point by . We denote by the generalization error or expected loss of a hypothesis and by its empirical loss over a sample :

In the general framework we consider, a hypothesis set depends on the sample received. We will denote by the hypothesis set depending on the labeled sample of size .

[Hypothesis set uniform stability] Fix . We will say that a family of data-dependent hypothesis sets is -uniformly stable (or simply -stable) for some , if for any two samples and of size differing only by one point, the following holds:

(1)

Thus, two hypothesis sets derived from samples differing by one element are close in the sense that any hypothesis in one admits a counterpart in the other set with -similar losses.

Next, we define a notion of cross-validation stability for data-dependent hypothesis sets. The notion measures the maximal change in loss of a hypothesis on a training example and the loss of a hypothesis on the same training example, when the hypothesis is chosen from the hypothesis set corresponding to the a sample where the training example in question is replaced by a newly sampled example. [Hypothesis set Cross-Validation (CV) stability] Fix . We will say that a family of data-dependent hypothesis sets has CV-stability for some , if the following holds (here, denotes the sample obtained by replacing by ):

(2)

We say that has average CV-stability for some if the following holds:

(3)

We also define a notion of diameter of data-dependent hypothesis sets, which is useful in bounding CV-stability. In applications, we will typically bound the diameter, thereby the CV-stability. [Diameter of data-dependent hypothesis sets] Fix . We define the diameter and average diameter of a family of data-dependent hypothesis sets by

(4)

Notice that, for consistent hypothesis sets, the diameter is reduced to zero since for any and . As mentioned earlier, CV-stability of hypothesis sets can be bounded in terms of their stability and diameter: A family of data-dependent hypothesis sets with -uniform stability, diameter , and average diameter has -CV-stability and -average CV-stability. Let , , and . For any and , by the -uniform stability of , there exists such that . Thus,

This implies the inequality

and the lemma follows.

We also introduce a new notion of Rademacher complexity for data-dependent hypothesis sets. To introduce its definition, for any two samples

and a vector of Rademacher variables

, denote by the sample derived from by replacing its th element with the th element of , for all with . We will use to denote the hypothesis set .

[Rademacher complexity of data-dependent hypothesis sets] Fix . The empirical Rademacher complexity and the Rademacher complexity of a family of data-dependent hypothesis sets for two samples and in are defined by

(5)

When the family of data-dependent hypothesis sets is -stable with , the empirical Rademacher complexity is sharply concentrated around its expectation , as with the standard empirical Rademacher complexity (see Lemma B.2).

Let denote the union of all hypothesis sets based on subsamples of of size : . Since for any , we have , the following simpler upper bound in terms of the standard empirical Rademacher complexity of can be used for our notion of empirical Rademacher complexity:

where is the standard empirical Rademacher complexity of for the sample .

The Rademacher complexity of data-dependent hypothesis sets can be bounded by , as indicated previously. It can also be bounded directly, as illustrated by the following example of data-dependent hypothesis sets of linear predictors. For any sample , define the hypothesis set as follows:

where . Define and as follows: and . Then, it can be shown that the empirical Rademacher complexity of the family of data-dependent hypothesis sets can be upper-bounded as follows (Lemma B.1):

Notice that the bound on the Rademacher complexity is non-trivial since it depends on the samples and , while a standard Rademacher complexity for non-data-dependent hypothesis set containing would require taking a maximum over all samples of size . Other upper bounds are given in Appendix B.

Let denote the family of loss functions associated to :

(6)

and let denote the family of hypothesis sets . Our main results will be expressed in terms of . When the loss function is -Lipschitz, by Talagrand’s contraction lemma (Ledoux and Talagrand, 1991), in all our results, can be replaced by .

3 General learning bound for data-dependent hypothesis sets

In this section, we present general learning bounds for data-dependent hypothesis sets that do not make use of the notion of hypothesis set stability.

One straightforward idea to derive such guarantees for data-dependent hypothesis sets is to replace the hypothesis set depending on the observed sample by the union of all such hypothesis sets over all samples of size , . However, in general, can be very rich, which can lead to uninformative learning bounds. A somewhat better alternative consists of considering the union of all such hypothesis sets for samples of size included in some supersample of size , with , . We will derive learning guarantees based on the maximum transductive Rademacher complexity of . There is a trade-off in the choice of : smaller values lead to less complex sets , but they also lead to weaker dependencies on sample sizes. Our bounds are more refined guarantees than the shattering-coefficient bounds originally given for this problem by Gat (2001) in the case , and later by Cannon et al. (2002) for any . They also apply to arbitrary bounded loss functions and not just the binary loss. They are expressed in terms of the following notion of transductive Rademacher complexity for data-dependent hypothesis sets:

where and where is a vector of

independent random variables taking value

with probability , and with probability . Our notion of transductive Rademacher complexity is simpler than that of El-Yaniv and Pechyony (2007) (in the data-independent case) and leads to simpler proofs and guarantees. A by-product of our analysis is learning guarantees for standard transductive learning in terms of this notion of transductive Rademacher complexity, which can be of independent interest.

Let be a family of data-dependent hypothesis sets. Then, for any with and any , the following inequality holds:

where . For , the inequality becomes:

We use the following symmetrization result, which holds for any with for data-dependent hypothesis sets (Lemma D, Appendix D):

To bound the right-hand side, we use an extension of McDiarmid’s inequality to sampling without replacement (Cortes et al., 2008) applied to . Lemma E (Appendix E) is then used to bound in terms of our notion of transductive Rademacher complexity. The full proof is given in Appendix C.

4 Learning bound for stable data-dependent hypothesis sets

In this section, we present generalization bounds for data-dependent hypothesis sets using the notion of Rademacher complexity defined in the previous section, as well as that of hypothesis set stability.

Let be a -stable family of data-dependent hypothesis sets with average CV-stability. Let be defined as in (6). Then, for any , with probability at least over the draw of a sample , the following inequality holds for all :

(7)

For any two samples , define as follows:

The proof consists of applying McDiarmid’s inequality to . The first stage consists of proving the -sensitivity of , with . The main part of the proof then consists of upper bounding the expectation in terms of both our notion of Rademacher complexity, and in terms of our notion of cross-validation stability. The full proof is given in Appendix F. The generalization bound of the theorem admits as a special case the standard Rademacher complexity bound for fixed hypothesis sets (Koltchinskii and Panchenko, 2002; Bartlett and Mendelson, 2002): in that case, we have for some , thus coincides with the standard Rademacher complexity ; furthermore, the family of hypothesis sets is -stable, thus the bound holds with . It also admits as a special case the standard uniform stability bound (Bousquet and Elisseeff, 2002): in that case, is reduced to a singleton, , and our notion of hypothesis set stability coincides with that of uniform stability of single hypotheses; furthermore, we have , since . Thus, using in the right-hand side inequality, the expression of the learning bound matches that of a uniform stability bound for single hypotheses.

5 Differential privacy-based bound for stable data-dependent hypothesis sets

In this section, we use recent techniques introduced in the differential privacy literature to derive improved generalization guarantees for stable data-dependent hypothesis sets (Steinke and Ullman, 2017; Bassily et al., 2016) (see also (McSherry and Talwar, 2007)). Our proofs also benefit from the recent improved stability results of Feldman and Vondrak (2018). We will make use of the following lemma due to Steinke and Ullman (2017, Lemma 1.2), which reduces the task of deriving a concentration inequality to that of upper bounding an expectation of a maximum.

Fix . Let

be a random variable with probability distribution

and independent copies of . Then, the following inequality holds:

We will also use the following result which, under a sensitivity assumption, further reduces the task of upper bounding the expectation of the maximum to that of bounding a more favorable expression. The sensitivity of a function is .

[(McSherry and Talwar, 2007; Bassily et al., 2016; Feldman and Vondrak, 2018)] Let be scoring functions with sensitivity . Let be the algorithm that, given a dataset and a parameter , returns the index with probability proportional to . Then, is -differentially private and, for any , the following inequality holds:

Notice that, if we define , then, by the same result, the algorithm returning the index with probability proportional to is -differentially private and the following inequality holds for any :

(8)

Let be a -stable family of data-dependent hypothesis sets with CV-stability. Let be defined as in (6). Then, for any , with probability at least over the draw of a sample , the following inequality holds for all :

For any two samples of size , define as follows:

The proof consists of deriving a high-probability bound for . To do so, by Lemma 5 applied to the random variable , it suffices to bound , where with , , independent samples of size drawn from . To bound that expectation, we use Lemma 5 and instead bound , where is an -differentially private algorithm. To apply Lemma 5, we first show that, for any , the function is -sensitive with . Lemma G helps us express our upper bound in terms of the CV stability coefficient . The full proof is given in Appendix G. The hypothesis set-stability bound of this theorem admits the same favorable dependency on the stability parameter as the best existing bounds for uniform-stability recently presented by Feldman and Vondrak (2018). As with Theorem 4, the bound of Theorem 5 admits as special cases both standard Rademacher complexity bounds ( for some fixed and ) and uniform-stability bounds (). In the latter case, our bound coincides with that of Feldman and Vondrak (2018) modulo constants that could be chosen to be the same for both results.111The differences in constant terms are due to slightly difference choices of the parameters and a slightly different upper bound in our case where multiplies the stability and the diameter, while the paper of Feldman and Vondrak (2018) does not seem to have that factor. Notice that the current bounds for standard uniform stability may not be optimal since no matching lower bound is known yet (Feldman and Vondrak, 2018). It is very likely, however, that improved techniques used for deriving more refined algorithmic stability bounds could also be used to improve our hypothesis set stability guarantees. In Appendix H, we give an alternative version of Theorem 5 with a proof technique only making use of recent methods from the differential privacy literature, including to derive a Rademacher complexity bound. It might be possible to achieve a better dependency on for the term in the bound containing the Rademacher complexity. In Appendix I, we initiate such an analysis by deriving a finer analysis on the expectation .

6 Applications

In this section, we discuss several applications of the learning guarantees presented in the previous sections. We discuss other applications in Appendix K. As already mentioned, both the standard setting of a fixed hypothesis set not varying with , that is that of standard generalization bounds, and the uniform stability setting where , are special cases benefitting from our learning guarantees.

6.1 Stochastic convex optimization

Here, we consider data-dependent hypothesis sets based on stochastic convex optimization algorithms. As shown by Shalev-Shwartz et al. (2010), uniform convergence bounds do not hold for the stochastic convex optimization problem in general. As a result, the data-dependent hypothesis sets we will define cannot be analyzed using standard tools for deriving generalization bounds. However, using arguments based on our notion of hypothesis set stability, we can provide learning guarantees here.

Consider stochastic optimization algorithms , each returning vector , after receiving sample , . We assume that the algorithms are all -sensitive in norm, that is, for all , we have if and differ by one point. We will also assume that these vectors are bounded by some that is , for all . This can be shown to be the case, for example, for algorithms based on empirical risk minimization with a strongly convex regularization term with (Shalev-Shwartz et al., 2010).

Assume that the loss is -Lipschitz with respect to its first argument . Let the data-dependent hypothesis set be defined as follows:

where is in the simplex of distributions and is the ball of radius around . We choose . A natural choice for would be the uniform mixture.

Since the loss function is -Lipschitz, the family of hypotheses is -stable. Additionally, for any and any , we have

where is the subordinate norm of matrix defined by . Thus, the average diameter admits the following upper bound: . In view of that, by Theorem 5, for any , with probability at least , the following holds for all :

The second stage of an algorithm in this context consists of choosing , potentially using a non-stable algorithm. This application both illustrates the use of our learning bounds using the diameter and its application even in the absence of uniform convergence bounds.

6.2 -sensitive feature mappings

Consider the scenario where the training sample is used to learn a non-linear feature mapping that is -sensitive for some .

may be the feature mapping corresponding to some positive definite symmetric kernel or a mapping defined by the top layer of an artificial neural network trained on

, with a stability property.

The second stage may consist of selecting a hypothesis out of the family of linear hypotheses based on :

Assume that the loss function is -Lipschitz with respect to its first argument. Then, for any hypothesis and any sample differing from by one element, the hypothesis admits losses that are -close to those of , with , since, for all , by the Cauchy-Schwarz inequality, the following inequality holds:

Thus, the family of hypothesis set is uniformly -stable with . In view that, by Theorem 4, for any , with probability at least over the draw of a sample , the following inequality holds for any :

(9)

Notice that this bound applies even when the second stage of an algorithm, which consists of selecting a hypothesis in , is not stable. A standard uniform stability guarantee cannot be used in that case. The setting described here can be straightforwardly extended to the case of other norms for the definition of sensitivity and that of the norm used in the definition of .

6.3 Distillation

Here, we consider distillation algorithms which, in the first stage, train a very complex model on the labeled sample. Let denote the resulting predictor for a training sample of size . We will assume that the training algorithm is -sensitive, that is for and differing by one point.

Figure 2: Illustration of the distillation hypothesis sets. Notice that the diameter of a hypothesis set may be large here.

In the second stage, a distillation algorithm selects a hypothesis that is -close to from a less complex family of predictors . This defines the following sample-dependent hypothesis set:

Assume that the loss is -Lipschitz with respect to its first argument and that is a subset of a vector space. Let and be two samples differing by one point. Note, may not be in , but we will assume that is in . Let be in , then the hypothesis is in since . Figure 2 illustrates the hypothesis sets. By the -Lipschitzness of the loss, for any , . Thus, the family of hypothesis sets is -stable.

In view that, by Theorem 4, for any , with probability at least over the draw of a sample , the following inequality holds for any :

Notice that a standard uniform-stability argument would not necessarily apply here since could be relatively complex and the second stage not necessarily stable.

6.4 Bagging

Bagging (Breiman, 1996) is a prominent ensemble method used to improve the stability of learning algorithms. It consists of generating new samples , each of size , by sampling uniformly with replacement from the original sample of size . An algorithm is then trained on each of these samples to generate predictors , . In regression, the predictors are combined by taking a convex combination . Here, we analyze a common instance of bagging to illustrate the application of our learning guarantees: we will assume a regression setting and a uniform sampling from without replacement.222Sampling without replacement is only adopted to make the analysis more concise; its extension to sampling with replacement is straightforward. We will also assume that the loss function is -Lipschitz in the predictions and that the predictions are in the range , and all the mixing weights are bounded by for some constant , in order to ensure that no subsample is overly influential in the final regressor (in practice, a uniform mixture is typically used in bagging).

To analyze bagging in this setup, we cast it in our framework. First, to deal with the randomness in choosing the subsamples, we can equivalently imagine the process as choosing indices in to form the subsamples rather than samples in , and then once is drawn, the subsamples are generated by filling in the samples at the corresponding indexes. Thus, for any index , the chance that it is picked in any subsample is . Thus, by Chernoff’s bound, with probability at least , no index in appears in more than subsamples. In the following, we condition on the random seed of the bagging algorithm so that this is indeed the case, and later use a union bound to control the chance that the chosen random seed does not satisfy this property, as elucidated in section J.2.

Define the data-dependent family of hypothesis sets as , where denotes the simplex of distributions over items with all weights . Next, we give upper bounds on the hypothesis set stability and the Rademacher complexity of . Assume that algorithm admits uniform stability (Bousquet and Elisseeff, 2002), i.e. for any two samples and of size that differ in exactly one data point and for all , we have . Now, let and be two samples of size differing by one point at the same index, and . Then, consider the subsets of which are obtained from the ’s by copying over all the elements except , and replacing all instances of by . For any , if , then and, if , then for any . We can bound now the hypothesis set uniform stability as follows: since is -Lipschitz in the prediction, for any , and any we have

Bounding the Rademacher complexity for is non-trivial. Instead, we can derive a reasonable upper bound by analyzing the Rademacher complexity of a larger function class. Specifically, for any , define the dimensional vector . Then the class of functions is . Clearly . Since , a standard Rademacher complexity bound (see Theorem 11.15 in (Mohri et al., 2018)) implies . Thus, by Talagrand’s inequality, we conclude that . In view of that, by Theorem 5, for any , with probability at least over the draws of a sample and the randomness in the bagging algorithm, the following inequality holds for any :

For and , the generalization gap goes to as , regardless of the stability of . This gives a new generalization guarantee for bagging, similar (but incomparable) to the one derived by Elisseeff et al. (2005). Note however that unlike their bound, our bound allows for non-uniform averaging schemes.

As an aside, we note that the same analysis can be carried over to the stochastic convex optimization setting of section 6.1, by setting to be a stochastic convex optimization algorithm which outputs a weight vector . This yields generalization bounds for aggregating over a larger set of mixing weights, albeit with the restriction that each algorithm uses only a small part of .

7 Conclusion

We presented a broad study of generalization with