## 1 Introduction

Generalization in classical machine learning models is often understood in terms of the bias-variance trade-off: over-parameterized models fit the data better but are more likely to overfit. However, the push for ever larger deep network models, driven by their remarkable empirical generalization, has spurred a race for the development of new theories of generalization with two main objectives: (1) guiding the practitioners in designing over-parameterized architectures and training schemes that would generalize better and (2) providing provable bounds on the performance of the model after deployment.

In this context, information theoretic bounds on generalization [22, 25, 3, 17, 24, 8, 12] are particularly appealing. They capture the intuitive idea that models that memorize less about the training set will generalize better. In particular, since they deal with the amount of information stored in the weights rather than the number of weights, they are especially adaptable to over-parameterized models [13, 2, 1]. However, the bounds are often vacuous in realistic settings, require modified training algorithms, or are black box bounds that are difficult to compute and/or to relate to relevant properties of the network/training algorithm. To remedy this problem, we ask whether it is possible to develop a bound that is: (a) realistically tight for standard training algorithms used in deep learning; (b) interpretable, meaning that they can be rewritten in terms of quantities the practitioners can control during the training process and that may guide toward the design of better training algorithms, (c) that applies out-of-the-box to standard training algorithms.

#### Contributions.

As a step in this direction, we introduce a new information theoretic bound based on leave-one-out conditional mutual information (loo-CMI). The main intuition behind the bound is to ask the counterfactual question “If we were to remove one sample at random from the training set and retrain, how well would we be able to infer which sample was removed?”. As we shall see, bounding this quantity is enough to bound sharply the amount of memorization in a deep network, and hence its generalization (Theorems 3.2 and 3.1). At the same time, the bound is easy to interpret and connects to several empirical quantities and other generalization theories, such as: (1) stability of the optimization algorithm (Section 4.4), (2) flatness of the loss landscape at convergence (Section 4.5). Moreover, we show that our bound can also be interpreted in the function space (that is, looking at the activations rather than the weights, (Theorem 3.2) leading to tighter bounds for over-parameterized models through the data processing inequality.

Our loo-CMI bound falls in the general class of conditional mutual information bounds, introduced by [24]. However, while standard CMI bounds iterate over all the subset of samples of size ( for a dataset of samples) in order to compute the bound, our loo-CMI only need to iterate over

possible samples to remove, leading to an exponential reduction in the cost to estimate the bound. Moreover, our strategy of removing a single sample is easy to relate to the stability of the training path to slight perturbation of the training set, thus giving a clear connection between generalization and the geometry of the loss landscape (

Figure 1). Empirically, we show that our bound can be computed and is non-vacuous for state-of-the-art deep networks fine-tuned on standard image classification tasks (Table 1). We also study the dependency of the bound on the size of the dataset (Figure 2), and the hyper-parameters (Figure 3).## 2 Preliminaries

We denote with the set and with the set of real-valued and complex-valued

dimensional vectors.

and denote the transpose and inverse of a square matrix . If is a vector with elements, then denotes the square matrix with the elements of on its main diagonal;is the standard Euclidean norm on vectors. We use uppercase letters to denote random variables, and lowercase letters to denote a specific value the random variable can take. If

is a random variable, thenis the probability distribution of

, and denotes the expected value of . For a pair of random variables and , and are the joint and product distributions, respectively. If andare two probability densities over

such that is absolutely continuous with respect to (), the Kullback-Leibler Divergence from

to is defined as:(1) |

The Mutual Information between two random variables and , and the conditional mutual information of and given a third random variable are respectively given by:

(2) | |||

(3) |

### 2.1 Problem Formulation

Let be a collection of independent and identically distributed samples drawn from some unknown distribution . We consider the setting of supervised learning where ; i.e., each sample is composed of a feature vector and a label. Let be a possibly stochastic training algorithm, where

is a set of weights (e.g., weights in a neural network). Given a loss function

, the empirical risk for weights on is given by , and the "true" loss is given by ; i.e., the average loss incurs on random samples drawn from .Our goal is to find a such that . Since we do not have access to the data-generating distribution , computing the true loss is infeasible. However, we do have access to a set of samples , and one approach is empirical risk minimization where we find the that minimizes . If the difference is small, then the true loss of an empirical risk minimizer would be nearly optimal within the considered hypothesis class. Therefore, it is of great interest to study the generalization gap of (given samples for ease of notation):

(4) |

### 2.2 Conditional Mutual Information Bounds

We start by recalling the main results on Conditional Mutual Information (CMI) bounds that we return to later. Let be a dataset with samples drawn from a distribution , grouped into pairs. Let a uniform random binary vector of size . selects one sample from each pair in to form given by . In this setting, the bound of Steinke and Zakynthinou [24] arises from asking the following question : “if a model is trained with a subset of samples chosen through the random binary indexing vector , how much information does the output of the algorithm provide about ?” Here, the Shannon Mutual Information is useful because it is a measure of the amount of ‘information’ obtained about one random variable after observing another random variable.

###### Theorem 2.1 (Theorem 2, [24]).

Let be a bounded loss function. Let be a possibly stochastic training algorithm. Let be the output of given the dataset , then

(5) |

Haghifam et al.[8] improved Theorem 2.1 by moving the expectation over outside the square root, and by measuring the information the output of the algorithm provides on random subsets of the indexing vector . Harutyunyan et al. [12] further improved the bound by moving the expectation over the random subsets outside the square root.

###### Theorem 2.2 (Theorem 2.6, [12]).

Let be a bounded loss function. Let and let be a random subset of of size . Let be the output of given the dataset , then

(6) |

One appealing aspect of the CMI bounds is that they work for any training algorithm and any model, including over-parameterized networks trained with Stochastic Gradient Descent (SGD). However, computing the bounds requires iterating over all possible

values of . Moreover, the bound is difficult to interpret since the value of for different values of can vary significantly and in unpredictable ways for large non-convex models. Taking inspiration from leave-one-out cross validation, we propose different CMI bound which avoids both problems by removing a single sample from the dataset. As we shall see, it allows for interpretable expressions, faster computation, and a connection to notions of stability.## 3 Leave-One-Out Conditional Mutual Information

Let be a dataset of i.i.d. samples drawn from . Let be uniform random variable taking values over the indices of the samples in . removes a single sample from to form , the dataset without the sample. Let be the output of the algorithm, then we define the pointwise leave-one-out conditional mutual information as:

(7) |

In this setting, measures the amount of information that the output of the algorithm reveals about , the index (identity) of the left-out sample. As it is the case for the conditional mutual information terms in Theorems 2.1 and 2.2, is bounded. Specifically, is upper bounded by the entropy of , . Hence, it does not suffer from the same issues as information stability bounds [22, 25]. Moreover, the output of an algorithm is not significantly affected by the inclusion or exclusion of one sample when the input consists of thousands of other samples, so we expect to be small for large values of . We use (7) to derive a bound on the generalization of error in expectation (Theorem 3.1).

### 3.1 Generalization Bounds Based on

is deeply intertwined with leave-one-out cross validation, so before we derive bounds on the generalization through , it is helpful to first obtain a probabilistic upper bound on the leave-one-out cross validation error (). measures the difference in loss between the samples which the algorithm trained on, and the sample that was left out. We begin with a precise definition of and a probabilistic bound on it.

(8) |

###### Lemma 3.1.

Let be a bounded loss function, then for all , , and ,

(9) |

To derive the generalization bounds included in this paper, we make heavy use of Lemma 7.1. We begin with a generalization bound in expectation.

###### Theorem 3.1.

Let be a training algorithm. Let be a bounded loss function, then

(10) |

where is given in (24). The bound in (42) guarantees a good generalization error when is small; i.e., when the output of the algorithm tells us little about the identity of the sample that was removed from the training dataset. If a parametric learning algorithm ends up memorizing the samples in the training set, the generalization gap could be large, and so would as we can determine by checking which sample was not memorized by the algorithm. On the other extreme, a learning algorithm which outputs a constant parameter regardless of the training set does not have a generalization gap, and in this case would equal as provides no information about . We note that bound of Theorem 3.1 most closely resembles the bound of Theorem 2.2 when . The latter bound is given by:

(11) |

Recalling the setting of Section 2.2, (11) is computed by first going through every possible value of . For every value of and for each , one fixes excluding which varies uniformly over . In other words, the component of the training set is equally likely to be either or . The right-hand-side of (11) then bounds the generalization using the information that the weights contain about . In other words, the question asked is: “can the output of the algorithm help us determine which one of or was present in the training set?". On the other hand, we ask a different question, "can the output of the algorithm help us determine the index of the sample that was removed?”. This subtle change leads to an exponential reduction in the cost of computing the since we only need to iterate over the values of , as opposed to which can take possible values.

### 3.2 Extension To The Function Space

The previous bounds used the output of the algorithm, the weights, to bound the generalization error. A different approach studied in [12] is to assume that given a training dataset and a test sample, the algorithm outputs a prediction on the test sample. In other words, we assume the algorithm is a possibly stochastic function which takes a training set , a new unlabeled sample , and outputs a possibly stochastic prediction on the new sample. The set of predictions can be different than the set of labels (e.g., class probabilities in multi-class classification tasks). Unlike bounds with respect to the weights, this approach is applicable to both parametric algorithms (e.g. neural networks) and non-parametric algorithms (e.g. -nearest neighbors). In the former case, the output of the algorithm specifies a prediction function from a class of functions parameterized by the weights i.e. .

We redefine the loss to be a non-negative function that measures the distance between the predictions and the ground-truth labels. Given and a training set , the true loss of an algorithm is given by , and the empirical loss is given by . The generalization gap of given is then defined as:

(12) |

Recalling the setup of Section 3, we define the pointwise functional leave-one-out conditional mutual information as:

(13) |

While measures the reduction in uncertainty about after having known the weights, measures the reduction in uncertainty after having known the predictions made on whole dataset (including the prediction on the removed sample). We derive a bound with respect to using a proof that closely resembles the proof of Theorem 3.1.

###### Theorem 3.2.

Let be a bounded loss function. Let be a uniform random variable over , then

(14) |

## 4 Computing The Bounds

### 4.1 Computable Upper Bounds to and

To evaluate and interpret the bounds for a parametric algorithm with prediction function , we require a computable closed-form expressions for and . However, obtaining a closed form expression is a difficult task in most cases. To see the reason behind this difficulty, note that can be written as:

(15) |

where is a mixture distribution. Since one of the terms in the Kullback-Leibler divergence of (15

) is a mixture distribution, it is unlikely to obtain a closed form expression for the conditional mutual information for most commonly encountered continuous distributions (e.g., Gaussian distributions

[6]). To find a computable and interpretable bound on the generalization error, we propose upper bounds on the conditional mutual information which can be interpreted and evaluated.###### Theorem 4.1.

Let be a parametric algorithm with prediction function . Let be defined as before. Let be an identical independent copy of , then

(16) | ||||

(17) |

One can alternatively upper bound through the convexity of the Kullback-Leibler divergence (similarly for ). However, since the function is strictly convex, it is easy to show that the bound of Theorem 4.1 is strictly tighter through Jensen’s inequality. Moreover, we opt to use this bound as it allows for more interpretable expressions (Corollary 4.1).

### 4.2 Geometry Aware Synthetic Randomization

Theorems 3.1 and 3.2 are valid in theory, but the outputs of algorithms used in practice are deterministic given a fixed input, and so we do not have a distribution of weights or predictions. This means that (7) and (13) are degenerate. Even if we vary the random seed for neural networks trained with SGD, and run the algorithm for each dataset and random seed, we obtain a set of discrete distributions with unequal supports. We alleviate this issue by taking measures similar to the ones made in [8, 18, 11], and also previously in related contexts in [2, 7, 1]. In particular, we add noise to the output of a deterministic algorithm, and use the now stochastic algorithm to bound the information in the weights and predictions. We avoid adding isotropic Gaussian noise for the weight-based bounds, as changing the values of some weights may have little effect on the loss compared to others. Therefore, we add noise while taking into account the geometry of the loss landscape.

Specifically, let be deterministic algorithm, and let , where . A choice of that would incorporate the notion that weights have a varying effect on the loss is with for all and not necessarily equal to for . Similarly, we can add noise to predictions made by a deterministic algorithm as , where . The generalization bounds derived for or are not directly applicable to and , but if certain Lipschitz continuity or convexity assumptions hold for the loss function , it is possible to derive bounds for deterministic algorithms from the bounds of their noisy versions (Sections 7.6 and 4.2). We begin by applying Theorem 4.1 to get interpretable expressions for the conditional mutual information terms.

###### Corollary 4.1.

Let be a deterministic algorithm with prediction function . Let and be the output and predictions respectively when sample is removed from the training set. Using Theorem 4.1, we obtain

(18) | ||||

(19) |

### 4.3 Connections to Stability and Bounds For Deterministic Algorithms

It is easy to see the connection between classical definitions of stability and the right-hand side of (18), (19). After all, to compute the right-hand side of (18) and (19), one only needs to know how the weights and predictions of the algorithm change when two datasets differ by one sample. Based on the observation that components of the weight might have differing degrees of effect on the loss, and the right-hand side of (18), we introduce the idea of "relative" weight stability. Moreover, we also use classical definitions of functional stability, and here we find it useful and intuitive to define two notions of stability: one with respect to the samples that the two datasets share, and one with respect to any other sample. The definitions are as follows.

###### Definition 4.1.

Let be datasets such that and differ by at most one sample. Letting , then we say that a deterministic algorithm has weight stability relative to positive semi-definite matrix if

###### Definition 4.2.

Let be datasets such that and differ by at most one sample, and without loss of generality, assume they differ at the first sample i.e. and for all . Let be prediction function , then we say has -train stability if for all Moreover, we say that -test stability if for all .

Given bounds on the noisy version of a deterministic algorithm, one can derive bounds on the deterministic algorithm itself by making certain Lipschitz continuity assumptions on the loss function. Moreover, we show that given relative weight stability and functional stability of deterministic algorithm, we can add an optimal amount of noise to find a bound on the deterministic algorithm. The following technique is also used by [12, 18] for the case of isotropic noise. We generalize this result for arbitrary values of the noise covariance.

###### Theorem 4.2.

Let be a deterministic algorithm, and let be the loss that a set of weights incur on a sample. If is -Lipschitz in the weights, then if has -weight stability relative to a positive semi-definite , then .

###### Theorem 4.3.

Let be a deterministic prediction function. Assume that has -train stability and -test stability. Assume the loss function is -Lipschitz continuity in the first coordinate, then .

### 4.4 Bounds For Stochastic Gradient Descent

As we have seen, the quality of our information bound depends on the stability of the optimization algorithm. Stochastic Gradient Descent is the most commonly used method in machine learning, and it is therefore interesting to study the stability of SGD and how it translates into generalization from an information theoretic point of view. Assuming the gradient updates are -bounded for some (see appendix for definition), then one can bound with respect to the number of iterations of SGD [10], and obtain a bound on the generalization through the right-hand-side of (18). In particular, we derive the following bound which show how training for a shorter time and having more bounded updates both contribute to generalization.

###### Lemma 4.1.

Let the gradient update rule be -bounded. Suppose we run SGD for steps, and we use the same starting value for the weights, then .

### 4.5 Local interpretation of the bound

The bound in Corollary 4.1 depends on non-local quantities as it requires retraining the model a linear number of times on subsets of the data. Ideally, one wants to bound the generalization without having to retrain. We now provide a qualitative approximation of the bound using local quantities, in order to connect the bound with the geometry of the loss function. Assume that the training algorithm minimizes the loss , and let be the minimum to which the algorithm converges, and similarly let . Using influence functions [15], we can approximate:

(20) |

where is the hessian of the loss, is the gradient of the loss on for , and we have introduced the per-sample gradient at convergence rescaled by the hessian .

Using this notation we can approximate:

(21) | ||||

(22) | ||||

(23) |

where in we have assumed , which is the optimal variance of the noise (Section 4.2).

From (23) we see that converging to a flat minimum, that is, having a small norm of the hessian , is expected to correlate to better generalization. This is in accordance with several empirical results connecting convergence to flat minima to better generalization [14, 7, 4]. It should be noted that flatness by itself cannot explain generalization, since we can reparameterize the network to have identical predictions (and hence generalization) but arbitrarily large hessian [5]. Indeed, in (23) we see that the role of the hessian is mediated by the similarity of the (re-scaled) per-sample gradients at convergence, which can change under reparameterization. In particular, studying flatness by itself is not enough.

## 5 Related Work

Over the past several years, there has been a significant line of research in using information theory both to interpret the behavior of DNNs [2, 23, 1] and to derive generalization bounds [22, 25, 3, 17, 24, 8, 12], with the earliest works on this line from [22, 25]. Many of them develop information stability bounds which involve the mutual information between the output of the algorithm and the samples [22, 25, 3, 17] which could degenerate, *e.g.,* if the data is continuous. This was observed in [24], which then proposed a new bound based on the conditional mutual information (see Theorem 2.1) which is non-degenerate. This work was further extended by [8, 12]. In particular [12] developed bounds based on prediction outputs rather than the algorithm outputs, and its combination with the conditional mutual information framework made it the state-of-the-art in terms of information-theoretic generalization bounds. The two issues identified in our work related to this is one of computability (as the [12] bound might in principle require computation exponential in the size of the dataset) and interpretability; this is the main focus of our work. In particular, our loo-CMI framework can not only be computed more efficiently, but also connects to classical leave-one-out cross validation measures used extensively in practice (see Theorem 3.1). Another aspect introduced in [12, 18] is the application of CMI bounds to deterministic algorithm outputs (as one can only run a finite number of runs of even a randomized algorithm on a dataset). We use this viewpoint in our work, but using a geometry-aware method (see Theorem 4.2), which takes into account the loss landscape in the algorithmic output space. There has also been a line of work connecting SGD to generalization including earlier works in [9] and more recently its connection with information theoretic bounds [20, 18]. We apply these ideas to the framework of loo-CMI (see Lemma 7.7). The connection to classical notions of algorithmic stability can also be made using the loo-CMI framework (see Theorem 7.6).

## 6 Experiments

#### Model and datasets.

We now study the behavior of our and

bounds on real-world image classification tasks. In particular, we fine-tune an off-the-shelf ResNet-18 model pretrained on ImageNet on a set of standard image-classification tasks (see also Table

1): MIT-67 [21], Oxford Pets [19]. On both datasets we fine-tune for 10 epochs using stochastic gradient descent with learning rate

, momentum , batch size , and weight decay . We compute by removing sample from the training set and re-training from scratch. For all the experiments we remove 10 samples one at a time across 3 random seeds and use corollary 4.1 to compute the information bounds. We used 2 NVIDIA 1080Ti GPUS and the experiments take 1-2 days.#### Non-vacuous bounds.

In Table 1, we compute the and bounds using (42) and (61). We show that while the former bound is vacuous, the bound provides non-vacuous generalization bounds on all datasets. This is significant, as obtaining non-vacuous bound for large-scale models remains a challenging problem. The failure of the bound based on is expected here, since looks directly at the information contained in the weights of the network without considering how this information is used. In large models with millions of parameters, most of the information in the weights do not significantly affect the predictions and should ideally be discarded. This is indeed done by the bound.

#### Effect of the dataset size.

To show how the quality of the bound changes for different sizes of the dataset, we subsample randomly and without replacement samples from each dataset. In Figure 2, we plot the resulting generalization gap alongside our generalization bounds as the size of the subsample varies (in terms of fractions of the dataset). We observe that the generalization bound becomes tighter as the size of the training set grows. This is not surprising as we expect the model to become more stable for larger datasets.

#### Synthetic randomization.

In Section 4.2, we introduced artificial Gaussian noise in order to obtain better bounds. Increasing the noise variance in the bound, or in , improves the bound on the CMI, but also increases the test error of the model. Hence, we need to select a value of and that provides a good trade-off between the two. In Figure 3, we show how changing the noise variance affects each term.

## 7 Proofs

### 7.1 Proof of Lemma 1

We begin by restating the result of Lemma 7.1.

###### Lemma.

Let be a bounded loss function, then for all , , and ,

(24) |

Proof: To show this, let and . For , we define:

(25) |

If we obtain a bound on for every , then we also obtain a bound on for all . Now,

(26) |

where . Since the maximum of must be at a stationary point, then

(27) |

Setting (27) to , then the right-hand side expression of (27) must be . This implies that is either at the extremities or we must have a value of , where

(28) |

Let be the number of that are , and the number of that are . Solving for using (28), we get that

(29) |

We find that depends on only through the number of s and s in . Writing with respect to and , we obtain

(30) |

Hence, finding the maximum of is equivalent to finding the solution for the following optimization problem.

(31) | ||||

s.t. |

Since and are bounded non-negative integers, we can find the solution of the above maximization through a brute-force search. However, we can relax the constraints of (31), and allow to take non-integer values over . Using the method of Lagrange multipliers, we have that:

(32) |

The partial derivative of (32) with respect to and are given by:

(33) | ||||

(34) |

Setting the partial derivatives of (33) and (34) to yields

(35) | ||||

(36) |

Substituting these expressions in (32), we see that we now need to find

s.t. |

Since for all , this expression is maximized by the largest value can take. Hence, the we obtain our maximum for

Plugging into our expression, we have

(37) |

Recalling that , we obtain

(38) |

Using the Taylor expansion of the exponent in the right-hand side of (38), we have that:

(39) |

Finally, we obtain

(40) | ||||

(41) |

Recalling that , we obtain the desired result.

### 7.2 Proof of Theorem 3.1

We begin restating the Theorem.

###### Theorem.

Let be a training algorithm. Let be a bounded loss function, then

(42) |

We state a series of lemmas and definitions that are essential for the proof.

###### Lemma 7.1.

Let , and let , then

(43) |

###### Proof.

We have that:

(44) | ||||

(45) | ||||

(46) | ||||

(47) |

∎

###### Lemma 7.2.

Let be a training algorithm, then

(48) |

###### Proof.

(49) | ||||

(50) | ||||

(51) | ||||