Log In Sign Up

Identifying Generalization Properties in Neural Networks

by   Huan Wang, et al.

While it has not yet been proven, empirical evidence suggests that model generalization is related to local properties of the optima which can be described via the Hessian. We connect model generalization with the local property of a solution under the PAC-Bayes paradigm. In particular, we prove that model generalization ability is related to the Hessian, the higher-order "smoothness" terms characterized by the Lipschitz constant of the Hessian, and the scales of the parameters. Guided by the proof, we propose a metric to score the generalization capability of the model, as well as an algorithm that optimizes the perturbed model accordingly.


page 1

page 2

page 3

page 4


Dissecting Hessian: Understanding Common Structure of Hessian in Neural Networks

Hessian captures important properties of the deep neural network loss la...

A Deeper Look at the Hessian Eigenspectrum of Deep Neural Networks and its Applications to Regularization

Loss landscape analysis is extremely useful for a deeper understanding o...

Robust Fine-Tuning of Deep Neural Networks with Hessian-based Generalization Guarantees

We consider transfer learning approaches that fine-tune a pretrained dee...

Better PAC-Bayes Bounds for Deep Neural Networks using the Loss Curvature

We investigate whether it's possible to tighten PAC-Bayes bounds for dee...

Towards Sharp Stochastic Zeroth Order Hessian Estimators over Riemannian Manifolds

We study Hessian estimators for real-valued functions defined over an n-...

Geometry Perspective Of Estimating Learning Capability Of Neural Networks

The paper uses statistical and differential geometric motivation to acqu...

Towards Unifying Neural Architecture Space Exploration and Generalization

In this paper, we address a fundamental research question of significant...

1 Introduction

Deep models have proven to work well in applications such as computer vision

(Krizhevsky et al., 2012) (He et al., 2014) (Karpathy et al., 2014), speech recognition (Mohamed et al., 2012) (Hinton et al., 2012)

, and natural language processing

(Socher et al., 2013) (Graves, 2013) (McCann et al., 2018). Many deep models have millions of parameters, which is more than the number of training samples, but the models still generalize well (Huang et al., 2017).

On the other hand, classical learning theory suggests the model generalization capability is closely related to the “complexity” of the hypothesis space. This seems to be a contradiction to the empirical observations that over-parameterized models generalize well on the test data. Indeed, even if the hypothesis space is complex, the final solution learned from a given training set may still be simple. An example is, suppose the hypothesis space is the union of linear classifiers and some complex function spaces. As a union set the hypothesis space is complex in the worst case, but for some training set the best solution may be a linear classifier. This suggests the generalization capability of the model is also related to the property of the solution.

Keskar et al. (2016) and Chaudhari et al. (2016) empirically observe that the generalization ability of a model is related to the spectrum of the Hessian matrix

evaluated at the solution, and large eigenvalues of the

often leads to poor model generalization. Also, (Keskar et al., 2016), (Chaudhari et al., 2016) and (Novak et al., 2018b) introduce several different metrics to measure the “sharpness” of the solution, and demonstrate the connection between the sharpness metric and the generalization empirically. Dinh et al. (2017)

later points out that most of the Hessian-based sharpness measures are problematic and cannot be applied directly to explain generalization. In particular, they show that the geometry of the parameters in RELU-MLP can be modified drastically by re-parameterization.

Another line of work originates from the theorists. (Langford and Caruana, 2001) and more recently (Harvey et al., 2017) (Neyshabur et al., 2017a) (Neyshabur et al., 2017b) use PAC-Bayes bound to analysis the generalization behavior of the deep models. Since the PAC-Bayes bound holds uniformly for all “posteriors”, it also holds for some particular “posteriors”, for example, the solution parameter perturbed with noise. This provides a natural way to incorporate the local property of the solution into the generalization analysis. In particular, Neyshabur et al. (2017a) suggests to use the difference between the perturbed loss and the empirical loss as the sharpness metric. Dziugaite and Roy (2017) tries to optimize the PAC-Bayes bound instead for a better model generalization. Still some fundamental questions remain unanswered. In particular we are interested in the following question:

How is model generalization related to local “smoothness” of a solution?

In this paper we try to answer the question from the PAC-Bayes perspective. Under mild assumptions on the Hessian of the loss function, we prove the generalization error of the model is related to this Hessian, the Lipschitz constant of the Hessian, the scales of the parameters, as well as the number of training samples. The analysis also gives rise to a new metric for generalization. Based on this, we can approximately select an optimal perturbation level to aid generalization which interestingly turns out to be related to Hessian as well. Inspired by this observation, we propose a perturbation based algorithm that makes use of the estimation of the Hessian to improve model generalization.

(a) Loss landscape. The color on the loss surface shows the pacGen scores. The color on the bottom plane shows an approximated generalization bound.
(b) Sample distribution
(c) Predicted labels by the sharp minimum
(d) Predicted labels by the flat minimum
Figure 1: Loss Landscape and Predicted Labels of a -layer MLP with parameters.

2 Sharp Minimum v.s. Flat Minimum - A Toy Example

Let us start with a toy example to demonstrate different behaviors of local optima. For training, we construct a small 2-dimensional sample set from a mixture of

Gaussians, and then binarize the labels by thresholding them from their median value. The sample distribution is shown in Figure

0(b). Then we use a -layer MLP model with sigmoid as the activation and cross entropy as the loss for training and prediction. The variables from different layers are shared so that the model only has two free parameters and .

The model is trained using samples. Fixing the samples, we plot the loss function with respect to the model variables , as shown in Figure 0(a). Many local optima are observed even in this simple two-dimensional toy example. In particular a sharp one, marked by the vertical green line, and a flat one, marked by the vertical red line. The colors on the loss surface display the values of the generalization metric scores (pacGen), which we will define in section 7. Smaller metric value indicates better generalization power.

As displayed in the figure, the metric score around the global optimum, indicated by the vertical green bar, is high, suggesting possible poor generalization capability as compared to the local optimum indicated by the red bar. We also plot a plane on the bottom of the figure. The color projected on the bottom plane indicates an approximated generalization bound, which considers both the loss and the generalization metric.111the bound was approximated with using inequality (13) The local optimum indicated by the red bar, though has a slightly higher loss, has a similar overall bound compared to the “sharp” global optimum.

On the other hand, fixing the parameter and , we may also plot the labels predicted by the model given the samples. Here we plot the prediction from both the sharp minimum (Figure 0(c)) and the flat minimum (Figure 0(d)). The sharp minimum, even though it approximates the true label better, has some complex structures in its predicted labels, while the flat minimum seems to produce a simpler classification boundary.

While it is easy to make observations on toy examples, it is less straight-forward to make a quantitative statement when the model parameters and the number of training samples grow. In the following sections we try connect the local smoothness of the solution and model generalization capability. Section 3 briefly introduces some preliminaries on the learning theory. Section 4 talks about the assumptions and intuitions on how the model perturbation is related to the generalization as well as the Hessian of the solution. Section 5 dives into two specific types of perturbations: uniform and truncated Gaussian. Section 6 discusses the effect of re-parameterization on the proposed bound. Some empirical approximations and experiments are shown in Section 7 and 8.

3 Model Generalization Theory

We consider the general machine learning scenario. Suppose we have a labeled data set

, where are sampled i.i.d. from a distribution . We try to learn a function , such that the expected loss

is small, where is the loss function.

Since we do not know the distribution , the expected loss is hard to calculate directly. Instead usually the empirical loss

is evaluated during the training procedure.

3.1 Rademacher Complexity

Minimizing the empirical loss

may lead to issues such as overfitting. In general, by the law of large number, for a fixed function

, the empirical loss converges almost surely to the expected loss. However, when is not fixed, i.e., depends on the samples, and the number of samples is finite, classical learning theory suggests that the gap between the expected loss and the empirical loss is bounded by the sum of the Rademacher complexity and a concentration tail (Shalev-Shwartz and Ben-David, 2014). The Rademacher complexity is defined as


s are i.i.d. Rademacher random variables.

Note the Rademacher complexity is only related to the function space , the sample distribution and the number of samples

. This seems to suggest when the function class is very complex, the gap between the empirical loss and the expected loss will be large. Though the learning theory based on Rademacher complexity can explain the overfitting effect to some extent, for example, when the hypothesis space is overly complex, the generalization tends to be worse, it is not easy to explain some well-known empirical observations in today’s deep learning experiments including:

  • Over-parameterization.

    The hypothesis space of a deep learning network can easily get rich enough to represent any function on a finite sample set (Zhang et al., 2017). According to the bound based on the Rademacher complexity, the network may tend to overfit. However empirically those deep models generalize well.

  • Different generalization behaviors for different local optima.

    The generalization bound based on Rademacher complexity holds uniformly for all hypothesis in the function class. On the other hand, it does not distinguish the generalization capabilities among different solutions. Obviously, there are “simple” solutions even if the whole function space is complex.

In this draft we will focus on the second empirical observations and give, to the best of our knowledge, a first explanation on behaviors of different local optima.

3.2 PAC-Bayes

Another line of theory discussing model generalization is PAC-Bayes (Mcallester, 2003) (McAllester, 1998) (McAllester, 1999) (Langford and Shawe-Taylor, 2002)

. The PAC-Bayes paradigm further assumes probability measures over the function class. In particular, it assumes a “posterior” distribution

as well as a “prior” distribution over the function class . In this way the function is assumed to be sampled from a “posterior” distribution over . As a consequence the expected loss is in terms of both the random draw of samples as well as the random draw of functions:

Correspondingly, the empirical loss in the PAC-Bayes paradigm is the expected loss over the draw of functions from the posterior:

PAC-Bayes theory suggests the gap between the expected loss and the empirical loss is bounded by a term that is related to the KL divergence between and (McAllester, 1999) (Langford and Shawe-Taylor, 2002). In particular, if the function is parameterized as with , when is perturbed around any , we have the following PAC-Bayes bound (Seldin et al., 2012) (Seldin et al., 2011) (Neyshabur et al., 2017a) (Neyshabur et al., 2017b):

[PAC-Bayes-Hoeffding Perturbation] Let , and be any fixed distribution over the parameters . For any and , with probability at least over the draw of samples, for any and any random perturbation ,


One may further optimize to get a bound that scales approximately as (Seldin et al., 2011). 222Since cannot depend on the data, one has to build a grid and use the union bound. A nice property of the perturbation bound (1) is it connects the generalization with the local properties around the solution through some perturbation around . In particular, suppose is a local optima, when the perturbation level of is small, tends to be small, but may be large since the posterior is too “focused” on a small neighboring area around , and vice versa. As a consequence, we may need to search for an “optimal” perturbation level for so that the bound is minimized.

4 Local Smoothness Assumptions

Keskar et al. (2016) investigate the local structures of the converged points for deep learning networks, and find that empirically the “sharpness” of the minima is closely related to the generalization property of the classifier. The sharp minimizers, which led to lack of generalization ability, are characterized by a significant number of large positive eigenvalues in . In particular, they propose a local sharpness metric: [Sharpness Metric] (Keskar et al., 2016) Given , and , the -sharpness of at is defined as:


where , and is the pseudo inverse of . Other variants of the model generalization metrics are also proposed by Chaudhari et al. (2016) and Novak et al. (2018b).

Neyshabur et al. (2017a) suggests an “expected sharpness” based on the PAC-Bayes bound:


They also point out the sharpness itself may not be enough to determine the generalization capability, but combining scales with sharpness one may get a control of the generalization. Similar connections are also found by Dziugaite and Roy (2017).

4.1 Smoothness Assumption over Hessian

While some researchers have discovered empirically the generalization ability of the models is related to the second order information around the local optima, to the best of our knowledge there is no work on how to connect the Hessian matrix with the model generalization. In this section we introduce the assumption about the second-order smoothness, which is later used in our generalization bound.

[Hessian Lipschitz]

A twice differentiable function is -Hessian Lipschitz if:


where is the operator norm.

The Hessian Lipschitz condition has been used in the numeric optimization community to model the second-order smoothness (Nesterov and Polyak, 2006) (Allen-Zhu and Orecchia, 2014). For the deep models it could be unrealistic to assume the Hessian Lipschitz condition holds for all . Instead we make a local Hessian Lipschitz assumption:

[Local Hessian Lipschitz] Function is -Hessian Lipschitz in , where

is a neighborhood around defined by two positive constants and .

To simplify the notation in the draft we denote .

4.2 Connecting Generalization and Hessian

Suppose the empirical loss function satisfies the local Hessian Lipschitz condition, then by Lemma in (Nesterov and Polyak, 2006), the perturbation of the function around a fixed point can be bounded by terms up to the third-order,


For perturbations with zero expectation, i.e., , the linear term in (5), . Because the perturbation for different parameters are independent, the second order term can also be simplified.


where is simply the -th diagonal element in Hessian. The following lemma is straight-forward given (1),(5), and (6).

Suppose the loss function . Let be any distribution on the parameters that is independent from the data. For any and , with probability at least over the draw of samples, for any such that satisfies the local -Hessian Lipschitz condition in , and any random perturbation , s.t., , , and are independent for any , we have


where is the -th diagonal element of .

Note by extrema of the Rayleigh quotient, the quadratic term on the right hand side of inequality (5) is further bounded by


This is consistent with the empirical observations of Keskar et al. (2016) that the generalization ability of the model is related to the eigenvalues of . The inequality (8) still holds even if the perturbations and are correlated. We add another lemma about correlated perturbations in Appendix (Lemma D).

4.3 Tradeoff between Sharpness Metric and Generalization Power

If we look at the right hand side of the inequality (7), and compare it with (3) (Neyshabur et al., 2017a), we see


can be interpreted as the sharpness metric of the empirical loss. It is closely related to the Hessian , but it is also related to the perturbation distributions. Figure (2) shows when the perturbation is fixed how can affect the term .

The other term


is related to the model generalization power in the original PAC-Bayes bound.

Ideally we would like both and to be small for better generalization capability. However, generally the perturbation distribution that leads to small tends to have large for a given prior. As we will see in the following sections, in the end we have to make trade-offs between the two terms.

5 Bounded Perturbations

Adding noise to the model for better generalization has proven successful both empirically and theoretically (Zhu et al., 2018) (Hoffer et al., 2017) (Jastrzȩbski et al., 2017) (Dziugaite and Roy, 2017) (Novak et al., 2018a). Instead of only minimizing the empirical loss, (Langford and Caruana, 2001) and (Dziugaite and Roy, 2017) assume different perturbation levels on different parameters, and minimize the generalization bound led by PAC-Bayes for better model generalization. However how to connect the noise distribution with the local optima structures, for example, , and how that is related to the generalization power have not been examined.

Since the assumptions in Lemma (4.2) are local, the distributions of interest for the perturbation are necessarily bounded. In this section we investigate two special forms of perturbations, the uniform perturbation and truncated Gaussian, and provide closed-form scale estimation for the perturbation levels.

5.1 Uniform Distribution

Suppose , and

. That is, the “posterior” distribution of the model parameters are uniform distribution, and the distribution supports vary for different parameters. We also assume the perturbed parameters are bounded, i.e.,

.333One may also assume the same for all parameters for a simpler argument. The proof procedure goes through in a similar way. If we choose the priors to be , and then


Note . Also we simplify the third order term in (7) by

where we use the inequality and is the number of parameters. By Lemma (4.2), we get


If we assume is locally convex around so that for all . Solve for that minimizes the right hand side, and we have the following lemma: Suppose the loss function , and model weights are bounded . For any and , with probability at least over the draw of samples, for any such that is locally convex in and satisfies the local -Hessian Lipschitz condition in ,


where are i.i.d. uniformly perturbed random variables, and


In our experiment, we simply treat as a hyper-parameter. Other other hand, one may further build a weighted grid over and optimize for the best (Seldin et al., 2011). In this way we reach the following theorem: Under the conditions of Lemma 5.1, for any , with probability at least over the draw of samples, for any such that in , is locally convex and satisfies the local -Hessian Lipschitz condition,

where are i.i.d. uniformly perturbed random variables, and


Please see the appendix for the details of the proof.

5.2 Truncated Gaussian

Because the Gaussian distribution is not bounded but Lemma (

4.2) requires bounded perturbation, we first truncate the distribution. The procedure of truncation is similar to the proof in (Neyshabur et al., 2017b) and (Mcallester, 2003).

Let , where is a diagonal covariance matrix. Denote the truncated Gaussian as . If then


If , by union bound . Here is the inverse Gaussian error function defined as , and is the number of parameters. Following a similar procedure as in the proof of Lemma 1 in (Neyshabur et al., 2017b),


Suppose the coefficients are bounded such that , where is a constant. Choose the prior as , and we have


Notice that after the truncation the variance only becomes smaller, so the bound of (

7) for the truncated Gaussian becomes


Again when is convex around such that , solve for the best and we get the following lemma:

Suppose the loss function , and model weights are bounded . For any and , with probability at least over the draw of samples, for any such that in , is convex and satisfies the local -Hessian Lipschitz condition,


where are random variables distributed as truncated Gaussian,


and is the -th diagonal element in .

Again We have an extra term , which may be further optimized over a grid to get a tighter bound. In our algorithm we treat as a hyper-parameter instead.

6 On the Re-parameterization of RELU-MLP

Dinh et al. (2017) points out the spectrum of

itself is not enough to determine the generalization power. One particular example is the multiple layer perceptron with RELU as the activations (RELU-MLP). For a two-layer RELU-MLP, denote

, and as the linear coefficients for the first and second layer. Clearly


If cross entropy (negative log likelihood) is used as the loss function, under certain regularization conditions, if , i.e., is the “true” parameter of the sample distribution, the change in Hessian to re-parameterization can be calculated as the outer product of the gradients, in this case


In general our bound does not assume the loss function to be cross entropy loss. Also we do not assume the model is RELU-MLP. As a result we would not expect our bound stays exactly the same during the re-parameterization.

On the other hand, the optimal perturbation levels in our bound scales inversely during the scaling of parameters, so the bound only changes approximately with a speed of logarithmic factor. According to Lemma (5.1) and (5.2), if we use the optimal on the right hand side of the bound, , , and are all behind the logarithmic terms. As a consequence, for RELU-MLP, if we do the re-parameterization trick as in Dinh et al. (2017), the change of the bound is small.

Disclaim: Section 7 and 8

will be heuristic-based experiments and approximations. They are not rigorous.

Figure 2: Sharpness Metric for , -dimensional case. Fixing the perturbation level, larger leads to larger .

7 An Approximate Generalization Metric

Assuming is locally convex around , so that for all . If we look at Lemma 5.1, for fixed and , the only relevant term is . Replacing the optimal , and using to approximate , we come up with PAC-Bayes based Generalization metric, called pacGen,444Even though we assume the local convexity in our metric, in application we may calculate the metric on every points. When we simply treat it as .

(a) Test Loss - Train Loss (MNIST)
(b) (MNIST)
Figure 3: Generalization gap and

as a function of epochs on MNIST for different batch sizes. SGD is used as the optimizer, and the learning rate is set as

for all configurations. As the batch size grows, gets larger. The trend is consistent with the true gap of losses.
(a) Test Loss - Train Loss (CIFAR-10)
(b) (CIFAR-10)
Figure 4: Generalization gap and as a function of epochs on CIFAR-10 for different batch sizes. SGD is used as the optimizer, and the learning rate is set as for all configurations.

To calculate the metric on real-world data we need to estimate the diagonal elements of the Hessian as well as the Lipschitz constant of the Hessian. For efficiency concern we follow Adam (Kingma and Ba, 2014) and approximate by . Also we use the exponential smoothing technique with as in (Kingma and Ba, 2014).

To estimate , we first estimate the Hessian of a randomly perturbed model 555In the experiment the gradients are taken w.r.t. instead of , and we ignore the difference between and ., and then approximate by .

We used the same model without dropout from the PyTorch example

666 We fix the learning rate as and vary the batch size for training. The gap between the test loss and the training loss, and the metric are plotted in Figure 3. We had the same observation as in (Keskar et al., 2016) that as the batch size grows, the gap between the test loss and the training loss tends to get larger. Our proposed metric also shows the exact same trend. Note we do not use LR annealing heuristics as in (Goyal et al., 2017) which enables large batch training.

Similarly we also carry out experiment by fixing the training batch size as , and varying the learning rate. Figure 5 shows generalization gap and as a function of epochs. It is observed that as the learning rate decreases, the gap between the test loss and the training loss increases. And the proposed metric shows similar trend compared to the actual generalization gap.

We also run the same model and experiment on CIFAR-10 (Krizhevsky et al., ) just to demonstrate the effectiveness of the metric. We observed similar trends on CIFAR-10 as shown in Figure 4 and Figure 6.

(a) Test Loss - Train Loss (MNIST)
(b) (MNIST)
Figure 5: Generalization gap and as a function of epochs on MNIST for different learning rates. SGD is used as the optimizer, and the batch size is set as for all configurations. As the learning rate shrinks, gets larger. The trend is consistent with the true gap of losses.
(a) Test Loss - Train Loss (CIFAR-10)
(b) (CIFAR-10)
Figure 6: Generalization gap and as a function of epochs on CIFAR-10 for different learning rates. SGD is used as the optimizer, and the batch size is set as for all configurations.

8 A Perturbed Optimization Algorithm

The right hand side of (1) has . This suggests rather than minimizing the empirical loss , we should optimize the perturbed empirical loss instead for a better model generalization power. Adding perturbation to the model is not a new trick. Most of the perturbation-based methods (Zhu et al., 2018) (Hoffer et al., 2017) (Jastrzȩbski et al., 2017) (Novak et al., 2018a) (Khan et al., 2018) are based on heuristic techniques and improvement in applications have already been observed empirically. Dziugaite and Roy (2017) first proposes to optimize for a better perturbation level from the PAC-Bayes bound, but their bound is not making use of the second order information. Also the best perturbation in (Dziugaite and Roy, 2017) is not close-form.

In this section we introduce a systematic way to perturb the model weights based on the PAC-Bayes bound. Again we use the same exponential smoothing technique as in Adam (Kingma and Ba, 2014) to estimate the Hessian . To make the algorithm efficient, we ignore the third order part in the bound (7) so that we do not have to estimate the Lipschitz constant of Hessian. The details of the algorithm is presented in (Algorithm 1), where we treat as a hyper-parameter to be optimized using the validation set.

1:, , , , =1e-5.
2:Initialization: for all . ,
3:for epoch in  do
4:     for minibatch in one epoch do
5:         for all  do
6:              if  then
10:              (sample perturbation)          
11:          (get stochastic gradients w.r.t. perturbed loss)

(update second moment estimate)

13:          (update using off-the-shell algorithms)
Algorithm 1 Perturbed OPT

Even though in theoretical analysis , in applications, won’t be zero especially when we only implement trial of perturbation. On the other hand, if the gradient is close to zero, then the first order term can be ignored. As a consequence, in (Algorithm 1) we only perturb the parameters that have small gradients whose absolute value is below . For efficiency issues we used a per-parameter capturing the variation of the diagonal element of Hessian. Also we decrease the perturbation level with a log factor as the epoch increases.

We compare the perturbed algorithm against the original optimization method on CIFAR-10, CIFAR-100 (Krizhevsky et al., )

, and Tiny ImageNet

777 The results are shown in Figure 7. We use the Wide-ResNet (Zagoruyko and Komodakis, 2016) as the prediction model.888 The depth of the chosen model is 58, and the widen-factor is set as 3. The dropout layers are turned off. For CIFAR-10 and CIFAR-100, we use Adam with a learning rate of , and the batch size is 128. For the perturbation parameters we use , , and =1e-5. For Tiny ImageNet, we use SGD with learning rate , and the batch size is 156. For the perturbed SGD we set , , and =1e-5. Also we use the validation set as the test set for the Tiny ImageNet. We observe the the effect with perturbation appears similar to regularization. With the perturbation, the accuracy on the training set tends to decrease, but the test or the validation set increases.

(a) CIFAR-10
(b) CIFAR-100
(c) Tiny ImageNet
Figure 7: Training and testing accuracy as a function of epochs on CIFAR-10, CIFAR-100 and Tiny ImageNet. For CIFAR, Adam is used as the optimizer, and the learning rate is set as . For the Tiny ImageNet, SGD is used as the optimizer, and the learning rate is set as .

9 Conclusion

We connect the smoothness of the solution with the model generalization in the PAC-Bayes framework. We prove that the generalization power of a model is related to the Hessian and the smoothness of the solution, the scales of the parameters, as well as the number of training samples. In particular, we prove that the best perturbation level scales roughly as , which mostly cancels out scaling effect in the re-parameterization suggested by (Dinh et al., 2017). To the best of our knowledge, this is the first work that integrate Hessian with the model generalization rigorously, and is also the first work explaining the effect of re-parameterization over the generalization rigorously. Based on our generalization bound, we propose a new metric to test the model generalization and a new perturbation algorithm that adjusts the perturbation levels according to the Hessian. Finally, we empirically demonstrate the effect of our algorithm is similar to a regularizer in its ability to attain better performance on unseen data.

10 Acknowledgement

The authors are grateful to Tengyu Ma, James Bradbury, Yingbo Zhou, and Bryan McCann for their helpful comments and suggestions on the manuscript.


Appendix A Proof of Lemma 5.1

We rewrite the inequality (12) below


The terms related to on the right hand side of (25) are


Since the assumption is for all , . Solving for that minimizes the right hand side of (25), and we have


The term on the right hand side of (12) is monotonically increasing w.r.t. , so


Combine the inequality (28), and the equation (27) with (25), and we complete the proof.

Appendix B Proof of Theorem 5.1

Combining (15) and (12), we get

The following proof is similar to the proof of Theorem 6 in (Seldin et al., 2011). Note the in Lemma (5.1) cannot depend on the data. In order to optimize we need to build a grid of the form

for .

For a given value of , we pick , such that

where is the largest integer value smaller than . Set , and take a weighted union bound over -s with weights , and we have with probability at least ,

Simplify the right hand side and we complete the proof.

Appendix C Proof of Lemma 5.2

We first rewrite the inequality (19) below: