Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks

05/30/2018 ∙ by Behnam Neyshabur, et al. ∙ Princeton University Institute for Advanced Study NYU college Toyota Technological Institute at Chicago 2

Despite existing work on ensuring generalization of neural networks in terms of scale sensitive complexity measures, such as norms, margin and sharpness, these complexity measures do not offer an explanation of why neural networks generalize better with over-parametrization. In this work we suggest a novel complexity measure based on unit-wise capacities resulting in a tighter generalization bound for two layer ReLU networks. Our capacity bound correlates with the behavior of test error with increasing network sizes, and could potentially explain the improvement in generalization with over-parametrization. We further present a matching lower bound for the Rademacher complexity that improves over previous capacity lower bounds for neural networks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks have enjoyed great success in learning across a wide variety of tasks. They played a crucial role in the seminal work of Krizhevsky et al. [12], starting an arms race of training larger networks with more hidden units, in pursuit of better test performance [10]. In fact the networks used in practice are over-parametrized to the extent that they can easily fit random labels to the data [27]. Even though they have such a high capacity, when trained with real labels they achieve smaller generalization error.

Traditional wisdom in learning suggests that using models with increasing capacity will result in overfitting to the training data. Hence capacity of the models is generally controlled either by limiting the size of the model (number of parameters) or by adding an explicit regularization, to prevent from overfitting to the training data. Surprisingly, in the case of neural networks we notice that increasing the model size only helps in improving the generalization error, even when the networks are trained without any explicit regularization - weight decay or early stopping [13, 26, 21]. In particular, Neyshabur et al. [21] observed that training on models with increasing number of hidden units lead to decrease in the test error for image classification on MNIST and CIFAR-10. Similar empirical observations have been made over a wide range of architectural and hyper-parameter choices [15, 24, 14]. What explains this improvement in generalization with over-parametrization? What is the right measure of complexity of neural networks that captures this generalization phenomenon?

Complexity measures that depend on the total number of parameters of the network, such as VC bounds, do not capture this behavior as they increase with the size of the network. Neyshabur et al. [20], Keskar et al. [11], Neyshabur et al. [22], Bartlett et al. [4], Neyshabur et al. [23], Golowich et al. [7] and Arora et al. [1] suggested different norm, margin and sharpness based measures, to measure the capacity of neural networks, in an attempt to explain the generalization behavior observed in practice. In particular Bartlett et al. [4] showed a margin based generalization bound that depends on the spectral norm and norm of the layers of a network. However, as shown in Neyshabur et al. [22] and in Figure 5, these complexity measures fail to explain why over-parametrization helps, and in fact increase with the size of the network. Dziugaite and Roy [6] numerically evaluated a generalization bound based on PAC-Bayes. Their reported numerical generalization bounds also increase with the increasing network size. These existing complexity measures increase with the size of the network as they depend on the number of hidden units either explicitly, or the norms in their measures implicitly depend on the number of hidden units for the networks used in practice [22] (see Figures 3 and 5).

Figure 1: Over-parametrization phenomenon. Left panel: Training pre-activation ResNet18 architecture of different sizes on CIFAR-10 dataset. We observe that even when after network is large enough to completely fit the training data(reference line), the test error continues to decrease for larger networks. Middle panel: Training fully connected feedforward network with single hidden layer on CIFAR-10. We observe the same phenomena as the one observed in ResNet18 architecture. Right panel: Unit capacity captures the complexity of a hidden unit and unit impact captures the impact of a hidden unit on the output of the network, and are important factors in our capacity bound (Theorem 1). We observe empirically that both unit capacity and unit impact shrink with a rate faster than where is the number of hidden units. Please see Supplementary Section A for experiments settings.

To study and analyze this phenomenon more carefully, we need to simplify the architecture making sure that the property of interest is preserved after the simplification. We therefore chose two layer ReLU networks since as shown in the left and middle panel of Figure 1, it exhibits the same behavior with over-parametrization as the more complex pre-activation ResNet18 architecture. In this paper we prove a tighter generalization bound (Theorem 2) for two layer ReLU networks. Our capacity bound, unlike existing bounds, correlates with the test error and decreases with the increasing number of hidden units. Our key insight is to characterize complexity at a unit level, and as we see in the right panel in Figure 1 these unit level measures shrink at a rate faster than for each hidden unit, decreasing the overall measure as the network size increases. When measured in terms of layer norms, our generalization bound depends on the Frobenius norm of the top layer and the Frobenius norm of the difference of the hidden layer weights with the initialization, which decreases with increasing network size (see Figure 2).

The closeness of learned weights to initialization in the over-parametrized setting can be understood by considering the limiting case as the number of hidden units go to infinity, as considered in Bengio et al. [5] and Bach [2]. In this extreme setting, just training the top layer of the network, which is a convex optimization problem for convex losses, will result in minimizing the training error, as the randomly initialized hidden layer has all possible features. Intuitively, the large number of hidden units here represent all possible features and hence the optimization problem involves just picking the right features that will minimize the training loss. This suggests that as we over-parametrize the networks, the optimization algorithms need to do less work in tuning the weights of the hidden units to find the right solution. Dziugaite and Roy [6] indeed have numerically evaluated a PAC-Bayes measure from the initialization used by the algorithms and state that the Euclidean distance to the initialization is smaller than the Frobenius norm of the parameters. Nagarajan and Kolter [18] also make a similar empirical observation on the significant role of initialization, and in fact prove an initialization dependent generalization bound for linear networks. However they do not prove a similar generalization bound for neural networks. Alternatively, Liang et al. [15] suggested a Fisher-Rao metric based complexity measure that correlates with generalization behavior in larger networks but they also prove the capacity bound only for linear networks.

Contributions: Our contributions in this paper are as follows.

  • We empirically investigate the role of over-parametrization in generalization of neural networks on 3 different datasets (MNIST, CIFAR10 and SVHN), and show that the existing complexity measures increase with the number of hidden units - hence do not explain the generalization behavior with over-parametrization.

  • We prove tighter generalization bounds (Theorems 2 and 5) for two layer ReLU networks. Our proposed complexity measure actually decreases with the increasing number of hidden units, and can potentially explain the effect of over-parametrization on generalization of neural networks.

  • We provide a matching lower bound for the Rademacher complexity of two layer ReLU networks. Our lower bound considerably improves over the best known bound given in Bartlett et al. [4], and to our knowledge is the first such lower bound that is bigger than the Lipschitz of the network class.

1.1 Preliminaries

We consider two layer fully connected ReLU networks with input dimension , output dimension , and the number of hidden units . Output of a network is 111Since the number of bias parameters is negligible compare to the size of the network, we drop the bias parameters to simplify the analysis. Moreover, one can model the bias parameters in the first layer by adding an extra dimension with value 1. where , and . We denote the incoming weights to the hidden unit by and the outgoing weights from hidden unit by . Therefore corresponds to row of matrix and corresponds to the column of matrix .

We consider the -class classification task where the label with maximum output score will be selected as the prediction. Following Bartlett et al. [4], we define the margin operator as a function that given the scores for each label and the correct label , it returns the difference between the score of the correct label and the maximum score among other labels, i.e. . We now define the ramp loss as follows:

(1)

For any distribution and margin , we define the expected margin loss of a predictor as . The loss defined this way is bounded between 0 and 1. We use

to denote the empirical estimate of the above expected margin loss. As setting

reduces the above to classification loss, we will use and to refer to the expected risk and the training error respectively.

2 Generalization of Two Layer ReLU Networks

Figure 2: Properties of two layer ReLU networks trained on CIFAR-10. We report different measures on the trained network. From left to right: measures on the second (output) layer, measures on the first (hidden) layer, distribution of angles of the trained weights to the initial weights in the first layer, and the distribution of unit capacities of the first layer. "Distance" in the first two plots is the distance from initialization in Frobenius norm.

Let

denotes the function class corresponding to the composition of the loss function and functions from class

. With probability

over the choice of the training set of size , the following generalization bound holds for any function [17, Theorem 3.1]:

(2)

where is the Rademacher complexity of a class of functions with respect to the training set which is defined as:

(3)

Rademacher complexity is a capacity measure that captures the ability of functions in a function class to fit random labels which increases with the complexity of the class.

2.1 An Empirical Investigation

We will bound the Rademacher complexity of neural networks to get a bound on the generalization error . Since the Rademacher complexity depends on the function class considered, we need to choose the right function class that only captures the real trained networks, which is potentially much smaller than networks with all possible weights, to get a complexity measure that explains the decrease in generalization error with increasing width. Choosing a bigger function class can result in weaker bounds that does not capture this phenomenon. Towards that we first investigate the behavior of different measures of network layers with increasing number of hidden units. The experiments discussed below are done on the CIFAR-10 dataset. Please see Section A for similar observations on SVHN and MNIST datasets.

First layer: As we see in the second panel in Figure 2 even though the spectral and Frobenius norms of the learned layer decrease initially, they eventually increase with , with Frobenius norm increasing at a faster rate. However the distance Frobenius norm measured w.r.t. initialization () decreases. This suggests that the increase in the Frobenius norm of the weights in larger networks is due to the increase in the Frobenius norm of the random initialization. To understand this behavior in more detail we also plot the distance to initialization per unit and the distribution of angles between learned weights and initial weights in the last two panels of Figure 2. We indeed observe that per unit distance to initialization decreases with increasing , and a significant shift in the distribution of angles to initial points, from being almost orthogonal in small networks to almost aligned in large networks. This per unit distance to initialization is a key quantity that appears in our capacity bounds and we refer to it as unit capacity in the remainder of the paper.

Unit capacity. We define as the unit capacity of the hidden unit .

Second layer: Similar to first layer, we look at the behavior of different measures of the second layer of the trained networks with increasing in the first panel of Figure 2. Here, unlike the first layer, we notice that Frobenius norm and distance to initialization both decrease and are quite close suggesting a limited role of initialization for this layer. Moreover, as the size grows, since the Frobenius norm of the second layer slightly decreases, we can argue that the norm of outgoing weights from a hidden unit decreases with a rate faster than

. If we think of each hidden unit as a linear separator and the top layer as an ensemble over classifiers, this means the impact of each classifier on the final decision is shrinking with a rate faster than

. This per unit measure again plays an important role and we define it as unit impact for the remainder of this paper.

Unit impact. We define as the unit impact, which is the magnitude of the outgoing weights from the unit .

Motivated by our empirical observations we consider the following class of two layer neural networks that depend on the capacity and impact of the hidden units of a network. Let be the following restricted set of parameters:

(4)

We now consider the hypothesis class of neural networks represented using parameters in the set :

(5)

Our empirical observations indicate that networks we learn from real data have bounded unit capacity and unit impact and therefore studying the generalization behavior of the above function class can potentially provide us with a better understanding of these networks. Given the above function class, we will now study its generalization properties.

2.2 Generalization Bound

In this section we prove a generalization bound for two layer ReLU networks. We first bound the Rademacher complexity of the class in terms of the sum over hidden units of the product of unit capacity and unit impact. Combining this with the equation (2) will give a generalization bound.

Theorem 1.

Given a training set and , Rademacher complexity of the composition of loss function over the class defined in equations (4) and (5) is bounded as follows:

(6)
(7)

The proof is given in the supplementary Section B. The main idea behind the proof is a new technique to decompose the complexity of the network into complexity of the hidden units. To our knowledge, all previous works decompose the complexity to that of layers and use Lipschitz property of the network to bound the generalization error. However, Lipschitzness of the layer is a rather weak property that ignores the linear structure of each individual layer. Instead, by decomposing the complexity across the hidden units, we get the above tighter bound on the Rademacher complexity of the networks.

The generalization bound in Theorem 1 is for any function in the function class defined by a specific choice of and fixed before the training procedure. To get a generalization bound that holds for all networks, we need to cover the space of possible values for and and take a union bound over it. The following theorem states the generalization bound for any two layer ReLU network 222For the statement with exact constants see Lemma 13 in Supplementary Section B. .

Theorem 2.

For any , , and , with probability over the choice of the training set , for any function such that and , the generalization error is bounded as follows:

The extra additive factor is the result of taking the union bound over the cover of and . As we see in Figure 5, in the regimes of interest this new additive term is small and does not dominate the first term. While we show that the dependence on the first term cannot be avoided using an explicit lower bound (Theorem 3), the additive term with dependence on might be just an artifact of our proof. In Section 4 we present a tighter bound based on norm of the weights which removes the extra additive term for large at a price of weaker first term.

2.3 Comparison with Existing Results

# Reference Measure
(1) Harvey et al. [9]
(2) Bartlett and Mendelson [3]
(3) Neyshabur et al. [20], Golowich et al. [7]
(4) Bartlett et al. [4], Golowich et al. [7]
(5) Neyshabur et al. [23]
(6) Theorem 2
Table 1: Comparison with the existing generalization measures presented for the case of two layer ReLU networks with constant number of outputs and constant margin.
Figure 3: Behavior of terms presented in Table 1 with respect to the size of the network trained on CIFAR-10.

In table 1 we compare our result with the existing generalization bounds, presented for the simpler setting of two layer networks. In comparison with the bound [4, 7]: The first term in their bound is of smaller magnitude and behaves roughly similar to the first term in our bound (see Figure 3 last two panels). The key complexity term in their bound is , and in our bound is , for the range of considered. and differ by number of classes, a small constant, and hence behave similarly. However, can be as big as when most hidden units have similar capacity, and are only similar for really sparse networks. Infact their bound increases with mainly because of this term . As we see in the first and second panels of Figure 3, norm terms appearing in Bartlett and Mendelson [3], Bartlett et al. [4], Golowich et al. [7] over hidden units increase with the number of units as the hidden layers learned in practice are usually dense.

Figure 4:

First panel: Training and test errors of fully connected networks trained on SVHN. Second panel: unit-wise properties measured on a two layer network trained on SVHN dataset. Third panel: number of epochs required to get 0.01 cross-entropy loss. Fourth panel: comparing the distribution of margin of data points normalized on networks trained on true labels vs a network trained on random labels.

Figure 5: Left panel: Comparing capacity bounds on CIFAR10 (unnormalized). Middle panel: Comparing capacity bounds on CIFAR10 (normalized). Right panel: Comparing capacity bounds on SVHN (normalized).

Neyshabur et al. [20], Golowich et al. [7] showed a bound depending on the product of Frobenius norms of layers, which increases with , showing the important role of initialization in our bounds. In fact the proof technique of Neyshabur et al. [20] does not allow for getting a bound with norms measured from initialization, and our new decomposition approach is the key for the tighter bound.

Experimental comparison. We train two layer ReLU networks of size on CIFAR-10 and SVHN datasets with values of ranging from to . The training and test error for CIFAR-10 are shown in the first panel of Figure 1, and for SVHN in the left panel of Figure 4. We observe for both datasets that even though a network of size 128 is enough to get to zero training error, networks with sizes well beyond 128 can still get better generalization, even when trained without any regularization. We further measure the unit-wise properties introduce in the paper, namely unit capacity and unit impact. These quantities decrease with increasing , and are reported in the right panel of Figure  1 and second panel of Figure 4. Also notice that the number of epochs required for each network size to get 0.01 cross-entropy loss decreases for larger networks as shown in the third panel of Figure 4.

For the same experimental setup, Figure 5 compares the behavior of different capacity bounds over networks of increasing sizes. Generalization bounds typically scale as where is the effective capacity of the function class. The left panel reports the effective capacity based on different measures calculated with all the terms and constants. We can see that our bound is the only that decreases with and is consistently lower that other norm-based data-independent bounds. Our bound even improves over VC-dimension for networks with size larger than 1024. While the actual numerical values are very loose, we believe they are useful tools to understand the relative generalization behavior with respect to different complexity measures, and in many cases applying a set of data-dependent techniques, one can improve the numerical values of these bounds significantly [6, 1]. In the middle and right panel we presented each capacity bound normalized by its maximum in the range of the study for networks trained on CIFAR-10 and SVHN respectively. For both datasets, our capacity bound is the only one that decreases with the size even for networks with about million parameters. All other existing norm-based bounds initially decrease for smaller networks but then increase significantly for larger networks. Our capacity bound therefore could potentially point to the right properties that allow the over-parametrized networks to generalize.

Finally we check the behavior of our complexity measure under a different setting where we compare this measure between networks trained on real and random labels [22, 4]. We plot the distribution of margin normalized by our measure, computed on networks trained with true and random labels in the last panel of Figure 4 - and as expected they correlate well with the generalization behavior.

3 Lower Bound

In this section we will prove a Rademacher complexity lower bound for neural networks matching the dominant term in the upper bound of Theorem 1. We will show our lower bound on a smaller function class than , with an additional constraint on spectral norm of the hidden layer, as it allows for comparison with the existing results, and extends also to the bigger class .

Theorem 3.

Define the parameter set

and let be the function class defined on by equation (5). Then, for any , and , there exists , such that

The proof is given in the supplementary Section B.3. Clearly, since it has an extra constraint. The above lower bound matches the first term, , in the upper bound of Theorem 1, upto which comes from the -Lipschitz constant of the ramp loss . Also when and ,

where . In other words, when , the function class on is equivalent to the linear function class on , and therefore we have the above lower bound. This shows that the upper bound provided in Theorem 1 is tight. It also indicates that even if we have more information, such as bounded spectral norm with respect to the reference matrix is small (which effectively bounds the Lipschitz of the network), we still cannot improve our upper bound.

To our knowledge all previous capacity lower bounds for spectral norm bounded classes of neural networks correspond to the Lipschitz constant of the network. Our lower bound strictly improves over this, and shows gap between Lipschitz constant of the network (which can be achieved by even linear models) and capacity of neural networks. This lower bound is non-trivial in the sense that the smaller function class excludes the neural networks with all rank-1 matrices as weights, and thus shows

-gap between the neural networks with and without ReLU. The lower bound therefore does not hold for linear networks. Finally, one can extend the construction in this bound to more layers by setting all weight matrices in intermediate layers to be the Identity matrix.

Comparison with existing results.

In particular, Bartlett et al. [4] has proved a lower bound of , for the function class defined by the parameter set,

(8)

Note that is a Lipschitz bound of the function class .
Given with bounds and , choosing and such that and results in . Hence we get the following result from Theorem 3.

Corollary 4.

, , such that .

Hence our result improves the lower bound in Bartlett et al. [4] by a factor of . Theorem 7 in Golowich et al. [7] also gives a lower bound, is the number of outputs of the network, for the composition of 1-Lipschitz loss function and neural networks with bounded spectral norm, or -Schatten norm. Our above result even improves on this lower bound.

4 Generalization for Extremely Large Values of

In this section we present a tighter bound that reduces the influence of this additive term and decreases even for larger values of . The main new ingredient in the proof is the Lemma 10, in which we construct a cover for the ball with entrywise dominance.

Theorem 5.

For any , , and , with probability over the choice of the training set , for any function such that and , the generalization error is bounded as follows:

where is the norm of the row norms.

In contrast to Theorem 2 the additive term is replaced by . For of order , constant improves on the additive term in Theorem 2. However the norms in the first term and are replaced by and . For , which is a tight upper bound for and is of the same order if all rows of have the same norm - hence giving a tighter bound that decreases with for larger values. In particular for we get the following bound.

Corollary 6.

Under the settings of Theorem 5, with probability over the choice of the training set , for any function , the generalization error is bounded as follows:

5 Discussion

In this paper we present a new capacity bound for neural networks that decreases with the increasing number of hidden units, and could potentially explain the better generalization performance of larger networks. However our results are currently limited to two layer networks and it is of interest to understand and extend these results to deeper networks. Also while these bounds are useful for relative comparison between networks of different size, their absolute values are still much larger than the number of training samples, and it is of interest to get smaller bounds. Finally we provided a matching lower bound for the capacity improving on the current lower bounds for neural networks.

In this paper we do not address the question of whether optimization algorithms converge to low complexity networks in the function class considered in this paper, or in general how does different hyper parameter choices affect the complexity of the recovered solutions. It is interesting to understand the implicit regularization effects of the optimization algorithms [19, 8, 25] for neural networks, which we leave for future work.

Acknowledgements

The authors thank Sanjeev Arora for many fruitful discussions on generalization of neural networks and David McAllester for discussion on the distance to random initialization. This research was supported in part by NSF IIS-RI award 1302662 and Schmidt Foundation.

References

Appendix A Experiments

a.1 Experiments Settings

Below we describe the setting for each reported experiment.

ResNet18

In this experiment, we trained a pre-activation ResNet18 architecture on CIFAR-10 dataset. The architecture consists of a convolution layer followed by 8 residual blocks (each of which consist of two convolution) and a linear layer on the top. Let

be the number of channels in the first convolution layer. The number of output channels and strides in residual blocks is then

and respectively. Finally, we use the kernel sizes 3 in all convolutional layers. We train 11 architectures where for architecture we set . In each experiment we train using SGD with mini-batch size 64, momentum 0.9 and initial learning rate 0.1 where we reduce the learning rate to 0.01 when the cross-entropy loss reaches 0.01 and stop when the loss reaches 0.001 or if the number of epochs reaches 1000. We use the reference line in the plots to differentiate the architectures that achieved 0.001 loss. We do not use weight decay or dropout but perform data augmentation by random horizontal flip of the image and random crop of size

followed by zero padding.

Two Layer ReLU Networks

We trained fully connected feedforward networks on CIFAR-10, SVHN and MNIST datasets. For each data set, we trained 13 architectures with sizes from to

each time increasing the number of hidden units by factor 2. For each experiment, we trained the network using SGD with mini-batch size 64, momentum 0.9 and fixed step size 0.01 for MNIST and 0.001 for CIFAR-10 and SVHN. We did not use weight decay, dropout or batch normalization. For experiment, we stopped the training when the cross-entropy reached 0.01 or when the number of epochs reached 1000. We use the reference line in the plots to differentiate the architectures that achieved 0.01 loss.

Evaluations

For each generalization bound, we have calculated the exact bound including the log-terms and constants. We set the margin to th percentile of the margin of data points. Since bounds in [3] and [21] are given for binary classification, we multiplied [3] by factor and [21] by factor to make sure that the bound increases linearly with the number of classes (assuming that all output units have the same norm). Furthermore, since the reference matrices can be used in the bounds given in [4] and [23]

, we used random initialization as the reference matrix. When plotting distributions, we estimate the distribution using standard Gaussian kernel density estimation.

a.2 Supplementary Figures

Figures 6 and 7 show the behavior of several measures on networks with different sizes trained on SVHN and MNIST datasets respectively. The left panel of Figure 8 shows the over-parametrization phenomenon in MNSIT dataset and the middle and right panels compare our generalization bound to others.

Figure 6:

Different measures on fully connected networks with a single hidden layer trained on SVHN. From left to right: measure on the output layer, measures in the first layer, distribution of angle to initial weight in the first layer, and singular values of the first layer.

Figure 7: Different measures on fully connected networks with a single hidden layer trained on MNIST. From left to right: measure on the output layer, measures in the first layer, distribution of angle to initial weight in the first layer, and singular values of the first layer.
Figure 8: Left panel: Training and test errors of fully connected networks trained on MNIST. Middle panel: Comparing capacity bounds on MNIST (normalized). Left panel: Comparing capacity bounds on MNIST (unnormalized).

Appendix B Proofs

b.1 Proof of Theorem 1

We start by stating a simple lemma which is a vector-contraction inequality for Rademacher complexities and relates the norm of a vector to the expected magnitude of its inner product with a vector of Rademacher random variables. We use the following technical result from

Maurer [16] in our proof.

Lemma 7 (Propostion 6 of Maurer [16]).

Let be the Rademacher random variables. For any vector , the following holds:

The above lemma can be useful to get Rademacher complexities in multi-class settings. The below lemma bounds the Rademacher-like complexity term for linear operators with multiple output centered around a reference matrix. The proof is very simple and similar to that of linear separators. See [3] for similar arguments.

Lemma 8.

For any positive integer , positive scaler , reference matrix and set , the following inequality holds:

Proof.

follows from the Jensen’s inequality. ∎

We next show that the Rademacher complexity of the class of networks defined in (5) and (4) can be decomposed to that of hidden units.

Lemma 9 (Rademacher Decomposition).

Given a training set and , Rademacher complexity of the class defined in equations (5) and (4) is bounded as follows:

Proof.

Let . We prove the lemma by showing the following statement by induction on :

where for simplicity of the notation, we let .

The above statement holds trivially for the base case of by the definition of the Rademacher complexity (3). We now assume that it is true for any and prove it is true for .

(9)

The last inequality follows from the Lipschitzness of the ramp loss. The ramp loss is Lipschitz with respect to each dimension but since the loss at each point only depends on score of the correct labels and the maximum score among other labels, it is -Lipschitz.

Using the triangle inequality we can bound the first term in the above bound as follows.

(10)

We will now add and subtract the initialization terms.

(11)

From equations (9), (10), (11) and Lemma 7 we get,

(12)

This completes the induction proof.

Hence the induction step at gives us:

Proof of Theorem 1.

Using Lemma 8, we can bound the the right hand side of the upper bound on the Rademacher complexity given in Lemma 9:

b.2 Proof of Theorems 2 and 5

We start by the following covering lemma which allows us to prove the generalization bound in Theorem 5 without assuming the knowledge of the norms of the network parameters. The following lemma shows how to cover an ball with a set that dominates the elements entry-wise, and bounds the size of a one such cover.

Lemma 10 ( covering lemma).

Given any , , consider the set . Then there exist sets of the form such that and where and

Proof.

We prove the lemma by construction. Consider the set . For any , consider such that for any ,