On Breiman's Dilemma in Neural Networks: Phase Transitions of Margin Dynamics

10/08/2018 ∙ by Weizhi Zhu, et al. ∙ The Hong Kong University of Science and Technology 0

Margin enlargement over training data has been an important strategy since perceptrons in machine learning for the purpose of boosting the robustness of classifiers toward a good generalization ability. Yet Breiman shows a dilemma (Breiman, 1999) that a uniform improvement on margin distribution does not necessarily reduces generalization errors. In this paper, we revisit Breiman's dilemma in deep neural networks with recently proposed spectrally normalized margins. A novel perspective is provided to explain Breiman's dilemma based on phase transitions in dynamics of normalized margin distributions, that reflects the trade-off between expressive power of models and complexity of data. When data complexity is comparable to the model expressiveness in the sense that both training and test data share similar phase transitions in normalized margin dynamics, two efficient ways are derived to predict the trend of generalization or test error via classic margin-based generalization bounds with restricted Rademacher complexities. On the other hand, over-expressive models that exhibit uniform improvements on training margins, as a distinct phase transition to test margin dynamics, may lose such a prediction power and fail to prevent the overfitting. Experiments are conducted to show the validity of the proposed method with some basic convolutional networks, AlexNet, VGG-16, and ResNet-18, on several datasets including Cifar10/100 and mini-ImageNet.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Margin, as a measurement of the robustness allowing some perturbations on classifier without changing its decision on training data, has a long history in characterizing the performance of classification algorithms in machine learning. As early as Novikoff (1962)

, it played a central role in the proof on finite-stopping or convergence of perceptron algorithm when training data is separable. Equipped with convex optimization technique, a plethora of large margin classifiers were triggered by support vector machines

(Cortes and Vapnik, 1995, Vapnik, 1998). For neural networks, Bartlett (1997, 1998) showed that the generalization error can be bounded by a margin-sensitive fat-shattering dimension, which is in turn bounded by the -norm of weights, shedding light on possible good generalization ability of over-parameterizd networks with small size weights despite the large VC dimensionality. The same idea was later applied to AdaBoost, an iterative algorithm to combine an ensemble of classifiers proposed by Freund and Schapire (1997), often exhibiting a phenomenon of resistance to overfitting that during the training process the generalization error does not increase even when the training error drops to zero. Toward deciphering such a resistance to overfitting phenomenon, Schapire et al. (1998) proposed an explanation that the training process keeps on improving a notion of classification margins in boosting, among later improvement (Koltchinskii et al., 2002) and works on establishing consistency of boosting via early stopping regularization (Bühlmann and Yu, 2002, Zhang and Yu, 2005, Yao et al., 2007). Lately such a resistance to overfitting was again observed in deep neural networks with over-parameterized models (Zhang et al., 2016). A renaissance of margin theory was brought by Bartlett et al. (2017) with a normalization of network using Lipschitz constants bounded by products of operator spectral norms. It inspires many further investigations in various settings (Miyato et al., 2018, Neyshabur et al., 2018, Liao et al., 2018).

However, the margin theory has a limitation that the improvement of margin distributions does not necessarily guarantee a better generalization performance, which is at least traced back to Breiman (1999) in his effort to understanding AdaBoost. In this work, Breiman designed an algorithm arc-gv such that the margin can be maximized via a prediction game, then he demonstrated an example that one can achieve uniformly larger margin distributions on training data than AdaBoost but suffer a higher generalization error. In the end of this paper, Breiman made the following comments with a dilemma:

”The results above leave us in a quandary. The laboratory results for various arcing algorithms are excellent, but the theory is in disarray. The evidence is that if we try too hard to make the margins larger, then overfitting sets in. My sense of it is that we just do not understand enough about what is going on.”

In this paper, we are going to revisit Breiman’s dilemma in the scenario of deep neural networks. Both success and failure can be witnessed with normalized margin based bounds on generalization error. First of all, let’s look at the following illustration example.

Example 1.1 (Breiman’s Dilemma with a CNN).

A basic 5-layer convolutional neural network of

channels (see Section 3 for details) is trained with CIFAR-10 dataset whose 10 percent labels are randomly permuted. When with parameters, Figure 1

(a) shows the training error and generalization (test) error in solid curves. From the generalization error in (a) one can see that overfitting indeed happens after about 10 epochs, despite that training error continuously drops down to zero. One can successfully predict such an overfitting phenomenon from Figure

1 (b), the evolution of normalized margin distributions defined later in this paper. In (b), while small margins are monotonically improved during training, large margins undergoes a phase transition from increase to decrease around 10 epochs such that one can predict the tendency of generalization error in (a) using large margin dynamics. Two particular sections of large margin dynamics are highlighted in (b), one at 9.8 on -axis that measures the percentage of normalized training margins no more than 9.8 (training margin error) and the other at 0.8 on

-axis that measures the normalized margins at quantile

(i.e. ). Both of them meet the tendency of generalization error in (a) and find good early stopping time to avoid overfitting. However, as we increase the channel number to with about parameters and retrain the model, (c) shows a similar overfitting phenomenon in generalization error; on the other hand, (d) exhibits a monotonic improvement of normalized margin distributions without a phase transition during the training and thus fails to capture the overfitting. This demonstrates the Breiman’s dilemma in CNN.

Figure 1: Demonstration of Breiman’s Dilemma in Convolutional Neural Networks.

A key insight behind this dilemma, is that one needs a trade-off between the expressive power of models and the complexity of the dataset to endorse margin bounds a prediction power. On one hand, when a model has a limited expressive power relative to the training dataset, in the sense that the training margin distributions CAN NOT be monotonically improved during training, the generalization or test error may be predicted from dynamics of normalized margin distributions. On the other hand, if we push too hard to improve margins by giving model too much degree of freedom such that the training margins are uniformly improved during training process, the predictability may be lost. A trade-off is thus necessary to balance the complexity of model and dataset in addition to margin improvement, otherwise one is doomed to meet Breiman’s dilemma when the models arbitrarily increase the expressive power.

The example above shows that the expressive power of models relative to the complexity of dataset, can be observed from the dynamics of normalized margins in training, instead of counting the number of parameters in neural networks. In the sequel, our main contributions are to make these precise by revisiting the Rademacher complexity bounds on network generalization error.

  • With the Lipschitz-normalized margins, a linear inequality is established between training margin and test margin in Theorem 1. When both training and test normalized margin distributions undergo similar phase transitions on increase-decrease during the training process, one may predict the generalization error based on the training margins as illustrated in Figure 1.

  • In a dual direction, one can define a quantile margin via the inverse of margin distribution functions, to establish another linear inequality between the inverse quantile margins and the test margins as shown in Theorem 2. Quantile margin is far easier to tune in practice and enjoys a stronger prediction power exploiting an adaptive selection of margins along model training.

  • In all cases, Breiman’s dilemma may fail both of the methods above when dynamics of normalized training margins undergo different phase transitions to that of test margins during training, where a uniform improvement of margins results in overfitting.

Section 2 describes our method to derive the two linear inequalities of generalization bounds above. Extensive experimental results are shown in Section 3 with basic CNNs, AlexNet, VGG, ResNet, and various datasets including CIFAR10, CIFAR100, and mini-Imagenet. Conclusions and future directions are discussed in Section 4. More experimental figures and proofs are collected in Appendices.

2 Methodology

Let be the input space (e.g. in image classification of size #(channel)-by-#(width)-by-#(height)) and be the space of classes. Consider a sample set of observations that are drawn i.i.d. from . For any function , let be the population expectation and be the sample average.

Define to be the space of functions represented by neural networks,

(1)

where is the depth of the network, is the weight matrix corresponding to a linear operator on and

stands for either element-wise activation function (e.g. ReLU) or pooling operator that are assumed to be Lipschitz bounded with constant

. For example, in convolutional network, where

stands for the convolution between input tensor

and kernel tensor . We equip with the Lipschitz semi-norm, for each ,

(2)

where is the spectral norm and . Without loss of generality, we assume for simplicity. Moreover we consider the following family of hypothesis mapping,

(3)

where denotes the coordinate and we further define the following class induced by Lipschitz semi-norm bound on ,

(4)

Now, rather than merely looking at whether a prediction on is correct or not, we further consider the prediction margin defined as . With that, we can define the ramp loss and margin error depending on the confidence of predictions. Given two thresholds , define the ramp loss to be

where . In particular and , we also write for simplicity. Define the margin error to measure if has margin no more than a threshold ,

(5)

In particular, is the common mis-classification error and . Note that , and is Lipschitz bounded by .

The central question we try to answer is, can we find a proper upper bound to predict the tendency of the generalization error along training, such that one can early stop the training near the epoch that is minimized? The answer is both a yes and a no!

We begin with the following lemma, as a margin-based generalization bound with network Rademacher complexity for multi-label classifications, using the uniform law of large numbers

(Koltchinskii et al., 2002, Cortes et al., 2013, Kuznetsov et al., 2015, Bartlett et al., 2017)

Lemma 2.1.

Given a , then, for any

, with probability at least

, the following holds for any with ,

(6)

where

(7)

is the Rademacher complexity of function class with respect to samples, and the expectation is taken over , .

Unfortunately, direct application of such bound in neural networks with a constant will suffer from the so-called scaling problem. The following proposition gives an lower bound of Rademacher complexity term.

2.1 A Lower Bound on the Rademacher Complexity

Proposition 1.

Consider the networks with activation functions , where we assume is Lipschitz continuous and there exists such that and exists. Then for any , there holds,

(8)

where is a constant that does not depend on .

This proposition extends Theorem 3.4 in Bartlett et al. (2017) to general activation functions and multi-class scenario, and the proof is presented in Appendix. It tells us if , upper bound (6) becomes trivial since . In fact, both Telgarsky (2013) and Soudry et al. (2018)

show that the gradient descent method will drive weight estimates in logistic regression and general boosting with exponential loss etc. to max-margin classifier at infinity when the data is linearly separable. In particular, the latter shows the growth rate of weight estimates is

. As for the deep neural network with cross-entropy loss, the input of last layer is usually be viewed as features extracted from original input. Training the last layer with other layers fixed is exactly a logistic regression, and the feature is linearly separable as long as the training error achieves zero. Therefore, without any normalization, the hypothesis space during training has no upper bound on

, and thus the upper bound (6) is useless. Besides, even for a fixed , the complexity term is computationally intractable.

In the following we are going to present two simple generalization error bounds based on normalized margins and restricted Rademacher complexity within certain Lipschitz balls.

2.2 Two Simplified Bounds with Normalized Margins and Restricted Rademacher Complexity

The first remedy is to restrict our attention on by normalizing with its Lipschitz semi-norm or its upper bounds. Note that a normalized network has the same mis-classification error as for all . For the choice of , it’s hard in practice to directly compute the Lipschitz semi-norm of a network, but instead some approximate estimates on the upper bound in (2) are available as discussed in Section 2.3. In the sequel, let be the normalized network and be the corresponding normalized hypothesis function. Now a simple idea is to regard as a constant when the model complexity is not over-expressive against data, then one can predict the tendency of generalization error via training margin error of the normalized network, that avoids the scaling problem and the computation of Rademacher complexity. The following theorem makes this precise.

Theorem 1.

Given and such that and , for any , with probability at least , along the training epoch , the following holds for each ,

(9)

where .

Remark.

In particular, when we take and , the bound above becomes,

(10)

Theorem 1 says, we can bound the normalized test margin distribution by the normalized training margin distribution . Recently Liao et al. (2018) investigates for normalized networks, the strong linear relationship between cross-entropy training loss and test loss when the training epochs are large enough. As a contrast, we consider the whole training process and normalized margins. In particular, we hope to predict the trend of generalization (test) error by choosing and a proper such that the training margin errors enjoy a high correlation with test error up to a monotone transform. For this purpose, the following facts are important. First, we do not expect the bound, for example (10), is tight for every choice of , instead we hope there exists some such that the training margin error nearly monotonically changes with generalization error. Figure 5 below shows the existence of such when models are not too big by exhibiting rank correlations between training margin error at various and training/test error. Moreover, Figure 4 below shows that the training margin error at such a good successfully recover the tendency of generalization error on CIFAR10 dataset. Second, the normalizing factor is not necessarily to be an upper bound of Lipschitz semi-norm. The key point is to prevent the complexity term of the normalized network going to infinity. Since for any constant , normalization by works in practice where the constant could be absorbed to , we could ignore the Lipschitz constant introduced by general activation functions in the hidden layers.

However, as Example 1.1 with Figure 1 shows above, once the training margin distribution is uniformly improved, dynamic of training margin error fails to detect the minimum of generalization error in the early stage. This is because when network structure becomes complex enough, the training margin distribution could be more easily improved. In this case although both and training margins are reduced, the restricted Rademacher complexity in Theorem 1 will blow up such that it is invalid to bound the generalization error using merely the training margins. In this case, the generalization error may overfit while training margins can not show it. This is exactly the same observation in Breiman (1999) to doubt the margin theory in boosting type algorithms. More detailed discussions will be given in Section 3.2.

The most serious limitation of Theorem 1 lies in we must fix a along the complete training process. In fact, the first term and second term in the bound (10) vary in the opposite directions with respect to , and thus different may prefer different for a trade-off. As in Figure 1 (b) of the example, while choosing is to fix an -coordinate section of margin distributions, its dual is to look for a -section which leads to different margins for different . This motivates the quantile margin in the following theorem. Let be the quantile margin of the network with respect to sample ,

(11)
Theorem 2.

Assume the input space is bounded by , that is . Given a quantile , for any and , the following holds with probability at least for all satisfying ,

(12)

and .

Remark.

We simply denote for when there is no confusion.

Compared with the bound (10), (12) make the choice of varying with and the cost is an additional constant term and the constraint that typically holds for large enough

in practice. In applications, the stochastic gradient descent method often effectively improves the training margin distributions along the drops of training errors, a small enough

and large enough usually meet . Moreover, even with the choice , constant term is still negligible and thus very little cost is paid in the upper bound.

In practice, tuning is far easier than tuning directly and setting a large enough usually provides us lots of information about the generalization performance. The quantile margin works effectively when the dynamics of large margin distributions reflects the behavior of generalization error, e.g. Figure 1. In this case, after certain epochs of training, the large margins have to be sacrificed to further improve small margins to reduce the training loss, that typically indicates a possible saturation or overfitting in test error.

2.3 Estimate of Normalization Factors

In this section we discuss how to estimate the Lipschitz constant bound in (2). Given an operator associated with a convolutional kernel , i.e. , there are two ways to estimate its operator norm. We begin with the following proposition, part (A) of which is adapted from the continuous version of Young’s convolution inequality in space (see Theorem 3.9.4 in Bogachev (2007)), and part (B) of which is a generalization to multiple channel kernels widely used in convolutional networks nowadays. The proof is presented in Appendix B.5.

Proposition 2.

(A) For convolution operator with kernel where is the -dimensional kernel size, there holds

(13)

In other words, .

(B) Consider a multiple channel convolutional kernel

with stride

, which maps input signal of channels to the output of channels by

where and

are assumed of zero-padding outside its support. The following upper bounds hold.

  1. Let , then

    (14)
  2. Let where , then

    (15)
Remark.

For stride , the upper bound (14) is tighter than (15), while for a large stride , the second bound (15) might become tighter by taking into acount the effect of stride.

In all these cases, the -norm of dominates the estimates, so in the following we will simply call these bounds -based estimates. Another method is given in (Miyato et al., 2018) based on power iterations (Golub and Van der Vorst, 2001), as a fast numerical approximation for the spectral norm of the operator matrix. Yet as a shortcoming, the power iteration method is not easy to apply to the ResNets.

We compare the two estimates in Figure 10. It turns out both of them can be used to predict the tendency of generalization error using normalized margins and both of them will fail when the network has large enough expressive power. Although using the -based estimate is very efficient, the power iteration method may be tighter and have a wider range of predictability.

In the remaining of this section, we will particularly discuss the treatment of ResNets. ResNet is usually a composition of the basic blocks shown in Figure 2 with short-cut structure. The following method is used in this paper to estimate upper bounds of spectral norm of such a basic block of ResNet.

Figure 2:

A basic block in ResNets used in this paper. The shortcut consists of one block with convolutional and batch-normalization layers, while the main stream has two blocks. ResNets are constructed as a cascading of several basic blocks of various sizes.

  • Convolution layer: its operator norm can be bounded either by the -based estimate or by power iteration above.

  • Batch Normalization (BN): in training process, BN normalizes samples by , where

    are mean and variance of batch samples, while keeping an online averaging as

    and . Then BN rescales by estimated parameters and output . Therefore the whole rescaling of BN on the kernel tensor of the convolution layer is and its corresponding rescaled operator is .

  • Activation and pooling: their Lipschitz constants can be known a priori, e.g. for ReLU and hence can be ignored. In general, can not be ignored if they are in the shortcut as discussed below.

  • Shortcut: In residue net with basic block in Figure 2, one has to treat the mainstream and the shortcut separately. Since , in this paper we take the Lipschitz upper bound by , where denotes a spectral norm estimate of BN-rescaled convolutional operator . In particular can be ignored since all paths are normalized by the same constant, while can not be ignored due to its asymmetry.

3 Experimental Results

We briefly introduce the network and dataset used in the experiments. For the network, our illustration Example 1.1 is based on a simple convolutional neural network whose architecture is shown in Figure 3 (more details in Appendix Figure 11), called basic CNN() here with channels that will be specified in different experiments below. Basically, it has five convolutional layers of channels at each, followed by batch normalization and ReLU, as well as a fully connected layer in the end. Furthermore, we consider various popular networks in applications, including AlexNet (Krizhevsky et al., 2012), VGG-16 (Simonyan and Zisserman, 2014) and ResNet-18 (He et al., 2016). For the dataset, we consider CIFAR10, CIFAR100 (Krizhevsky and Hinton, 2009) and Mini-ImageNet (Vinyals et al., 2016).

Figure 3: Illustration of the architecture of basic CNN.

The spirit of the following experiments is to show, when and how, the margin bound could be used to numerically predict the tendency of generalization or test error along the training path?

3.1 Success: Training Margin Error and Quantile Margin

In this experiment, we are going to explore when there is a nearly monotone relationship between training margin error and test margin error such that Theorem 1 and Theorem 2 can be applied to predict the tendency of generalization (test) error.

First let’s consider training a basic CNN(50) on CIFAR10 dataset with and without random noise. The relations between test error and training margin error with , inverse quantile margin with are shown in Figure 4. In this simple example where the network is small and the dataset is simple, the bounds (9) and (12) show a good prediction power: they stop either near the epoch of sufficient training without noise (Left, original data) or before an overfitting occurs with noise (Right, 10 percents label corrupted).

Figure 4: Success examples. Net structure: basic CNN (50). Dataset: Original CIFAR10 (Left) and CIFAR10 with 10 percents label corrupted (Right). In each figure, we show training error (red solid), training margin error (red dash) and inverse quantile margin (red dotted) with and generalization error (blue solid). The marker “x” in each curve indicates the global minimum along epoch . Both training margin error and inverse quantile margin successfully predict the tendency of generalization error.
Figure 5: Spearman’s and Kendall’s rank correlations between training (or quantile) margins and training errors, as well as training (or quantile) margins and test errors, at different (or , respectively). Net structure: Basic CNN(50). Dataset: CIFAR10. Top: Spearman’s rank correlation. Bottom: Kendall’s rank correlation. Left: Blue curves show rank correlations between training margin error and test (generalization) error, while Red curves show that between the training margin error and training error, at different . Right: Blue curves show rank correlations between inverse quantile margin and test error, and Red curves show that between inverse quantile margin and training error, at different . Both Spearman’s and Kendall’s show qualitatively the same phenomenon that dynamics of large margins are closely related to the test errors in the sense that they have similar trends marked by large rank correlations. On the other hand, small margins are close to training errors in trend.

Why does it work in this case? Here are some detailed explanations on its mechanism. The training margin error () and the inverse quantile margin () are both closely related to the dynamics of training margin distributions. Figure 1 (b) actually shows that the dynamics of training margin distributions undergo a phase transition: while the low margins have a monotonic increase, the large margins undergo a phase transition from increase to decrease, indicated by the red arrows. Therefore different choices of for the linear bounds (9) (a parallel argument holds for in (12)) will have different effects. In fact, the training margin error with a small is close to the training error, while that with a large is close to test error. Figure 5 shows such a relation using rank correlations (in terms of Spearman- and Kendall-222The Spearman’s and Kendall’s rank correlation coefficients measure how two variables are correlated up to a monotone transform and a larger correlation means a closer tendency.) between training margin errors (or inverse quantile margins) and training errors, as well as training margin errors (or inverse quantile margins) and test errors, for each (or , respectively). In these plots one sees that the dynamics of large margins have a similar trend to the test errors, while small margins are close to training errors in rank correlations. Therefore for a good prediction, one can choose a large enough (or , respectively) at the peak point of rank correlation curve between training margins and test errors. Under such choices, the epoch when the phase transition above happens is featured with a cross-over in dynamics of training margin distributions in Figure 1 (b), and lives near the optima of the training margin error curve.

Although both the training margin error () and the inverse quantile margin () can be used here to successfully predict the trend of test (generalization) error, the latter can be more powerful in our studies. In fact, dynamics of the inverse quantile margins can adaptively select for each without access to the complexity term. Unlike merely looking at the training margin error with a fixed , quantile margin bound (12) in Theorem 2 shows a stronger prediction power than (10) and is even able to capture more local optima. In Figure 6, the test error curve has two valleys corresponding to a local optimum and a global optimum, and the quantile margin curve with successfully identifies both. However, if we consider the dynamics of training margin errors, it’s rarely possible to recover the two valleys at the same time since their critical thresholds and are different. Another example of ResNet-18 is given in Figure 12 in Appendix.

Figure 6: Inverse quantile margin. Net structure: CNN(400). Dataset: CIFAR10 with 10 percents label corrupted. Left: the dynamics of test error (blue) and inverse quantile margin with (red). Two local minima are marked by “x” in each curve. Right: dynamics of training margin distributions, where two distributions in red color correspond to when the two local minima occur. The inverse quantile margin successfully captures two local minima of test error.

In a summary, when training and test margin dynamics share similar phase transitions, both theorems we developed can be used to predict test (generalization) error via normalized training margins, even leaving us data-dependent early stopping rule to avoid overfitting when data is noisy. However, below we shall see a different scenario when training and test margin dynamics are of distinct phase transitions, such a prediction fails as Breiman’s dilemma.

3.2 Failure: Breiman’s Dilemma and Phase Transitions in Margin Dynamics

In this section, we show that when the expressive power of models are comparable to data complexity, the dynamics of training margin distributions and that of test margin distributions share similar phase transitions which enables us to predict generalization (test) error utilizing the theorems in this paper. However, when model complexity arbitrarily increase to be over-expressive against the dataset, the training margins can be monotonically improved to undergo different phase transitions to that of test margin dynamics, then the prediction power is lost. This exhibits Breiman’s dilemma in neural networks.

We conduct three sets of experiments in the following.

3.2.1 Experiment I: Basic CNNs on CIFAR10

Figure 7: Breiman’s Dilemma I: comparisons between dynamics of test margin distributions and training margin distributions. Net structure: Basic CNN(50) (Left), Basic CNN(100) (Middle), Basic CNN(400) (Right). Dataset: CIFAR10 with 10 percent labels corrupted. First row: evolutions of training margin distributions. Second row: evolutions of test margin distributions. Third row: heatmaps are Spearman- rank correlation coefficients between dynamics of training margin error () and dynamics of test margin error () drawn on the plane. CNN(50) and CNN(100) share similar phase transitions in training and test margin dynamics while CNN(400) does not. When model becomes over-representative to dataset, training margins can be monotonically improved while test margins can not be, losing the predictability.

In the first experiment shown in Figure 7, we fix the dataset to be CIFAR10 with 10 percent of labels randomly permuted, and gradually increase the channels from basic CNN(50) to CNN(400). For CNN(50) (#(parameters) is 92,610) and CNN(100) (#(parameters) is 365,210), both training margin dyamics and test margin dynamics share a similar phase transition during training: small margins are monotonically improved while large margins are firstly improved then dropped afterwards. The last row in Figure 7 shows the heatmaps as Spearman- rank correlations between these two dynamics drawn in - plane. The block diagonal structures in the rank correlation heatmaps illustrates such a similarity in phase transitions. To be specific, small (or large) margins in both training margins and test margins share high level rank correlations marked by diagonal blocks in light color, while the difference between small and large margins are marked by off-diagonal blocks in dark color. Particularly at , the test (generalization) error dynamics can be predicted using large training margins, as their rank correlations are high.

However, as the channel number increases to CNN(400) (#(parameters) is 5,780,810), the dynamics of the training margin distributions becomes a monotone improvement without the phase transition above. This phenomenon is not a surprise since with a strong representation power, the whole training margin distribution can be monotonically improved without sacrificing the large margins. On the other hand, the generalization or test error can not be monotonically improved. The heatmap of rank correlations between training and test margin dynamics thus exhibits such a distinction in phase transitions by changing the block diagonal structure above to double column blocks for CNN(400). In particular, for , test margin dynamics have low rank correlations with all training margin dynamics as they are of different phase transitions in evolutions. As a result, one CAN NOT predict test error at using training margin dynamics.

3.2.2 Experiment II: CNN(400) and ResNet-18 on CIFAR100 and Mini-ImageNet

In the second experiment shown in Figure 8, we compare the normalized margin dynamics of training CNN(400) and ResNet-18 on two different datasets, CIFAR100 and Mini-ImageNet. CIFAR100 is more complex than CIFAR10, but less complex than Mini-ImageNet. It shows that: (a) CNN(400) does not have an over-expressive power on CIFAR100, whose normalized training margin dynamics exhibits a phase transition – a sacrifice of large margins to improve small margins during training; (b) ResNet-18 does have an over-expressive power on CIFAR100 by exhibiting a monotone improvement on training margins, but loses such a power in Mini-ImageNet with phase transitions of training margin dynamics.

Figure 8: Breiman’s Dilemma II. Net structure: Basic CNN(400) (Left), ResNet-18 (Middle, Right). Dataset: CIFAR100 (Left, Middle), Mini-ImageNet (Right) with 10 percent labels corrupted. With a fixed network structure, we further explore how the complexity of dataset influences the margin dynamics. Taking ResNet-18 as an example, margin dynamics on CIFAR100 doesn’t have any cross-over (phase transition), but on Mini-Imagenet a cross-over occurs.

From this experiment, one can see that simply counting the number of parameters and samples can not tell us if the model and data complexities are over-representative or comparable. Instead, phase transitions of margin dynamics provide us a tool to investigate their relationship. CNN(400) (5.8M parameters) has a too much expressive power for the simplest CIFAR10 dataset such that the training margins can be monotonically improved during training; but CNN(400)’s expressive power seems comparable to the more complex CIFAR100. Similarly, the more complex model ResNet-18 (11M parameters) has a too much expressive power for CIFAR100, but seems comparable to Mini-ImageNet.

3.2.3 Comparisons of Basic CNNs, AlexNet, VGG16, and ResNet-18 in CIFAR10/100 and Mini-ImageNet

In this part, we collect comparisons of various networks on CIFAR10/100 and Mini-ImageNet dataset. Figure 9 shows both success and failure cases with different networks and datasets. In particular, the predictability of generalization error based on Theorem 1 and Theorem 2 can be rapidly observed on the third column of Figure 9, the heatmaps of rank correlations between training margin dynamics and test margin dynamics. On one hand, one can use the training margins to predict the test error as shown in the first column of Figure 9, when model complexity is comparable to data complexity such that the training margin dynamics share similar phase transitions with test margin dynamics, indicated by block diagonal structures in rank correlations (e.g. CNN(100) - CIFAR10, AlexNet - CIFAR100, AlexNet - MiniImageNet, VGG16 - MiniImageNet, and ResNet-18 - MiniImageNet). On the other hand, such a prediction fails when models become over-expressive against datasets such that the training margin dynamics undergo different phase transitions to test margin dynamics, indicated by the lost of block diagonal structures in rank correlations (e.g. CNN(400)- CIFAR10, ResNet-18 - CIFAR100, VGG16 - CIFAR100).

As we have shown, phase transitions of margin dynamics play a central role in characterizing the trade-off between model expressive power and data complexity, hence the predictability of generalization error by our theorems. If one tries hard to improve training margins by arbitrarily increasing the model complexity, the training margin distributions can be monotonically enlarged but it may lead to overfitting. This phenomenon is not unfamiliar to us, since Breiman has pointed out that the improvement of training margins is not enough to guarantee a small generalization or test error in the boosting type algorithms (Breiman, 1999). Now again we find the same phenomenon ubiquitous in deep neural networks. In this paper, the inspection of the trade-off between expressive power of models and complexity of data via phase transitions of margin dynamics provides us a new perspective to study the Breiman’s dilemma in applications.

Figure 9: Comparisons of Basic CNNs, AlexNet, VGG16, and ResNet-18 in CIFAR10/100, and Mini-ImageNet. The dataset and network in use are marked in titles of middle pictures in each row. Left: curves of training error, generalization error, training margin error and inverse quantile margin. Middle: dynamics of training margin distributions. Right: heatmaps are Spearman- rank correlation coefficients between dynamics of training margin error () and dynamics of test margin error () drawn on the plane.

3.3 Discussion: Effluence of Normalization Factor Estimates

Figure 10: Comparisons on normalization factor estimates by power iteration and the -based estimate. Dataset: CIFAR10 with 10 percents corrupted. Net structure: Basic CNN with channels 50 (Top, Left), 100 (Top, Middle), 400 (Top Right), 200 (Middle, Left), 600 (Middle, Middle), 900 (Middle, Right). In the top row, the spectral norm in is estimated via the -based estimate method and in the middle row, the spectral norm is estimated by power iteration. Bottom pictures show the estimates of by power iterations (in green color) and by the -based estimate method (in blue color), respectively. The curves of estimates are rescaled for visualization since a fixed scaling factor along training doesn’t influence the occurrence of cross-overs or phase transitions. Note that the original -based estimates are of order (100 channels, 400 channels, 900 channels, respectively) and the power iteration estimates are of (100 channels, 400 channels, 900 channels, respectively). As shown above, a more accurate estimation of spectral norm may extend the range of predictability, but eventually faces the Breiman’s dilemma if the model representation power grows too much against the dataset complexity.

In the end, it’s worth to mention that different choices of the normalization factor estimation may affect the range of predictability, but still exhibit Breiman’s dilemma. In all experiments above, normalization factor is estimated via the -based estimate in Proposition 2 in Section 2.3. One could also use power iteration (Miyato et al., 2018) to present a more precise estimation on spectral norm. Usually the -based estimates lead to a coarser upper bound than the power iterations, see Figure 10. It is a fact that in training margin dynamics, large margins are typically improved at a slower speed than small margins. Therefore it turns out a more accurate estimation of spectral norm with faster increases in training may bring cross-overs (or phase transitions) in large training margins and extend the range of predictability. However Breiman’s dilemma still persists when the balance between model representation power and dataset complexity is broken as model complexity arbitrarily grows.

4 Conclusion and Future Directions

In this paper, we show that Breiman’s dilemma is ubiquitous in deep learning, in addition to previous studies on Boosting algorithms. We exhibit that Breiman’s dilemma is closely related to the trade-off between the expressive power of models and the complexity of data. Large margins on training data do not guarantee a good control on model complexity, instead, phase transitions in dynamics of normalized margin distributions is shown to be able to reflect the trade-off between model expressiveness and data complexity. In other words, such phase transitions of margin evolutions measure the

degree-of-freedom of models with respect to data. A data-driven early stopping rule by monitoring the margin dynamics is possible, whose detailed study is left for a future direction to explore. Lipschitz semi-norm plays an important role in normalizing or regularizing neural networks, e.g. in GANs (Kodali et al., 2017, Miyato et al., 2018), therefore a more careful treatment deserves further pursuits.

Acknowledgement

We thank Tommy Poggio, Peter Bartlett, and Xiuyuan Cheng for helpful discussions.

Appendix A Appendix: More Experimental Figures

a.1 Architecture Details about Basic CNNs

Figure 11: Detailed information about CNN(50), CNN(100), CNN(200), and CNN(400).

a.2 Two local minimums in ResNet-18

Figure 12: Inverse quantile margin captures local optima, though may fail in predicting their relative order when model complexity is over-representative. Network: ResNet-18. Data: CIFAR10 with 10 percents label corrupted. Normalization factor, spectral complexity estimated by power iteration. Left: the dynamics of test error and inverse quantile margin with . Overfitting occurs and two local minimums are marked with “x” in each dynamic. The dash line highlights the epochs when the training margins are monotonically improved. Right: dynamics of training margin distribution. Two distributions corresponding to local minima of test error are highlighted in red color. Since after the first (better) local minimum, the training margin distribution is uniformly improved in the second (worse) local minimum, that leads to the inverse quantile margin showing the second local minimum of smaller value. Yet the true order of the two local minima of test error is opposite. However, the inverse quantile margin still captures the optima locally, where the training margin distributions have cross-overs (phase-transitions) near local minima of test error.

Appendix B Appendix: Proofs

b.1 Auxiliary Lemmas

Lemma B.1.

For any and bounded-value functions , the following holds with probability at least ,

(16)

where

(17)

is the Rademacher Complexity of function class .

For completeness, we include its proof that also needs the following well-known McDiarmid’s inequality (see, e.g. Wainwright (2019)).

Lemma B.2 (McDiarmid’s Bounded Difference Inequality).

For -bounded difference functions s.t. ,

Proof of Lemma b.1.

It suffices to show that for ,

(18)

where with probability at least ,

(19)

by McDiarmid’s bounded difference inequality, and

(20)

using Rademacher complexity.

To see (19), we are going to show that is a bounded difference function. Consider . Assume that the -th argument changes to , then for every ,

Hence , which implies that is a - bounded difference function. Then (19) follows from the McDiarmid’s inequality (Lemma B.2) using and .

As to (20),

that ends the proof. ∎

We also need the following contraction inequality of Rademacher Complexity (Ledoux and Talagrand, 1991, Meir and Zhang, 2003).

Lemma B.3 (Rademacher Contraction Inequality).

For any Lipschitz function: such that ,

Ledoux and Talagrand (1991) has an additional factor 2 in the contraction inequality which is dropped in Meir and Zhang (2003). Its current form is stated in Mohri et al. (2012) as Talagrand’s Lemma (Lemma 4.2).

The last lemma gives the Rademacher complexity of the hypothesis space of maximum over functions in different hypothesis spaces (Ledoux and Talagrand, 1991).

Lemma B.4.

Let be hypothesis space and define

Then,

b.2 Proof of Proposition 1

Proof of Proposition 1.

The key idea is to approximate the linear function restricted in the Lipschitz ball by the neural network, where the local linearity of activation functions plays an important role. Therefore, we can show a subset of whose Rademacher complexity is larger than that of the (restricted) linear function.

We consider the Taylor expansion of around , , and thus,

(21)

and there exists a , ,

(22)

Without loss of generality, we assume and

since we can always do a linear transformation before and after each activation function and the additional Lipschitz can be bounded by a constant. We further assume the Lipschitz constant

for simplicity.

Let be the class of linear function with Lipschitz semi-norm less than and we show that given a , for each , there exists with and such that satisfying .

To see this, define with , which satisfies . Next we construct a particular -layer network as follows

With such a construction , define . Then since , and

(23)

where and stands for the composite of functions. The second inequality is implied from (21) and (22) since . Moreover, given and , we define a subclass by,

We firstly consider the empirical Rademacher complexity for a given sample set of size . Let and for any given ,