Collegial Ensembles

06/13/2020
by   Etai Littwin, et al.
Apple Inc.
0

Modern neural network performance typically improves as model size increases. A recent line of research on the Neural Tangent Kernel (NTK) of over-parameterized networks indicates that the improvement with size increase is a product of a better conditioned loss landscape. In this work, we investigate a form of over-parameterization achieved through ensembling, where we define collegial ensembles (CE) as the aggregation of multiple independent models with identical architectures, trained as a single model. We show that the optimization dynamics of CE simplify dramatically when the number of models in the ensemble is large, resembling the dynamics of wide models, yet scale much more favorably. We use recent theoretical results on the finite width corrections of the NTK to perform efficient architecture search in a space of finite width CE that aims to either minimize capacity, or maximize trainability under a set of constraints. The resulting ensembles can be efficiently implemented in practical architectures using group convolutions and block diagonal layers. Finally, we show how our framework can be used to analytically derive optimal group convolution modules originally found using expensive grid searches, without having to train a single model.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

02/24/2022

Embedded Ensembles: Infinite Width Limit and Operating Regimes

A memory efficient approach to ensembling neural networks is to share mo...
07/11/2020

Bayesian Deep Ensembles via the Neural Tangent Kernel

We explore the link between deep ensembles and Gaussian processes (GPs) ...
06/17/2022

Fast Finite Width Neural Tangent Kernel

The Neural Tangent Kernel (NTK), defined as Θ_θ^f(x_1, x_2) = [∂ f(θ, x_...
01/21/2020

On the infinite width limit of neural networks with a standard parameterization

There are currently two parameterizations used to derive fixed kernels c...
05/14/2020

Deep Ensembles on a Fixed Memory Budget: One Wide Network or Several Thinner Ones?

One of the generally accepted views of modern deep learning is that incr...
12/02/2019

GroSS: Group-Size Series Decomposition for Whole Search-Space Training

We present Group-size Series (GroSS) decomposition, a mathematical formu...
06/23/2021

Bayesian Deep Learning Hyperparameter Search for Robust Function Mapping to Polynomials with Noise

Advances in neural architecture search, as well as explainability and in...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

figures/CE_diagram

Figure 1: Collegial Ensemble

Neural networks exhibit generalization behavior in the over-parameterized regime, a phenomenon that has been well observed in practice rethinking; prove; modern; explore. Recent theoretical advancements have been made to try and understand the trainability and generalization properties of over-parameterized neural networks, by observing their nearly convex behaviour at large width wide; me. For a wide neural network with parameters and a convex loss , the parameter updates can be represented in the space of functions as kernel gradient decent (GD) updates , with the Neural Tangent Kernel (NTK) function operating on :

In the infinite width limit, the NTK remains constant during training, and GD reduces to kernel GD, rendering the optimization a convex problem. Hence, over parameterized models in the “large width” sense both generalize, and are simpler to optimize. In this work, we consider a different type of over-parameterization achieved through ensembling. We denote by collegial ensembles (CE) models where the output, either intermediate or final, is constructed from the aggregation of multiple identical pathways (see illustration in Fig. 1). We show that the training dynamics of CE simplify when the ensemble multiplicity is large, in a similar sense as wide models, yet at a much cheaper cost in terms of parameter count. Our results indicate the existence of a “sweet spot” for the correct ratio between width and multiplicity, where ”close to convex” behaviour does not come at the expense of size. To find said “sweet-spot”, we rely on recent findings on finite corrections of gradients boris1; rtk, though we use them in a more general manner than their original presentation. Specifically, we use the following proposition stated informally:

Proposition 1.

(Informal) Denote by the NTK at initialization of a fully connected ANN with hidden layer widths and depth . There exists positive constants such that:

where the variance is measured on the individual entries of

, with respect to the random sampling of the weights.

In boris1 and rtk, Proposition 1 is proven for the on-diagonal entries of , in fully connected architectures. In this work, we assume and empirically verify (see Appendix Fig. 8) it holds in a more general sense for practical architectures, with different constants of . Since diminishes with width, we hypothesize that a small width neutral network behaves closer to its large width counterpart as decreases. Notably, similar observations using activations and gradient variance as predictors of successful initializations were presented in boris2; boris3. Motivated by this hypothesis, we formulate a primal-dual constrained optimization problem that aims to find an optimal CE with respect to the following objectives:

  1. Primal (optimally smooth): minimize for a fixed number of parameters.

  2. Dual (optimally compact):  minimize number of parameters for a fixed .

The primal formulation seeks to find a CE which mimics the simplified dynamics of a wide model using a fixed budget of parameters, while the dual formulation seeks to minimize the number of parameters without suffering the optimization and performance consequences typically found in the ”narrow regime”. Using both formulations, we find excellent agreement between our theoretical predictions and empirical results, on both small and large scale models.

Our main contributions are the following:

  1. We adapt the popular over-parameterization analysis to collegial ensembles (CE), in which the output units of multiple architecturally identical models are aggregated, scaled, and trained as a single model. For ensembles with models each of width , we show that under gradient flow and the L2 loss, the NTK remains close to its initial value up to a correction.

  2. We formulate two optimization problems that seek to find optimal ensembles given a baseline architecture, in the primal and dual form, respectively. The optimally smooth ensemble achieves higher accuracy than the baseline, using the same budget of parameters. The optimally compact ensemble achieves a similar performance as the baseline, with significantly fewer trainable parameters.

  3. We show how optimal grouping in ResNeXt resnext architectures can be derived and improved upon using our framework, without the need for an expensive search over hyper-parameters.

The remaining paper is structured as follows. In Sec. 2 we formally define collegial ensembles, and present our results for their training dynamics in the large regime. In Sec. 3 we present our framework for architecture search in the space of collegial ensembles, and in Sec. 4 we present further experimental results on the CIFAR-10/CIFAR-100 cifar

and downsampled ImageNet

small-imagenet datasets using large scale models.

2 Collegial Ensembles

We dedicate this section to formally define collegial ensembles, and analyze their properties from the NTK perspective. Specifically, we would like to formalize the notion of the ”large ensemble” regime, where its dynamic behaviour is reminiscent of wide single models. In the following analysis we consider simple feed forward fully connected neural network , where the width of the hidden layers are given by , adopting the common NTK parameterization:

where is the input,

is the ReLU activation function,

is the weight matrix associated with layer , and

denotes the concatenation of all the weights of all layers, which are initialized iid from a standard normal distribution. Given a dataset

, the empirical NTK denoted by is given by where . Given the network , we parameterize a space of ensemble members by a multiplicity parameter

and a width vector

, such that:

where is the concatenation of the weights of all the ensembles, and is the NTK of the ensemble. Plainly speaking, the network defines a space of ensembles given by the scaled sum of neural networks of the same architecture, with weights initialized independently from a normal distribution. Since each model in the ensemble is architecturally equivalent to , it is easy to show that the infinite width kernel is equal for both models: We define the Neural Mean Kernel (NMK) as the mean of the empirical NTK:

The NMK is defined by an expectation over the normally distributed weights, and does not immediately equal the infinite width limit of the NTK given by

. The following Lemma stems from the application of the strong law of large numbers (LLN):

Lemma 1 (Infinite ensemble).

The following holds:

We deffer the reader to Sec. H in the appendix for the full proofs of Lemma. 1 and Theorem. 1. While both and do not depend on the weights, they are defined differently. is defined by an expectation over the weights, and depends on the width of the architecture, whereas is defined by an infinite width limit. However, empirical observation using Monte Carlo sampling, as presented in Fig. 2, show little to no dependence of the NMK on the widths . Moreover, we empirically observe that (Note that similar observations have been reported in NTK).

(a)
(b)
Figure 2: Convergence of the ensemble NTK to the NMK when increasing the number of models in the ensemble. is computed for both the diagonal and off-diagonal elements for . The smoother surface as increases in (a) demonstrates the convergence of the NTK. The black line in (b), computed with wider model, shows that the convergence is indeed to the NMK, and the NTK mean does not depend on the width of the model. Each model in the ensemble is fully connected with layers and .

We next show that under gradient flow, remains close to its initial value for the duration of the optimization process when is large. Given the labels vector and the cost function at time , , under gradient flow with learning rate , the weights evolve over continuous time according to . The following theorem gives an asymptotic bound on the leading correction of the ensemble NTK over time. For simplicity, we state our result for constant width networks.

Theorem 1 (NTK evolution over time).

For analytic activation functions and the cost function , it holds for any finite :

where the notation states that is stochastically bounded.

We verified the result of Theorem 1 for fully connected networks trained on MNIST, results are summarized in Fig. 9 in the appendix. Large collegial ensembles therefore refer to a regime where is large. In the case of infinite multiplicity, optimization dynamics reduces to kernel gradient descent with , rather than as the relevant kernel function. A striking implication arises from Theorem 1. The total parameter count in the ensemble is linear in , and quadratic in , hence it is much more parameter efficient to increase rather than . Since the ”large” regime depends on both and , CE possess an inherent degree of flexibility in their practical design. As we show in the following section, this increased flexibility allows the design of both parameter efficient ensembles, and better performing ensembles, when compared with the baseline model.

3 Efficient Ensembles

In this section, we show how Proposition 1 can be used to derive optimally smooth and compact ensembles out of a predefined set of ensembles, parameterized by and some baseline model , where is the width vector of the baseline model. We define parameter efficiency by the ratio between the parameter count in the baseline model, denoted by , and the parameter count in a single model in the ensemble, denoted by :

Using Proposition. 1, the behaviour of the variance of the NTK as a function of widths and depth , is given by:

Primal formulation:

We cast the primal objective as an optimization problem, where we would like to find parameters that correspond to the smoothest ensemble:

Since the weights for each model are sampled independently, it holds that:

Equating the parameter count in both models to maintain a fixed efficiency, we can derive for each the number of the models in the primal formulation:

The optimal parameters can be found using a grid search.

Dual formulation:

The dual formulation can be cast as an optimization problem, with the following objective:

Matching the smoothness criterion using Eq. 3, we can derive for each the number of models in the dual formulation:

Ideally, we can find such that the total parameter count in the ensemble is considerably reduced. Equating the solutions for both the primal and dual problems in Eq. 3 and Eq. 3, it is straightforward to verify that , implying strong duality. Therefore, the primal and dual solutions differ only in the optimal multiplicity of the ensemble. Both objectives are plotted in Fig. 3 using a feedforward fully connected network baseline with and constant width .

Note that the efficient ensembles framework outlined in this section can readily be applied with different efficiency metrics. For instance, instead of using the parameter efficiency, one could consider the FLOPs efficiency (see Appendix E).

(a) Primal formulation
(b) Dual formulation
Figure 3: Primal and dual objective curves for a baseline feedforward fully connected network with layers, , and . (a) The minimizer of the primal objective (red) is achieved for and . (b) The maximizer of the dual objective (red) is achieved for and , achieving an efficiency value .

4 Experiments

In the following section we conduct experiments to both validate our assumptions, and evaluate our framework for efficient ensemble search. Starting with a toy example, we evaluate the effect of and on test performance using fully connected models trained on the MNIST dataset. For the latter experiments, we move to larger scale models trained on CIFAR-10/100, and the downsampled ImageNet datasets.

4.1 Ablation Study – MNIST

An effective way to improve the performance of a neural network is to increase its size. Recent slim architectures, such as ResNeXt, demonstrate it possible to maintain accuracy while significantly reducing parameter count. In Fig. 4 we provide further empirical evidence that capacity of a network by itself is not a good predictor of performance, when decoupled from other factors.

Specifically, we demonstrate strong correlation between the empirical test error and the variance , while is kept constant (primal). On the other hand, increasing while keeping constant (dual) does not improve the performance. For both experiments we use as a baseline a fully connected model with layers and width for each layer. The width of a layer for each of the models in the ensemble is

. Each ensemble was trained on MNIST for 70 epochs with the Adam optimizer, and the accuracy was averaged over 100 repetitions.

(a) Primal
(b) Dual
(c) Primal
(d) Dual
Figure 4: Decoupling capacity and variance. The error (blue) is highly correlated with , and less sensitive to . (a) and (b) show the theoretical variance of the model correlates well with accuracy. (c) and (d) show the corresponding number of parameters . Decreasing the variance (a) improves performance when is fixed (c). Increasing significantly (d) without reducing the variance (b) can cause degradation in performance due to overfitting.

4.2 Aggregated Residual Transformations

ResNeXt resnext introduces the concept of aggregated residual transformations utilizing group convolutions, which achieves better parameter/capacity trade off than its ResNet counterpart. In this paper, we view ResNeXt as a special case of CE, where the ensembling happens at the block level. We hypothesize that the optimal blocks identified with CE will lead to an overall better model when stacked up, and by doing so we get the benefit of factorizing the design space of a network to modular levels. See Algorithm 2 for the detailed recipe.

figures/residual_block

Figure 5:

For these experiments, we use both the CIFAR-10 and CIFAR-100 datasets following the implementation details described in resnext. We also report results for ImageNet, a dataset introduced in small-imagenet that preserves the number of samples and classes of the original ImageNet-1K imagenet, but downsamples the image resolutions to . In addition, we report results on regular ImageNet in Appendix A and on ImageNet small-imagenet in Appendix B.

Fitting to a ResNet block. The first step in the optimization required for both the primal and dual objectives, is to approximate the parameter in Eq. 3. For convolutional layers, the coefficient multiplying becomes where is the fan-in of layer . Following Algorithm 1, we approximate the corresponding to a ResNet block parametrized by as depicted in Fig. 5

. We compute a Monte Carlo estimate of the second moment of one diagonal entry of the NTK matrix for increasing width

and fixed . For simplicity, we fit the second moment normalized by the squared first moment, given by , which can easily be fitted with a first degree polynomial when considering its natural logarithm. We find and show the fitted second moment in Appendix Fig. 8.

4.2.1 Experiments on CIFAR-10/100

Primal formulation. As a baseline architecture, we use a ResNet, following the notations of resnext section 5.1. Following Algorithm 2, we compute for and find the optimum and , after adjusting to match the number of parameters of the baseline and account for rounding errors and different block topology approximations. As can be seen in Table 0(a), the model achieving the primal optimum, , attains better test error on CIFAR-10/100 than the ResNeXt baseline at a similar parameter budget. We also report results for a wider baseline from resnext and show similar trends. The test error for multiple models sitting on the primal curve is depicted in Fig. 5(a) for CIFAR-100 and Appendix Fig. 6(a) for CIFAR-10. Test errors are averaged over the last 10 epochs over 10 runs.

Input: Baseline module with , a set of width ratios , , samples .
Output: Fitted .
1 for  do
2       Construct module . for  do
3             Sample weights of . Compute .
4      Estimate .
Fit using Eq. 3.
Algorithm 1 Fitting per architecture
Input: Baseline module with , a set of width ratios , , samples , .
Output: optimal .
1 Fit using Algorithm 1. if Primal then
2       find and using Eq. 3.
3 else if Dual then
4      find and using Eq. 3.
Correct and to nearest integer values.
Algorithm 2 Find Optimal CE module
(a) Primal
(b) Dual
Figure 6: Test errors (in %) on CIFAR-100. Clear correlation is observed between test error and the primal curve. (a) Different models sitting on the primal curve. The model achieving the primal minimum, , achieves the best error. (b) Different models sitting on the dual curve. The model achieves the dual maximum () and a test error comparable to the baseline with

times fewer parameters. Results are reported over 10 runs and shown with standard error bars.

Dual formulation. Using the same ResNet base block as for the primal experiment, thus using the same fitted , we compute the optimal and maximizing the parameter efficiency curve and find the same as the primal, , and . The resulting ResNeXt network has times fewer parameters than the baseline and achieves similar or slightly degraded performance on CIFAR-10/100 as shown in Table LABEL:table:dual_resnext. The efficiency curve depicted in red in Fig. 5(b) is constructed using a single ResNet block topology and with non integer numbers for as described above. Thus it only approximates the real parameter efficiency, explaining why some models in the close vicinity of the optimum have a slightly higher real efficiency as can be seen in Table LABEL:table:dual_resnext. The test error for multiple models sitting on the dual curve is depicted in Fig. 5(b) for CIFAR-100 and Appendix Fig. 6(b) for CIFAR-10.

4.2.2 Experiments on ImageNet

Primal formulation. Using the same baseline architecture as for the CIFAR experiments, we train the model achieving the optimum, , and report results in Table 3. Our optimal model achieves lower top-1 and top-5 errors than the baseline ResNeXt architecture derived in resnext at a similar parameter budget. We use the same augmentations and learning rate schedule as small-imagenet.

Dual formulation. Using the same baseline and optimally compact architecture derived in section 4.2.1, we observe a similar trend: our optimal model suffers lighter top-1 and top-5 degradation than the Wide ResNet variant with a reduced parameter budget, with times fewer parameters than the baseline. Sampling models on the dual curve with lower such as , we find models that suffer less than a percent drop in top-1 and top-5 error with a significantly lower parameter count.

Model Params C10 C100
 resnext 13.8M 4.08 19.39
 resnext 13.3M 3.96 18.57
(Ours) 13.7M 3.82 17.94
(Ours) 12.9M 3.74 18.05
Wide ResNet wideresnet 36.5M 4.17 20.50
36.3M 3.88 18.36
 resnext 34.4M 3.65 17.77
(Ours) 36.3M 3.65 17.44
(a) Primal
Table 1: Primal and dual results for ResNeXt-29 baselines on CIFAR-10/100. Test errors (in %) for CIFAR-10 (C10) and CIFAR-100 (C100) along with model sizes are reported. All models are variants of ResNeXt-29 except for Wide ResNet. (a) The optimally smooth models, and , surpass the baselines with the same number of parameters. (b) The optimally compact models only use a fraction of the parameters, yet attain similar or slightly degraded test errors. indicates the parameter efficiency. indicates we reproduced results on baseline architectures from the cited paper, indicates models in the close vicinity of the optima. Results are averaged over 10 runs. Models were trained on 8 GPUs unless indicated by §, in which case they were trained on a single GPU.
Params Top-1 error Top-5 error
Wide ResNet 28-10 small-imagenet - 37.1M 40.96 18.87
ResNet-29, 1 14.8M 40.61 17.82
ResNeXt-29, 1 14.4M 39.58 17.09
ResNeXt-29, (Ours) 1 14.8M 38.41 16.13
Wide ResNet 28-5 small-imagenet 1.6 9.5M 45.36 21.36
ResNeXt-29, (Ours) 2.8 5.2M 43.36 19.65
ResNeXt-29, (Ours) 1.8 8.0M 41.54 18.58
Table 2: Primal and dual results for ResNet baselines on ImageNet. Top-1 and top-5 errors (in %) and model sizes are reported. The optimally smooth model, , surpasses the baseline architectures from resnext (indicated with ) with the same number of parameters. The optimally compact model, , achieves slightly degraded results but with times fewer parameters. Results are averaged over 5 runs.

5 Related Work

Various forms of multi-pathway neural architectures have surfaced over the years. In the seminal AlexNet architecture alexnet, group convolutions were used as a method to distribute training on multiple GPUs. More recently, group convolutions were popularized by ResNeXt resnext, empirically demonstrating the benefit of aggregating multiple residual branches. In shufflenet, a channel shuffle unit was introduced in order to promote information transfer between different groups. In condensenet and other, the connections between pre-defined set of groups are learned in a differentiable manner, and in prune, grouping is achieved through pruning of full convolutional blocks. In a seemingly unrelated front, the theoretical study of wide neural networks has seen considerable progress recently. A number of papers wide; yang; exact; guy; enhanced have followed in the footsteps of the original NTK paper NTK. In wide, it is shown that wide models of any architecture evolve as linear models, and in yang, a general framework for computing the NTK of a broad family of architectures is proposed. Finite width corrections to the NTK are derived in boris1; me; guy. In this work, we extend the ”wide” regime to the multiplicity dimension, and show two distinct regimes where different kernels reign. We then use finite corrections of NTK to formulate two optimality criterions, and demonstrate their usefulness in predicting efficient and performant ensembles.

6 Conclusion

Understanding the effects of model architecture on training and test performance is a longstanding goal in the deep learning community. In this work we analyzed collegial ensembling, a general technique used in practice where multiple and functionally identical pathways are aggregated. We showed that collegial ensembles exhibit two distinct regimes of over-parameterization, defined by large width and large multiplicity, with two distinct kernels governing the dynamics of each. In between these two regimes, we introduced a framework for deriving optimal ensembles in a sense of minimum capacity or maximum trainability. Empirical results on practical models demonstrate the predictive power of our framework, paving the way for more principled architecture search algorithms.

References

Appendix A Results and Implementation Details: ImageNet

Primal formulation. Following [resnext], we use ResNet-50 and ResNet-101 as baselines and report results in Table 2(a). Our ResNet-50 based optimal model, , obtains slightly better top-1 and top-5 errors than the baseline reported in [resnext]. This is quite remarkable given that [resnext] converged to this architecture via an expensive grid search. Our ResNet-101 based optimal model achieves a significantly better top-1 and top-5 error than the ResNet-101 baseline, and a slightly higher top-1 and top-5 error than the ResNeXt baseline .

Dual formulation. Using ResNet-50 and ResNet-101 as baselines, we find models that achieve similar top-1 and top-5 errors with significantly less parameters. Detailed results can be found in Table 2(b).

Params Top-1 error Top-5 error
ResNet-50,  [resnext] 25.6M 23.93 7.11
ResNeXt-50,  [resnext] 25.0M 22.42 6.36
ResNeXt-50, (Ours) 25.8M 22.37 6.30
ResNeXt-50, (Ours) 25.1M 22.39 6.38
ResNet-101,  [resnext] 44.5M 22.32 6.25
ResNeXt-101,  [resnext] 44.2M 21.01 5.72
ResNeXt-101, (Ours) 45.5M 21.16 5.74
ResNeXt-101, (Ours) 44.2M 21.20 5.76
(a) Primal
Params Top-1 error Top-5 error
ResNet-50,  [resnext] 1 25.6M 23.93 7.11
ResNeXt-50, (Ours) 1.3 19.4M 23.80 7.11
ResNeXt-50, (Ours) 1.5 17.1M 24.00 7.11
ResNet-101,  [resnext] 1 44.5M 22.32 6.25
ResNeXt-101, (Ours) 1.4 32.9M 22.06 6.11
ResNeXt-101, (Ours) 1.7 25.8M 22.30 6.27
(b) Dual
Table 3: Primal and dual results for ResNet baselines on ImageNet. Top-1 and top-5 errors (in %) and model sizes are reported. indicates the parameter efficiency, indicates we reproduced results on baseline architectures from the cited paper, indicates models in the close vicinity of the optimum. Results are averaged over 5 runs.

Implementation details. We follow [resnext] for the implementation details of ResNet-50, ResNet-101 and their ResNeXt counterparts. We use SGD with momentum and a batch size of 256 on GPUs ( per GPU). The weight decay is and the initial learning rate . We train the models for epochs and divide the learning rate by a factor of at epoch , and . We use the same data normalization and augmentations as in [resnext] except for lighting that we do not use.

Appendix B Results and Implementation Details: ImageNet

Implementation details. In order to adapt the ResNeXt-29 architectures used for CIFAR-10/100 and ImageNet to the resolution of ImageNet, we add an additional stack of three residual blocks following [small-imagenet]. Following the general parametrization of ResNeXt [resnext], we multiply the width of this additional stack of blocks by

and downsample the spatial maps by the same factor using a strided convolution in the first residual block. We use SGD with

momentum and a batch size of on 8 GPUs ( per GPU). The weight decay is and the initial learning rate . We train the models for epochs and divide the learning rate by a factor at epoch , and . We use the same data normalization and augmentations as [small-imagenet].

Params Top-1 error Top-5 error
Wide ResNet 36-5 [small-imagenet] - 37.6M 32.34 12.64
ResNet-38, 1 37.5M 29.36 10.10
ResNeXt-38, 1 39.0M 28.86 9.72
ResNeXt-38, (Ours) 1 36.3M 28.34 9.38
ResNeXt-38, (Ours) 2.3 16.3M 30.93 11.02
ResNet-38, 1 57.8M 28.56 9.67
ResNeXt-38, 1 56.1M 28.01 9.28
ResNeXt-38, (Ours) 1 57.7M 27.24 8.74
ResNeXt-38, (Ours) 3.0 19.1M 30.22 10.60
Table 4: Primal and dual results for ResNet baselines on ImageNet. Top-1 and top-5 errors (in %) and model sizes are reported. The optimally smooth models, and , surpass the baseline architectures from [resnext] (indicated with ) with the same number of parameters. The optimally compact models, and , achieve slightly degraded results but with significantly fewer parameters. Results are averaged over 5 runs.

Appendix C Implementation Details: ImageNet

We use the same ResNeXt-29 architectures from the CIFAR experiments. We use SGD with momentum and a batch size of on GPUs ( per GPU). The weight decay is and the initial learning rate . We train the models for epochs and divide the learning rate by a factor at epoch , and . We use the same data normalization and augmentations as in [small-imagenet].

Appendix D Results on CIFAR-10

Results are shown in Fig. 7 and implementation details can be found in the main text in Sec. 4.2.1.

(a) Primal
(b) Dual
Figure 7: Test errors (in %) on CIFAR-10. Clear correlation is observed between test error and the primal curve. (a) Different models sitting on the primal curve. The model achieving the primal minimum, , achieves the best error. (b) Different models sitting on the dual curve. The model achieves the dual maximum () and a slightly higher test error than the baseline with times fewer parameters. Results are reported over 10 runs and shown with standard error bars.

Appendix E FLOPs efficiency

In Sec.3 and the rest of the paper, we considered the parameter efficiency defined as the ratio between the number of parameters of the baseline model and the ensemble (see Eq. 3). Using this definition of efficiency, models satisfying the primal objective were models with similar number of parameters. Instead of using the parameter efficiency, we can consider FLOPs efficiency in the same way:

where and are the number of FLOPs of the baseline model and of one model in the ensemble respectively. We report results for the primal formulation on CIFAR-10/100 in Table 5. We see that the model achieving the primal optimum, , attains better test error on CIFAR-10 and CIFAR-100 with similar number of FLOPs.

Model GFLOPs Params C10 C100
ResNet-29  [resnext] 4.19 13.8M 4.08 19.39
ResNeXt-29  [resnext] 4.15 13.3M 3.96 18.57
ResNeXt-29 (Ours) 4.20 12.7M 3.66 17.86
ResNeXt-29 (Ours) 4.17 12.6M 3.73 18.04
Table 5: Results for ResNeXt-29 baselines on CIFAR-10/100 when keeping FLOPs constant instead of # parameters. Test errors (in %) for CIFAR-10 (C10) and CIFAR-100 (C100) along with model GFLOPs and sizes are reported. The optimally smooth model, , surpasses the baselines with the same number of FLOPs. indicates we reproduced results on baseline architectures from the cited paper, indicates models in the close vicinity of the optimum. Results are averaged over 10 runs.

Appendix F Fitting to a ResNet Block

Figure 8: Estimating the variance of a ResNet block and fitting . The Monte Carlo estimate is calculated over trials and is fitted following Algorithm 1 as described in Sec. 4.2.

Appendix G Figure Illustrating Theorem 1

Figure 9: Dynamics of the NTK during training as a function of width and multiplicity for (a) baseline single model (b) ensemble (c) ensemble with a constant width , and multiplicity to match the number of parameters in the baseline (red). The NTK was computed for a single off-diagonal entry for a depth fully connected network trained on MNIST. The y axis corresponds to the absolute change in log scale between the NTK value at initialization, and after training for epochs. As predicted in Theorem 1, the baseline model with and the ensemble with have equal , therefore exhibit similar correction of the NTK. In (c), the change of the NTK becomes smaller than the baseline, as mn is considerably larger, although the total parameter count is the same as the baseline.

Appendix H Proofs of Lemma 1 and Theorem 1

See 1

Proof.

Recall that the NTK of the ensemble is given by the mean:

Note that expectation of each member in the average is identical and finite under Lebesgue integration:

Since each member of the ensemble is sampled independently, we have from the strong law of large numbers (LLN):

Proving the claim. ∎

See 1

Proof.

In the following proof, we assume for the sake of clarity our training data contains a single example, so that . The results however hold in the general case. Throughout the proof, we use to denote the weights at time , while denote the weights at initialization.

For analytic activation functions, the time evolved kernel is analytic with respect to . Therefore, at any time we may approximate the kernel using Taylor expansion evaluated at :

Similarly to the technique used in [wide], we assume we may exchange the Taylor expansion with large width and multiplicity limits. We now analyze each term in the Taylor expansion separately. Using Eq. 2 in the main text, the -th order term of the ensemble NTK is given by:

Denote the residual terms and . Under gradient flow and the cost given by , the parameters evolve according to:

The parameters of each model in the ensemble evolve according to:

The time derivative operator at can be expanded as follows:

Plugging the definition of into Eq. H into Eq. H yields:

where we have introduced the operator . For each model in the ensemble, the -th time derivative of its corresponding NTK at is therefore given by:

where , and .
Using the notation , and noticing that depend only on , the following hold:

(1)
(2)
(3)
(4)

Using the above equalities, the terms can be expressed as a sum over a finite set as follows:

Example:

for , the term is expanded using as follows:

Expanding the multiplication and using Eq. 3:

Using the chain rule in Eq.

4, and eliminating elements using Eq. 2:

We can now express the result in the formulation of Eq. H

The ’th time derivative of the full ensemble is given by:

Note that for , the term represents the ’th time derivative of under gradient flow with the loss . Using results111In [guy], the result was obtained using a conjecture, and demonstrated empirically to be tight. An result of the same quantity is obtained rigorously in [NTK], which yields an asymptotic behaviour of for the ensemble. from [guy] on wide single fully connected models, we have that , and . Moreover, from the independence of the weights , it holds that . Therefore, for any fixed the set

is finite, and so we may apply the central limit theorem for large

on the terms in Eq. H individually:

Plugging back into Eq. H yields the desired result by noticing that and :