Neural networks exhibit generalization behavior in the over-parameterized regime, a phenomenon that has been well observed in practice rethinking; prove; modern; explore. Recent theoretical advancements have been made to try and understand the trainability and generalization properties of over-parameterized neural networks, by observing their nearly convex behaviour at large width wide; me. For a wide neural network with parameters and a convex loss , the parameter updates can be represented in the space of functions as kernel gradient decent (GD) updates , with the Neural Tangent Kernel (NTK) function operating on :
In the infinite width limit, the NTK remains constant during training, and GD reduces to kernel GD, rendering the optimization a convex problem. Hence, over parameterized models in the “large width” sense both generalize, and are simpler to optimize. In this work, we consider a different type of over-parameterization achieved through ensembling. We denote by collegial ensembles (CE) models where the output, either intermediate or final, is constructed from the aggregation of multiple identical pathways (see illustration in Fig. 1). We show that the training dynamics of CE simplify when the ensemble multiplicity is large, in a similar sense as wide models, yet at a much cheaper cost in terms of parameter count. Our results indicate the existence of a “sweet spot” for the correct ratio between width and multiplicity, where ”close to convex” behaviour does not come at the expense of size. To find said “sweet-spot”, we rely on recent findings on finite corrections of gradients boris1; rtk, though we use them in a more general manner than their original presentation. Specifically, we use the following proposition stated informally:
(Informal) Denote by the NTK at initialization of a fully connected ANN with hidden layer widths and depth . There exists positive constants such that:
where the variance is measured on the individual entries of
where the variance is measured on the individual entries of, with respect to the random sampling of the weights.
In boris1 and rtk, Proposition 1 is proven for the on-diagonal entries of , in fully connected architectures. In this work, we assume and empirically verify (see Appendix Fig. 8) it holds in a more general sense for practical architectures, with different constants of . Since diminishes with width, we hypothesize that a small width neutral network behaves closer to its large width counterpart as decreases. Notably, similar observations using activations and gradient variance as predictors of successful initializations were presented in boris2; boris3. Motivated by this hypothesis, we formulate a primal-dual constrained optimization problem that aims to find an optimal CE with respect to the following objectives:
Primal (optimally smooth): minimize for a fixed number of parameters.
Dual (optimally compact): minimize number of parameters for a fixed .
The primal formulation seeks to find a CE which mimics the simplified dynamics of a wide model using a fixed budget of parameters, while the dual formulation seeks to minimize the number of parameters without suffering the optimization and performance consequences typically found in the ”narrow regime”. Using both formulations, we find excellent agreement between our theoretical predictions and empirical results, on both small and large scale models.
Our main contributions are the following:
We adapt the popular over-parameterization analysis to collegial ensembles (CE), in which the output units of multiple architecturally identical models are aggregated, scaled, and trained as a single model. For ensembles with models each of width , we show that under gradient flow and the L2 loss, the NTK remains close to its initial value up to a correction.
We formulate two optimization problems that seek to find optimal ensembles given a baseline architecture, in the primal and dual form, respectively. The optimally smooth ensemble achieves higher accuracy than the baseline, using the same budget of parameters. The optimally compact ensemble achieves a similar performance as the baseline, with significantly fewer trainable parameters.
We show how optimal grouping in ResNeXt resnext architectures can be derived and improved upon using our framework, without the need for an expensive search over hyper-parameters.
The remaining paper is structured as follows. In Sec. 2 we formally define collegial ensembles, and present our results for their training dynamics in the large regime. In Sec. 3 we present our framework for architecture search in the space of collegial ensembles, and in Sec. 4 we present further experimental results on the CIFAR-10/CIFAR-100 cifar
and downsampled ImageNetsmall-imagenet datasets using large scale models.
2 Collegial Ensembles
We dedicate this section to formally define collegial ensembles, and analyze their properties from the NTK perspective. Specifically, we would like to formalize the notion of the ”large ensemble” regime, where its dynamic behaviour is reminiscent of wide single models. In the following analysis we consider simple feed forward fully connected neural network , where the width of the hidden layers are given by , adopting the common NTK parameterization:
where is the input,is the weight matrix associated with layer , and
denotes the concatenation of all the weights of all layers, which are initialized iid from a standard normal distribution. Given a dataset, the empirical NTK denoted by is given by where . Given the network , we parameterize a space of ensemble members by a multiplicity parameter
and a width vector, such that:
where is the concatenation of the weights of all the ensembles, and is the NTK of the ensemble. Plainly speaking, the network defines a space of ensembles given by the scaled sum of neural networks of the same architecture, with weights initialized independently from a normal distribution. Since each model in the ensemble is architecturally equivalent to , it is easy to show that the infinite width kernel is equal for both models: We define the Neural Mean Kernel (NMK) as the mean of the empirical NTK:
The NMK is defined by an expectation over the normally distributed weights, and does not immediately equal the infinite width limit of the NTK given by
. The following Lemma stems from the application of the strong law of large numbers (LLN):
Lemma 1 (Infinite ensemble).
The following holds:
We deffer the reader to Sec. H in the appendix for the full proofs of Lemma. 1 and Theorem. 1. While both and do not depend on the weights, they are defined differently. is defined by an expectation over the weights, and depends on the width of the architecture, whereas is defined by an infinite width limit. However, empirical observation using Monte Carlo sampling, as presented in Fig. 2, show little to no dependence of the NMK on the widths . Moreover, we empirically observe that (Note that similar observations have been reported in NTK).
We next show that under gradient flow, remains close to its initial value for the duration of the optimization process when is large. Given the labels vector and the cost function at time , , under gradient flow with learning rate , the weights evolve over continuous time according to . The following theorem gives an asymptotic bound on the leading correction of the ensemble NTK over time. For simplicity, we state our result for constant width networks.
Theorem 1 (NTK evolution over time).
For analytic activation functions and the cost function , it holds for any finite :
where the notation states that is stochastically bounded.
We verified the result of Theorem 1 for fully connected networks trained on MNIST, results are summarized in Fig. 9 in the appendix. Large collegial ensembles therefore refer to a regime where is large. In the case of infinite multiplicity, optimization dynamics reduces to kernel gradient descent with , rather than as the relevant kernel function. A striking implication arises from Theorem 1. The total parameter count in the ensemble is linear in , and quadratic in , hence it is much more parameter efficient to increase rather than . Since the ”large” regime depends on both and , CE possess an inherent degree of flexibility in their practical design. As we show in the following section, this increased flexibility allows the design of both parameter efficient ensembles, and better performing ensembles, when compared with the baseline model.
3 Efficient Ensembles
In this section, we show how Proposition 1 can be used to derive optimally smooth and compact ensembles out of a predefined set of ensembles, parameterized by and some baseline model , where is the width vector of the baseline model. We define parameter efficiency by the ratio between the parameter count in the baseline model, denoted by , and the parameter count in a single model in the ensemble, denoted by :
Using Proposition. 1, the behaviour of the variance of the NTK as a function of widths and depth , is given by:
We cast the primal objective as an optimization problem, where we would like to find parameters that correspond to the smoothest ensemble:
Since the weights for each model are sampled independently, it holds that:
Equating the parameter count in both models to maintain a fixed efficiency, we can derive for each the number of the models in the primal formulation:
The optimal parameters can be found using a grid search.
The dual formulation can be cast as an optimization problem, with the following objective:
Matching the smoothness criterion using Eq. 3, we can derive for each the number of models in the dual formulation:
Ideally, we can find such that the total parameter count in the ensemble is considerably reduced. Equating the solutions for both the primal and dual problems in Eq. 3 and Eq. 3, it is straightforward to verify that , implying strong duality. Therefore, the primal and dual solutions differ only in the optimal multiplicity of the ensemble. Both objectives are plotted in Fig. 3 using a feedforward fully connected network baseline with and constant width .
Note that the efficient ensembles framework outlined in this section can readily be applied with different efficiency metrics. For instance, instead of using the parameter efficiency, one could consider the FLOPs efficiency (see Appendix E).
In the following section we conduct experiments to both validate our assumptions, and evaluate our framework for efficient ensemble search. Starting with a toy example, we evaluate the effect of and on test performance using fully connected models trained on the MNIST dataset. For the latter experiments, we move to larger scale models trained on CIFAR-10/100, and the downsampled ImageNet datasets.
4.1 Ablation Study – MNIST
An effective way to improve the performance of a neural network is to increase its size. Recent slim architectures, such as ResNeXt, demonstrate it possible to maintain accuracy while significantly reducing parameter count. In Fig. 4 we provide further empirical evidence that capacity of a network by itself is not a good predictor of performance, when decoupled from other factors.
Specifically, we demonstrate strong correlation between the empirical test error and the variance , while is kept constant (primal). On the other hand, increasing while keeping constant (dual) does not improve the performance. For both experiments we use as a baseline a fully connected model with layers and width for each layer. The width of a layer for each of the models in the ensemble is
. Each ensemble was trained on MNIST for 70 epochs with the Adam optimizer, and the accuracy was averaged over 100 repetitions.
4.2 Aggregated Residual Transformations
ResNeXt resnext introduces the concept of aggregated residual transformations utilizing group convolutions, which achieves better parameter/capacity trade off than its ResNet counterpart. In this paper, we view ResNeXt as a special case of CE, where the ensembling happens at the block level. We hypothesize that the optimal blocks identified with CE will lead to an overall better model when stacked up, and by doing so we get the benefit of factorizing the design space of a network to modular levels. See Algorithm 2 for the detailed recipe.
For these experiments, we use both the CIFAR-10 and CIFAR-100 datasets following the implementation details described in resnext. We also report results for ImageNet, a dataset introduced in small-imagenet that preserves the number of samples and classes of the original ImageNet-1K imagenet, but downsamples the image resolutions to . In addition, we report results on regular ImageNet in Appendix A and on ImageNet small-imagenet in Appendix B.
Fitting to a ResNet block. The first step in the optimization required for both the primal and dual objectives, is to approximate the parameter in Eq. 3. For convolutional layers, the coefficient multiplying becomes where is the fan-in of layer . Following Algorithm 1, we approximate the corresponding to a ResNet block parametrized by as depicted in Fig. 5and fixed . For simplicity, we fit the second moment normalized by the squared first moment, given by , which can easily be fitted with a first degree polynomial when considering its natural logarithm. We find and show the fitted second moment in Appendix Fig. 8.
4.2.1 Experiments on CIFAR-10/100
Primal formulation. As a baseline architecture, we use a ResNet, following the notations of resnext section 5.1. Following Algorithm 2, we compute for and find the optimum and , after adjusting to match the number of parameters of the baseline and account for rounding errors and different block topology approximations. As can be seen in Table 0(a), the model achieving the primal optimum, , attains better test error on CIFAR-10/100 than the ResNeXt baseline at a similar parameter budget. We also report results for a wider baseline from resnext and show similar trends. The test error for multiple models sitting on the primal curve is depicted in Fig. 5(a) for CIFAR-100 and Appendix Fig. 6(a) for CIFAR-10. Test errors are averaged over the last 10 epochs over 10 runs.
times fewer parameters. Results are reported over 10 runs and shown with standard error bars.
Dual formulation. Using the same ResNet base block as for the primal experiment, thus using the same fitted , we compute the optimal and maximizing the parameter efficiency curve and find the same as the primal, , and . The resulting ResNeXt network has times fewer parameters than the baseline and achieves similar or slightly degraded performance on CIFAR-10/100 as shown in Table LABEL:table:dual_resnext. The efficiency curve depicted in red in Fig. 5(b) is constructed using a single ResNet block topology and with non integer numbers for as described above. Thus it only approximates the real parameter efficiency, explaining why some models in the close vicinity of the optimum have a slightly higher real efficiency as can be seen in Table LABEL:table:dual_resnext. The test error for multiple models sitting on the dual curve is depicted in Fig. 5(b) for CIFAR-100 and Appendix Fig. 6(b) for CIFAR-10.
4.2.2 Experiments on ImageNet
Primal formulation. Using the same baseline architecture as for the CIFAR experiments, we train the model achieving the optimum, , and report results in Table 3. Our optimal model achieves lower top-1 and top-5 errors than the baseline ResNeXt architecture derived in resnext at a similar parameter budget. We use the same augmentations and learning rate schedule as small-imagenet.
Dual formulation. Using the same baseline and optimally compact architecture derived in section 4.2.1, we observe a similar trend: our optimal model suffers lighter top-1 and top-5 degradation than the Wide ResNet variant with a reduced parameter budget, with times fewer parameters than the baseline. Sampling models on the dual curve with lower such as , we find models that suffer less than a percent drop in top-1 and top-5 error with a significantly lower parameter count.
|Params||Top-1 error||Top-5 error|
|Wide ResNet 28-10 small-imagenet||-||37.1M||40.96||18.87|
|Wide ResNet 28-5 small-imagenet||1.6||9.5M||45.36||21.36|
5 Related Work
Various forms of multi-pathway neural architectures have surfaced over the years. In the seminal AlexNet architecture alexnet, group convolutions were used as a method to distribute training on multiple GPUs. More recently, group convolutions were popularized by ResNeXt resnext, empirically demonstrating the benefit of aggregating multiple residual branches. In shufflenet, a channel shuffle unit was introduced in order to promote information transfer between different groups. In condensenet and other, the connections between pre-defined set of groups are learned in a differentiable manner, and in prune, grouping is achieved through pruning of full convolutional blocks. In a seemingly unrelated front, the theoretical study of wide neural networks has seen considerable progress recently. A number of papers wide; yang; exact; guy; enhanced have followed in the footsteps of the original NTK paper NTK. In wide, it is shown that wide models of any architecture evolve as linear models, and in yang, a general framework for computing the NTK of a broad family of architectures is proposed. Finite width corrections to the NTK are derived in boris1; me; guy. In this work, we extend the ”wide” regime to the multiplicity dimension, and show two distinct regimes where different kernels reign. We then use finite corrections of NTK to formulate two optimality criterions, and demonstrate their usefulness in predicting efficient and performant ensembles.
Understanding the effects of model architecture on training and test performance is a longstanding goal in the deep learning community. In this work we analyzed collegial ensembling, a general technique used in practice where multiple and functionally identical pathways are aggregated. We showed that collegial ensembles exhibit two distinct regimes of over-parameterization, defined by large width and large multiplicity, with two distinct kernels governing the dynamics of each. In between these two regimes, we introduced a framework for deriving optimal ensembles in a sense of minimum capacity or maximum trainability. Empirical results on practical models demonstrate the predictive power of our framework, paving the way for more principled architecture search algorithms.
Appendix A Results and Implementation Details: ImageNet
Primal formulation. Following [resnext], we use ResNet-50 and ResNet-101 as baselines and report results in Table 2(a). Our ResNet-50 based optimal model, , obtains slightly better top-1 and top-5 errors than the baseline reported in [resnext]. This is quite remarkable given that [resnext] converged to this architecture via an expensive grid search. Our ResNet-101 based optimal model achieves a significantly better top-1 and top-5 error than the ResNet-101 baseline, and a slightly higher top-1 and top-5 error than the ResNeXt baseline .
Dual formulation. Using ResNet-50 and ResNet-101 as baselines, we find models that achieve similar top-1 and top-5 errors with significantly less parameters. Detailed results can be found in Table 2(b).
Implementation details. We follow [resnext] for the implementation details of ResNet-50, ResNet-101 and their ResNeXt counterparts. We use SGD with momentum and a batch size of 256 on GPUs ( per GPU). The weight decay is and the initial learning rate . We train the models for epochs and divide the learning rate by a factor of at epoch , and . We use the same data normalization and augmentations as in [resnext] except for lighting that we do not use.
Appendix B Results and Implementation Details: ImageNet
Implementation details. In order to adapt the ResNeXt-29 architectures used for CIFAR-10/100 and ImageNet to the resolution of ImageNet, we add an additional stack of three residual blocks following [small-imagenet]. Following the general parametrization of ResNeXt [resnext], we multiply the width of this additional stack of blocks by
and downsample the spatial maps by the same factor using a strided convolution in the first residual block. We use SGD withmomentum and a batch size of on 8 GPUs ( per GPU). The weight decay is and the initial learning rate . We train the models for epochs and divide the learning rate by a factor at epoch , and . We use the same data normalization and augmentations as [small-imagenet].
|Params||Top-1 error||Top-5 error|
|Wide ResNet 36-5 [small-imagenet]||-||37.6M||32.34||12.64|
Appendix C Implementation Details: ImageNet
We use the same ResNeXt-29 architectures from the CIFAR experiments. We use SGD with momentum and a batch size of on GPUs ( per GPU). The weight decay is and the initial learning rate . We train the models for epochs and divide the learning rate by a factor at epoch , and . We use the same data normalization and augmentations as in [small-imagenet].
Appendix D Results on CIFAR-10
Appendix E FLOPs efficiency
In Sec.3 and the rest of the paper, we considered the parameter efficiency defined as the ratio between the number of parameters of the baseline model and the ensemble (see Eq. 3). Using this definition of efficiency, models satisfying the primal objective were models with similar number of parameters. Instead of using the parameter efficiency, we can consider FLOPs efficiency in the same way:
where and are the number of FLOPs of the baseline model and of one model in the ensemble respectively. We report results for the primal formulation on CIFAR-10/100 in Table 5. We see that the model achieving the primal optimum, , attains better test error on CIFAR-10 and CIFAR-100 with similar number of FLOPs.
Appendix F Fitting to a ResNet Block
Appendix G Figure Illustrating Theorem 1
Appendix H Proofs of Lemma 1 and Theorem 1
Recall that the NTK of the ensemble is given by the mean:
Note that expectation of each member in the average is identical and finite under Lebesgue integration:
Since each member of the ensemble is sampled independently, we have from the strong law of large numbers (LLN):
Proving the claim. ∎
In the following proof, we assume for the sake of clarity our training data contains a single example, so that . The results however hold in the general case. Throughout the proof, we use to denote the weights at time , while denote the weights at initialization.
For analytic activation functions, the time evolved kernel is analytic with respect to . Therefore, at any time we may approximate the kernel using Taylor expansion evaluated at :
Similarly to the technique used in [wide], we assume we may exchange the Taylor expansion with large width and multiplicity limits. We now analyze each term in the Taylor expansion separately. Using Eq. 2 in the main text, the -th order term of the ensemble NTK is given by:
Denote the residual terms and . Under gradient flow and the cost given by , the parameters evolve according to:
The parameters of each model in the ensemble evolve according to:
The time derivative operator at can be expanded as follows:
where we have introduced the operator . For each model in the ensemble, the -th time derivative of its corresponding NTK at is therefore given by:
where , and .
Using the notation , and noticing that depend only on , the following hold:
Using the above equalities, the terms can be expressed as a sum over a finite set as follows:
for , the term is expanded using as follows:
Expanding the multiplication and using Eq. 3:
Using the chain rule in Eq.4, and eliminating elements using Eq. 2:
We can now express the result in the formulation of Eq. H
The ’th time derivative of the full ensemble is given by:
Note that for , the term represents the ’th time derivative of under gradient flow with the loss . Using results111In [guy], the result was obtained using a conjecture, and demonstrated empirically to be tight. An result of the same quantity is obtained rigorously in [NTK], which yields an asymptotic behaviour of for the ensemble. from [guy] on wide single fully connected models, we have that , and . Moreover, from the independence of the weights , it holds that . Therefore, for any fixed the set
is finite, and so we may apply the central limit theorem for largeon the terms in Eq. H individually:
Plugging back into Eq. H yields the desired result by noticing that and :