Why M Heads are Better than One: Training a Diverse Ensemble of Deep Networks

11/19/2015 ∙ by Stefan Lee, et al. ∙ Indiana University Bloomington Carnegie Mellon University Virginia Polytechnic Institute and State University 0

Convolutional Neural Networks have achieved state-of-the-art performance on a wide range of tasks. Most benchmarks are led by ensembles of these powerful learners, but ensembling is typically treated as a post-hoc procedure implemented by averaging independently trained models with model variation induced by bagging or random initialization. In this paper, we rigorously treat ensembling as a first-class problem to explicitly address the question: what are the best strategies to create an ensemble? We first compare a large number of ensembling strategies, and then propose and evaluate novel strategies, such as parameter sharing (through a new family of models we call TreeNets) as well as training under ensemble-aware and diversity-encouraging losses. We demonstrate that TreeNets can improve ensemble performance and that diverse ensembles can be trained end-to-end under a unified loss, achieving significantly higher "oracle" accuracies than classical ensembles.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 13

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Convolutional Neural Networks (CNNs) have shown impressive performance on a wide range of computer vision tasks. An important (and perhaps under-acknowledged) fact is that the state-of-the-art models are generally

ensembles

of CNNs, including nearly all of the top performers on the ImageNet Large Scale Visual Recognition Challenge 

[31]. For example, GoogLeNet [36], one of the best-performing models submitted to the ILSVRC challenge, is an ensemble achieving a five percentage point increase in accuracy over a single base model of the same architecture.

In these ensembles, multiple classifiers are trained to perform the same task and their predictions are averaged to generate a new, typically more accurate, prediction. A number of related justifications have been given for the success of ensembles, including:

  1. Bayesian Model Averaging, that ensembles are a finite sample approximation to integration over the model class [9, 26, 27];

  2. Model Combination, that ensembles enrich the space of hypotheses considered by the base model class and are representationally richer [8]; and

  3. Reducing Estimation and Optimization Errors, that ensemble averaging reduces the variance of base models, averaging out variations due to objective function non-convexity, initialization, and stochastic learning 

    [7, 28].

At the heart of these arguments is the idea of diversity: if we train multiple learners with decorrelated errors, their predictions can be averaged to improve performance [5]. In this work, we rigorously treat ensembling as a problem in its own right, examining multiple ensembling strategies ranging from standard bagging to parameter sharing and ensemble-aware losses. We compare these methods across multiple datasets and architectures, demonstrating that some standard techniques may not be suitable for deep ensembles and novel approaches improve performance.

Ensemble-Aware Losses. Typically, ensemble members are trained independently with no unifying loss, despite the fact that outputs are combined at test time. It is common in classical literature to view ensemble members as “experts” [18] or “specialists” [17], but in typical practice no effort is made to encourage diversity or specialization. It seems natural then to question whether a ensemble-aware loss might result in better performance. Here we study two ensemble-aware losses: (1) directly training an ensemble to minimize the loss of the ensemble mean, and (2) generalization of Multiple Choice Learning [13] to explicitly encourage diversity.

Parameter Sharing. As a number of papers have demonstrated, initial layers of CNNs tend to learn simple, generic features which vary little between models, while deeper layers learn features specific to a particular task and input distribution [37, 12, 24]. We propose a family of tree-structured deep networks (which we call TreeNets) that exploit the generality in lower layers by sharing them across ensemble members to reduce parameters. We investigate the depth at which sharing should happen, along a spectrum from single models (full sharing) to independent ensembles (no sharing). This coupling of lower layers naturally forces any diversity between ensemble members to be concentrated in the deeper, unshared layers. Perhaps somewhat surprisingly, we find that the optimal setting is not a classical ensemble, but instead a TreeNet that shares a few (typically 1-2) initial layers. Thus tree-structured networks are a simple way to improve performance while reducing parameters.

Model-Distributed Training of Coupled Ensembles.

Unfortunately, both of the above approaches to coupling ensemble members, either at the “top” of the architecture with ensemble-aware losses that operate on outputs from all ensemble members, or at the “bottom” with parameter sharing in TreeNets, create significant computational difficulties. Since networks are not independent, it is no longer possible to train them separately in parallel, and sequential training may require months of GPU time even for relatively small ensembles. Moreover, even if training time is not a concern, larger models often take up most of the available RAM on a GPU, so it is not possible to fit an ensemble on one GPU. To overcome these hurdles, we present and will release a novel MPI-based model-parallel distributed modification to the popular Caffe deep learning framework

[19] that implements cross-process communication as layers of the CNN.

We thoroughly evaluate each methodology across multiple datasets and network architectures. These experiments cast new light on ensembles in deep networks, demonstrating the effects of randomization in parameter and data space, parameter sharing, and unified losses on modern scale vision problems. More concretely, we:

  1. rigorously treat CNN ensembling as its own problem,

  2. introduce a family of models called TreeNets that permit a spectrum of degrees of layer-sharing,

  3. present ensemble-aware and diversity-encouraging loses, and

  4. present a distributed model-parallel framework to train deep ensembles.

2 Related Work

Neural networks, ensembles, and techniques to improve robustness and diversity of grouped learners have decades of work in machine learning research, but only recently have ensembles of CNNs been studied. Related work can be broadly divided into two categories: ensemble learning for general networks, and its more recent application to CNNs.

Ensemble Learning Theory. Neural networks have been applied in a wide variety settings with many diverse modifications. Much of the theoretical foundation for ensemble learning with neural networks was laid in the 1990s. Krogh et al. [22] and Hansen and Salamon [16] provided theoretical and empirical evidence that diversity in error distributions across member models can boost ensemble performance. This led to ensemble methods that averaged predictions from models trained with different initializations [16] and from models trained on different bootstrapped training sets [38, 22]. These methods take an indirect approach to introducing diversity in ensembles. Other work has explicitly trained decorrelated ensembles of neural networks by penalizing positive correlation between error distributions [29, 23, 2]. While effective on shallow networks, these methods have not been applied to deeper architectures.

Although initially proposed for Structured SVMs, the work of Guzman-Rivera et al. [13, 14, 15] on Multiple Choice Learning (MCL) provides an attractive alternative that does not require computing correlation between error. Related ideas were studied by Dey et al.[6]

in the context of submodular list prediction. We generalize MCL and apply it to CNNs – incorporate it with stochastic gradient descent-based training.

CNN Ensembles. While ensembles of CNNs have been used extensively, little work has focused on improving the ensembling process. Most CNN ensembles use multiple random initializations or training data subsets to inject diversity. For example, popular ensembles of VGG[33] and AlexNet[21] simply retrain with different initializations and average the predictions. GoogLeNet[36] induces diversity with straightforward bagging, training each model with a sampled dataset. Other networks, like Sequence to Sequence RNNs [35], use both approaches simultaneously.

Parameter sharing is not a novel development in CNNs, but its effect on ensembles has not been studied. Recent related work by Bachman et al. [3] proposed a general framework called pseudo-ensembles for training robust models. They define a pseudo-ensemble as a group of child models which are instances of a parent model perturbed by some noise process. They explicitly encourage correlation in model parameters through the parent by a Pseudo-Ensemble Agreement (PEA) regularizer. Although outwardly related to parameter sharing, pseudo-ensembles are fundamentally different than the techniques presented here, as they use parameter sharing to train a single robust CNN model rather than to produce an ensemble with fewer parameters. Other recent work by Sercu et al. [32] uses parameter sharing in the context of multi-task learning to build a common representation for multilingual translation. Finally, Dropout [17] can be interpreted as a procedure that trains an exponential number of highly related networks and cheaply combines them into one network, similar to PEA

One relevant recent work is [17], which briefly focuses on ensembles. Members of this types of ensemble are specialists which are trained on subsets of all possible labels with each subset manually designed to include easily confused labels. These models are fine-tuned from one shared generalist and then combined to make a final prediction. In contrast, our diversity-encouraging loss require no human hand-designing of class specialization – our loss naturally allows members to specialize according to subset of classes or pockets of feature space, providing an end-to-end way of learning diverse ensembles.

3 Experimental Design

We first describe the datasets, architectures, and evaluation metrics that we use in our experiments to better understand ensembling in deep networks.

3.1 Datasets and Architectures

We evaluate on three popular image classification benchmarks: CIFAR10 [20], CIFAR100[20], and the 2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [30]. Since our goal is not to present new designs and architectures for these tasks but rather to study the effect of different ensembling techniques, unless otherwise noted we use standard models and training routines. All models are trained via stochastic gradient descent with momentum and weight decay.

CIFAR10. For this dataset, we use Caffe “CIFAR10 Quick” [19] network as our base model. The reference model is trained using a batch size of 350 for 5,000 iterations with a momentum of 0.9, weight decay of 0.004, and an initial learning rate of 0.001 which drops to 0.0001 after 4000 iterations. We refer to this network and training procedure as CIFAR10-Quick.

CIFAR100. We use the Network in Network model by Lin et al. [25] as well as their reference training procedure, which runs for 300,000 iterations with a batch size of 128, momentum of 0.9, weight decay of 0.0001, and an initial learning rate of 0.1. The learning rate decays by a factor of 10 whenever the training loss fails to drop by at least 1% over 20,000 iterations; this occurs twice over the course of a typical training run. Our reference model’s accuracy is about 4% lower than reported in [25], because we do not perform their dataset normalization procedure. We refer to this network and training procedure as CIFAR100-NiN.

ILSVRC2012. For this dataset we use both the Network in Network model [25] and CaffeNet (similar to AlexNet[21]). Both networks are trained for 450,000 iterations with an initial learning rate of 0.1, momentum of 0.9, and weight decay of 0.0005. For NiN the batch size is 128 and the learning rate is reduced by a factor of 10 every 200,000 iterations. For CaffeNet the batch size is 256 and the learning rate schedule is accelerated, reducing every 100,000 iterations. We refer to these models as ILSVRC-NiN and ILSVRC-Alex, respectively.

3.2 Evaluation Metrics

We evaluate our ensemble performance with respect to two different metrics. Ensemble-Mean Accuracy is the accuracy of the “standard” test-time procedure for ensembles – averaging the beliefs of all members and predicting the most confident class. Strong performance on this metric indicates that the ensemble members generally agree on the correct response, with errors reduced by smoothing across members. In contrast, Oracle Accuracy is the accuracy of the ensemble if an “oracle” selects the prediction of the most accurate ensemble member for each example. Oracle Accuracy demonstrates what the ensemble knows as a collection of specialists, and has been used in prior work to measure ensemble performance [13, 14, 15, 6, 4].

4 Random Initialization and Bagging

We now present our analysis of different approaches to training CNN ensembles. This section focuses on standard approaches, while Sections 5 and 6 present novel ideas on parameter sharing and ensemble-aware losses.

Randomly initializing network weights and randomly re-sampling dataset subsets (bagging) are perhaps the most commonly-used methods to create model variation in members of CNN ensembles. Table 1

presents results using three different ensembling techniques: (1) Random Initialization, in which all member models see the same training data but are initialized using different random seeds, (2)Bagging, in which each member uses the same initial weights but trains on a subset of data sampled (with replacement) from the original, and (3) Combined, which uses both techniques. Numbers in the table are accuracies and standard deviations across three trials. The CIFAR ensembles were built with four members while the ILSVRC ensembles had five.

Single Model Random Init. Bagging Combined
Accuracy Ensemble-Mean Accuracy
CIFAR10-Quick 4 77.06 0.27 80.72 0.10 78.40 0.28 78.95 0.17
CIFAR100-NiN 4 60.19 0.49 66.51 0.27 62.11 0.24 61.73 0.16
ILSVRC-Alex 5 56.79 0.04 59.94 0.36 57.46 0.12 57.39 0.14
ILSVRC-NiN 5 58.90 0.13 64.08 0.11 55.02 0.15 60.51 0.12
Accuracy Oracle Accuracy
CIFAR10-Quick 4 77.06 0.27 89.89 0.17 89.94 0.27 89.28 0.25
CIFAR100-NiN 4 60.19 0.49 78.63 0.31 75.47 0.12 75.21 0.28
ILSVRC-Alex 5 56.79 0.04 70.45 0.63 69.58 0.17 69.61 0.17
ILSVRC-NiN 5 58.90 0.13 73.60 0.07 67.79 0.02 72.92 0.00
Table 1: Comparison of standard ensembling techniques. All ensembles outperform their base models, but bagging shows smaller gains resulting from reduced training data.

As expected, all ensembles improve performance over their single base model. Somewhat surprisingly, we find that bagging reduces Ensemble-Mean Accuracy compared to random initialization alone, while Oracle Accuracy remains nearly constant. This result suggests that the bagged networks are poorly calibrated, such that confident incorrect responses are negatively impacting results. The individual member networks (not shown in table) also perform worse than those trained on the original dataset. We attribute these results to the reduction in unique training exemplars that bagging introduces. Given an initial dataset of examples from which we draw points with replacement to make a bagged set

, the probability of an example

being sampled at least once is . The expected fraction of examples drawn at least once is thus , which is approximately for large ; i.e. bagging costs over a third of our unique data points! Not only are we losing 37% of our data, we are also introducing that many duplicated data points. To examine whether these duplicates affect performance, we reran the CIFAR10 experiments with a dataset of 31,500 unique examples (approximately 63% of the original dataset) and found similar reductions in accuracy, indicating that the loss of unique data is the primary negative effect of bagging.

Note that for convex or shallow models, the loss of unique exemplars in bagging is typically acceptable as random parameter initialization is simply insufficient to produce diversity. To the best of our knowledge, this is the first finding to establish that random initialization may not only be sufficient but preferred over bagging for deep networks given their large parameter space and the necessity of large training data.

5 Parameter Sharing with TreeNets

Ensembles and single models can be seen as two endpoints on a spectrum of approaches: single models require a careful allocation of parameters to perform well, while ensembles extract as much performance as possible from multiple instances of a base model. Ensemble approaches likely introduce wasteful duplication of parameters in generic lower layers, increasing training time and model size. The hierarchical nature of CNNs makes them well-suited to alternative ensembling approaches where member models benefit from shared information at the lower layers while retaining the advantages of classical ensembling methods.

Motivated by this observation, in this section we present and evaluate a family of tree-structured CNN ensembles called TreeNets, as shown in Figure 1. A TreeNet is an ensemble consisting of zero or more shared initial layers, followed by a branching point and zero or more independent layers. During training, the shared layers above a branch receive gradient information from each child network, which are accumulated according to back-propagation. At test time, each path from root to leaf can be considered an independent network, except that redundant computations at the shared layers need not be performed.

Figure 1: TreeNets exist on a spectrum between single models and fully independent ensembles.

We evaluated our novel TreeNet models on the two larger architectures trained on ImageNet, ILSVRC-Alex and ILSVRC-NiN, and Table 2 presents the results. The table shows the Ensemble-Mean Accuracy (again in terms of means and standard deviations across three trials) achieved by TreeNets with splits at different depths. For example, splitting at conv2 means that all layers up to and including conv2 are shared, and all branches are independent afterwards. Since layers that do not contain any parameters (e.g. pooling, nonlinearity) are unaffected by parameter sharing, we only show results for splitting on parameterized layers.

ILSVRC-Alex 5 Ensemble-Mean
Split Point Accuracy
ensemble 59.47 0.45
conv1 59.62 0.09
conv2 59.32 0.17
conv3 58.39 0.10
conv4 57.73 0.05
conv5 55.25 0.03
single model 56.79 0.04
ILSVRC-NiN 4 Ensemble-Mean
Split Point Accuracy
ensemble 64.08 0.00
conv1 65.50 0.24
cccp1 65.69 0.08
cccp2 65.64 0.11
conv2 65.64 0.07
cccp3 65.47 0.07
cccp4 65.62 0.01
single model 58.90 0.13
Table 2: Results for TreeNet training at various depths of ILSVRC-Alex and ILSVRC-NiN. Ensemble performance is retained even with substantial parameter sharing.

We see that shared parameter networks not only retain the performance of full ensembles, but can outperform them. For our best ILSVRC-NiN TreeNet, we improve accuracy over standard ensembles while reducing the parameter count by 7%. It may be that lower layer representations, though simple and generic, still had room for improvement. By sharing low level weights, each weight is updated by multiple sources of supervision, one per branch. This indicates TreeNets could provide regularization which favors slightly better low level representations.

We find further evidence for this claim by looking at individual branches of the TreeNet compared to the independently trained networks of the ensemble. Regardless of split point, each TreeNet branch in our shared ensemble achieved around 2 to 3 percentage points higher accuracy than independent ensemble members. Unlike in classical ensembles where each member model performs about as well as the base architecture, TreeNets seem to boost performance of not only the ensemble but the individual networks as well. We also experimented with multiple splits leading to more complicated “balanced binary” tree structures on ILSVRC-NiN and found similar improvements.

We also tested ILSVRC-Alex TreeNet models trained for object detection on PASCAL VOC 2007 [10] dataset. We used the Fast R-CNN [12] architecture fine-tuned from our TreeNet models. For the test-time bounding-box regression, we average the results from each member model for an ensemble. We found a statistically significant increase in mean average precision of about 0.7% across multiple runs compared to starting from a standard ensemble. We take these initial experiments to imply TreeNet models are at least as generalizable to other tasks as standard ensembles. More details are provided in the supplementary materials.

To summarize the key results in this section, we found that TreeNets with a few (typically 1-2) initial layers outperform classical ensembles, while also having fewer parameters which may reduce test-time computation time and memory requirements.

6 Training Under Ensemble-Aware Losses

In the two previous sections, each ensemble member was trained with the same objective – independent cross-entropy of each ensemble member. What happens if the objective is aware of the ensemble? We begin by showing a surprising result: the first “natural” idea of simply optimizing the performance of the average-beliefs of the ensemble does not

work, and we provide intuitions why this is the case (lack of diversity). This negative result shows that a more careful design for ensemble-aware loss functions is crucial. We then propose a diversity-encouraging loss function that shows significantly improved oracle performance.

6.1 Directly Optimizing for Model Averaging

For a standard ensemble, test-time classification is typically performed by averaging the output of the member networks, so it is natural to explicitly optimize the performance of the corresponding Ensemble-Mean loss during training. We ran all four ensemble architectures under two settings: (1) Score-Averaged, in which we average the last layer outputs (i.e. the scores that are inputs to the softmax function), and (2) Probability-Averaged, in which we average the softmax probabilities of ensemble members. Intuitively, the difference between the two settings is that the former assumes the ensemble members are “calibrated” to produce scores of similar relative magnitudes while the latter does not.

Independent Losses Score-Averaged Prob-Averaged
Ensemble-Mean Accuracy
CIFAR10-Quick 4 80.72 0.10 79.32 0.02 77.10 0.16
CIFAR100-NiN 4 66.51 0.27 65.77 0.21 62.77 0.28
ILSVRC-Alex 5 59.94 0.13 56.56 0.10 49.81 0.18
ILSVRC-NiN 5 83.43 0.10 79.24 0.36 42.39 0.24
Table 3: Results of training ensembles to reduce loss over member predictions averaged either over scores or probabilities.

Table 3 shows the results of these experiments, again averaged over three trials. In all cases, network averaging reduced performance, with Probability-Averaged causing greater degradation. This is counter-intuitive: explicitly optimizing for the performance of Ensemble-Mean does worse than averaging independently trained models. We attribute this to two problems, which we now discuss: lack of diversity and numerical instability.

Averaging Outputs Reduces Diversity. Unfortunately, averaging scores or probabilities during training has the unintended consequence of eliminating diversity in gradients back-propagated through the ensemble. Consider a generic averaging layer,

that ultimately contributes to some loss , and consider the derivative of with respect to some

This expression does not depend on — gradients back-propagated into all ensemble members are identical! Due to the averaging layer, responsibility for mistakes is shared, which eliminates gradient diversity. This is different from the behavior of an ensemble of independently trained networks, where each member receives a different gradient depending on individual performance. (The averaging also scales the gradients, so in our experiments we compensate by increasing the learning rate by a factor of ; otherwise, we found learning tended to arrive at even worse solutions.)

Averaging Probabilities Is Unstable. We attribute the further loss of accuracy when averaging probabilities (versus scores) to increased numerical instability. The softmax function’s derivative with respect to its input is unstable for outputs near 0 or 1. However, when paired with a cross-entropy loss, the derivative of the loss with respect to softmax input reduces to a simple subtraction. Unfortunately, there is no similar simplification for cross-entropy over an average of softmax outputs (see supplemental materials for details). Optimization under these conditions is difficult, causing loss at convergence for Probability-Averaged networks to be nearly twice that of Score-Averaged networks, and about the same as a single network.

Motivated by the finding that decreased diversity from optimizing Ensemble-Mean leads to reduced performance, we next present an explicit diversity-encouraging loss.

6.2 Adding Diversity via Multiple Choice Learning

We have so far discussed the role of ensemble diversity in the context of model averaging; however, in many settings, generating multiple plausible hypotheses may be preferred to producing a single answer. Ensembles fit naturally into this space as they produce multiple answers by design. However, independently trained models typically converge to similar solutions, prompting the need to optimize for diversity directly. In this section, we develop and experiment with diversity encouraging losses and demonstrate their effectiveness at specializing ensembles.

We build on Multiple Choice Learning (MCL) [13], which we briefly recap here. Consider a set of predictors such that where

is a probability distribution over some set of labels, and a dataset

=

, where each feature vector

has a ground truth label . From the point of view of an oracle that only listens to the most correct , the loss for an example is

which we will call the oracle set-loss. Intuitively, given that the oracle will select the most correct predictor, the loss on any example is the minimum loss over predictors. Alternatively, the oracle loss can be interpreted as allowing a system to guess times, scoring an example as correct if any guess is correct. Thus an ensemble of predictors is directly comparable to the commonly used top- metric used in many benchmarks (e.g. top-5 in ILSVRC [30]).

We adapt this framework to the cross-entropy loss used for training deep classification networks. Given a single predictor , the cross-entropy loss for example is

where is the predicted probability of class . Let

be a binary variable indicating whether predictor

has the lowest loss on example . We can then define a cross-entropy oracle set-loss over a dataset ,

Notice that just like cross-entropy is an upper-bound on training error, this expression is an upper-bound on the oracle training error [13]. Guzman-Rivera et al. [13] presented a coordinate descent algorithm for optimizing such an objective. Their approach alternates between two stages: first, each data point is assigned to its most accurate predictors, and then models are trained until convergence using only the assigned examples.

Even if done in parallel, training multiple CNNs to convergence for each iteration is intractable. We thus interleave the assignment step with batch updates in stochastic gradient descent. For each batch, we pass the examples through the network, producing probability distributions over the label space from each ensemble member. During the backward pass, the gradient of the loss for each example is computed with respect only to the predictor with the lowest error on that example (with ties broken randomly). Pseudo-code is available in the supplement.

So far we have assumed that the oracle can select only one answer, i.e. , however this can easily be generalized to select the predictors with lowest loss such that . Varying from one to the number of predictors trades off between diversity and the number of training examples each predictor sees, which affects both generalization and convergence.

(a) k=1
(b) k=2

(c) k=3
(d) k=4
(e)
Figure 2: (a)-(d): Percentage of test examples of each class assigned to each ensemble member by the oracle (i.e. those with lowest loss). The degree of specialization is very sharp at and softens to almost uniform at . (e): Guided-backprop images for standard and MCL trained ensemble members. Networks that have not specialized in a given class are agnostic to the image content.

Experimental results. We begin our experiments with MCL on the CIFAR10-Quick network. Table 4 shows the individual network accuracies and the oracle accuracy for MCL trained ensembles of different values of . As is increased, each member network is exposed to more of the data and we see decreased oracle accuracy in exchange for increased individual member performance. At =4, the oracle-set loss reduces to independent cross-entropy for each member, producing a standard ensemble. The =1 case showcases the degree of model specialization. Each individual network performs very poorly (accuracy of 19-27%); however, taken as an ensemble the oracle accuracy is over 93%! This clearly shows that the networks have specialized and diversified with each taking responsibility for a subset of examples. To the best of our knowledge, this is the first work to demonstrate such behavior.

CIFAR10-Quick4 Member Networks Ensemble Accuracy
k Accuracy Ensemble-Mean Oracle
1 24.35 27.18 27.15 19.36 28.38 93.10
2 50.95 27.41 55.88 34.81 75.16 92.78
3 65.46 40.71 64.79 70.37 79.76 92.55
4 77.12 76.76 77.29 76.80 80.72 89.78
Table 4: Increasing the number of predictors each data point is assigned to results in reduced oracle accuracy as the diversifying effect is reduced. Note that is a standard ensemble.

To further characterize what the MCL member networks are learning, we tracked which test examples are assigned to each ensemble member by the oracle accuracy metric (i.e. which ensemble member has the lowest error on each example). Figure 2(a)-(d) show the distribution of classes assigned to each ensemble member, and the results are striking: at =1 we see almost complete division of the label space! As increases we see increased uniformity in these distributions. Note that these divisions emerge from the loss and are not hand-designed or pre-initialized in any way.

In Figure 2(e) we visualize how the ensemble members respond to input images using guided backprop [34], which is similar to the deconv visualizations of Zeiler and Fergus [37]. These images can be interpreted as the gradient of the indicated class score with respect to the input image. Features that are clear in these images have the largest influence on the network’s output for a given input image. Each row shows these visualizations for a single input image for a standard network and for members of an MCL ensemble. Networks that have not specialized in the given class are agnostic to the image content. See supplementary material for more examples.

MCL As Label-Space Clustering. We have shown that MCL trained ensembles tend to converge to a label-space clustering where each member focuses on a subset of the labels. The set of possible label-space clusterings is vast, so to put the MCL results into perspective we train hand-designed specialist ensembles with randomized label assignments. For CIFAR10 we randomly split the labels evenly to the four ensemble members and train each with respect to those labels. Over the course of 100 trials, we found oracle-accuracy ranged from 87.62 to 94.65 with a mean of 91.83. This shows that generally the MCL optimization selects high quality label space clusterings with respect to oracle accuracy.

An alternative strategy presented by [17] is to diversify members by dividing labels into clusters of hard to distinguish classes; very briefly described, assignments are generated by clustering the covariance matrix of label scores computed across an input set for a generalist CNN. We trained an ensemble using this clustering method and it led to significantly decreased oracle performance versus MCL on CIFAR10-Quick and ILSVRC-Alex. This is not surprising since they do not optimize for oracle accuracy.

Overcoming Data Fragmentation. Despite not training member networks to convergence in each iteration of coordinate descent, our method results in improved oracle accuracy over standard ensembles. However, interleaving the assignment step with stochastic gradient descent results in data fragmentation, with each network seeing only a fraction of each batch (as illustrated by the class-specialization). We find this reduced effective batch size results in noisy gradients that inhibit learning, especially on larger networks.

Deep networks are especially sensitive to the effects of data fragmentation early in training when errors (and therefore gradients) are typically larger. In Guzman-Rivera et al. [13], initial assignments for the first iteration of training were decided by clustering the data into clusters. In contrast, assignments in our approach are based on network performance which is initially the result of random initialization. To investigate the effect of this initial phase of learning, we applied our MCL loss to fine-tune a previously trained CIFAR10-Quick ensemble. As shown in Table 5, the benefits of pretraining are most pronounced for lower values of where data fragmentation is most severe.

CIFAR10-Quick4 Iterations of Cross-Entropy Pretraining
k 100 500 1000 2000 4000
1 94.21 94.21 94.65 95.75 96.00
2 92.79 93.06 93.16 93.07 93.00
3 92.25 92.93 91.77 90.94 90.94
Table 5: Increasing the amount of pretraining before fine-tuning with the MCL loss results in increase oracle accuracy.

While pretraining did stabilize learning, data fragmentation on CIFAR10 is a relatively minor problem whereas training with MCL from scratch on larger networks using standard batch sizes consistently failed to outperform standard ensembles. We attribute this to a combination of data fragmentation and the difficulty of initial learning. To test this hypothesis, we experimented with fine-tuning and gradient accumulation across batches on the ILSVRC-Alex architecture. We accumulated gradients from 5 batches before updating parameters and fine-tuned from a fully-trained ensemble. Table 6 shows the result of 3000 iterations of this fine-tuning experiment for different values of . This setup overcame the data fragmentation problem and we see the same trends as in CIFAR10.

These experiments demonstrate MCL’s ability to quickly diversify an ensemble. To push this further, we reran the fine-tuning experiment for =1, this time initializing all ensemble members with the same network. Despite starting from an ensemble of identical networks with an oracle accuracy of 56.90%, the ensemble reached an oracle accuracy of 72.67% after only 3000 iterations!

ILSVRC-Alex5 Single Member Ensemble Accuracy
Accuracy Ensemble-Mean Oracle
k=1 46.50 55.22 74.67
k=2 52.48 59.21 73.40
k=3 55.38 59.73 71.75
k=4 56.33 60.09 70.84
base ensemble 57.17 60.31 70.50
Table 6: Fine-tuning and gradient accumulation across batches allows larger networks to specialized under the MCL loss.

We have demonstrated that the MCL loss is effective at inducing diversity, however the member networks specialize so much that Ensemble-Mean Accuracy suffers. We tried linearly combining the MCL loss with the standard cross-entropy to balance diversity with general performance. We find training under this loss improves CIFAR10 Ensemble-Mean accuracy by 1% over a standard ensemble.

In this section we have developed a novel MCL framework and shown it produces ensembles with substantially improved oracle accuracies when training from scratch and even when fine-tuning from a single network.

7 Distributed Ensemble Training

Training an ensemble on a single GPU is prohibitively expensive, so standard practice for large ensembles is to train the multiple networks either sequentially or in parallel. However, any form of model coupling requires communication between learners. To make enable our experiments at scale, we have developed and will release a modification to Caffe, which we call MPI-Caffe, that uses the Message Passing Interface (MPI) [1] standard to enable cross-GPU/machine communication. These communication operations are provided as Caffe model layers, allowing network designers to quickly experiment with distributed networks, where different parts of the model reside on different GPUs and machines. Figure 3 shows how an ensemble of CIFAR10-Quick networks with parameter sharing and model averaging is defined as a single specification and distributed across multiple process. In MPI-Caffe, each process is assigned a identifier (called a rank); by setting the ranks each network layer belongs to, we can easily design distributed ensembles.

Figure 3: MPI-Caffe models can be defined by a single network specification and distributed by across multiple GPUs. Dashed lines indicate cross-process communication and not input/output.

The MPIBroadcast and MPIGather layers provide the core communication functionality. MPIBroadcast forwards its input to the other processes during a forward pass and accumulates gradients from each during back-propagation. The forward pass for MPIGather collects all of its inputs from multiple processes and outputs them to a single network, and the backward pass simply routes the gradients back to the corresponding input.

We tested our MPI-Caffe framework on a large-scale cluster with one Telsa K20 GPU per node and a maximum MPI node interconnect bandwidth of 5.8 GB/sec. To characterize the communication overhead for an ensemble, we measure the time spent sharing various layers of the ILSVRC-Alex5 architecture. The largest layer we shared was pool2 which amounts to broadcasting nearly 36 million floats per batch. Despite the layer’s size we find only 0.49% of the forward-backward pass time is used by communication. More details are available in the supplement.

8 Discussion and Conclusion

There is a running theme behind all of the ideas presented in this paper: diversity. Our experiments on bagging demonstrate that the diversity induced in ensemble members by random parameter initializations is more useful than that introduced by bags with duplicated examples. Our experiment on explicitly training for Ensemble-Mean performance show that averaging beliefs of ensemble members before computing losses has the unintended effect of removing diversity in gradients. Our novel diversity-inducing MCL loss shows that encouraging diversity in ensemble members can significantly improve performance. Finally, our novel TreeNet architecture shows that diversity is important in high-level representations while low-level filters might be better off without it. Training these large-scale architectures is made practical by our MPI-Caffe framework.

In future work, we would like to adapt the MCL framework to structured predictions problems. In a structured context where the space of “good” solutions is quite large, we feel diverse models can have an even greater benefit.

References

Appendix Appendix A TreeNet Object Detection Results on PASCAL VOC 2007

As briefly described in Section 5 of the main paper, the ILSVRC-Alex TreeNet architecture was also evaluated for object detection using the PASCAL VOC 2007 dataset, which includes labeled ground-truth bounding-box annotations for 20 object classes. For this task, we used Fast R-CNNs [11]. During training, Fast RCNNs finetune a model pretrained on ImageNet for classification under two losses, one over the predicted class of an object proposal, and one with bounding box regression. For our ensembles, we average both the class prediction as well as the bounding box coordinates from ensemble member models.

To evaluate TreeNets and standard ensembles, we fine-tune four different instances for each under the Fast R-CNN framework and compute the mean and standard deviation of the classwise average precisions (APs) as well as the mean APs over all classes. Table 7 presents these results for various models with the averaged bounding boxes – a standard ensemble, a TreeNets split after conv1, conv2, and conv3, as well as a single model. We remind the reader that non-parameterized layers are irrelevant with respect to splits so we do not report results for those layers. We also evaluate without the bounding-box regression, instead using the initial selective search proposals directly. Table 8 shows these results.

In both tasks we see that TreeNets outperform the standard ensembles and single models by significant margins. We note that we see similar gains in accuracy when using the regressed bounding boxes for both single models and ensembles, implying that the bounding box averaging procedure for ensembles is reasonable.


mean
Ensemble 67.49 68.89 52.00 38.25 16.74 68.14 70.68 67.56 26.45 63.78 61.93 61.74 73.31 67.18 56.53 23.84 50.45 54.81 69.04 59.26 55.90
1.42 0.57 1.25 2.38 0.90 1.90 0.37 2.65 0.16 1.63 1.59 1.69 0.68 0.56 0.30 0.58 0.82 1.45 1.45 0.88 0.21

conv1
67.83 68.47 52.66 37.90 18.01 69.53 70.83 67.93 27.30 61.59 62.96 62.06 74.89 67.97 57.43 23.80 50.94 56.42 70.57 59.79 56.44
1.34 0.85 1.41 0.98 0.81 1.85 0.51 0.49 0.84 2.31 0.65 1.16 1.14 0.48 0.33 0.50 2.06 0.55 1.53 0.91 0.23

conv2
67.30 69.29 52.62 37.58 17.90 68.81 71.04 68.54 27.13 63.66 62.37 62.20 74.60 68.45 57.52 24.20 52.53 55.00 71.31 60.66 56.64
1.05 0.79 1.41 1.41 0.82 0.73 0.28 1.50 1.05 1.06 1.83 0.62 0.56 0.23 0.46 1.24 0.59 0.73 1.64 0.19 0.23

conv3
66.29 67.35 50.22 36.23 16.65 67.77 70.22 67.73 25.01 60.91 61.80 61.99 73.64 68.38 56.42 21.57 49.75 55.54 70.03 58.81 55.32
1.49 1.05 1.40 2.54 0.83 2.21 0.73 0.48 1.62 3.27 1.03 0.69 0.51 0.92 1.02 0.42 1.95 1.26 0.77 0.58 0.54

Single
62.53 65.25 41.30 32.36 11.98 62.56 66.89 61.76 20.72 56.16 56.91 55.14 69.53 64.48 51.39 21.06 45.20 47.91 65.70 54.35 50.66
Model 0.24 0.49 2.12 2.47 1.11 1.27 0.80 2.14 0.58 3.32 1.58 1.56 1.71 1.27 1.10 0.36 0.77 1.44 0.89 1.94 0.30
Table 7: Average Precision for Object Detection using different TreeNet models with the Fast R-CNN framework, when the coordinates of bounding boxes from each member model are averaged.
mean
Ensemble 64.48 63.99 45.42 34.17 15.60 63.90 67.81 62.55 24.20 58.36 55.91 56.81 62.61 67.76 50.35 22.25 45.79 49.47 65.20 57.76 51.72
0.75 0.94 0.71 1.14 0.26 1.46 0.34 0.90 0.27 2.09 1.34 1.14 1.24 0.52 0.35 0.84 1.68 1.76 1.57 0.72 0.15
conv1 64.32 64.80 46.45 36.04 17.03 65.61 68.24 63.21 25.06 56.79 56.90 56.80 62.32 66.58 50.68 22.68 46.66 49.76 66.03 58.24 52.21
1.01 0.73 1.04 1.69 1.02 1.43 0.88 1.43 0.36 2.57 0.99 1.33 1.90 1.31 0.38 0.81 1.99 0.97 0.75 1.53 0.42
conv2 64.22 63.68 45.85 35.31 16.43 65.89 68.78 63.21 24.57 56.37 57.10 56.57 62.28 66.69 51.04 21.72 48.19 48.86 66.24 58.71 52.08
1.04 1.51 1.24 0.63 0.70 1.63 0.25 1.60 0.86 1.40 1.50 1.28 0.77 1.49 0.56 0.96 0.80 1.31 2.19 1.06 0.33
conv3 63.56 63.91 44.18 34.24 16.01 64.36 67.71 63.00 23.27 56.99 57.21 56.31 61.05 66.04 50.22 20.22 45.03 49.40 66.26 56.92 51.29
1.10 0.75 0.67 2.78 0.96 1.57 0.86 1.41 0.82 1.71 0.46 1.49 1.65 1.18 0.85 0.29 2.39 0.73 1.80 0.56 0.44
Single 59.11 60.82 35.84 28.84 10.73 58.91 63.98 55.84 19.40 50.03 51.65 47.95 58.38 61.44 44.81 18.61 41.64 42.74 61.73 51.89 46.22
Model 1.49 1.04 1.86 1.20 0.86 2.04 0.69 1.31 0.78 2.91 2.91 2.48 2.49 1.06 0.99 0.38 2.17 1.32 1.38 0.63 0.26
Table 8: Average Precision for Object Detection using different TreeNet models with the Fast R-CNN framework, when predicted bounding boxes are not used.

Appendix Appendix B Instability of Averaged Softmax Outputs

As discussed in Section 6.1 of the main paper, training under a cross-entropy loss over averaged softmax outputs results in reduced performance compared to both standard ensembles and score-averaged ensembles. We find that this is because averaging softmax outputs prior to the cross-entropy loss has less stable gradients compared to standard cross-entropy over softmax outputs. Let us consider the standard case first and formulate the derivative of the cross-entropy loss with respect to softmax inputs. The cross-entropy loss and softmax function are defined as:

The derivative of the softmax probability with respect to some score is

where is 1 if and 0 otherwise. This derivative requires multiplying probabilities which can be quite small, leading to underflow errors. Taking the derivative of the cross-entropy loss with respect to some results in a more stable solution:

Let us now consider the case where is averaged over predictors such that

The derivative of this new with respect to the score of one predictor is then

Again computing the derivative of the loss with respect to a score we see

The rightmost term in this result is identical to the standard case presented above; however, the first term acts to weight the gradient for each predictor and can be shown to range from 0 to M. The product of this term and the probability can be prone to underflow errors when is less than . On the other hand, when is greater than the gradients are increased in magnitude which can result in overshooting good minima.

This scaling of the gradients has an interesting similarity with MCL. If a predictor puts little mass into the correct class compared to the other predictors, the weighting factor and thus the gradient go to zero – meaning worse performing members are less encouraged to improve than strong performers. This is similar behavior to what a soft-assignment variant of MCL might induce. However, we do not notice improved oracle accuracy relative to base ensembles for models trained with probability-averaged losses, implying the predictors are making relatively similar predictions.

Appendix Appendix C Pseudo-code for Stochastic Gradient Descent with MCL

We describe the classical MCL algorithm and our approach to integrate MCL coordinate descent with stochastic gradient descent in Section 6.2 of the main paper. Here we provide psuedocode for both algorithms to highlight the differences and provide additional clarity.

Data: and loss
Result: Predictor parameters
Initialization:
  
  
while  do
       Step 1: Train each predictor to completion using only its corresponding subset of the data
          
       Step 2: Reassign each example to its least-loss predictor
          
      
end while
Algorithm 1 Classical MCL
Data: , SGD parameters , and loss
Result: Network parameters
Initialization:
  Randomly initialize
  
while  do
      
       Sample batch
       Step 1: Forward pass
          For , compute forward-pass and losses
          Partition by updating indicator variables as:
            
         
         
       Step 2: Backward pass
          For each apply gradient descent update using only the subset of examples on which it achieves the lowest loss
          
      
end while
Algorithm 2 Integrating MCL coordinate descent with SGD steps

Appendix Appendix D Visualizations for MCL Trained Ensembles

In this section we present additional insight into how MCL ensemble training differs from the behavior of standard ensembles. To show how the distribution of class examples changes over training for MCL we have produced a video showing the proportion of each CIFAR10 class assigned to each predictor at test time and how it changes over training iterations. The intensity of each class icon is proportional to the fraction of class examples assigned to a predictor. Figure 4 shows a sample early and later frame from the video.

Figure 4: The left image shows the class distribution early in training – notice how many classes are split between multiple predictors. The right frame shows the evolution of this distribution after 220 additional iterations. Many of the classes have stabilized.

We also present additional guided-backprop [34]

visualizations described in Section 6.2 of the main paper for different layers in members of traditional and MCL ensembles. These images visualize how the ensemble members respond to input images. These images can be interpreted as the gradient of a neuron output with respect to the input image. Features that are clear in these images have the largest influence on the network’s output for a given input image. Figure 

5 shows these visualizations taken for an input image with respect to its true class label. Notice that ensemble members are agnostic to classes that they are not specialized in. The input images are those that produce the highest correct response on the ensemble model. Visualisations of the same neurons in Figure 6 are generated independently for each model using the image that gives the highest activation. We note that while there is a greater response for non-specialized ensemble members, they remain largely indifferent to image content. We see similar patterns of indifference in lower convolutional layer visualizations as well shown in Figures 7, 8, and 9.

Figure 5: Reconstructions using features from the output layer using the images that give highest activation for the single model. Column1: Model from Standard Ensemble, Column2-4: Members of ensemble trained using MCL Loss
Figure 6: Reconstructions using features from the output layer using the images that give highest activation for each model independently. Column1: Model from Standard Ensemble, Column2-4: Members of ensemble trained using MCL Loss
Figure 7: Reconstructions using the conv1 layer. Column1: Model from Standard Ensemble, Column2-4: Members of ensemble trained using MCL Loss
Figure 8: Reconstructions using the conv2 layer. Column1: Model from Standard Ensemble, Column2-4: Members of ensemble trained using MCL Loss
Figure 9: Reconstructions using the conv3 layer. Column1: Model from Standard Ensemble, Column2-4: Members of ensemble trained using MCL Loss

Appendix Appendix E MPI-Caffe

MPI-Caffe is a modification of the popular Caffe deep learning framework that enables cross-GPU/cross-machine communication on MPI enabled systems as model layers. Providing these MPI operations as layers allows network designers the flexibility to quickly experiment with distributed networks while abstracting away much of the communication logic. This enables experimentation with extremely large (i.e. larger than can be held in a single GPU) networks as well as ensemble-aware model parallelism schemes. This document explains the function of these layers as well as providing example usage. The core functionality in MPI-Caffe is provided by

The primary file defining the interface of the MPI layers is MPILayers.hpp. There are also many supporting modifications in the source that should be noted in case anyone tries to modify or update the base Caffe version. The network initialization code in net.cpp has been substantially altered to accommodate the distributed framework. Some other changes occur in layer.hpp, solver.cpp, and caffe.cpp among others.

Appendix E.1 A Toy Example

Let’s start with a toy example to build context for the MPI layer descriptions. Suppose we want to train an TreeNet ensemble of CIFAR10-Quick and we want train it under a score-averaged loss. Figure 10 shows how we might modify the LeNet structure using MPI-Caffe to implement this model across three processes/GPUs in an MPI enabled cluster. We will go through this example to explain the function and parametrization of the MPIBroadcast and MPIGather layers.

Figure 10: Left) Example network specification for MPI-Caffe enabled CIFAR10-Quick ensemble with parameter sharing and model averaging. Right) Distributed model resulting from the specification, visualized for each participating process. Dashed lines indicate MPI cross process communication and not layer input/outputs.

Appendix E.1.1 MPIBroadcast

The first layer we discuss is MPIBroadcast (highlighted in red in Figure 10). The MPIBroadcast layer broadcasts a copy of its input blob to each process in its communication group during its forward pass. During the backward pass, the gradients from each copy are summed and passed back to the input blob. The communication group consists of all processes that carry a copy of a particular broadcast layer. By default a communication group contains all processes; however, adding mpi_rank:n rules in either the include or exclude layer parameters can alter this group.

[fontshape=tt,fontsize=,fontfamily=courier,commandchars= {}]BROAD Figure 11: An example MPIBroadcast layer definition and a diagram of its forward-pass behavior.

Figure 11 shows the MPIBroadcast layer definition from our example and the corresponding forward-pass behavior. Going step by step through the definition: we

  • [Line 2-5] declare this layer to be a MPI Broadcast layer named “broad” with input blob pool2 and output blob pool2_b,

  • [Line 6-8] set the mpi_param root value to 0 indicating that process 0 will be initiating the broadcast,

  • and [Line 9-13] establish a communication group consisting of processes 0, 1, and 2.

During a forward pass, the MPIBroadcast layer on process 0 will send a copy of pool2 to processes 1 and 2 as well as retain a copy for itself. For the example, we would also need to modify the ip1 layer to take pool2_b as input rather than pool2.

It is important to note the effect the choice of mpi_param{ root } has on network structure. As shown in the example in Figure 10, each process parses the entire network structure and retains only the layers that include its MPI rank. For process 0, this includes the entire network, but for processes 1 and 2 the network starts with the MPIBroadcast layer. In order to allow this behavior, non-root processes have the input blob (pool2 in our example) stripped out during network parsing. Additionally for this example we need to average these top blobs before sending the result into the softmax loss.

Appendix E.1.2 MPIGather

If the purpose of a broadcast layer is to take some data and push copies into multiple process spaces, the MPIGather layer can be thought of as the opposite. In a forward pass, it takes multiple copies of a blob from multiple process spaces and collects them in the root process. During a backward pass, the gradients for each top blob are routed back to the corresponding input blob and process. Similar to the previous section, Figure 12 shows the layer definition from out example and a diagram of the forward pass behavior.

[]GATHER 1 layer 2 name: ExampleLayer 3 type: MPIGather 4 bottom: ip2 5 top: ip2_0 6 top: ip2_1 7 top: ip2_2 8 mpi_param 9 root: 0 10 11 include 12 mpi_rank: 0 13 mpi_rank: 1 14 mpi_rank: 2 15 16

[fontshape=tt,fontsize=,fontfamily=courier]GATHER
Figure 12: An example MPIGather layer definition and a diagram of its forward-pass behavior.

The mpi_param{root} parameter in the gather layer defines which process will be receiving the gathered blobs and producing top blobs. In analogy to the broadcast layer parsing, gather layers in non-root processes are pruned of the top blobs during network parsing (see Figure 10).

There are some restrictions to the gather layer’s use. First, the bottom blob (ip2 in our example) must be defined in all communication group processes. Second, the number of top blobs must equal the number of processes in the communication group. Both of these conditions are checked by the source and will report an error if not satisfied.

Appendix E.2 Notes and Other Examples

It is worth noting a few other use points about MPI-Caffe:

  • the MPIBroadcast layer can be used to construct a very large single-path network spanned across multiple GPU’s

  • the MPIGather layer can be used to allow more sophisticated ensemble losses

  • there is no limit on the number or order of MPI layers such that complex distributed networks are possible

  • in situations where network latency is lower than reading from disk, the MPIBroadcast layer can be used to train multiple independent networks more quickly

Appendix E.3 Communication Cost Analysis

We tested our MPI-Caffe framework on a large-scale cluster with one Tesla K20 GPU per node and a maximum MPI node interconnect bandwidth of 5.8 GB/sec. To characterize the communication overhead for an ensemble, we measure the time spent sharing various layers of the ILSVRC-Alex5 architecture. Each network was run on a separate node (with one node also holding the shared layers). Figure 13 shows the communication time to share a given layer as a fraction of the forward-backward pass. The x-axis indicates the number of floats broadcast per batch for each layer. We note that for these layers overhead appears approximately linear and even the largest layer incurs very little overhead for communication.

Figure 13: Fraction of forward-backward pass time used for TreeNets sharing various layers against the size of the layers. The overhead from communication is quite small and scales approximately linearly with the size of the layer being shared.