Distilled Hierarchical Neural Ensembles with Adaptive Inference Cost

03/03/2020 ∙ by Adria Ruiz, et al. ∙ 0

Deep neural networks form the basis of state-of-the-art models across a variety of application domains. Moreover, networks that are able to dynamically adapt the computational cost of inference are important in scenarios where the amount of compute or input data varies over time. In this paper, we propose Hierarchical Neural Ensembles (HNE), a novel framework to embed an ensemble of multiple networks by sharing intermediate layers using a hierarchical structure. In HNE we control the inference cost by evaluating only a subset of models, which are organized in a nested manner. Our second contribution is a novel co-distillation method to boost the performance of ensemble predictions with low inference cost. This approach leverages the nested structure of our ensembles, to optimally allocate accuracy and diversity across the ensemble members. Comprehensive experiments over the CIFAR and ImageNet datasets confirm the effectiveness of HNE in building deep networks with adaptive inference cost for image classification.



There are no comments yet.


page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning models have shown impressive performance in many domains. In particular, convolutional neural networks (CNNs) have emerged as the most successful approach to solve many visual recognition tasks (Krizhevsky et al., 2012; Simonyan and Zisserman, 2014; He et al., 2016). However, state-of-the-art models typically require a large amount of computation during inference, limiting their deployment in edge devices such as mobile phones or autonomous vehicles. For this reason, methods based on network pruning (Huang et al., 2018a), architecture search (Tan et al., 2019), and manual network design (Sandler et al., 2018) have been proposed to learn more efficient models.

Figure 1: HNE uses a hierarchical parameter-sharing scheme generating a binary tree, where each leaf produces a separate model output. The amount of computation during inference is controlled by determining which part of the tree is evaluated for the ensemble prediction. (Bottom) Our hierarchical distillation approach leverages the full ensemble to supervise parts of the tree that are used in small ensembles.

Despite the promising results achieved by these approaches, there exist scenarios where, instead of deploying a single efficient network, we are interested in dynamically adapting the inference latency depending on external constraints. These include situations where either the stream of data to be processed or the amount of available compute is non-constant. Examples include autonomous driving, where both might be non-constant depending on acquisition speed and the presence of other concurrent processes. Another example is scanning of incoming data streams in large online platforms such as social media networks, where the amount of traffic may surge in case of important events. Therefore, models must be able to scale the number of the operations on-the-fly depending on the amount of available compute at any point in time.

The design of networks with adaptive inference cost has only recently started to received attention, see e.g. (Huang et al., 2018b; Yu and Huang, 2019; Ruiz and Verbeek, 2019). In this paper, we address this problem by introducing a novel framework which we refer as Hierarchical Neural Ensembles (HNE). Inspired by ensemble learning (Breiman, 1996), HNE embeds a large number of networks whose combined outputs provide a more accurate prediction than any individual model. In order to reduce the computational cost of evaluating multiple networks, HNE employ a binary-tree structure to share a subset of intermediate layers between the different models. Additionally, this scheme allows to control the inference cost by deciding on-the-fly how many networks to use, i.e. how many branches of the tree to evaluate, see Figure 1 (Top). To train HNE, we also propose a novel co-distillation method adapted to the hierarchical structure of ensemble predictions, see Figure 1 (Bottom)

Our contributions are summarised as follows: (i) To the best of our knowledge, we are the first to explore hierarchical ensembles for deep models with adaptive inference cost. (ii) We propose a hierarchical co-distillation scheme to increase the accuracy of ensembles for adaptive inference cost. (iii) Focusing on image classification, we show that our framework can be used to design efficient CNN ensembles. In particular, we evaluate the different proposed components by conducting exhaustive ablation studies on CIFAR-10/100 datasets. Additionally, compared to previous methods for adaptive inference, we show that HNE provides state-of-the-art accuracy-computation trade-offs on CIFAR-10/100 and ImageNet.

2 Related Work

Efficient networks and adaptive inference cost.  Several approaches have been expored to reduce the inference cost of deep neural networks. These include the design of efficient convolutional blocks (Howard et al., 2017; Ma et al., 2018; Sandler et al., 2018), automated design using neural architecture search (NAS) (Wu et al., 2019; Tan et al., 2019; Cai et al., 2019), and network pruning techniques (Liu et al., 2017; Huang et al., 2018a; Véniat and Denoyer, 2018).

Whereas these approaches are effective to reduce the resources required by a single network, the design of models with adaptive inference cost has received less attention. To address this problem, some methods allow any-time predictions in DNNs by introducing intermediate classifiers providing outputs at early-stages of inference

(Bolukbasi et al., 2017; Huang et al., 2018b; Li et al., 2019; Elbayad et al., 2020). This idea was combined with NAS by Zhang et al. (2019) to automatically find the optimal position of the classifiers. Yu and Huang (2019) and Yu et al. (2018) proposed a network that can reduce the amount of feature channels in order to adapt the computational cost. Ruiz and Verbeek (2019) introduced convolutional neural mixture models, a probabilistic model implemented by a dense-network that can be dynamically pruned. Finally, other approaches have proposed different mechanisms to skip intermediate layers during inference in a data-dependent manner (Wu et al., 2018; Veit and Belongie, 2018; Wang et al., 2018)

Different from previous methods relying on intermediate classifiers, we address adaptive inference by exploiting hierarchically structured network ensembles. Additionally, our framework can be used with any base model in contrast to other approaches which require specific network architectures (Huang et al., 2018b; Ruiz and Verbeek, 2019). Therefore, it is complementary to other works employing manual network design or NAS.

Network ensembles.  Ensemble learning is a classic approach to improve generalization (Hansen and Salamon, 1990; Rokach, 2010)

. The success of this strategy can be explained by the reduction in variance resulting from averaging the output of different learned predictors

(Breiman, 1996). In the context of neural networks, seminal works (Hansen and Salamon, 1990; Krogh and Vedelsby, 1995; Naftaly et al., 1997; Zhou et al., 2002) observed that a significant accuracy boost could be achieved by averaging the outputs of independently trained models. Modern neural networks have also been shown to benefit from this strategy. In particular, CNN ensembles have been employed to improve the performance of individual models (Lan et al., 2018; Lee et al., 2015), increase their robustness to label noise (Lee and Chung, 2020)

, and estimate outputs uncertainty

(Malinin et al., 2020; Ilg et al., 2018).

The main limitation of network ensembles is the increase in training and inference costs. In practice, state-of-the-art deep models contain a large number of intermediate layers and thus, independent learning and evaluation of several networks is very costly. Whereas some strategies have been proposed to decrease the training time (Huang et al., 2017; Loshchilov and Hutter, 2017), the high inference cost still remains as a bottleneck in scenarios where computational resources are limited. In this context, different works have proposed to build ensembles where the individual networks share a subset of parameters (Lee et al., 2015; Minetto et al., 2019; Lan et al., 2018). By sharing a part of the layers between networks, the inference cost is also reduced.

Our HNE uses a binary-tree structure in order to share intermediate layers between the individual networks. Whereas hierarchically structured networks have been explored for different purposes, such as learning mixture of experts (Liu et al., 2019; Tanno et al., 2019), incremental learning (Roy et al., 2020) or efficient ensemble implementation (Zhang et al., 2018; Lee et al., 2015), our work is the first to leverage this structure for models with adaptive inference cost.

Knowledge distillation.  In the context of neural networks, knowledge distillation (KD) has been employed to improve the accuracy of a compact low-capacity “student” network, by training it on soft-labels generated by a high-capacity “teacher” network (Romero et al., 2015; Hinton et al., 2015; Ba and Caruana, 2014). However, KD has also been shown to be beneficial when the soft-labels are provided by one or multiple networks with the same architecture as the student (Furlanello et al., 2018). Another related approach is co-distillation (Zhang et al., 2018; Lan et al., 2018), where the distinction between student and teacher networks is lost and, instead, the different models are jointly optimized and distilled online. In the context of ensembles, co-distillation has been shown effective to improve the accuracy of the individual models by transferring the knowledge from the full ensemble (Anil et al., 2018; Song and Chai, 2018).

In this paper, we introduce hierarchical co-distillation in order to improve the performance of HNE. Different from ensemble co-distillation strategies (Lan et al., 2018; Song and Chai, 2018), our method transfers the knowledge from the full model to smaller ensembles in a hierarchical manner. Previous works have also employed distillation in the context of adaptive inference (Yu and Huang, 2019; Li et al., 2019). However, our approach is specially designed for neural ensembles, where the goal is not only to improve predictions requiring a low inference cost, but also to preserve the diversity between the individual network outputs.

3 Hierarchical Neural Ensemble

A hierarchical neural ensemble (HNE) embeds an ensemble of deep networks computing output from input as:


where each is a network with parameters and is the total number of models. In classification tasks,

is a vector containing the logits for

classes. Furthermore, we assume that each network is modelled by a composition of functions, or “blocks”:


where each block is a set of layers with parameters . Typically,

contains a set of operations such as convolutions, batch normalization layers or activation functions.

Hierarchical parameter sharing.  If we consider parameters to be different for all blocks and networks, then the complete set of model parameters is given by


and the inference cost of computing Eq. (1) is equivalent to evaluate independent networks.

In order to reduce the computational complexity, a HNE exploits a binary tree structure to share network parameters, where each node of the tree represents a computational block, see Figure 1. Formally, the parameters of an HNE composed of networks are collectively denoted as:


where for each block we consider independent sets of parameters . This scheme generates a hierarchical structure where the number of independent parameters increases with the block index. For the first block , all the networks share . For there exist independent parameters corresponding to the number of networks in the ensemble.

Figure 2: Efficient HNE implementation using group convolutions. Feature maps generated by different branches in the tree are stored in contiguous memory. Using group convolutions, the independent branch outputs can be computed in parallel. When a branch is split in two, the feature maps are replicated along the channel dimension and the number of groups for the next convolution is doubled.
Figure 3: Comparison between standard co-distillation and our hierarchical co-distillation for ensemble learning. In the first case, the individual models are pushed to their average, reducing their variance and the overall ensemble performance. In contrast, hierarchical co-distillation preserves the model diversity by distilling smaller ensembles to the overall ensemble.

Adaptive inference cost.  As previously discussed, the success of ensemble learning is caused by the reduction in variance resulting from averaging a set of outputs from different models. The expected reduction in variance and improvement in accuracy are monotonic in the number of models in the ensemble. This observation allows to easily adapt the trade-off between computation and final performance. In particular, given that the different networks can be evaluated in a sequential manner, the computational cost can be controlled by choosing how many models to evaluate to approximate the ensemble prediction of Eq. (1).

In the case of HNE, this can be achieved by evaluating only a sub-tree obtained by removing a set of computational blocks in the hierarchical structure, see Figure 1. More formally, we can choose any value and compute the ensemble output using a subset of networks as:


3.1 Computational Complexity

Hierarchical vs. independent networks.  We analyse the inference cost of a HNE compared to an ensemble composed by fully-independent networks. By assuming that functions require the same number of operations for all blocks , the complexity of evaluating all the networks in a HNE is:


where is the total number of blocks and is the cost of evaluating a single function . Note that this quantity is proportional to , which is the total number of nodes in a full-binary tree of depth .

On the other hand, it is easy to show that an ensemble composed by networks with independent parameters has an inference cost of:


Considering the same number of networks in both approaches, the ratio between the time complexities in Eq. (6) and Eq. (7) is defined by:


For a single network is considered, and the independent ensemble has the same computational complexity as HNE (). In contrast, when the number of evaluated models is increased (), the second term in Eq. (8) becomes negligible and the computaional complexity of a HNE is times lower as compared to evaluating independent networks. This speed-up increases linearly with the number of blocks . This is important because, as we increase the amount of models, the ensemble accuracy is significantly improved and, therefore, it is crucial to obtain higher speed-up when evaluating large network ensembles.

Efficient HNE implementation.  Despite the theoretical reduction in the inference cost, a naïve implementation where the individual network outputs are computed sequentially does not allow to exploit the inherent parallelization power provided by modern GPUs. Fortunately, the evaluation of the different networks in the HNE can be parallelized by using the notion of grouped convolutions (Xie et al., 2017; Howard et al., 2017), where different sets of input channels are used to compute an independent set of outputs. See Figure 2 for an illustration of our efficient implementation of HNE using group convolutions.

3.2 HNE Optimization

Given a training set composed of

sample and label pairs, HNE parameters are optimized by minimizing a loss function for each individual network as:


where is the cross-entropy loss comparing ground-truth labels with network outputs. Note that traditional ensemble methods such as bagging (Breiman, 1996), employ different subsets of training data in order to increase the variance of the individual models. We do not employ this strategy here, as previous work (Neal et al., 2018; Lee et al., 2015) has shown that networks trained from different initialization already exhibit a significant diversity in their predictions. Moreover, a reduction of training samples has a negative impact on the quality of the individual models.

A drawback of the loss of Eq. (9) is that it is symmetric among the networks in the ensemble. Notably, it ignores the hierarchical structure of the sub-trees that are used for adaptive inference. To address this limitation, we can optimize a loss that measures the accuracy of the different sub-trees corresponding to the evaluation of an increasing number of networks in the HNE:


where is defined in Eq. (5). Despite the apparent advantages of replacing Eq. (9) by Eq. (10) during learning, we empirically show that this strategy generally produce worse results. The reason is that Eq. (10) impedes the branches to behave as an ensemble of independent networks. Instead, the different models tend to co-adapt in order to minimize the training error and, as a consequence, averaging their outputs does not reduce the variance over test data predictions. In the following, we present an alternative approach that allows to effectively exploit the hierarchical structure of HNE outputs during learning.

3.3 Hierarchical Co-distillation

Previous works focused on network ensembles have explored the use of co-distillation (Anil et al., 2018; Song and Chai, 2018). In particular, these methods attempt to transfer the ensemble knowledge to the individual models by introducing an auxiliary distillation loss for each network:


where is the ensemble output for sample . The cross-entropy loss compares the network outputs with the soft-labels generated by normalizing . During training, the distillation loss is combined with the standard cross-entropy loss of Eq. (9) as , where is an hyper-parameter controlling the trade-off between both terms.

Whereas this distillation approach boost the performance of individual models, it has a critical drawback in the context of ensemble learning for adaptive inference. In particular, standard co-distillation encourage all the predictions to be similar to their average. As a consequence, the variance between model predictions decreases, limiting the improvement given by combining multiple models in an ensemble.

To address this limitation, we propose an novel distillation scheme which we refer to it as “hierarchical co-distillation”. The core idea is to transfer the knowledge from the full ensemble to the smaller sub-tree ensembles used for adaptive inference in HNE. In particular, we minimize


Similar to Eq. (10), the defined loss also exploits the hierarchical structure of HNE. However, in this case the predictions for each sub-tree are distilled with the ensemble outputs. In contrast to standard co-distillation, the minimization of Eq. (12) does not force all the models to be close to their average. As a consequence, the diversity between predictions is better preserved, retaining the advantages of averaging multiple models. See Figure 3 for an illustration.

4 Experiments

Evaluated Models Inference
1 2 4 8 16
91.7 92.4 93.2 93.7 94.0
90.7 91.4 92.4 93.4 93.6
67.7 70.0 71.9 73.4 74.8
67.9 70.4 73.1 74.3 75.0
93.6 94.2 94.8 95.0 95.2
92.8 93.6 94.3 94.9 95.2
73.5 75.4 77.5 79.0 79.7
71.7 73.8 76.7 78.7 79.5
Table 1: Accuracy of HNE and HNE embedding 16 different networks on CIFAR-10/100. Columns correspond to the number of models evaluated during inference.
Figure 4:

Accuracy and standard deviation in logits

vs. FLOP count for HNE trained (i) without distillation, (ii) with standard co-distillation, and (iii) with our hierarchical co-distillation. Curves represent the evaluation of ensembles of size 1 to 16.

4.1 Datasets and Implementation Details

CIFAR-10/100 (Krizhevsky, 2009) contain 50k train and 10k test images from 10 and 100 classes, respectively. Following standard protocols (He et al., 2016), we pre-process the images by normalizing their mean and standard-deviation for each color channel. Additionally, during training we use a data augmentation process where we extract random crops of 3232 after applying a

-pixel zero padding to the original image or its horizontal flip.

ImageNet (Russakovsky et al., 2015) is composed by 1.2M and 50k high-resolution images for training and validation, respectively. Each sample is labelled according to 1,000 different categories. Following (Yu and Huang, 2019), we use the standard protocol during evaluation resizing the image and extracting a center crop of . For training, we also apply the same data augmentation process as in the cited work but remove color transformations.

Base architectures.  We implement HNE with commonly used arhictectures. For CIFAR10-100, we use a variant of ResNet (He et al., 2016), composed of a sequence of residual convolutional layers with bottlenecks. We employ depth-wise instead of regular convolutions to reduce computational complexity. We generate a HNE with a total five blocks embedding an ensemble of CNNs. For ImageNet, we implement an HNE based on MobileNet v2 (Sandler et al., 2018), which uses inverted residual layers and depth-wise convolutions as main building blocks. In this case, we use four blocks generating different networks. In supplementary material we present a detailed description of our HNE implementation using ResNet and MobileNetv2. The particular design choices for both architectures are set to provide a similar computational complexity as previous methods to which we compare in Section 4.3

. We implemented our HNE framework on Pytorch 1.0 and all the experiments are carried out over NVIDA GPUs. Our code will be released upon publication.

Inference cost metric.  Following previous work (Huang et al., 2018b; Yu and Huang, 2019; Ruiz and Verbeek, 2019; Zhang et al., 2019), we evaluate the computational cost of the models according to the number of floating-point operations (FLOPs) during inference. The advantages of this metric is that it is independent from differences in hardware and implementations, and is strongly correlated with the wall-clock inference time.

Hyper-parameter settings.  All the networks are optimized using SGD with initial learning rate of and a momentum of . For CIFAR datasets, following Ruiz and Verbeek (2019)

, we train HNE using a cosine-shaped learning rate scheduler. However, for faster experimentation we reduce the number of epochs to

and use a batch size of . Weight decay is set to . For ImageNet, our model is trained with a cyclic learning rate schedule (Mehta et al., 2019). In particular, we use epochs and reduce the learning rate with a factor two at epochs . We use a batch size of images split across three GPUs, and weight decay is set to . For distillation methods, following Song and Chai (2018), we use a temperature to generate soft-labels from ensemble predictions. Finally, we set the trade-off parameter between the cross-entropy loss and the distillation loss to , unless specified otherwise.

Figure 5: Results on CIFAR-10/100 using ensembles with different parameter-sharing schemes: (i) fully-independent networks, (ii) multi-branch architecture with shared backbone, (iii) the proposed hierarchical network ensemble.

4.2 Ablation Study on CIFAR-10/100

We present a comprehensive ablation study to evaluate the different components of our approach. We report results for the base ResNet architecture, as well as a version where for all layers we divided the number of feature channels by two (). The latter allows us to provide a more complete evaluation by considering an additional regime where inference is extremely efficient.

Optimizing nested network ensembles.  In contrast to ”single-network” learning, in HNE we are interested in optimizing a set of ensembles generated by increasing the number of models. In order to understand the advantages of hierarchical co-distillation for this purpose, we compare four different alternative objectives that can be minimized during HNE training: (1) An independent loss for each model, in Eq. (9). (2) A hierarchical loss explicitly maximizing the accuracy of nested ensembles, in Eq. (10). (3) A combination of both. (4) A combination of and the proposed hierarchical co-distillation loss, in Eq. (12).

As shown in Table 1, learning HNE with provides better performance as compared to training with . These results can be understood from the perspective of the bias-variance trade-off. Concretely, note that attempts to reduce the bias by co-adapting the individual outputs in order to minimize the training error. As the different networks are not trained independently, the reduction in variance resulting from averaging multiple models in an ensemble is lost, causing a performance drop on test data. This effect is also observed when combining with . Using hierarchical co-distillation, however, consistently outperforms the alternatives in all the evaluated cases. Interestingly, also encourages the co-adaptation of individual outputs. However, it has two critical advantages over . Firstly, the co-distillation loss compares the outputs of small ensembles with a soft-label generated by averaging all the individual networks. As a consequence, the matched distribution preserves the higher generalization ability produced by combining independent models. Secondly, as other distillation strategies, is able to transfer the the information of the logits generated by the all the models to the small ensembles. In conclusion, the presented experiment shows the benefits of our co-distillation approach to increase the accuracy of the hierarchical outputs in HNE.

Figure 6: Comparison of HNE with the state of the art. Curves correspond to results obtained by models with adaptive inference cost. Disconnected points show results for methods requiring independent models for each FLOP count.

Comparing distillation approaches.  After demonstrating the effectiveness of hierarchical co-distillation to boost the performance of small ensembles, we evaluate its advantages w.r.t. standard co-distillation. For this purpose, we train HNE using the distillation loss of Eq. (11), as Song and Chai (2018). For standard co-distillation we also evaluate a range of values for in order to analyze the impact on accuracy and model diversity. The latter is measured as the standard deviation in the logits of the evaluated models, averaged across all classes and test samples. In Figure 4 we report both the accuracy on test data (Acc, in columns one and three), and standard deviation (Std, in second and fourth).

From the figure we make several observations. Consider the for standard co-distillation with a high weight on the distillation loss (). As expected, the performance of small ensembles is improved w.r.t. training without distillation. For larger ensembles, however, the accuracy tends to be significantly lower compared not using distillation, due to the reduction in diversity among the models induced by the co-distillation loss. This effect can be controlled by reducing the weight of the distillation loss, but smaller values for tend to compromise the accuracy of small ensembles. In contrast, our hierarchical co-distillation achieves comparable or better performance for all ensemble sizes and datasets. For small ensemble sizes, we obtain improvements over training without distillation that are either comparable or larger than the ones obtained with standard co-distillation. For large ensemble sizes, our approach significantly improves over standard co-distillation, and the accuracy is comparable if not better for large ensembles obtained without distillation. In Supplementary Material we provide additional results to give more insights on the advantages of hierarchical co-distillation for HNE.

Hierarchical parameter sharing.  In this experiment, we compare the performance obtained by HNE and an ensemble of independent networks that are not organized in a tree structure, using the same base architecture. We also compare to the multi-branch architecture explored by Lan et al. (2018) and Lee et al. (2015). In this case, the complexity is reduced by sharing the same “backbone” for the first blocks for all models, and then bifurcate to independent branches for the subsequent blocks. For HNE we use five different blocks with . To achieve architectures in a similar FLOP range, we use the first three blocks as the backbone and implement 16 independent branches for the last two blocks. We use hierarchical distillation for all three architectures. In Figure 5 we again report both the accuracy (Acc) and the standard deviation of logits (Std).

From the results, we observe that HNE obtains significantly better results than the independent networks. Moreover, the hierarchical structure allows to significantly reduce the computational cost for large ensemble sizes. The multi-branch scheme achieves slightly better performance when the amount of evaluated models is small, but HNE is significantly better for larger ensembles. These results can again be understood by observing the diversity across models. In particular, we see that sharing a common backbone for all networks drastically reduces diversity in predictions. In contrast, HNE leads to prediction diversity that is comparable to that of independent networks that do not share any parameters or computations. This shows the importance of using independent sets of parameters in early-blocks to achieve diversity between individual models.

4.3 Comparison with the state of the art

CIFAR-10/100.  We compare the performance of HNE trained with hierarchical co-distillation with previous approaches proposed for learning CNNs with adaptive inference cost: Multi-scale DenseNet (Huang et al., 2018b), Graph Hyper-Networks (Zhang et al., 2019), Deep Adaptive Networks (Li et al., 2019), and Convolutional Neural Mixture Models (Ruiz and Verbeek, 2019). Following (Ruiz and Verbeek, 2019), we also show the results obtained by previous network pruning methods used to reduce the inference cost in CNNs: NetSlimming (Liu et al., 2017), CondenseNet (Huang et al., 2018a), and SSS (Huang and Wang, 2018). We report HNE result for ensembles of sizes up to eight, in order to provide a maximum FLOP count similar to the compared methods (250M).

Results in Figure 6 show that for both datasets HNE consistently outperforms previous approaches with adaptive computational complexity. Considering also network pruning methods, only CondenseNet obtains a small improvement on CIFAR10 for a single operating point. This approach, however, requires to train and evaluate a separate model to achieve any particular inference cost.

ImageNet.  We compare our method with slimmable (Yu et al., 2018), and universally slimmable Networks (Yu and Huang, 2019). To the best of our knowledge these works have reported state-of-the-art performance on ImageNet for adaptive CNNs with a relatively low computational complexity. In particular, we report results obtained with a similar inference cost range than our HNE based on MobileNetv2 architecture (150-600M FLOPs). For hierarchical co-distillation we use .

The results in Figure 6 show that our HNE achieves better accuracy than the compared methods across all inference cost levels. Moreover, our approach is complementary to slimmable networks as the dynamic reduction in feature channels used in slimmable networks can be applied to the blocks in our hierarchical structure. Combining HNE with this strategy provides a complementary mechanism to find better trade-offs between computation and accuracy.

5 Conclusions

We have proposed Hierarchical Neural Ensembles (HNE), a framework to design deep models with adaptive inference cost. Additionally, we have introduced a novel hierarchical co-distillation approach adapted to the structure of HNE. Compared to previous methods for adaptive deep networks, we have reported state-of-the-art compute-accuracy trade-offs on CIFAR-10/100 and ImageNet.

While we have demonstrated the effectiveness of our framework in the context of CNNs for image classification, our approach is generic and can be used to build ensembles of other types of deep networks for different tasks and domains, for example in NLP. Our framework is complementary to network compression and neural architecture search methods. We believe that a promising line of research for adaptive models is to explore the combination of these type of approaches with our hierarchical neural ensembles.


This work has been partially supported by the grant “Deep in France” (ANR-16-CE23-0006).


  • R. Anil, G. Pereyra, A. Passos, R. Ormandi, G. E. Dahl, and G. E. Hinton (2018) Large scale distributed neural network training through online distillation. ICLR. Cited by: §2, §3.3.
  • J. Ba and R. Caruana (2014) Do deep nets really need to be deep?. In NIPS, Cited by: §2.
  • T. Bolukbasi, J. Wang, O. Dekel, and V. Saligrama (2017) Adaptive neural networks for efficient inference. In ICML, Cited by: §2.
  • L. Breiman (1996) Bagging predictors. Machine learning. Cited by: §1, §2, §3.2.
  • H. Cai, L. Zhu, and S. Han (2019) ProxylessNAS: direct neural architecture search on target task and hardware. ICLR. Cited by: §2.
  • M. Elbayad, J. Gu, E. Grave, and M. Auli (2020) Depth-adaptive transformer. In ICLR, Cited by: §2.
  • T. Furlanello, Z. C. Lipton, M. Tschannen, L. Itti, and A. Anandkumar (2018) Born again neural networks. In ICML, Cited by: §2.
  • L. Hansen and P. Salamon (1990) Neural network ensembles. PAMI 12 (10). Cited by: §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: Appendix A, Appendix A, §1, §4.1, §4.1.
  • G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2.
  • A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: Appendix A, §2, §3.1.
  • A. Howard, M. Sandler, G. Chu, L. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, et al. (2019) Searching for mobilenetv3. In ICCV, Cited by: Appendix A.
  • G. Huang, S. Liu, L. van der Maaten, and K. Weinberger (2018a) CondenseNet: an efficient densenet using learned group convolutions. In CVPR, Cited by: §1, §2, §4.3.
  • G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten, and K. Q. Weinberger (2018b) Multi-scale dense networks for resource efficient image classification. ICLR. Cited by: §1, §2, §2, §4.1, §4.3.
  • G. Huang, Y. Li, G. Pleiss, Z. Liu, J. E. Hopcroft, and K. Q. Weinberger (2017) Snapshot ensembles: train 1, get m for free. ICLR. Cited by: §2.
  • Z. Huang and N. Wang (2018) Data-driven sparse structure selection for deep neural networks. In ECCV, Cited by: §4.3.
  • E. Ilg, O. Cicek, S. Galesso, A. Klein, O. Makansi, F. Hutter, and T. Brox (2018) Uncertainty estimates and multi-hypotheses networks for optical flow. In ECCV, Cited by: §2.
  • A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Master’s Thesis, University of Toronto. Cited by: §4.1.
  • A. Krizhevsky, I. Sutskever, and G. Hinton (2012) ImageNet classification with deep convolutional neural networks. In NIPS, Cited by: §1.
  • A. Krogh and J. Vedelsby (1995)

    Neural network ensembles, cross validation, and active learning

    In NIPS, Cited by: §2.
  • X. Lan, X. Zhu, and S. Gong (2018) Knowledge distillation by on-the-fly native ensemble. In NIPS, Cited by: §2, §2, §2, §2, §4.2.
  • J. Lee and S. Chung (2020) Robust training with ensemble consensus. In ICLR, Cited by: §2.
  • S. Lee, S. Purushwalkam, M. Cogswell, D. Crandall, and D. Batra (2015) Why m heads are better than one: training a diverse ensemble of deep networks. arXiv preprint arXiv:1511.06314. Cited by: §2, §2, §2, §3.2, §4.2.
  • H. Li, H. Zhang, X. Qi, R. Yang, and G. Huang (2019) Improved techniques for training adaptive deep networks. In ICCV, Cited by: §2, §2, §4.3.
  • Y. Liu, J. Stehouwer, A. Jourabloo, and X. Liu (2019) Deep tree learning for zero-shot face anti-spoofing. In CVPR, Cited by: §2.
  • Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang (2017) Learning efficient convolutional networks through network slimming. In ICCV, Cited by: §2, §4.3.
  • I. Loshchilov and F. Hutter (2017)

    Sgdr: stochastic gradient descent with warm restarts

    In ICLR, Cited by: §2.
  • N. Ma, X. Zhang, H. Zheng, and J. Sun (2018) ShuffleNet V2: practical guidelines for efficient CNN architecture design. In ECCV, Cited by: §2.
  • A. Malinin, B. Mlodozeniec, and M. Gales (2020) Ensemble distribution distillation. In ICLR, Cited by: §2.
  • S. Mehta, M. Rastegari, L. Shapiro, and H. Hajishirzi (2019) ESPNetv2: a light-weight, power efficient, and general purpose convolutional neural network. CVPR. Cited by: §4.1.
  • R. Minetto, M. Segundo, and S. Sarkar (2019) Hydra: an ensemble of convolutional neural networks for geospatial land classification. IEEE Transactions on Geoscience and Remote Sensing. Cited by: §2.
  • U. Naftaly, N. Intrator, and D. Horn (1997) Optimal ensemble averaging of neural networks. Network: Computation in Neural Systems. Cited by: §2.
  • B. Neal, S. Mittal, A. Baratin, V. Tantia, M. Scicluna, S. Lacoste-Julien, and I. Mitliagkas (2018) A modern take on the bias-variance tradeoff in neural networks. arXiv preprint arXiv:1810.08591. Cited by: §3.2.
  • L. Rokach (2010) Ensemble-based classifiers. Artificial Intelligence Review. Cited by: §2.
  • A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio (2015) Fitnets: hints for thin deep nets. ICLR. Cited by: §2.
  • D. Roy, P. Panda, and K. Roy (2020) Tree-CNN: a hierarchical deep convolutional neural network for incremental learning. Neural Networks. Cited by: §2.
  • A. Ruiz and J. Verbeek (2019) Adaptative inference cost with convolutional neural mixture models. In ICCV, Cited by: §1, §2, §2, §4.1, §4.1, §4.3.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet large scale visual recognition challenge. IJCV. Cited by: §4.1.
  • M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) MobileNetV2: inverted residuals and linear bottlenecks. In CVPR, Cited by: Appendix A, §1, §2, §4.1.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. BMVC. Cited by: §1.
  • G. Song and W. Chai (2018) Collaborative learning for deep neural networks. In NIPS, Cited by: §2, §2, §3.3, §4.1, §4.2.
  • M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. Le (2019) MnasNet: platform-aware neural architecture search for mobile. In CVPR, Cited by: §1, §2.
  • R. Tanno, K. Arulkumaran, D. Alexander, A. Criminisi, and A. Nori (2019) Adaptive neural trees. ICML. Cited by: §2.
  • A. Veit and S. Belongie (2018) Convolutional networks with adaptive inference graphs. In ECCV, Cited by: §2.
  • T. Véniat and L. Denoyer (2018) Learning time/memory-efficient deep architectures with budgeted super networks. In CVPR, Cited by: §2.
  • X. Wang, F. Yu, Z. Dou, T. Darrell, and J. E. Gonzalez (2018) SkipNet: learning dynamic routing in convolutional networks. In ECCV, Cited by: §2.
  • B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda, Y. Jia, and K. Keutzer (2019) FBNet: hardware-aware efficient convnet design via differentiable neural architecture search. In CVPR, Cited by: §2.
  • Z. Wu, T. Nagarajan, A. Kumar, S. Rennie, L. Davis, K. Grauman, and R. Feris (2018) BlockDrop: dynamic inference paths in residual networks. In CVPR, Cited by: §2.
  • S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He (2017) Aggregated residual transformations for deep neural networks. In CVPR, Cited by: §3.1.
  • J. Yu and T. Huang (2019) Universally slimmable networks and improved training techniques. ICCV. Cited by: §1, §2, §2, §4.1, §4.1, §4.3.
  • J. Yu, L. Yang, N. Xu, J. Yang, and T. Huang (2018) Slimmable neural networks. ICLR. Cited by: §2, §4.3.
  • C. Zhang, M. Ren, and R. Urtasun (2019) Graph hypernetworks for neural architecture search. ICLR. Cited by: §2, §4.1, §4.3.
  • Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu (2018) Deep mutual learning. In CVPR, Cited by: §2, §2.
  • Z. Zhou, J. Wu, and W. Tang (2002) Ensembling neural networks: many could be better than all. Artificial intelligence. Cited by: §2.

Appendix A HNE Architectures

In the following, we provide a detailed description of our HNE implementation using ResNet (He et al., 2016) and MobileNetV2 (Sandler et al., 2018) architectures for the CIFAR and ImageNet datasets, respectively.

ResNet HNE.  We build our HNE using the two types of layers illustrated in Figure 7. We refer to them as ResNet and ResNet+Branching. The first one implements a standard residual block with bottleneck (He et al., 2016). However, as described in Figure 2 of the main paper, we replace standard convolutional filters by group convolutions in order to compute in parallel the output of the different branches in the tree structure. Additionally, we also replace the standard convolution by a depth-wise separable convolution (Howard et al., 2017) in order to reduce computational complexity. The ResNet+Branching layer follows the structure of ResNets ”shortcut” blocks used when the image resolution is reduced or the number of channels is augmented. In HNE we employ it when a branch is split in two. Therefore, we add a channel replication operation as shown in Figure 7(b). Using these two types of layers, we implement HNE with blocks embedding a total of ResNets of depth . See Table 2 for a detailed description of the full architecture configuration.

Figure 7: The two layers used in HNE with ResNet architecture. We use in_c and out_c to refer to the number of input and output feature channels of the convolutional block as described in Table 2

. In blue, we show the number of channels resulting after applying each block and whether strided convolution is used or not. BN refers to Batch Normalization.

Stride Repetitions
Groups ()
Groups ()

+ BN + ReLU

1 1 1 1 3 16
[HTML]EFEFEF 0 Conv + BN + ReLU 1 1 1 1 16 64
[HTML]FFFFFF 1 ResNet+Branching 1 1 1 2 64 64
[HTML]FFFFFF 1 ResNet 1 2 2 2 64 64
[HTML]EFEFEF 2 ResNet+Branching 2 1 2 4 64 128
[HTML]EFEFEF 2 ResNet 1 3 4 4 128 128
[HTML]FFFFFF 3 ResNet+Branching 2 1 4 8 128 256
[HTML]FFFFFF 3 ResNet 1 3 8 8 256 256
[HTML]EFEFEF 4 ResNet+Branching 2 1 8 16 256 256
[HTML]EFEFEF 4 ResNet 1 4 16 16 256 256
[HTML]FFFFFF Classifier Avg. Pool + Linear 1 1 16 16 256
Table 2: Full layer configuration of HNE based on ResNet architecture. Each row shows: (1) The block index in the hierarchical structure. (2) The type of layer, whether strided convolutions is used and the number of times that it is stacked. (3) Input and output resolutions of the feature maps. (4) The number of input and output groups used for convolutions, representing the number of active branches in the tree structure. (5) The number of input and output channels before and after applying the layers. Note that it is multiplied by the number of input and output groups representing the number of active branches. Finally, refers to the number of classes in the dataset.

MobileNet HNE.  In order to implement an efficient ensemble of MobileNetV2 networks, we employ the same layers used in the original architecture (see Figure 8). In this case, we refer to them as MBNet and MBNet+Branching layers. The first one implements the inverted residual blocks used in MobileNetV2 but using group convolutions to account for the different HNE branches. The second one is analogous to the shortcut blocks used in this model but we add a channel group replication to implement the branch splits in the tree structure. The detailed layer configuration is shown in Table 3. Note that our MobileNet HNE embed a total of networks with the same architecture as a MobileNetV2 model with channel width multiplier of . The only difference is that we replace the last convolutional layer by a convolution and a fully-connected layer, as suggested in (Howard et al., 2019). This modification offers a reduction of the total number of FLOPs without a significant drop in accuracy.

Figure 8: The two layers used in HNE with MobileNet architectures. We use in_c and out_c to refer to the number of input and output feature channels of the convolutional block as described in Table 3. In blue, we show the number of channels resulting after applying each block and whether strided convolution is used. BN refers to Batch Normalization.
Stride Repetitions
Groups ()
Groups ()
[HTML]EFEFEF 0 Conv + BN + ReLU6 2 1 1 1 3 16
[HTML]EFEFEF 0 Conv + BN + ReLU6 1 1 1 1 16 16
[HTML]EFEFEF 0 MBNet 2 2 1 1 16 24
[HTML]FFFFFF 1 MBNet+Branching 2 1 1 2 24 [HTML]FFFFFF 24
[HTML]FFFFFF 1 MBNet 1 2 2 2 24 [HTML]FFFFFF24
[HTML]EFEFEF 2 MBNet+Branching 2 1 2 4 24 [HTML]EFEFEF48
[HTML]EFEFEF 2 MBNet 1 3 4 4 48 [HTML]EFEFEF48
[HTML]EFEFEF 2 MBNet 1 3 4 4 48 [HTML]EFEFEF72
[HTML]FFFFFF 3 MBNet+Branching 2 1 4 8 72 [HTML]FFFFFF120
[HTML]FFFFFF 3 MBNet 1 2 8 8 120 [HTML]FFFFFF120
[HTML]FFFFFF 3 Conv + BN + ReLU6 1 1 8 8 120 [HTML]FFFFFF720
[HTML]FFFFFF 3 Avg. Pool + Linear + ReLU6 1 1 8 8 720 [HTML]FFFFFF1280
[HTML]EFEFEF Classifier Linear 1 1 8 8 1280 [HTML]EFEFEF1000
Table 3: Full layer configuration of HNE based on MobileNetV2 architecture. Each row shows: (1) The block index in the hierarchical structure. (2) The type of layer, whether strided convolutions is used and the number of times that it is stacked. (3) Input and output resolutions of the feature maps. (4) The number of input and output groups used for convolutions, representing the number of active branches in the tree structure. (5) The number of input and output channels before and after applying the layers. Note that it is multiplied by the number of input and output groups representing the number of active branches.

Appendix B Additional Results on Hierarchical Co-distillation

We provide additional results to give more insights on the effect of the evaluated distillation approaches. Figure 9 depicts the performance of HNE for a different ensemble sizes used during inference (curves) and the accuracy of the individual networks in the ensemble (bars). Comparing the results without distillation to standard co-distillation, we can observe that the latter significantly increases the accuracy of the individual models. This is expected because the knowledge from the complete ensemble is transferred to each network independently. However, the accuracy when the number of evaluated models is increased tend to be lower than HNE trained without distillation. Despite the fact that both phenomena may seem counter-intuitive, they can be explained because standard co-distillation tend to decrease the diversity between the individual models. As a consequence, the gains obtained by combining a large number of networks is reduced even though they are individually more accurate.

The reported results for Hierarchical Co-Distillation clearly show its advantages with respect to the alternative approaches. In this case, we can observe that the accuracy of the first model is much higher than in HNE trained without distillation, and similar to the model obtained by standard co-distillation. The reason is that the ensemble knowledge is directly transferred to the predictions of the first network in the hierarchical structure. Additionally, the performance of the rest of networks is usually lower than HNE trained with standard co-distillation. The improvement obtained by ensembling their predictions is, however, significantly higher. As discussed in the main paper, this is because hierarchical co-distilllation preserves the diversity between the network outputs better, even though their individual average performance can be worse.

Figure 9: Results on CIFAR100 for HNE (top) and HNE (botton) trained without distillation, standard co-distillation and the proposed hierarchical co-distillation. Curves indicate the performance of the different ensembles when the number of evaluated models is increased. Bars depict the accuracy of each individual network.