Convolutional neural networks (CNNs) form the basis of many state-of-the-art computer vision models. Despite their outstanding performance, the computational cost of inference in these CNN-based models is typically very high. This holds back applications on mobile platforms, such as autonomous vehicles, drones, or phones, where computational resources are limited, concurrent data-streams need to be processed, and low-latency prediction is critical.
To accelerate CNNs we can reduce their complexity before training, e.gby decreasing the number of filters or network layers. This solution, however, may lead to sub-optimal results given that over-parametrization plays a critical role in the optimization of deep networks [7, 9]. Fortunately, other studies have found a complementary phenomena: given a trained CNN, a large number of its filters are redundant and do not have a significant impact on the final prediction . Motivated by these two findings, much research has focused on accelerating CNNs using network pruning [11, 18, 22, 35, 38, 40, 47, 56]. Pruning can be applied at multiple levels, e.gby removing independent filters [35, 40], groups of them [11, 18], or entire layers . Despite the encouraging results of these methods, their ability to provide a wide range of operating points along the trade-off between accuracy and computation is limited. The reason is that these approaches typically require to train a separate model for each specific pruning level.
In this paper, we propose Convolutional Neural Mixture Models (CNMMs), which provide a novel perspective on network pruning. A CNMM define a distribution over a large number of CNNs. The mixture is naturally pruned by removing networks with low probabilities, see Figure1. Despite the appealing simplicity of this approach, it presents several challenges. First, learning a large ensemble of CNNs may require a prohibitive amount of computation. Second, even if many networks in the mixture are pruned, their independent evaluation during inference is likely to be less efficient than computing the output of a single large model.
In order to ensure tractability, we design a parameter-sharing scheme between different CNNs. This enables us to (i) jointly train all the networks, and (ii) efficiently compute an approximation of the mixture output without independently evaluating all the networks.
Image classification and semantic segmentation experiments show that CNMMs achieve an excellent trade-off between prediction accuracy and computational cost. Unlike most previous network pruning approaches, a single CNMM model achieves a wide range of operating points along this trade-off without any re-training.
2 Related work
Neural network ensembles. Learning ensembles of neural networks is a long-standing research topic. Seminal works explored different strategies to combine the outputs of different networks to obtain more accurate predictions [30, 46, 61]. Recently, the success of deep models has renewed interest in ensemble methods.
For this purpose, many approaches have been explored. For instance, [31, 62] used bagging  and boosting  to train multiple networks. Other works have considered to learn diverse models by employing different parameter initializations , or re-training a subset of layers . While these strategies are effective to learn diverse networks, their main limitation is the required training cost. In practice, training a deep model can take multiple days, and therefore large ensembles may have a prohibitive cost. To reduce the training time, it has been suggested [20, 39] to train a single network and to use parameters from multiple iterations of the optimization process to define the ensemble. Despite the efficiency of this method during training, this approach does not reduce inference cost, since multiple networks must be evaluated independently at test time.
. By relying on sampling, these methods allow to jointly train all the individual components in the ensemble and perform approximate inference during testing. Bayesian neural networks (BNNs) fall in this paradigm and use a distribution over parameters, rather than a single point estimate[11, 28, 41, 47]. A sample from the parameter distribution can be considered as an individual network. Other works have implemented the notion of implicit ensembles by using dropout 
mechanisms. Dropping neurons can be regarded as sampling over a large ensemble of different networks. Moreover, scaling outputs during testing according to the dropout probability can be understood as an approximated inference mechanism. Motivated by this idea, different works have applied dropout over individual weights , network activations , or connections in multi-branch architectures [12, 32]. Interestingly, it has been observed that ResNets 
behave like an ensemble of models, where some residual connections can be removed without significantly reducing prediction accuracy. This idea was used by ResNets with stochastic depth , where different dropout probabilities are assigned to the residual connections.
Our proposed Convolutional Neural Mixture Model is an implicit ensemble defining a mixture distribution over an exponential number of CNNs. This allows to use the learned probabilities to prune the model by removing non-relevant networks. Using a mixture of CNNs for model pruning is a novel approach, which contrasts to previous methods employing ensembles for other purposes such as boosting performance [8, 34], improving learning dynamics , or uncertainty estimation [24, 31].
Efficient inference in deep networks. A number of strategies have been developed to reduce the inference time of CNNs, including the design of efficient convolutional operators [16, 23, 58], knowledge distillation [4, 15], neural architecture search [14, 63], weight compression [43, 54], and quantization [25, 36]. Network pruning has emerged as one of the most effective frameworks for this purpose [11, 33, 35, 56]
. Pruning methods aim to remove weights which do not have a significant impact on the network output. Among these methods, we can differentiate between two main strategies: online and offline pruning. In offline pruning, a network is first optimized for a given task using standard training. Subsequently, non-relevant weights are identified using different heuristics including their norm, similarity to other weights , or second order derivatives . The main advantage of this strategy is that it can be applied to any pre-trained network. However, these approaches require a costly process involving several prune/retrain cycles in order to recover the original network performance. Online approaches, on the other hand, perform pruning during network training. For example, sparsity inducing regularization can be used over individual weights [38, 41], groups of them [11, 18, 47], or over the connections in multi-branch architectures [1, 56]. These methods typically have a hyper-parameter, to be set before training, determining the trade-off between the final performance and the pruning ratio.
In contrast to previous approaches, we prune entire CNNs by removing the networks with the smallest probabilities in the mixture. This approach offers two main advantages. First, it does not require to define a hyper-parameter before training to determine the balance between the potential compression and the final performance. Second, the number of removed networks can be controlled after optimization. Therefore, a learned CNMM can be deployed at multiple operating points to trade-off computation and prediction accuracy. For example, across different devices with varying computational resources, or on the same device with different computational constraints depending on the processor load of other processes. The recently proposed Slimmable Neural Networks  have also focused on adapting the accuracy-efficiency trade-off at run time. This is achieved by embedding a small set of CNNs with varying widths into a single model. Different from this approach, our CNMMs embed a large number of networks with different depths, which allows for a finer granularity to control the computational cost during pruning.
3 Convolutional Neural Mixture Models
Without loss of generality, we consider a CNN as a function mapping an RGB image
to a tensor. In particular, we assume that is defined as a sequence of operations:
where is computed from the previous feature map . We assume that the functions
can be either the identity function, or a standard CNN block composed of different operations such as batch-normalization, convolution, activation functions, or spatial pooling. In this manner, the effective depth of the network,i.ethe number of non-identity layers , is at most .
The output tensor
of the CNN is used to make predictions for a specific task. For example, in image classification, a linear classifier overcan be used in order to estimate the class probabilities for the entire image. For semantic segmentation the same linear classifier is used for each spatial position in .
Given these definitions, a convolutional neural mixture model (CNMM) defines a distribution over output as:
where is a finite set of CNNs, is a delta function centered on the output of each network, and defines the mixing weights over the CNNs in .
3.1 Modelling a distribution over CNNs
We now define mixtures that contain a number of CNNs that is exponential in the maximum depth , in a way that allows us to manipulate these mixtures in a tractable manner.
Each component in the mixture is a chain-structured CNN uniquely characterised by a sequence of length , where the sequences are constrained to be a non-decreasing set of integers from to , i.ewith , and . This sequence determines the set of functions that are used in Eq. (1). In particular, given a sequence , the output of the corresponding network is computed as:
For the function is a convolutional block as described above with its own parameters, while the functions are identity functions that leave the input unchanged.
By, imposing , there is a one-to-one mapping between sequences and the corresponding CNNs.111In particular, this constraint ensures that, e.g, the network is uniquely encoded by the sequence ‘01444’, ruling out the alternative sequences ‘01144’ and ‘01114’. See Figure 2 (Left). If multiple networks use the same function , these networks share their parameters on this function, which ensures that the total number of parameters of the mixture does not grow exponentially, although there are exponentially many mixture components. For instance, for , the mixture will be composed of eight different networks illustrated in Figure 2 (Left). From the illustration it is easy to see that, in general, the mixture contains components with shared parameters.
In order to define the probabilities for each network in the mixture, we define a distribution over sequences
as a reversed Markov chain:
To ensure that sequences have positive probability if and only if they are valid, i.esatisfy the constraints defined above, we set and define:
As illustrated in Figure 2 (Right), these constraints generate a binary tree generating valid non-decreasing sequences. The conditional probabilities are modelled by a Bernoulli distribution with probability , indicating whether the previous number in the sequence is or .
3.2 Sampling outputs from CNMMs
The graphical model defined in Figure 3 shows that we can sample from the output distribution in Eq. (2) by first generating a sequence from and then evaluating the associated network with Eq. (3). In the following, we formulate an alternative strategy to sample from the model. This formulation offers two advantages. (i) It is amenable to continuous relaxation, which facilitates learning. (ii) It suggests an iterative algorithm to compute feature map expectations, which can be used instead of sampling for efficient inference.
The conditional gives the distribution over across the networks with . For example, consists of two weighted delta peaks, located at and , respectively. See Figure 2 (Left). These conditional distributions can be expressed as the forwards recurrence:
where is a delta function centered on . Therefore, unbiased samples from can be obtained through sample propagation. Recall from Eq. (5) that, given , there are only two possible values of that remain, namely and . As a consequence, the sum over in Eq. (3.2) only consists of two terms. Given this observation, samples can be obtained from samples as:
where for a given value of we sample from to compute a binary indicator , which signals whether the resulting is equal to or .
Using Eq. (7) we iterative sample from distributions for , and for each we compute samples for . An illustration of the algorithm is shown in Figure 4. The computational complexity of a complete pass in this iterative process is , since for each , we compute samples, each of which is computed in from the samples already computed for . This is roughly equivalent to the cost of evaluating a single network with dense layer connectivity of depth , which has a total of connections implemented by the functions .
Sampling outputs from networks of bounded depth. Using the described algorithm, correspond to output tensors sampled from the mixture defined in Eq. (2). Moreover, for any , samples from are output feature maps generated by networks with depth bounded by . For instance, in Figure 2, samples are generated with one of the networks coded by the sequences and .
3.3 Training and inference
We use to collectively denote the parameters of the convolutional blocks and the parameters defining the mixing weights via Eq. (5). Moreover, the parameters of the classifier that predict the image label(s) from the output tensor are denoted as . Given a training set composed of images and labels , we optimize the parameters by minimizing
where is the cross-entropy loss comparing the label with the class probabilities computed from . In practice, we replace the expectation over in each training iteration with samples from .
Learning from subsets of networks. As discussed in Section 3.2, samples from the distribution correspond to outputs of CNNs in the mixture with depth at most
. In order to improve performance of models with reduced inference time, we explicitly emphasize the loss for such efficient relatively shallow networks. Therefore, we sum the above loss function over the outputs sampled from networks of increasing depth:
where we use a separate classifier for each . In practice, we balance each loss with a weight increasing linearly with .
Relaxed binary variables with concrete distributions.
Relaxed binary variables with concrete distributions.The recurrence in Eq. (7) requires sampling from , defined in Eq. (5). The sampling renders the parameters non-differentiable, which prevents gradient-based optimization for them. To address this limitation, we use a continuous relaxation by modelling as a binary “concrete” distribution . In this manner, we can use the re-parametrization trick [27, 48] to back-propagate gradients w.r.tsamples in Eq. (7) and, thus to compute gradients for the parameters .
Efficient inference by expectation propagation. Once the CNMM is trained, the predictive distribution on is given by . The expectation is intractable to compute exactly, contrary to our goal of efficient inference. A naive Monte-Carlo sample approximation is still requires multiple evaluations of the full CNMM. Instead, we propose an alternative approximation by propagating expectations instead of samples in Eq. (7), i.eusing the approximation , where is obtained by running the sampling algorithm replacing the samples with their expectations .
3.4 Accelerating CNNMs
CNMMs offer two complementary mechanisms in order to accelerate inference. We describe both in the following.
Evaluating intermediate classifiers. The different classifiers learned by minimizing Eq. (9) operate over the outputs of a mixture of networks with maximum depth . Therefore, at each iteration of the inference algorithm in Eq. (7) we can already output predictions based on classifier . This strategy is related with the one employed in multi-scale dense networks (MSDNets) , where “early-exit” classifiers are used to provide predictions at various points in time during the inference process.
Network pruning. A complementary strategy to accelerate CNMMs is to remove networks from the mixture. The computational cost of the inference process is dominated by the evaluation of the CNN blocks in Eq. (7). However, these function does not need to be computed when the variable . Therefore, a natural approach to prune CNMMs is to set certain to zero, removing all the CNNs from the mixture that use . We use the learned distribution in order to remove networks with a low probability. Note that for a given value of , the pairwise marginal is exactly the sum of probabilities of all the networks involving the function . Using this observation, we use an iterative pruning algorithm where, at each step, we compute all pairwise marginals for all possible values of and . We then set where ,. Finally, the marginals are updated, and we iterate.
In this manner, we achieve different pruning levels by progressively removing convolutional blocks that will not be evaluated during inference. This process does not require any re-training of the model, allowing to dynamically set different pruning ratios. Note that this process is complementary to the use of intermediate classifiers, as discussed above. The reason for this is that our pruning strategy may be used to remove functions for any “early” prediction step . Finally, it is interesting observe that the proposed pruning mechanism can be regarded as a form of neural architecture search [37, 63], where the optimal network connectivity for a given pruning ratio is automatically discovered by taking into account the learned probabilities .
We perform experiments over two different tasks: image classification and semantic segmentation. Following previous work, we measure the computational cost in terms of the number of floating point multiply and addition operations (FLOPs) required for inference. The number of FLOPs provides a metric that correlates very well with the actual inference wall-time, while being independent of implementation and hardware used for evaluation.
4.1 Datasets and experimental setup
CIFAR-10/100 datasets. These datasets  are composed of 50k train and 10k test images with a resolution of 32
32 pixels. The goal is to classify each image across 10 or 100 classes, respectively. Images are normalized using the means and standard deviations of the RGB channels. We apply standard data augmentation operations: (i) a
-pixel zeros padding followed by 3232 cropping. (ii) Random horizontal flipping with probability . Performance is evaluated in terms of the mean accuracy across classes.
CityScapes dataset. This dataset  contains 10242048 pixel images of urban scenes with pixel-level labels across 19 classes. The dataset is split into training, validation and test sets with 2,975, 500 and 1,525 samples each. The ground-truth annotations for the test set are not public, and we use the validation set instead for evaluation. To assess performance we use the standard mean intersection-over-union (mIoU) metric. We follow the setup of , and down-sample the images by a factor two before processing them. As a data augmentation strategy during training, we apply random horizontal flipping and resizing by using a scaling factor between 0.75 and 1.1. Finally, we use random crops of 384768 pixels from the down-sampled images.
Base architecture. As discussed in Section 3.2, the learning and inference algorithms for CNMMs can be implemented using a network with dense layer connectivity . Based on this observation, we use an architecture similar to MSDNets . Specifically. we define a set of blocks, each composed of a set of feature maps . See Fig. (5).
The initial feature map in each block has channels and, at each subsequent feature map in the block, the spatial resolution is reduced by a factor two in each dimension, and the number of channels is doubled. Feature maps are connected by functions if the output feature map has the same or half the resolution of the input feature map . Finally, we consider the output tensor to have different connectivity and spatial resolution depending on the task.
Implementation for image classification.
We implement the convolutional layers as the set of operations (BN-ReLU-DConv-BN-ReLU-Conv-BN), where BN refers to batch normalization, DConv is adepth-wise separable convolution , and Conv is a convolution. In order to reduce computation, for a given tensor , the different functions share the parameters of the initial operations (BN-ReLU-DConv) for all . Moreover, when the resolution of the feature map is reduced, we use average pooling after these three initial operations. In all our experiments, the number of initial channels in is set to . This is achieved by using a convolution over the input image and then apply a convolutional block. Finally, all the tensors with the lowest spatial resolution are connected to the output . Concretely,
is a vectorobtained by applying the operations (BN-ReLU-GP-FC-BN) to the input tensors, where GP refers to global average pooling, and FC corresponds to a fully-connected layer. The classifier maps linearly to a vector of dimension equal to the number of classes. When using Eq. (9) to train the CNMM, we connect a classifier with the end of each block.
Implementation for semantic segmentation. We use the same setup as for image classification, but replace the ReLU activations with parametric ReLUs as in . Moreover, we use max instead of average pooling to reduce the spatial resolution. The input tensor has channels and a resolution four times lower than the original images. This is achieved by applying a
convolution with stride
to the input and then using a (BN-ReLU-Conv) block followed by max pooling. The output tensorreceives connections from all the previous feature maps and has the same channels and spatial resolution as . Given that the input feature maps are at different scales, we apply a (BN-PReLU-Conv-BN) block over the input tensor and use bi-linear up-sampling with different scaling factors in order to recover the original spatial resolution. The final classifier computing the class probabilities using are defined as blocks (UP-BN-PReLU-Conv-UP-BN-PReLU-DConv), where UP refers to bilinear upsampling, which allows to recover the original image resolution. The first and second convolutions in the block have and output channels respectively, where is the number of classes. As in image classification, we use an intermediate classifier at each step where a full block of computation is finished.
Optimization details. In our experiments, we use SGD with momentum by setting the initial learning rate to 0.1 and weight decay to . In CIFAR, we use a cosine annealing schedule to accelerate convergence. On the other hand, in Cityscapes we employ a cyclic schedule with warm restarts as in . The temperature of the concrete distributions modelling
is set to 2. We train our model by using 300 and 200 epochs, and batch size of 64 and 6, for respectively CIFAR-10/100 and Cityscapes. For the cyclic scheduler, the learning rate is divided by two at epochs. Additionally, the models trained in Cityscapes are fine-tuned during 10 epochs by using random crops of size 5121024 instead of 384768.
4.2 Pruning and intermediate classifiers
We evaluate the proposed pruning and intermediate classifiers strategies to reduce the inference time of trained CNMMs. For CIFAR-10/100 we learn a CNMM with blocks, using scales each. For Cityscapes we use blocks and scales. For each dataset, we train one model that uses a single classifier , optimized using Eq. (8). In addition, we train a second model with intermediate classifiers , minimizing the loss function in Eq. (9). In the following, we will refer to the first and second variant as CNMM-single and CNMM respectively.
In Figure 6 we report prediction accuracy vsFLOPs for inference. Each model is represented as a curve, traced by pruning the model to various degrees. Across the three datasets, the CNMM model with intermediate classifiers achieves higher accuracy in fast inference settings than the CNMM-single model. Recall that all the operation point across the different CNMM curves are obtained from a single trained model. Therefore, this single model can realize the upper-envelope of the performance curves. As expected, the maximum performance of the intermediate classifiers increases with the step number. The accuracy of CNMM at the final step is comparable to the level obtained by the CNMM-single model: slightly worse on CIFAR-10, and slightly better at CIFAR-100 and CityScapes. This is because the minimized intermediate losses provide additional supervisory signal which is particularly useful to encourage accurate prediction for shallow, but fast, CNNs. In conclusion, the CNMM model with intemediate classifiers is to be preferred, since it provides a better trade-off between accuracy and computation at a wider range of FLOP counts.
By analysing the operating points along each curve, we can observe the effectiveness of the proposed pruning algorithm. For the CIFAR datasets we can reduce the FLOP count by a factor two without significant loss in accuracy. For CityScapes, about 25% pruning can be achieved without a significant loss. In general, if several exit points can achieve the same FLOP count by applying varying amounts of pruning, best performance is obtained pruning less for an earlier classifier, rather than pruning more for a later exit.
4.3 Comparison with the state of the art
Image classification. We compare our model with different state-of-the-art CNN acceleration strategies [17, 18, 22, 38, 56]. We consider methods applying pruning at different levels, such as independent filters (Network slimming ), groups of weights (CondenseNet) , connections in multi-branch architectures (SuperNet) , or a combination of them (SSS) . We also compare our method with any-time models based on early-exit classifiers (MSDNet) . Among other previous state-of-the-art methods, the compared approaches have shown the best performance among efficient inference methods with 200 million FLOPs. We compare to CNMMs using 6 and 12 blocks, using three scales is both cases.
The results in Figure 7 (left, center) show that CNMMs achieve similar or better accuracy-compute trade-off across a broad range of FLOP counts than all the compared methods in the CIFAR datasets. Only CondenseNets shows somewhat better performance for medium FLOP counts. Moreover, note that the different operating points shown for the compared methods (except for MSDNets) are obtained by using different models trained independently, e.gby different settings of a hyper-parameter controlling the pruning ratio. In contrast, CNMM embeds a large number operating points in a single model. This feature is interesting when the available computational budget can change dynamically, based on concurrent processes, or when the model is deployed across a wide range of devices. In these scenarios, a single CNMM can be accelerated on-the-fly depending on the available resources. Note that a single MSDNet is also able to provide early-predictions by using intermediate classifiers. However, our CNMM provides better performance for a given FLOP count and allows for a finer granularity to control the computational cost.
Semantic segmentation. State-of-the-art methods for real-time semantic segmentation have mainly focused on the manual-design of efficient network architectures. By employing highly optimized convolutional modules, ESPNet  and ESPNetv2  have achieved impressive accuracy-computation trade-offs. Other methods, such as [5, 59], offer higher accuracy but at several orders of magnitude higher inference cost, limiting their application in resource constrained scenarios.
are obtained by using a model pre-trained in ImageNet. For a fair comparison with our CNMMs, we have trained EspNetv2 from scratch by using the code provided by the authors222https://github.com/sacmehta/EdgeNets. As can be observed, CNMM provides a better trade-off compared to ESPNet. In particular, a full CNMM without pruning obtains an improvement of 0.5 points of mIoU, while reducing the FLOP count by 45%. Moreover, an accelerated CNMM achieves a similar performance compared to the most efficient ESPNet that needs more than two times more FLOPs. On the other hand, ESPNetv2 gives slightly better trade-offs compared to our CNMMs. However, this model relies on an efficient inception-like module  that also includes group point-wise and dilated convolutions. These are orthogonal design choices that can be integrated in our model as well, and we expect that to further improve our results. Additionally, the different operating points in ESPNet and ESPNetv2 are achieved using different models trained independently. Therefore, unlike our approach, these methods do not allow for a fine-grained control over the accuracy-computation trade-off, and multiple models need to be trained. Figure 8 shows qualitative results using different operating points from a single CNMM.
We proposed to address model pruning by using Convolutional Neural Mixture Models (CNMMs), a novel probabilistic framework that embeds a mixture of an exponential number of CNNs. In order to make training and inference tractable, we rely on massive parameter sharing across the models, and use concrete distributions to differentiate across the discrete sampling of mixture components. To achieve efficient inference in CNMM we use an early-exit mechanism that allows prediction after evaluating only a subset of the networks. In addition, we use a pruning algorithm to remove CNNs that have low mixing probabilities. Our experiments on image classification and semantic segmentation tasks show that CNMMs achieve excellent trade-offs between prediction accuracy and computational cost. Unlike most of previous works, a single CNMM model allows for a large number and wide range of accuracy-compute trade-offs, without any re-training.
This work is supported by ANR grants ANR-16-CE23-0006 and ANR-11-LABX-0025-01.
-  (2018) Maskconnect: connectivity learning by gradient descent. In ECCV, Cited by: §2.
-  (2013) Understanding dropout. In NeurIPS, Cited by: §2.
-  (1996) Bagging predictors. Machine learning. Cited by: §2.
-  (2017) Learning efficient object detection models with knowledge distillation. In NeurIPS, Cited by: §2.
-  (2018) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. PAMI. Cited by: §4.3.
The cityscapes dataset for semantic urban scene understanding. In CVPR, Cited by: §4.1.
-  (2019) Gradient descent provably optimizes over-parameterized neural networks. ICLR. Cited by: §1.
-  (2018) Coupled ensembles of neural networks. In 2018 International Conference on Content-Based Multimedia Indexing (CBMI), Cited by: §2.
-  (2018) The lottery ticket hypothesis: training pruned neural networks. ICLR. Cited by: §1.
-  (2017) Concrete dropout. In NeurIPS, Cited by: §2.
-  (2018) Structured variational learning of bayesian neural networks with horseshoe priors. ICML. Cited by: §1, §2, §2.
-  (2017) Branchout: regularization for online ensemble tracking with convolutional neural networks. In CVPR, Cited by: §2.
-  (2016) Identity mappings in deep residual networks. In ECCV, Cited by: §2.
-  (2018) AMC: AutoML for model compression and acceleration on mobile devices. In ECCV, Cited by: §2.
Distilling the knowledge in a neural network.
NIPS Deep Learning and Representation Learning Workshop. Cited by: §2.
-  (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §2, §4.1.
-  (2018) Multi-scale dense convolutional networks for efficient prediction. ICLR. Cited by: §A.1, §3.4, §4.1, §4.3.
-  (2018) Condensenet: an efficient densenet using learned group convolutions. In CVPR, Cited by: §1, §2, §4.3.
-  (2017) Densely connected convolutional networks. In CVPR, Cited by: §3.2, §4.1.
-  (2017) Snapshot ensembles: train 1, get m for free. ICLR. Cited by: §2.
-  (2016) Deep networks with stochastic depth. In ECCV, Cited by: §2, §2.
-  (2018) Data-driven sparse structure selection for deep neural networks. In ECCV, Cited by: §1, §4.3.
-  (2016) SqueezeNet: alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360. Cited by: §2.
-  (2018) Uncertainty estimates and multi-hypotheses networks for optical flow. In ECCV, Cited by: §2.
-  (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. In CVPR, Cited by: §2.
-  (2014) Speeding up convolutional neural networks with low rank expansions. BMVC. Cited by: §1.
-  (2014) Auto-encoding variational Bayes. In ICLR, Cited by: §3.3.
-  (2015) Variational dropout and the local reparameterization trick. In NeurIPS, Cited by: §2.
-  (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §4.1.
Neural network ensembles, cross validation, and active learning. In NeurIPS, Cited by: §2.
-  (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In NeurIPS, Cited by: §2, §2.
-  (2017) Fractalnet: ultra-deep neural networks without residuals. ICLR. Cited by: §2.
-  (1990) Optimal brain damage. In NeurIPS, Cited by: §2.
-  (2015) Why m heads are better than one: training a diverse ensemble of deep networks. arXiv preprint arXiv:1511.06314. Cited by: §2, §2.
-  (2017) Pruning filters for efficient convnets. ICLR. Cited by: §1, §2.
-  (2016) Fixed point quantization of deep convolutional networks. In ICML, Cited by: §2.
-  (2019) Darts: differentiable architecture search. ICLR. Cited by: §3.4.
-  (2017) Learning efficient convolutional networks through network slimming. In ICCV, Cited by: §1, §2, §4.3.
Sgdr: stochastic gradient descent with warm restarts. ICLR. Cited by: §2.
-  (2018) Learning sparse neural networks through regularization. ICLR. Cited by: §1.
-  (2017) Multiplicative normalizing flows for variational bayesian neural networks. In ICML, Cited by: §2, §2.
The concrete distribution: a continuous relaxation of discrete random variables. ICLR. Cited by: §3.3.
-  (2017) Domain-adaptive deep network compression. In ICCV, Cited by: §2.
-  (2018) Espnet: efficient spatial pyramid of dilated convolutions for semantic segmentation. In ECCV, Cited by: §4.1, §4.3.
-  (2019) ESPNetv2: a light-weight, power efficient, and general purpose convolutional neural network. CVPR. Cited by: §4.1, §4.1, §4.3, §4.3.
-  (1997) Optimal ensemble averaging of neural networks. Network: Computation in Neural Systems. Cited by: §2.
-  (2017) Structured bayesian pruning via log-normal multiplicative noise. In NeurIPS, Cited by: §1, §2, §2.
Stochastic backpropagation and approximate inference in deep generative models. In ICML, Cited by: §3.3.
-  (2003) The boosting approach to machine learning: an overview. In Nonlinear estimation and classification, Cited by: §2.
-  (2016) Swapout: learning an ensemble of deep architectures. In NeurIPS, Cited by: §2.
-  (2015) Data-free parameter pruning for deep neural networks. BMVC. Cited by: §2.
-  (2014) Dropout: a simple way to prevent neural networks from overfitting. JMLR. Cited by: §2.
-  (2016) Rethinking the inception architecture for computer vision. In CVPR, Cited by: §4.3.
-  (2016) Convolutional neural networks with low-rank regularization. ICLR. Cited by: §2.
-  (2016) Residual networks behave like ensembles of relatively shallow networks. In NeurIPS, Cited by: §2.
-  (2018) Learning time/memory-efficient deep architectures with budgeted super networks. In CVPR, Cited by: §1, §2, §4.3.
-  (2019) Slimmable neural networks. ICLR. Cited by: §2.
-  (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In CVPR, Cited by: §2.
-  (2017) Pyramid scene parsing network. In CVPR, Cited by: §4.3.
-  (2018) Retraining: a simple way to improve the ensemble accuracy of deep neural networks for image classification. In ICPR, Cited by: §2.
-  (2002) Ensembling neural networks: many could be better than all. Artificial intelligence. Cited by: §2.
-  (2019) Binary ensemble neural network: more bits per network or more networks per bit?. CVPR. Cited by: §2.
-  (2018) Learning transferable architectures for scalable image recognition. In CVPR, Cited by: §2, §3.4.
Appendix A Supplementary material
We provide further results on CIFAR100 in order to show the importance of all components of our proposed CNMMs. Moreover, we provide additional qualitative results of semantic segmentation on the CityScapes dataset.
a.1 Ablative study of CNMM
Using sampling during training. During learning, CNMMs generate a set of samples using Eq. (7). In contrast, during inference we use the expectations instead. In order to evaluate the importance of sampling during learning, we have optimized a CNMM by using the aforementioned expectations instead of samples. Figure 9 shows the results obtained by the model using this approach, denoted as “Training with expectations”. We observe that, compared to the CNMM using sampling, the accuracy decreases faster when different pruning ratios are applied. We attribute this to the fact that our sampling procedure can be regarded as a continuous-relaxation of dropout, where a subset of functions are randomly removed when computing the output tensor . As a consequence, the learned model is more robust to the pruning process where some of the convolutional blocks are removed during inference. This is not the case when deterministic expectations are used in Eq. (7) rather than samples.
Comparison with a deterministic model. We compare the performance of our CNMM with a deterministic variant using the same architecture. Concretely, in Eq. (7) we ignore samples and simply sum the feature maps and . Note that the resulting model is analogous to a MSDNet  using early-exit classifiers. We report the results in Figure 9, denoted as “Deterministic with early-exits”. We observe that our CNMM model obtains better performance than its deterministic counterpart. Moreover, same as MSDNets, accelerating the deterministic model is only possible by using the early-exits. In contrast, the complementary pruning algorithm available in CNMM allows for a finer granularity to control the computational cost.
Expectation approximation during inference. In order to validate our approximation of during inference, we evaluate the performance obtained by using a Monte-Carlo procedure for the same purpose. In particular, we generate samples from the output distribution . Then, we compute the class probabilities for each sample and average them. Table 1 shows the results obtained by varying the number of samples. We observe that our approach offers a similar performance as the Monte-Carlo approximation using . For a higher number of samples, we observe slight improvements in the results. However, note that a Monte-Carlo approximation is very inefficient since it requires independent evaluations of the model.
In particular, the last row in Table 1 is 30 times more costly to obtain than the two first rows. The minimal gain obtained with more samples could probably be more efficiently obtained by using a larger model.
a.2 Additional Qualitative Results
In Figure 10 we provide additional qualitative results for semantic segmentation obtained by a single trained CNMM model, using various opertating points with different number of FLOPs during inference.