Adaptative Inference Cost With Convolutional Neural Mixture Models

08/19/2019 ∙ by Adria Ruiz, et al. ∙ 0

Despite the outstanding performance of convolutional neural networks (CNNs) for many vision tasks, the required computational cost during inference is problematic when resources are limited. In this context, we propose Convolutional Neural Mixture Models (CNMMs), a probabilistic model embedding a large number of CNNs that can be jointly trained and evaluated in an efficient manner. Within the proposed framework, we present different mechanisms to prune subsets of CNNs from the mixture, allowing to easily adapt the computational cost required for inference. Image classification and semantic segmentation experiments show that our method achieve excellent accuracy-compute trade-offs. Moreover, unlike most of previous approaches, a single CNMM provides a large range of operating points along this trade-off, without any re-training.



There are no comments yet.


page 8

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Convolutional neural networks (CNNs) form the basis of many state-of-the-art computer vision models. Despite their outstanding performance, the computational cost of inference in these CNN-based models is typically very high. This holds back applications on mobile platforms, such as autonomous vehicles, drones, or phones, where computational resources are limited, concurrent data-streams need to be processed, and low-latency prediction is critical.

Figure 1: A Convolutional Neural Mixture Model embeds a large number of CNNs. Weight sharing enables efficient joint training of all networks and computation of the mixture output. The learned mixing weights can be used to remove networks from the mixture, and thus reduce the computational cost of inference.

To accelerate CNNs we can reduce their complexity before training, e.gby decreasing the number of filters or network layers. This solution, however, may lead to sub-optimal results given that over-parametrization plays a critical role in the optimization of deep networks [7, 9]. Fortunately, other studies have found a complementary phenomena: given a trained CNN, a large number of its filters are redundant and do not have a significant impact on the final prediction [26]. Motivated by these two findings, much research has focused on accelerating CNNs using network pruning [11, 18, 22, 35, 38, 40, 47, 56]. Pruning can be applied at multiple levels, e.gby removing independent filters [35, 40], groups of them [11, 18], or entire layers [56]. Despite the encouraging results of these methods, their ability to provide a wide range of operating points along the trade-off between accuracy and computation is limited. The reason is that these approaches typically require to train a separate model for each specific pruning level.

In this paper, we propose Convolutional Neural Mixture Models (CNMMs), which provide a novel perspective on network pruning. A CNMM define a distribution over a large number of CNNs. The mixture is naturally pruned by removing networks with low probabilities, see Figure 

1. Despite the appealing simplicity of this approach, it presents several challenges. First, learning a large ensemble of CNNs may require a prohibitive amount of computation. Second, even if many networks in the mixture are pruned, their independent evaluation during inference is likely to be less efficient than computing the output of a single large model.

In order to ensure tractability, we design a parameter-sharing scheme between different CNNs. This enables us to (i) jointly train all the networks, and (ii) efficiently compute an approximation of the mixture output without independently evaluating all the networks.

Image classification and semantic segmentation experiments show that CNMMs achieve an excellent trade-off between prediction accuracy and computational cost. Unlike most previous network pruning approaches, a single CNMM model achieves a wide range of operating points along this trade-off without any re-training.

2 Related work

Neural network ensembles.  Learning ensembles of neural networks is a long-standing research topic. Seminal works explored different strategies to combine the outputs of different networks to obtain more accurate predictions [30, 46, 61]. Recently, the success of deep models has renewed interest in ensemble methods.

For this purpose, many approaches have been explored. For instance, [31, 62] used bagging [3] and boosting [49] to train multiple networks. Other works have considered to learn diverse models by employing different parameter initializations [34], or re-training a subset of layers [60]. While these strategies are effective to learn diverse networks, their main limitation is the required training cost. In practice, training a deep model can take multiple days, and therefore large ensembles may have a prohibitive cost. To reduce the training time, it has been suggested [20, 39] to train a single network and to use parameters from multiple iterations of the optimization process to define the ensemble. Despite the efficiency of this method during training, this approach does not reduce inference cost, since multiple networks must be evaluated independently at test time.

An alternative strategy to allow efficient training and inference is to use implicit ensembles [11, 21, 32, 47]

. By relying on sampling, these methods allow to jointly train all the individual components in the ensemble and perform approximate inference during testing. Bayesian neural networks (BNNs) fall in this paradigm and use a distribution over parameters, rather than a single point estimate 

[11, 28, 41, 47]. A sample from the parameter distribution can be considered as an individual network. Other works have implemented the notion of implicit ensembles by using dropout [52]

mechanisms. Dropping neurons can be regarded as sampling over a large ensemble of different networks 

[2]. Moreover, scaling outputs during testing according to the dropout probability can be understood as an approximated inference mechanism. Motivated by this idea, different works have applied dropout over individual weights [10], network activations [50], or connections in multi-branch architectures [12, 32]. Interestingly, it has been observed that ResNets [13]

behave like an ensemble of models, where some residual connections can be removed without significantly reducing prediction accuracy 

[55]. This idea was used by ResNets with stochastic depth [21], where different dropout probabilities are assigned to the residual connections.

Figure 2: (Left) Illustration of how a large collection of CNNs is represented in a CNMM. Each network is uniquely identified by a non-decreasing sequence , containing numbers from to . Consecutive entries in the sequence determine the functions applied to compute the CNN output. In this manner, sequences with common sub-sequences share functions and their parameters in the corresponding networks. (Right) Illustration of the distribution that defines the mixing weights over the CNN models, here . Each

is a Bernoulli distribution on whether

equals or . This defines a binary tree generating all the valid sequences.

Our proposed Convolutional Neural Mixture Model is an implicit ensemble defining a mixture distribution over an exponential number of CNNs. This allows to use the learned probabilities to prune the model by removing non-relevant networks. Using a mixture of CNNs for model pruning is a novel approach, which contrasts to previous methods employing ensembles for other purposes such as boosting performance [8, 34], improving learning dynamics [21], or uncertainty estimation [24, 31].

Efficient inference in deep networks.  A number of strategies have been developed to reduce the inference time of CNNs, including the design of efficient convolutional operators [16, 23, 58], knowledge distillation [4, 15], neural architecture search [14, 63], weight compression [43, 54], and quantization [25, 36]. Network pruning has emerged as one of the most effective frameworks for this purpose [11, 33, 35, 56]

. Pruning methods aim to remove weights which do not have a significant impact on the network output. Among these methods, we can differentiate between two main strategies: online and offline pruning. In offline pruning, a network is first optimized for a given task using standard training. Subsequently, non-relevant weights are identified using different heuristics including their norm 

[35], similarity to other weights [51], or second order derivatives [33]. The main advantage of this strategy is that it can be applied to any pre-trained network. However, these approaches require a costly process involving several prune/retrain cycles in order to recover the original network performance. Online approaches, on the other hand, perform pruning during network training. For example, sparsity inducing regularization can be used over individual weights [38, 41], groups of them [11, 18, 47], or over the connections in multi-branch architectures [1, 56]. These methods typically have a hyper-parameter, to be set before training, determining the trade-off between the final performance and the pruning ratio.

In contrast to previous approaches, we prune entire CNNs by removing the networks with the smallest probabilities in the mixture. This approach offers two main advantages. First, it does not require to define a hyper-parameter before training to determine the balance between the potential compression and the final performance. Second, the number of removed networks can be controlled after optimization. Therefore, a learned CNMM can be deployed at multiple operating points to trade-off computation and prediction accuracy. For example, across different devices with varying computational resources, or on the same device with different computational constraints depending on the processor load of other processes. The recently proposed Slimmable Neural Networks [57] have also focused on adapting the accuracy-efficiency trade-off at run time. This is achieved by embedding a small set of CNNs with varying widths into a single model. Different from this approach, our CNMMs embed a large number of networks with different depths, which allows for a finer granularity to control the computational cost during pruning.

3 Convolutional Neural Mixture Models

Without loss of generality, we consider a CNN as a function mapping an RGB image

to a tensor

. In particular, we assume that is defined as a sequence of operations:


where is computed from the previous feature map . We assume that the functions

can be either the identity function, or a standard CNN block composed of different operations such as batch-normalization, convolution, activation functions, or spatial pooling. In this manner, the effective depth of the network,

i.ethe number of non-identity layers , is at most .

The output tensor

of the CNN is used to make predictions for a specific task. For example, in image classification, a linear classifier over

can be used in order to estimate the class probabilities for the entire image. For semantic segmentation the same linear classifier is used for each spatial position in .

Given these definitions, a convolutional neural mixture model (CNMM) defines a distribution over output as:


where is a finite set of CNNs, is a delta function centered on the output of each network, and defines the mixing weights over the CNNs in .

3.1 Modelling a distribution over CNNs

We now define mixtures that contain a number of CNNs that is exponential in the maximum depth , in a way that allows us to manipulate these mixtures in a tractable manner.

Each component in the mixture is a chain-structured CNN uniquely characterised by a sequence of length , where the sequences are constrained to be a non-decreasing set of integers from to , i.ewith , and . This sequence determines the set of functions that are used in Eq. (1). In particular, given a sequence , the output of the corresponding network is computed as:


For the function is a convolutional block as described above with its own parameters, while the functions are identity functions that leave the input unchanged.

By, imposing , there is a one-to-one mapping between sequences and the corresponding CNNs.111In particular, this constraint ensures that, e.g, the network is uniquely encoded by the sequence ‘01444’, ruling out the alternative sequences ‘01144’ and ‘01114’. See Figure 2 (Left). If multiple networks use the same function , these networks share their parameters on this function, which ensures that the total number of parameters of the mixture does not grow exponentially, although there are exponentially many mixture components. For instance, for , the mixture will be composed of eight different networks illustrated in Figure 2 (Left). From the illustration it is easy to see that, in general, the mixture contains components with shared parameters.

In order to define the probabilities for each network in the mixture, we define a distribution over sequences

as a reversed Markov chain:


To ensure that sequences have positive probability if and only if they are valid, i.esatisfy the constraints defined above, we set and define:


As illustrated in Figure 2 (Right), these constraints generate a binary tree generating valid non-decreasing sequences. The conditional probabilities are modelled by a Bernoulli distribution with probability , indicating whether the previous number in the sequence is or .

3.2 Sampling outputs from CNMMs

The graphical model defined in Figure 3 shows that we can sample from the output distribution in Eq. (2) by first generating a sequence from and then evaluating the associated network with Eq. (3). In the following, we formulate an alternative strategy to sample from the model. This formulation offers two advantages. (i) It is amenable to continuous relaxation, which facilitates learning. (ii) It suggests an iterative algorithm to compute feature map expectations, which can be used instead of sampling for efficient inference.

Figure 3: Graphical model representation of the CNMM. The sequence codes for the CNN architecture. Each is an intermediate feature map generated by the sampled CNN. It is computed from the previous feature map using .

The conditional gives the distribution over across the networks with . For example, consists of two weighted delta peaks, located at and , respectively. See Figure 2 (Left). These conditional distributions can be expressed as the forwards recurrence:


where is a delta function centered on . Therefore, unbiased samples from can be obtained through sample propagation. Recall from Eq. (5) that, given , there are only two possible values of that remain, namely and . As a consequence, the sum over in Eq. (3.2) only consists of two terms. Given this observation, samples can be obtained from samples as:


where for a given value of we sample from to compute a binary indicator , which signals whether the resulting is equal to or .

Using Eq. (7) we iterative sample from distributions for , and for each we compute samples for . An illustration of the algorithm is shown in Figure 4. The computational complexity of a complete pass in this iterative process is , since for each , we compute samples, each of which is computed in from the samples already computed for . This is roughly equivalent to the cost of evaluating a single network with dense layer connectivity of depth  [19], which has a total of connections implemented by the functions .

Sampling outputs from networks of bounded depth.  Using the described algorithm, correspond to output tensors sampled from the mixture defined in Eq. (2). Moreover, for any , samples from are output feature maps generated by networks with depth bounded by . For instance, in Figure 2, samples are generated with one of the networks coded by the sequences and .

Figure 4: Top: Illustration of the algorithm used to sample intermediate feature maps from the mixture distribution. At each iteration , we generate by using: (i) samples obtained in the previous iteration, (ii) the corresponding functions and (iii) samples from . Bottom: Network with dense connectivity implementing the sampling algorithm.

3.3 Training and inference

We use to collectively denote the parameters of the convolutional blocks and the parameters defining the mixing weights via Eq. (5). Moreover, the parameters of the classifier that predict the image label(s) from the output tensor are denoted as . Given a training set composed of images and labels , we optimize the parameters by minimizing


where is the cross-entropy loss comparing the label with the class probabilities computed from . In practice, we replace the expectation over in each training iteration with samples from .

Learning from subsets of networks.  As discussed in Section 3.2, samples from the distribution correspond to outputs of CNNs in the mixture with depth at most

. In order to improve performance of models with reduced inference time, we explicitly emphasize the loss for such efficient relatively shallow networks. Therefore, we sum the above loss function over the outputs sampled from networks of increasing depth:


where we use a separate classifier for each . In practice, we balance each loss with a weight increasing linearly with .

Relaxed binary variables with concrete distributions.

  The recurrence in Eq. (7) requires sampling from , defined in Eq. (5). The sampling renders the parameters non-differentiable, which prevents gradient-based optimization for them. To address this limitation, we use a continuous relaxation by modelling as a binary “concrete” distribution [42]. In this manner, we can use the re-parametrization trick [27, 48] to back-propagate gradients w.r.tsamples in Eq. (7) and, thus to compute gradients for the parameters .

Efficient inference by expectation propagation.  Once the CNMM is trained, the predictive distribution on is given by . The expectation is intractable to compute exactly, contrary to our goal of efficient inference. A naive Monte-Carlo sample approximation is still requires multiple evaluations of the full CNMM. Instead, we propose an alternative approximation by propagating expectations instead of samples in Eq. (7), i.eusing the approximation , where is obtained by running the sampling algorithm replacing the samples with their expectations .

3.4 Accelerating CNNMs

CNMMs offer two complementary mechanisms in order to accelerate inference. We describe both in the following.

Evaluating intermediate classifiers.  The different classifiers learned by minimizing Eq. (9) operate over the outputs of a mixture of networks with maximum depth . Therefore, at each iteration of the inference algorithm in Eq. (7) we can already output predictions based on classifier . This strategy is related with the one employed in multi-scale dense networks (MSDNets) [17], where “early-exit” classifiers are used to provide predictions at various points in time during the inference process.

Network pruning.  A complementary strategy to accelerate CNMMs is to remove networks from the mixture. The computational cost of the inference process is dominated by the evaluation of the CNN blocks in Eq. (7). However, these function does not need to be computed when the variable . Therefore, a natural approach to prune CNMMs is to set certain to zero, removing all the CNNs from the mixture that use . We use the learned distribution in order to remove networks with a low probability. Note that for a given value of , the pairwise marginal is exactly the sum of probabilities of all the networks involving the function . Using this observation, we use an iterative pruning algorithm where, at each step, we compute all pairwise marginals for all possible values of and . We then set where ,. Finally, the marginals are updated, and we iterate.

Figure 5: Dense network with sparse connectivity implementing the inference algorithm of our CNMMs. See text for details.
Figure 6: Prediction accuracy vsFLOPs for accelerated CNMMs. Black curves depict the performance of a CNMM learned using a single final classifier. Colored curves correspond to intermediate classifiers at different steps of the inference algorithm. Points on one curve are obtained by progressively pruning convolutional layers.

In this manner, we achieve different pruning levels by progressively removing convolutional blocks that will not be evaluated during inference. This process does not require any re-training of the model, allowing to dynamically set different pruning ratios. Note that this process is complementary to the use of intermediate classifiers, as discussed above. The reason for this is that our pruning strategy may be used to remove functions for any “early” prediction step . Finally, it is interesting observe that the proposed pruning mechanism can be regarded as a form of neural architecture search [37, 63], where the optimal network connectivity for a given pruning ratio is automatically discovered by taking into account the learned probabilities .

4 Experiments

We perform experiments over two different tasks: image classification and semantic segmentation. Following previous work, we measure the computational cost in terms of the number of floating point multiply and addition operations (FLOPs) required for inference. The number of FLOPs provides a metric that correlates very well with the actual inference wall-time, while being independent of implementation and hardware used for evaluation.

4.1 Datasets and experimental setup

CIFAR-10/100 datasets.  These datasets [29] are composed of 50k train and 10k test images with a resolution of 32

32 pixels. The goal is to classify each image across 10 or 100 classes, respectively. Images are normalized using the means and standard deviations of the RGB channels. We apply standard data augmentation operations: (i) a

-pixel zeros padding followed by 32

32 cropping. (ii) Random horizontal flipping with probability . Performance is evaluated in terms of the mean accuracy across classes.

CityScapes dataset.  This dataset [6] contains 10242048 pixel images of urban scenes with pixel-level labels across 19 classes. The dataset is split into training, validation and test sets with 2,975, 500 and 1,525 samples each. The ground-truth annotations for the test set are not public, and we use the validation set instead for evaluation. To assess performance we use the standard mean intersection-over-union (mIoU) metric. We follow the setup of [45], and down-sample the images by a factor two before processing them. As a data augmentation strategy during training, we apply random horizontal flipping and resizing by using a scaling factor between 0.75 and 1.1. Finally, we use random crops of 384768 pixels from the down-sampled images.

Figure 7: Comparison of the our CNMM with state-of-the-art efficient inference approaches on the CIFAR and CityScapes datasets. Disconnected markers refer to models that are trained independently. Curves correspond to a single model that can operate at different number of FLOPs. CNMM curves are obtained by using the optimal combination of pruning and intermediate classifiers.

Base architecture.  As discussed in Section 3.2, the learning and inference algorithms for CNMMs can be implemented using a network with dense layer connectivity [19]. Based on this observation, we use an architecture similar to MSDNets [17]. Specifically. we define a set of blocks, each composed of a set of feature maps . See Fig. (5).

The initial feature map in each block has channels and, at each subsequent feature map in the block, the spatial resolution is reduced by a factor two in each dimension, and the number of channels is doubled. Feature maps are connected by functions if the output feature map has the same or half the resolution of the input feature map . Finally, we consider the output tensor to have different connectivity and spatial resolution depending on the task.

Implementation for image classification.

  We implement the convolutional layers as the set of operations (BN-ReLU-DConv-BN-ReLU-Conv-BN), where BN refers to batch normalization, DConv is a

depth-wise separable convolution [16], and Conv is a convolution. In order to reduce computation, for a given tensor , the different functions share the parameters of the initial operations (BN-ReLU-DConv) for all . Moreover, when the resolution of the feature map is reduced, we use average pooling after these three initial operations. In all our experiments, the number of initial channels in is set to . This is achieved by using a convolution over the input image and then apply a convolutional block. Finally, all the tensors with the lowest spatial resolution are connected to the output . Concretely,

is a vector

obtained by applying the operations (BN-ReLU-GP-FC-BN) to the input tensors, where GP refers to global average pooling, and FC corresponds to a fully-connected layer. The classifier maps linearly to a vector of dimension equal to the number of classes. When using Eq. (9) to train the CNMM, we connect a classifier with the end of each block.

Implementation for semantic segmentation.  We use the same setup as for image classification, but replace the ReLU activations with parametric ReLUs as in [44]. Moreover, we use max instead of average pooling to reduce the spatial resolution. The input tensor has channels and a resolution four times lower than the original images. This is achieved by applying a

convolution with stride

to the input and then using a (BN-ReLU-Conv) block followed by max pooling. The output tensor

receives connections from all the previous feature maps and has the same channels and spatial resolution as . Given that the input feature maps are at different scales, we apply a (BN-PReLU-Conv-BN) block over the input tensor and use bi-linear up-sampling with different scaling factors in order to recover the original spatial resolution. The final classifier computing the class probabilities using are defined as blocks (UP-BN-PReLU-Conv-UP-BN-PReLU-DConv), where UP refers to bilinear upsampling, which allows to recover the original image resolution. The first and second convolutions in the block have and output channels respectively, where is the number of classes. As in image classification, we use an intermediate classifier at each step where a full block of computation is finished.

Optimization details.  In our experiments, we use SGD with momentum by setting the initial learning rate to 0.1 and weight decay to . In CIFAR, we use a cosine annealing schedule to accelerate convergence. On the other hand, in Cityscapes we employ a cyclic schedule with warm restarts as in [45]. The temperature of the concrete distributions modelling

is set to 2. We train our model by using 300 and 200 epochs, and batch size of 64 and 6, for respectively CIFAR-10/100 and Cityscapes. For the cyclic scheduler, the learning rate is divided by two at epochs

. Additionally, the models trained in Cityscapes are fine-tuned during 10 epochs by using random crops of size 5121024 instead of 384768.

Figure 8: Pixel-level predictions for a single CNMM operating under different computational constraints. As discussed, our model allows to dynamically dynamically set the trade-off between accuracy and inference time with no additional cost.

4.2 Pruning and intermediate classifiers

We evaluate the proposed pruning and intermediate classifiers strategies to reduce the inference time of trained CNMMs. For CIFAR-10/100 we learn a CNMM with blocks, using scales each. For Cityscapes we use blocks and scales. For each dataset, we train one model that uses a single classifier , optimized using Eq. (8). In addition, we train a second model with intermediate classifiers , minimizing the loss function in Eq. (9). In the following, we will refer to the first and second variant as CNMM-single and CNMM respectively.

In Figure 6 we report prediction accuracy vsFLOPs for inference. Each model is represented as a curve, traced by pruning the model to various degrees. Across the three datasets, the CNMM model with intermediate classifiers achieves higher accuracy in fast inference settings than the CNMM-single model. Recall that all the operation point across the different CNMM curves are obtained from a single trained model. Therefore, this single model can realize the upper-envelope of the performance curves. As expected, the maximum performance of the intermediate classifiers increases with the step number. The accuracy of CNMM at the final step is comparable to the level obtained by the CNMM-single model: slightly worse on CIFAR-10, and slightly better at CIFAR-100 and CityScapes. This is because the minimized intermediate losses provide additional supervisory signal which is particularly useful to encourage accurate prediction for shallow, but fast, CNNs. In conclusion, the CNMM model with intemediate classifiers is to be preferred, since it provides a better trade-off between accuracy and computation at a wider range of FLOP counts.

By analysing the operating points along each curve, we can observe the effectiveness of the proposed pruning algorithm. For the CIFAR datasets we can reduce the FLOP count by a factor two without significant loss in accuracy. For CityScapes, about 25% pruning can be achieved without a significant loss. In general, if several exit points can achieve the same FLOP count by applying varying amounts of pruning, best performance is obtained pruning less for an earlier classifier, rather than pruning more for a later exit.

4.3 Comparison with the state of the art

Image classification.  We compare our model with different state-of-the-art CNN acceleration strategies  [17, 18, 22, 38, 56]. We consider methods applying pruning at different levels, such as independent filters (Network slimming [38]), groups of weights (CondenseNet) [18], connections in multi-branch architectures (SuperNet) [56], or a combination of them (SSS) [22]. We also compare our method with any-time models based on early-exit classifiers (MSDNet) [17]. Among other previous state-of-the-art methods, the compared approaches have shown the best performance among efficient inference methods with 200 million FLOPs. We compare to CNMMs using 6 and 12 blocks, using three scales is both cases.

The results in Figure 7 (left, center) show that CNMMs achieve similar or better accuracy-compute trade-off across a broad range of FLOP counts than all the compared methods in the CIFAR datasets. Only CondenseNets shows somewhat better performance for medium FLOP counts. Moreover, note that the different operating points shown for the compared methods (except for MSDNets) are obtained by using different models trained independently, e.gby different settings of a hyper-parameter controlling the pruning ratio. In contrast, CNMM embeds a large number operating points in a single model. This feature is interesting when the available computational budget can change dynamically, based on concurrent processes, or when the model is deployed across a wide range of devices. In these scenarios, a single CNMM can be accelerated on-the-fly depending on the available resources. Note that a single MSDNet is also able to provide early-predictions by using intermediate classifiers. However, our CNMM provides better performance for a given FLOP count and allows for a finer granularity to control the computational cost.

Semantic segmentation.  State-of-the-art methods for real-time semantic segmentation have mainly focused on the manual-design of efficient network architectures. By employing highly optimized convolutional modules, ESPNet [44] and ESPNetv2 [45] have achieved impressive accuracy-computation trade-offs. Other methods, such as [5, 59], offer higher accuracy but at several orders of magnitude higher inference cost, limiting their application in resource constrained scenarios.

In Figure 7 (right) we compare our CNMM results to these two approaches. Note that the original results reported in [45]

are obtained by using a model pre-trained in ImageNet. For a fair comparison with our CNMMs, we have trained EspNetv2 from scratch by using the code provided by the authors

222 As can be observed, CNMM provides a better trade-off compared to ESPNet. In particular, a full CNMM without pruning obtains an improvement of 0.5 points of mIoU, while reducing the FLOP count by 45%. Moreover, an accelerated CNMM achieves a similar performance compared to the most efficient ESPNet that needs more than two times more FLOPs. On the other hand, ESPNetv2 gives slightly better trade-offs compared to our CNMMs. However, this model relies on an efficient inception-like module [53] that also includes group point-wise and dilated convolutions. These are orthogonal design choices that can be integrated in our model as well, and we expect that to further improve our results. Additionally, the different operating points in ESPNet and ESPNetv2 are achieved using different models trained independently. Therefore, unlike our approach, these methods do not allow for a fine-grained control over the accuracy-computation trade-off, and multiple models need to be trained. Figure 8 shows qualitative results using different operating points from a single CNMM.

5 Conclusions

We proposed to address model pruning by using Convolutional Neural Mixture Models (CNMMs), a novel probabilistic framework that embeds a mixture of an exponential number of CNNs. In order to make training and inference tractable, we rely on massive parameter sharing across the models, and use concrete distributions to differentiate across the discrete sampling of mixture components. To achieve efficient inference in CNMM we use an early-exit mechanism that allows prediction after evaluating only a subset of the networks. In addition, we use a pruning algorithm to remove CNNs that have low mixing probabilities. Our experiments on image classification and semantic segmentation tasks show that CNMMs achieve excellent trade-offs between prediction accuracy and computational cost. Unlike most of previous works, a single CNMM model allows for a large number and wide range of accuracy-compute trade-offs, without any re-training.


This work is supported by ANR grants ANR-16-CE23-0006 and ANR-11-LABX-0025-01.


  • [1] K. Ahmed and L. Torresani (2018) Maskconnect: connectivity learning by gradient descent. In ECCV, Cited by: §2.
  • [2] P. Baldi and P. J. Sadowski (2013) Understanding dropout. In NeurIPS, Cited by: §2.
  • [3] L. Breiman (1996) Bagging predictors. Machine learning. Cited by: §2.
  • [4] G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker (2017) Learning efficient object detection models with knowledge distillation. In NeurIPS, Cited by: §2.
  • [5] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2018) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. PAMI. Cited by: §4.3.
  • [6] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    In CVPR, Cited by: §4.1.
  • [7] S. S. Du, X. Zhai, B. Poczos, and A. Singh (2019) Gradient descent provably optimizes over-parameterized neural networks. ICLR. Cited by: §1.
  • [8] A. Dutt, D. Pellerin, and G. Quénot (2018) Coupled ensembles of neural networks. In 2018 International Conference on Content-Based Multimedia Indexing (CBMI), Cited by: §2.
  • [9] J. Frankle and M. Carbin (2018) The lottery ticket hypothesis: training pruned neural networks. ICLR. Cited by: §1.
  • [10] Y. Gal, J. Hron, and A. Kendall (2017) Concrete dropout. In NeurIPS, Cited by: §2.
  • [11] S. Ghosh, J. Yao, and F. Doshi-Velez (2018) Structured variational learning of bayesian neural networks with horseshoe priors. ICML. Cited by: §1, §2, §2.
  • [12] B. Han, J. Sim, and H. Adam (2017) Branchout: regularization for online ensemble tracking with convolutional neural networks. In CVPR, Cited by: §2.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun (2016) Identity mappings in deep residual networks. In ECCV, Cited by: §2.
  • [14] Y. He, J. Lin, Z. Liu, H. Wang, L. Li, and S. Han (2018) AMC: AutoML for model compression and acceleration on mobile devices. In ECCV, Cited by: §2.
  • [15] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network.

    NIPS Deep Learning and Representation Learning Workshop

    Cited by: §2.
  • [16] A. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §2, §4.1.
  • [17] G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten, and K. Weinberger (2018) Multi-scale dense convolutional networks for efficient prediction. ICLR. Cited by: §A.1, §3.4, §4.1, §4.3.
  • [18] G. Huang, S. Liu, L. van der Maaten, and K. Weinberger (2018) Condensenet: an efficient densenet using learned group convolutions. In CVPR, Cited by: §1, §2, §4.3.
  • [19] G. Huang, Z. Liu, L. van der Maaten, and K. Weinberger (2017) Densely connected convolutional networks. In CVPR, Cited by: §3.2, §4.1.
  • [20] G. Huang, Y. Li, G. Pleiss, Z. Liu, J. E. Hopcroft, and K. Q. Weinberger (2017) Snapshot ensembles: train 1, get m for free. ICLR. Cited by: §2.
  • [21] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger (2016) Deep networks with stochastic depth. In ECCV, Cited by: §2, §2.
  • [22] Z. Huang and N. Wang (2018) Data-driven sparse structure selection for deep neural networks. In ECCV, Cited by: §1, §4.3.
  • [23] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer (2016) SqueezeNet: alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360. Cited by: §2.
  • [24] E. Ilg, O. Cicek, S. Galesso, A. Klein, O. Makansi, F. Hutter, and T. Brox (2018) Uncertainty estimates and multi-hypotheses networks for optical flow. In ECCV, Cited by: §2.
  • [25] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. In CVPR, Cited by: §2.
  • [26] M. Jaderberg, A. Vedaldi, and A. Zisserman (2014) Speeding up convolutional neural networks with low rank expansions. BMVC. Cited by: §1.
  • [27] D. Kingma and M. Welling (2014) Auto-encoding variational Bayes. In ICLR, Cited by: §3.3.
  • [28] D. P. Kingma, T. Salimans, and M. Welling (2015) Variational dropout and the local reparameterization trick. In NeurIPS, Cited by: §2.
  • [29] A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §4.1.
  • [30] A. Krogh and J. Vedelsby (1995)

    Neural network ensembles, cross validation, and active learning

    In NeurIPS, Cited by: §2.
  • [31] B. Lakshminarayanan, A. Pritzel, and C. Blundell (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In NeurIPS, Cited by: §2, §2.
  • [32] G. Larsson, M. Maire, and G. Shakhnarovich (2017) Fractalnet: ultra-deep neural networks without residuals. ICLR. Cited by: §2.
  • [33] Y. LeCun, J. Denker, and S. Solla (1990) Optimal brain damage. In NeurIPS, Cited by: §2.
  • [34] S. Lee, S. Purushwalkam, M. Cogswell, D. Crandall, and D. Batra (2015) Why m heads are better than one: training a diverse ensemble of deep networks. arXiv preprint arXiv:1511.06314. Cited by: §2, §2.
  • [35] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf (2017) Pruning filters for efficient convnets. ICLR. Cited by: §1, §2.
  • [36] D. Lin, S. Talathi, and S. Annapureddy (2016) Fixed point quantization of deep convolutional networks. In ICML, Cited by: §2.
  • [37] H. Liu, K. Simonyan, and Y. Yang (2019) Darts: differentiable architecture search. ICLR. Cited by: §3.4.
  • [38] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang (2017) Learning efficient convolutional networks through network slimming. In ICCV, Cited by: §1, §2, §4.3.
  • [39] I. Loshchilov and F. Hutter (2017)

    Sgdr: stochastic gradient descent with warm restarts

    ICLR. Cited by: §2.
  • [40] C. Louizos, M. Welling, and D. P. Kingma (2018) Learning sparse neural networks through regularization. ICLR. Cited by: §1.
  • [41] C. Louizos and M. Welling (2017) Multiplicative normalizing flows for variational bayesian neural networks. In ICML, Cited by: §2, §2.
  • [42] C. J. Maddison, A. Mnih, and Y. W. Teh (2017)

    The concrete distribution: a continuous relaxation of discrete random variables

    ICLR. Cited by: §3.3.
  • [43] M. Masana, J. van de Weijer, L. Herranz, A. D. Bagdanov, and J. M. Alvarez (2017) Domain-adaptive deep network compression. In ICCV, Cited by: §2.
  • [44] S. Mehta, M. Rastegari, A. Caspi, L. Shapiro, and H. Hajishirzi (2018) Espnet: efficient spatial pyramid of dilated convolutions for semantic segmentation. In ECCV, Cited by: §4.1, §4.3.
  • [45] S. Mehta, M. Rastegari, L. Shapiro, and H. Hajishirzi (2019) ESPNetv2: a light-weight, power efficient, and general purpose convolutional neural network. CVPR. Cited by: §4.1, §4.1, §4.3, §4.3.
  • [46] U. Naftaly, N. Intrator, and D. Horn (1997) Optimal ensemble averaging of neural networks. Network: Computation in Neural Systems. Cited by: §2.
  • [47] K. Neklyudov, D. Molchanov, A. Ashukha, and D. P. Vetrov (2017) Structured bayesian pruning via log-normal multiplicative noise. In NeurIPS, Cited by: §1, §2, §2.
  • [48] D. Rezende, S. Mohamed, and D. Wierstra (2014)

    Stochastic backpropagation and approximate inference in deep generative models

    In ICML, Cited by: §3.3.
  • [49] R. E. Schapire (2003) The boosting approach to machine learning: an overview. In Nonlinear estimation and classification, Cited by: §2.
  • [50] S. Singh, D. Hoiem, and D. Forsyth (2016) Swapout: learning an ensemble of deep architectures. In NeurIPS, Cited by: §2.
  • [51] S. Srinivas and R. V. Babu (2015) Data-free parameter pruning for deep neural networks. BMVC. Cited by: §2.
  • [52] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. JMLR. Cited by: §2.
  • [53] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In CVPR, Cited by: §4.3.
  • [54] C. Tai, T. Xiao, Y. Zhang, X. Wang, et al. (2016) Convolutional neural networks with low-rank regularization. ICLR. Cited by: §2.
  • [55] A. Veit, M. J. Wilber, and S. Belongie (2016) Residual networks behave like ensembles of relatively shallow networks. In NeurIPS, Cited by: §2.
  • [56] T. Véniat and L. Denoyer (2018) Learning time/memory-efficient deep architectures with budgeted super networks. In CVPR, Cited by: §1, §2, §4.3.
  • [57] J. Yu, L. Yang, N. Xu, J. Yang, and T. Huang (2019) Slimmable neural networks. ICLR. Cited by: §2.
  • [58] X. Zhang, X. Zhou, M. Lin, and J. Sun (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In CVPR, Cited by: §2.
  • [59] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In CVPR, Cited by: §4.3.
  • [60] K. Zhao, T. Matsukawa, and E. Suzuki (2018) Retraining: a simple way to improve the ensemble accuracy of deep neural networks for image classification. In ICPR, Cited by: §2.
  • [61] Z. Zhou, J. Wu, and W. Tang (2002) Ensembling neural networks: many could be better than all. Artificial intelligence. Cited by: §2.
  • [62] S. Zhu, X. Dong, and H. Su (2019) Binary ensemble neural network: more bits per network or more networks per bit?. CVPR. Cited by: §2.
  • [63] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2018) Learning transferable architectures for scalable image recognition. In CVPR, Cited by: §2, §3.4.

Appendix A Supplementary material

We provide further results on CIFAR100 in order to show the importance of all components of our proposed CNMMs. Moreover, we provide additional qualitative results of semantic segmentation on the CityScapes dataset.

a.1 Ablative study of CNMM

Using sampling during training.  During learning, CNMMs generate a set of samples using Eq. (7). In contrast, during inference we use the expectations instead. In order to evaluate the importance of sampling during learning, we have optimized a CNMM by using the aforementioned expectations instead of samples. Figure 9 shows the results obtained by the model using this approach, denoted as “Training with expectations”. We observe that, compared to the CNMM using sampling, the accuracy decreases faster when different pruning ratios are applied. We attribute this to the fact that our sampling procedure can be regarded as a continuous-relaxation of dropout, where a subset of functions are randomly removed when computing the output tensor . As a consequence, the learned model is more robust to the pruning process where some of the convolutional blocks are removed during inference. This is not the case when deterministic expectations are used in Eq. (7) rather than samples.

Comparison with a deterministic model.  We compare the performance of our CNMM with a deterministic variant using the same architecture. Concretely, in Eq. (7) we ignore samples and simply sum the feature maps and . Note that the resulting model is analogous to a MSDNet [17] using early-exit classifiers. We report the results in Figure 9, denoted as “Deterministic with early-exits”. We observe that our CNMM model obtains better performance than its deterministic counterpart. Moreover, same as MSDNets, accelerating the deterministic model is only possible by using the early-exits. In contrast, the complementary pruning algorithm available in CNMM allows for a finer granularity to control the computational cost.

Expectation approximation during inference.  In order to validate our approximation of during inference, we evaluate the performance obtained by using a Monte-Carlo procedure for the same purpose. In particular, we generate samples from the output distribution . Then, we compute the class probabilities for each sample and average them. Table 1 shows the results obtained by varying the number of samples. We observe that our approach offers a similar performance as the Monte-Carlo approximation using . For a higher number of samples, we observe slight improvements in the results. However, note that a Monte-Carlo approximation is very inefficient since it requires independent evaluations of the model.

In particular, the last row in Table 1 is 30 times more costly to obtain than the two first rows. The minimal gain obtained with more samples could probably be more efficiently obtained by using a larger model.

Figure 9: FLOPs vsperformance curves for our model (CNMM), and the variants described in Section A.1. As in Figure 7 of the main paper, the curves are obtained by using the optimal combination of early-exits and pruning when possible. In this manner, results for CNMM represent upper-envelope of all the different curves depicted in Fig. 6 of the main paper.
Figure 10: Pixel-level predictions for a single CNMM adapting the number of FLOPs required during inference.
Approximation FLOPs Top-1 Accuracy
Expectation (used) 93M 74.4
Sampling N=1 93M 71.2
Sampling N=5 463M 74.4
Sampling N=15 1390M 74.5
Sampling N=30 2780M 74.6
Table 1: Comparison of the results obtained in CIFAR100 by approximating the CNMM output using our approach or a Monte-Carlo procedure with different number of samples .

a.2 Additional Qualitative Results

In Figure 10 we provide additional qualitative results for semantic segmentation obtained by a single trained CNMM model, using various opertating points with different number of FLOPs during inference.