Introduction
CNN models have been widely used by imagebased applications, thanks to the breakthrough performance of AlexNet [Krizhevsky, Sutskever, and Hinton2012] and VGGNet [Simonyan and Zisserman2014]. However, a very deep and wide CNN model consists of many parameters, and as a result, the trained model may demand a large amount of DRAM and a large number of multiplications to perform a prediction. Such high resource and computation requirements lead to latency, heat, and power consumption problems, which are suboptimal for edge devices such as mobile phones and IoTs [Sze et al.2017]. Therefore, reducing CNN model size is essential for improving resource utilization and conserving energy.
Several CNN model reduction algorithms have recently been proposed [Sze et al.2017]. These algorithms can be divided into two categories: microlevel (e.g., performing reduction/quantization inside a filter) and macrolevel reduction (e.g., removing redundant filters). These two categories are complements of each other. (More details are presented in the related work section.)
There are two macrolevel reduction approaches: optimization based and channelscaling based. Each approach has multiple methods and algorithms. The optimizationbased approach typically estimates the filter importance by formulating an optimization problem with the adopted criteria (e.g., filter weight magnitude). Removing a filter (or a channel) will affect both the former and latter layers. The filter pruning step of the optimizationbased method must take into account the interconnected structures between CNN layers. Therefore, a CNN model such as DenseNet [Huang et al.2017] and ShuffleNet [Zhang et al.2017] with more complex interconnected structures may prevent the optimizationbased approach from being effective.
The channelscaling based approach using an scalar to reduce channel width. For instance, MobileNet [Howard et al.2017] uses the same scaler to prune the widths of all channels. Applying the same scaler to all convolutional layers without considering each information density is a coarsegrained method. A finegrained method that finds the optimal scalar for each convolutional layer should be ideal. However, the increasingly complicated interlayer connection structures of CNN models forbid finegrained scaling to be feasible. In addition, a layerdependent method requires a dependable metric to measure information redundancy for each convolution layer to determine an effective layerdependent scalar.
To address the shortcomings of the current modelcompaction methods, we propose macroblock scaling (MBS). A macroblock consists of a number of convolution layers that exhibit similar characteristics, such as having the same resolution or being a segment of convolution layers with customized interconnects. Having macroblock as a structure abstraction provides the flexibility for MBS to interoperate with virtually any CNN models of various structures, and also permits channelscaling to be performed in a “finer”grained manner. To quantify information density for each macroblock so as to determine an effective macroblockdependent scalar, MBS uses effective flops to measure each macroblock’s information density. (We define effective flops
to be the number of convolution flops required for the activated nonzero ReLU outputs.) Experimental results show that the reduction MBS can achieve is more significant than those achieved by all prior schemes.
In summary, the contributions of this work are as follows:

MBS employs macroblock to address the issues that both coarsegrained and finegrained scaling cannot deal with, and hence allows channelscaling to be performed with any CNN models.

MBS proposes using an effective and efficient measure, effective flops, to quantify information density to decide macroblockdependent scaling factors. As shown in the algorithm section, the computation complexity of MBS is linear w.r.t. the number of training instances times the number of parameters, which is more efficient than the optimizationbased methods.

Extensive empirical studies on two representative datasets and all wellknown CNN models (e.g., MobileNet, ShuffleNet, ResNet, and DenseNet) demonstrate that MBS outperforms all stateoftheart modelreduction methods in reduction size while preserving the same level of prediction accuracy. Due to its simple and effective nature, MBS remains to be effective even with ultradeep CNNs like ResNet on ImageNet ( reduction) and ResNet on CIFAR ( reduction).
The remaining parts of this paper are organized into three main sections. The Related Work section highlights some previous efforts of reducing CNN models. The Method section explains our proposed MBS algorithm. The Experiment section shows the encouraging results of applying MBS on various CNN models.
Related Work
Property  Optimizationbased  Scalingbased 

Performance  High  MediumHigh 
Flexibility  Low  High 
Scalability  Low  High 
We review related work in two parts. We first review key CNN properties relevant to the inception of MBS and then review some representative modelreduction methods.
Relevant CNN Properties
There are research works [Teerapittayanon, McDanel, and Kung2016, Figurnov et al.2017, Wu et al.2018] integrating the early stop (or early exit) mechanism of the initial CNN layers in order to speed up the inference process. This phenomenon demonstrates that the outcome of a CNN model at early stages can be adequate for predicting an image label with high confidence. This result provides supporting evidence for us to group convolution layers into two types: former convolution layers (near to the input image) as the base layers, and latter convolution layers (close to the label output) as the enhancement layers. The early stop mechanism motivates that the information density in the enhancement layers should be lower than that in the base layers, and therefore, more opportunities exist in the enhancement layers to be compressed to reduce model size.
Reduction Methods of CNN Model
As mentioned in the introduction that model reduction can be divided into microlevel and macrolevel approaches. Binary approximation of a CNN filter is one important direction for microlevel model reduction [Courbariaux, Bengio, and David2015, Rastegari et al.2016, Lin, Zhao, and Pan2017, Hubara et al.2016, Rastegari et al.2016]. Maintaining prediction accuracy of a binary CNN is a challenging issue [Tang, Hua, and Wang2017]. The sparse convolution modules [Liu et al.2015, Wen et al.2016, Jaderberg, Vedaldi, and Zisserman2014, Denton et al.2014, Aghasi et al.2017] or deep compression [Han, Mao, and Dally2016] usually introduce irregular structures. However, these microlevel model reduction methods with irregular structures often require special hardware for acceleration [Han et al.2016].
The macrolevel model reduction approach removes irrelevant filters and maintains the existing structures of CNNs [Li et al.2017, He, Zhang, and Sun2017, Liu et al.2017, Hassibi and Stork1993]. The methods of this reduction approach estimate the filter importance by formulating an optimization problem based on some adopted criteria (e.g., the filter weight magnitudes or the filter responses). The research work of [Yu et al.2018] addresses the filter importance issue by formulating the problem into binary integer programming with the aid of feature ranking [Roffo, Melzi, and Cristani2015], which achieves the stateoftheart result. For an layers model with parameters and training images, the computational complexity of the feature ranking preprocessing is to acquire the corresponding CNN outputs of the training images, and the ranking step would take additional complexity [Roffo, Melzi, and Cristani2015]. In addition to the precessing step, the binary integer programming is an NPhard problem. The detail complexity is not specified in [Yu et al.2018]. In general, a good approximate solution for variables still requires high computational complexity (e.g.,
[Karmarkar1984]).MBS enjoys low computation complexity that is in computing information density and in computing the scaling factors (Algorithm 1 in the next section presents details). MBS enjoys superior modelreduction ratio while preserving prediction accuracy at low computation cost.
MBS Algorithm
This section presents our proposed macroblock scaling (MBS) algorithm for reducing an already trained CNN model. We first define key terms including channel, filter, channel scaling, macroblock, and parameters used by MBS. We then explain how MBS computes information density, and how that information is used by MBS to reduce model size. Finally, we analyze computational complexity of MBS and compare its efficiency with competing modelcompaction algorithms.
We use image applications to explain a CNN pipeline. A typical CNN pipeline accepts training images as input. These training instances are of the same height and width. To simplify our notation, we assume all input images are in the square form with the same resolution
. A CNN model is composed of multiple convolution layers. The input to a convolution layer is a set of input tensors (or input activations), each of which is called a
channel [Sze et al.2017]. Each layer generates a successively highlevel abstraction of the input tensors, call a output tensor or feature map.More specifically, the convolution layer , , takes input tensor and produces output tensor , where is the spatial height and width of , the input channel width (i.e., number of channels), the spatial height and width of , and the output channel width. Let denote the spatial dimension of the squarefilter kernel of , the required number of parameters of can be written as
(1) 
MBS groups convolution layers into macroblocks. Macroblock consists of the convolution layers whose output tensors (feature maps) are of the same size. Figure 1
depicts an example CNN model with three CNN macroblocks. The size of output tensors is downsampled by the pooling layers with stride size 2. Hence, macroblock
is defined as(2) 
Operation scaling reduces channel width. Intuitively, MBS would like to prune channels that cannot provide positive contributions to accurate prediction. For instance, MobileNet [Howard et al.2017] scales down all channel widths by a constant ratio , or we call this baseline scheme scaling. MobileNet uses the same value for all convolution layers. However, an effective channelscaling scheme should estimate the best scaling ratio for each convolutional layer based on its information density, which MBS quantifies and determines layerdependent values.
Grouping Layers on Information Density
An effective CNN model requires a sufficient number of convolution layers to capture good representations from input data. However, as the number of the convolution layers grows beyond a threshold, the additional benefit in improving prediction accuracy can diminish. One may argue that the former convolution layers may learn lowlevel representations such as edges and contours, whereas latter layers highlevel semantics. As we will demonstrate shortly, the latter layers may cover receptive fields that are larger than the input images, and their learned information may not contribute to class prediction. The effective receptive field in CNN is the region of the input image that affects a particular neuron of the network [Luo et al.2016]. Figure 2 shows an example, where a neuron of a former convolution layer covers a region inside the input image, whereas a neuron of a latter layer may cover a region larger than the input image.
Hence, we categorize convolution layers into two types, base layers and enhancement layers, which are defined as follows:

Base convolution layers: The former convolution layers (near to the input) of a CNN model learn essential representations from the training data. Though representations captured in the base layers could be redundant, they are fundamental for accurate class prediction.

Enhancement convolution layers: The latter convolution layers may cover receptive fields larger than the input areas^{1}^{1}1Due to data augmentation and boundary patching operations applied to raw input images, a training input image may contain substantial useless information at its boundaries.. Therefore, opportunities are available for channel pruning to remove both useless and redundant information.
Determining CNN Base Layers by Receptive Field
Revisit macroblocks in Figure 1. Convolution layer belonging to macroblock is the first enhancement layer, where its receptive field is larger than the size of the input image. We estimate information redundancy of each macroblock by measuring the information density ratio contributed by the enhancement layers.
Now, we define a function to compute the receptive field size of layer . For simplicity, assume that the receptive field region of filter is . The possible set of values of is discrete, which is determined by the configuration of the kernel size and the stride step of a CNN model. For lucid exposition, we define to characterize the minimum receptive field boundary that is greater than a given value as follows:
(3) 
We use this boundary to divide base convolution layers and enhancement convolution layers in a CNN pipeline.
We can determine the base layers of a CNN by setting the value of . As we have previously explained, the area beyond and at the boundary of an image contains less useful information. Therefore, setting is reasonable to separate those layers that can contribute more to class prediction from the other layers. A macroblock can contain base convolution layers only, enhancement layers only, or a mixture of the two.
Macroblocklevel Reduction
To preserve the design structure of an original CNN model, MBS performs reduction at the macroblock level instead of at the convolutionlayer level. As we defined in the beginning of this section, each macroblock contains convolutions layers of the same resolution. In addition, for some CNN models that have convolution layers connected into a complex structure, MBS treats an entire such segment as a macroblock to preserve its design. Our macroblock approach, as its name suggests, does not deal with detailed interlayer connection structure. The macroblock abstraction thus makes model reduction simple and adaptive.
Macroblock Information Density/Redundancy Estimation
MBS uses convolution FLOP to estimate information density. A FLOP (denoted by the lowercase “flop” in the remainder of the paper) is a multiplyandadd operation in convolution. The more frequently that ReLU outputs a zero value means that the less information that convolution layer contains. Therefore, only those flops that can produce a nonzero ReLU output is considered to be effective. Figure 3 shows the computation of the effective flops of a convolution layer. Let denote effective flops of layer , and the nonzero probability of its ReLU output. We can define as
(4) 
To evaluate information density of macroblock , we tally the total effective flops from the beginning of the CNN pipeline to the end of macroblock . We can write the sum of the effective flops as
(5) 
Next, we compute the effective flops in the base layers or those flops taking place within the receptive field as
(6) 
where the base layers have the maximum receptive field size .
Based on the total flops and base flops , we define the difference between the two as the enhancement flops, which is denoted as and can be written as . The redundancy ratio of macroblock is then defined as the total enhancement flops over the total flops, or
(7) 
We estimate the channelscaling factor for each macroblock based on this derived redundancy , which is addressed next.
ChannelScaling Factor Estimation
We define the relation between the original channel width of macroblock and the compact channel width after the reduction process, which is depicted as
(8) 
If there is no redundancy in macroblock (i.e., ), the original channel is equal to the compact channel width . Therefore, the channel width multiplier for the macroblock is
(9) 
where this estimation makes since according to Eq. (7). The lower bound of the channelscaling factor is in accordance with the observation made by MobileNet [Howard et al.2017] that a scaling factor that is less than can introduce noticeable distortion.
Algorithm 1 presents our MBS algorithm, which estimates the scaling factor for each macroblock and derives the compact channel width . The MBS scaling algorithm takes the pretrained model with convolution layers and the training images as input. The convolution results of the pretrained model for the training images are utilized for estimating the scaling factors. The inner loop from steps to collects nonzero statistics of the ReLU outputs . The steps in the outer loop after the inner loop (steps and ) take an average over training instances, and then derive the effective flop for each convolution layer .
The macroblock process starts from step . The MBS algorithm first tallies the total flops for each macroblock (steps to ). MBS then computes the base flops and redundant ratio for macroblock (steps to ). The scaling factor is derived from redundancy ratio in step . After has been computed, MBS estimates the compact channel width for each macroblock in step .
After algorithm 1 outputs the new set of channel widths, the CNN is retrained with this set of new parameters to generate a more compact model . In the experimental section, we will evaluate the effectiveness of MBS by examining the performance (prediction accuracy and modelsize reduction) achieved by over .
The pretrained model has layers with parameters and training images. The required complexity of MBS consists of two parts: the nonzero statistics collection (from steps to ) and the redundancy estimation (from steps to ). In the first part, we collect by inferencing the training images, which the statement in step can be absorbed into the forward pass of the pretrained model. Hence, the computational complexity is . The second part traverses all the convolution layers of the pretrained model for computing the effective flops. The complexity of the second part is in the pretrained model . The wallclock time of first part usually takes minutes on a PC with NVIDIA Ti for each pretrained model on ImageNet. Notice that we only have to conduct first part once for each pretrained model. The wallclock time of second part is negligible, which is less than one second on the same PC.
Experiments
Model  Acc. [Diff.]  Saving 

ResNet    
MBS ()  []  
ResNet    
MBS ()  []  
ResNet    
MBS ()  []  
ResNet    
MBS ()  []  
ResNet    
MBS ()  []  
ResNet    
MBS ()  []  
ResNet    
[Li et al.2017] A  []  
[Li et al.2017] B  []  
[Yu et al.2018] NISP  []  
MBS ()  [] 
We applied MBS to various CNNs by PyTorch
on CIFAR and ImageNet to evaluate its effectiveness in model reduction. Our experiments aim to answer three main questions:
How aggressively can one reduce the size of a CNN model without significantly degrading prediction accuracy?

Can MBS work effectively with deep and already highly compact CNN models?

Can MBS outperform competing modelreduction schemes?
Model  Top [Diff.]  Top [Diff.]  Parameters ()  Saving  Configuration 

ResNet 
  
MBS ()  []  []  
ResNet    
MBS ()  []  []  
DenseNetBC    
MBS ()  []  [] 
Results on CIFAR10 Dataset
CIFAR consists of k training images and k testing images of classes. We follow the training settings in [Huang et al.2017]: batch size is , weight decay is , and learning rate is set to initially and divided by at the and
of the total training epochs, respectively.
Accuracy and Reduction Tradeoff on ResNet
We evaluated the effect of setting different receptive field size threshold on prediction accuracy on CIFAR with ResNet. The threshold is set to , which ranges from to with step size from the leftmost point to the rightmost point for each model in Figure 4.
Figure 4 shows two results. The axis depicts model reduction ratio from low to high, and the axis prediction accuracy. We first observe that on all ResNet models (ResNet, , , , and ), the more number of enhancement layers (i.e., MBS employing smaller value, see the five values on each line from large on the left to small on the right), the better the model reduction ratio. Second, the tradeoff between model reduction and prediction accuracy exhibits in all ResNet models, as expected.
The figure provides an application designer a perfect guidance to select the receptive field setting to fulfill the design goal. If accuracy outweights model size, a larger is desirable (i.e., fewer enhancement layers). If model size is the primary concern for powerconservation and framerate improvement (e.g., a video analysis requires fps), then the designer can select a small . For instance, on ResNet, achieves model reduction with loss in prediction accuracy.
MBS vs. Other Reduction Schemes
Table 2 compares the reduction achieved by MBS and some representative methods. The tophalf of the table lists our evaluation on all ResNet models. For instance, MBS reduces the model size of ResNet significantly () with negligible accuracy drop (). The bottom half of the table compares MBS with recently published methods with the best reduction ratios. We set MBS at the same accuracy level, MBS achieves the highest reduction ratio ().
We also compared MBS with the naive scaling method used by ResNet. The scaling multiplies the entire model with the same scaling factor , whereas MBS adaptively sets the scaling factor by the information density of each macroblock. Figure 5 plots the range of from to with step size . MBS outperforms scaling in prediction accuracy under four model sizes.
Top [Diff.]  Top [Diff.]  Parameters ()  Saving  Configuration  

ResNet (Original)    
[Li et al.2017]  []      
[Yu et al.2018] NISPA  []        
[Yu et al.2018] NISPB  []        
MBS () 
[]  [] 
Model 
Top [Diff.]  Top [Diff.]  Parameters ()  Saving  Configuration 
ShuffleNet ()    
Proposed ()  []  []  
MobileNet    
Proposed ()  []  []  
Proposed ()  []  []  
MobileNet    
Proposed ()  []  []  
Proposed ()  []  []  

ImageNet Dataset
ImageNet has million training images and k images for the class validation. We trained all models (except for DenseNet, explained next) by epochs with batch size set as . The learning rate is initially set to and divided by at epochs and , respectively. For DenseNet, we trained its model by epochs with batch size and divided the learning rate by at epochs as suggested in [Huang et al.2017]. The data augmentation follows the ImageNet script of PyTorch, which is the same as ResNet [He et al.2016]. The weight decay is for the CNNs with standard convolution (e.g., ResNet and DenseNet). The CNNs with depthwise separable convolution (e.g., ShuffleNet and MobileNet) set the weigh decay to according to the training configurations as suggested in ShuffleNet [Zhang et al.2017].
Results of CNNs with Standard Convolution
Table 3 shows that MBS is flexible to work with different CNN designs including very deep and complex CNN models such as ResNet and DenseNet. As shown in Table 2, MBS can work with different depth configurations of ResNet on the CIFAR10 dataset. Table 3 further shows consistent results when working on ImageNet. MBS achieves model reduction for ResNet, while maintaining the same prediction accuracy. On a highly optimized deep model DenseNet (a version of DenseNetBC defined in [Huang et al.2017]), which has bottleneck modules and transition layers already highly compressed by . MBS still can achieve additional model reduction with negligible accuracy loss.
To exhaustively compare with all prior works, we also conducted experiments with ResNet. We divided the learning rate by at epoch and trained the reduced ResNet with additional epochs as a simplified finetune process. Table 4 shows that MBS is slightly better than stateoftheart on ResNet (by at the same accuracy level).
Results of CNNs with Depthwise Convolution
We applied MBS to two CNN models with depthwise convolution structures, ShuffleNet and MobileNet. The depthwise convolution structure already reduces CNN model size significantly. Table 5 shows that MBS can further reduce these highly compact models. On ShuffleNet, MBS reduces the model size by additional with negligible distortion. The depthwise convolution and the unique shuffling operation of ShuffleNet would increase the difficulty of the objective function formulation for the optimizationbased methods. On the contrary, MBS can simply estimate the channelscaling factor for each CNN macroblock and perform model reduction.
We also evaluated MBS with MobileNet at different input image resolutions. Table 5 shows that MBS achieves and reduction on and , respectively. Notice that when we set , the prediction accuracy of MobileNet improves slightly. This result suggests a possible smaller threshold value of for MobileNet. Hence, we applied a slightly more aggressive setting of , which achieved a modelsize reduction.
Conclusion
We proposed a novel method to estimate the channelscaling factor for each CNN macroblock. Our proposed MBS algorithm reduces model size guided by an information density surrogate without significantly degrading classprediction accuracy. MBS is flexible in that it can work with various CNN models (e.g., ResNet, DenseNet, ShuffleNet and MobileNet), and is also scalable in its ability to work with ultra deep and highly compact CNN models (e.g., ResNet). MBS outperforms all recently proposed methods to reduce model size at low computation complexity. With an adjustable receptive field parameter, an application designer can determine a proper tradeoff between prediction accuracy and model size (implying DRAM size and power consumption) by looking up a tradeoff table similar to the table presented in Figure 4.
References
 [Aghasi et al.2017] Aghasi, A.; Abdi, A.; Nguyen, N.; and Romberg, J. 2017. Nettrim: Convex pruning of deep neural networks with performance guarantee. In NIPS. 3180–3189.
 [Courbariaux, Bengio, and David2015] Courbariaux, M.; Bengio, Y.; and David, J.P. 2015. Binaryconnect: Training deep neural networks with binary weights during propagations. In NIPS. 3123–3131.
 [Denton et al.2014] Denton, E.; Zaremba, W.; Bruna, J.; LeCun, Y.; and Fergus, R. 2014. Exploiting linear structure within convolutional networks for efficient evaluation. In NIPS, NIPS’14, 1269–1277. Cambridge, MA, USA: MIT Press.
 [Figurnov et al.2017] Figurnov, M.; Collins, M. D.; Zhu, Y.; Zhang, L.; Huang, J.; Vetrov, D.; and Salakhutdinov, R. 2017. Spatially adaptive computation time for residual networks. In IEEE CVPR.
 [Han et al.2016] Han, S.; Liu, X.; Mao, H.; Pu, J.; Pedram, A.; Horowitz, M. A.; and Dally, W. J. 2016. Eie: Efficient inference engine on compressed deep neural network. ISCA.
 [Han, Mao, and Dally2016] Han, S.; Mao, H.; and Dally, W. J. 2016. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. ICLR.
 [Hassibi and Stork1993] Hassibi, B., and Stork, D. G. 1993. Second order derivatives for network pruning: Optimal brain surgeon. In NIPS, 164–171. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
 [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In IEEE CVPR.
 [He, Zhang, and Sun2017] He, Y.; Zhang, X.; and Sun, J. 2017. Channel pruning for accelerating very deep neural networks. In IEEE ICCV.
 [Howard et al.2017] Howard, A. G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; and Adam, H. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861.
 [Huang et al.2017] Huang, G.; Liu, Z.; van der Maaten, L.; and Weinberger, K. Q. 2017. Densely connected convolutional networks. In CVPR.
 [Hubara et al.2016] Hubara, I.; Courbariaux, M.; Soudry, D.; ElYaniv, R.; and Bengio, Y. 2016. Binarized neural networks. In NIPS. 4107–4115.
 [Jaderberg, Vedaldi, and Zisserman2014] Jaderberg, M.; Vedaldi, A.; and Zisserman, A. 2014. Speeding up convolutional neural networks with low rank expansions. In BMVC. BMVA Press.
 [Karmarkar1984] Karmarkar, N. 1984. A new polynomialtime algorithm for linear programming. Combinatorica 4(4):373–395.
 [Krizhevsky, Sutskever, and Hinton2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In NIPS, NIPS’12, 1097–1105.
 [Li et al.2017] Li, H.; Kadav, A.; Durdanovic, I.; Samet, H.; and Graf, H. P. 2017. Pruning filters for efficient convnets. ICLR.
 [Lin, Zhao, and Pan2017] Lin, X.; Zhao, C.; and Pan, W. 2017. Towards accurate binary convolutional neural network. In NIPS. 344–352.
 [Liu et al.2015] Liu, B.; Wang, M.; Foroosh, H.; Tappen, M.; and Pensky, M. 2015. Sparse convolutional neural networks. In IEEE CVPR.
 [Liu et al.2017] Liu, Z.; Li, J.; Shen, Z.; Huang, G.; Yan, S.; and Zhang, C. 2017. Learning efficient convolutional networks through network slimming. In IEEE ICCV.
 [Luo et al.2016] Luo, W.; Li, Y.; Urtasun, R.; and Zemel, R. 2016. Understanding the effective receptive field in deep convolutional neural networks. In Lee, D. D.; Sugiyama, M.; Luxburg, U. V.; Guyon, I.; and Garnett, R., eds., NIPS. Curran Associates, Inc. 4898–4906.
 [Rastegari et al.2016] Rastegari, M.; Ordonez, V.; Redmon, J.; and Farhadi, A. 2016. Xnornet: Imagenet classification using binary convolutional neural networks. In ECCV 2016, 525–542.

[Roffo, Melzi, and Cristani2015]
Roffo, G.; Melzi, S.; and Cristani, M.
2015.
Infinite feature selection.
In IEEE ICCV 2015, 4202–4210.  [Simonyan and Zisserman2014] Simonyan, K., and Zisserman, A. 2014. Very deep convolutional networks for largescale image recognition. CoRR abs/1409.1556.
 [Sze et al.2017] Sze, V.; Chen, Y. H.; Yang, T. J.; and Emer, J. S. 2017. Efficient processing of deep neural networks: A tutorial and survey. Proceedings of the IEEE 105(12):2295–2329.
 [Tang, Hua, and Wang2017] Tang, W.; Hua, G.; and Wang, L. 2017. How to train a compact binary neural network with high accuracy? In AAAI.
 [Teerapittayanon, McDanel, and Kung2016] Teerapittayanon, S.; McDanel, B.; and Kung, H. T. 2016. Branchynet: Fast inference via early exiting from deep neural networks. In editor., ed., ICPR.
 [Wen et al.2016] Wen, W.; Wu, C.; Wang, Y.; Chen, Y.; and Li, H. 2016. Learning structured sparsity in deep neural networks. In NIPS. 2074–2082.
 [Wu et al.2018] Wu, Z.; Nagarajan, T.; Kumar, A.; Rennie, S.; Davis, L. S.; Grauman, K.; and Feris, R. 2018. Blockdrop: Dynamic inference paths in residual networks. In IEEE CVPR.
 [Yu et al.2018] Yu, R.; Li, A.; Chen, C.F.; Lai, J.H.; Morariu, V. I.; Han, X.; Gao, M.; Lin, C.Y.; and Davis, L. S. 2018. Nisp: Pruning networks using neuron importance score propagation. In IEEE CVPR.
 [Zhang et al.2017] Zhang, X.; Zhou, X.; Lin, M.; and Sun, J. 2017. Shufflenet: An extremely efficient convolutional neural network for mobile devices. CoRR abs/1707.01083.