I Introduction
Since the breakthrough performance demonstrated by convolutional neural networks (CNNs) on ImageNet, deep architecture has been successfully applied to a number of areas such as speech recognition, object tracking, and image classification. As the width and depth of a CNN is increased to improve prediction accuracy, the model complexity and training time increase as well. Whereas model training can be sped up by employing a large number of GPUs, inferencing on mobile and wearable devices (e.g., mobile VR) faces the resource limitations of memory, power and computation. In this work, we utilize information flow analysis to perform CNN model reduction while preserving prediction accuracy.
Traditionally, a complex CNN is simplified for embedded systems by using the teacherstudent model [1, 2]. Such simplification demonstrates that important properties of a CNN can be preserved when its model complexity is reduced. However, almost all modelreduction approaches treat a CNN as a blackbox and simply compress the model parameters obtained by the training process. Much effort has recently been devoted to opening the blackbox to better understand and interpret CNNs. The work by [3] proposes the use of information theory to analyze the internal behaviors of a deep architecture. Inspired by this approach, our work incorporates information density to conduct model reduction. Our informationbased approach works orthogonally to the traditional blackbox approach (discussed further in Section II), and can achieve additional compaction while preserving prediction accuracy.
We first conduct lesion studies to probe the dynamic nature of the network robustness in CNNs with the information density consideration. We define convolution macroblock as the set of convolution layers whose output feature maps have the same height and width. Our lesion studies provide important clues for us to construct a hypothetical information flow structure. The hypothetical structure is useful for guiding us to design an effective model reduction algorithm. The hypothetical framework formulates a CNN as an information pipeline consisting of cascaded convolution macroblocks. Our aim is to identify the fewest channel numbers^{1}^{1}1
The input to a convolution layer is a set of input tensors, each of which is called a channel
[13]. with sufficient information flow between macroblocks so as to reduce their associated parameters. We propose our backward reduction algorithm named BRIEF, which incorporates the hypothetical information flow structure to achieve our twin design goals of model reduction and accuracy preservation.The contributions of this work can be summarized as follows:

Significant reduction results. With the aid of BRIEF, we are able to compress MobileNet to a model size that is smaller than that of SqueezeNet while achieving higher prediction accuracy. We also achieve a model reduction on ResNet with ImageNet, which is better than the stateoftheart approach () presented in [7]. Even in the case of SqueezeNet, a highly optimized model, we can achieve additional reduction with negligible prediction degradation.

Further model compaction. The Convolutionlayer dominant models are inherently compact and become the default choices for mobile devices. Our proposed framework can further shrink these highly optimized CNN models (e.g., MobileNet[4]). Most traditional compaction works regard the CNN as a blackbox and perform either parameter elimination or compression within a limited range. Our method utilizes the distribution trends of the information density and the dynamic nature of CNNs, which works orthogonally to the blackbox approach. Therefore, we can reduce the model size of CNNs further.
To the best of our knowledge, this is the first work to achieve significant reduction on already highly compact CNN models using information flow analysis. The remainder of this paper is organized into four sections. Section II describes related work in CNN models and modelreduction techniques. Section III explains our hypothetical information flow structure and presents our proposed backward reduction algorithm. Section IV details experiments. We offer our concluding remarks in Section V.
Ii Related Work
The rapid increase of the number of parameters from the cascaded fully connected (FC) layers is the main reason for excessive CNN model size. Recent developed CNN models (e.g., ResNet [5], DenseNet [8], and MobileNet [4]
) show that with the aid of improved structural design, the cascaded FC layer design is not mandatory for achieving high prediction accuracy. These improved designs include, but are not limited to, batch normalization
[9] and bottleneck structure [10]. Though these new models without many cascaded FC layers are highly compact, we propose BRIEF from the information density perspective to further compact these models.This section first reviews existing CNN models (Section IIA) and then traditional blackbox modelreduction approaches (Section IIB). In comparison with the traditional methods, our BRIEF is a coarsegrained model compaction technique applicable to both FClayer dominant and Convolutionlayer dominant CNN models.
Iia Categories of CNNs
A CNN typically consists of convolutional layers, pooling layers, and FC layers. Based on the distribution of the model parameters in a CNN, we can categorize the CNN as either FClayer dominant or Convolutionlayer dominant.

FClayer dominant models. A model is considered as FClayer dominant when the parameters of FC layers comprise more than of the total parameters in a CNN. The wellknown AlexNet [11] and VGGNet [12] models are examples of FClayer dominant CNNs. CNNs in this category usually possess huge model footprints that are mainly contributed by the parameters of the FC layers. The parameters of the FC layers in VGG account for MB () out of the total MB storage.

Convolutionlayer dominant models. The recent trend of CNNs replaces the cascaded FC layers with a global average pooling layer. The parameters of FC layers now comprise less than of parameters in the whole model. CNNs in this category are compact models with high prediction accuracy. For example, MobileNet takes up only MB of storage but possesses VGGlevel accuracy.
IiB Model Reduction Techniques
Based on the level of involved structures, the model reduction techniques can be classified into two categories: finegrained and coarsegrained
[13].
Finegrained approaches. There are various reduction techniques applied on the filter/kernel level. Such techniques include sparse convolution [14] and deep compression [15]. The irregular structures introduced on the filter/kernel level often require special hardware acceleration [16]. Replacing the cascading FC layers with an average pooling layer introduces negligible performance degradation. FC layer pruning [17] usually achieves significant reduction since FC layers are sparse in nature. Binary filter/kernel approximation methods are also popular algorithms to model reduction [18, 19, 20, 21]. However, retaining the prediction accuracy of a binary CNN is a challenging issue. This issue is addressed and relaxed by using multiple binary representations [22, 23].

Coarsegrained approaches. The coarsegrained techniques conduct filter pruning [7, 24, 25], which removes irrelevant filters/kernels from the model. The filter importance is determined by filter weight magnitude [7] or derived from the similarity among interlayer feature maps [25]. The work of [24] integrates an additional scaling factor for each filter output during the training process, which the magnitude of the factor reflects the filter importance. Conducting pruning at the filterlevel can free us from designing kernelspecific hardware/software solutions.
Iii Method
In this section, we study the information flow of CNNs using the data processing inequality of information theory [27]. We investigate the relationship between the network robustness and the information density of CNNs by our designed lesion studies. Our lesion studies reveal the dynamic nature of the network robustness of CNNs for dealing with the distortions caused by channel removal. We utilize the insights observed from our lesion studies to later (in Section IIID) propose our backward modelreduction algorithm.
Iiia Descending Trend of Information Density
The data processing inequality (DPI) theorem [27]
shows that for the variables forming a Markov chain
, we have(1) 
where represents the mutual information between variables and .
Let us consider a CNN as an information pipeline where its input is images and postprocessing operations are convolution and filtering. Based on the DPI theorem, the information density in the CNN information pipeline must exhibit a decreasing trend. This information decreasing trend can also explain the phenomenon observed by [13] that the activation outputs of the latter layers of a CNN are increasingly sparse. We plot Figure 1 to illustrate our theoretically justified conjecture that the latter macroblocks (defined in Section I) of a CNN contain lower information density than the earlier ones. This conjecture establishes the foundation of our modelreduction scheme.
IiiB Lesion Study: Definitions and Setup
How can the trend of decreasing information density help reduce the model size of a CNN? This section conducts lesion studies to shed the insights. Our lesion consists of three components: the CIFAR dataset, a sequential CNN, and our designed onehot lesion:
CIFAR10 Dataset. CIFAR consists of natural images with resolution of classes, and divides k images for training and k images for testing. CIFAR is not too large (convenient for us to run many experiments) and contains sufficient natural images to reflect CNN behaviors. The network on CIFAR was trained using SGD, weight decay , momentum , and minibatch size . Following the work of [5], we set the initial learning rate to be and divided by at and
of the total epochs, respectively.
Each point represents a lesion experiment of a onehot lesion vector
. The dashed black line represents the original prediction accuracies that are (a) and (b) .Sequential CNN. The sequential CNN consists of the VGGlike structure where the cascaded FC layers are replaced by a global average pooling layer and an FC layer in order to simulate a convolutionlayer dominant CNN. This model serves as our reference model because of its simplicity and representative structures. In particular, the “downsampling then channel number” design is widely adopted in modern compact CNN models (e.g., MobileNet [4]) for considering the tradeoff between model complexity (e.g., size) and prediction accuracy.
Figure 2 shows an example of sequential CNNs, which has convolution layers. The notation () represents the number of the output channels of convolution layer . We define as the channel number of input images and use the following vector to represent the channels in the sequential CNN,
(2) 
The channels that have the same size of output feature maps are grouped into the same convolution macroblock . For example, the channels ( is not an output channel number) in Figure 2 are grouped into macroblock . The configuration of macroblocks of a sequential CNN model is denoted by such as for the model in Figure 2.
Onehot Lesion. We analyze the alteration in the predication accuracy of our sequential CNN models by constant and proportional onehot lesions, depicted as follows:

Constant onehot lesion. For each individual lesion experiment, we set for only one selected index and keep the other channel numbers () unchanged. This constant onehot lesion is denoted as .

Proportional onehot lesion. For each individual lesion experiment, we set for only one selected index and keep the other channel numbers () unchanged. Parameter is tunable. This proportional onehot lesion is denoted as .
Lesion  Channel Width (Conv. Depth=) 
Constant  
Proportional  
Table I illustrates an example of a onehot lesion for a sequential CNN with convolution depth .
We conduct a onehot lesion (between convolution layers) first to investigate the prediction accuracy behaviors at the channel level of a sequential CNN model. We then investigate at the macroblock level about rate distortion (RD) performance, which is an important criterion in lossy compression from the perspective of information theory.
IiiC Lesion Study: Analysis
Our experiments aim to find the relationship between model size and model prediction accuracy at different levels: first at the channel level and then at the macroblock level.
In constant onehot lesion at the channel level, we expected that keeping only a constant channel number in the latter macroblocks would suffer from severer degradation in prediction accuracy. This expectation is based on the fact that the channels in the latter macroblocks are much wider, and constant onehot lesion takes more information away from the channels in the latter macroblocks (e.g., keeping out of ) than from the earlier ones (e.g., keeping out of ). Figure 3 plots the result of constant onehot lesion at the channel level. The xaxis denotes the selected channel index for . The yaxis shows the corresponding prediction accuracy of the modified CNN on CIFAR. Surprisingly, macroblock enjoys an unexpected accuracy bounce: the accuracy achieved by removing more channels from macroblock is higher than the accuracy achieved by removing less channels from .
Figure 4 plots the experimental results of proportional onehot lesion. Why did we perform this study? Constant onehot lesion keeps only one channel (or channels) alive in each macroblock. One may argue that such an extreme study may have more severely penalized the later macroblocks with larger number of channels being removed. Thus, we conduct proportional onehot lesion to keep alive the same percentage of channels in all macroblocks. In the first experiment we set the alive channels to be one sixteenth, and the second experiment one eighth. Figures 4(a) and 4(b) show that when one sixteenth channels are kept, the bounced behavior on macroblock still exhibits. Since we increase the channel number of proportional onehot lesion, the prediction accuracy gap is reduced between macroblocks and . When we further increase the alive channels to one eighth, Figures 4(c) and 4(d) show that the distortion of all onehot lesion is limited (i.e., accuracy drop). The bounced behavior still exists thought less significant. Figure 4(d) illustrates that macroblock has the highest average prediction accuracy among all macroblocks.
These two lesion studies demonstrate that macroblock provides better accuracy recovery as compared with macroblock even though we remove much more channels from in both constant and proportional onehot lesion studies. Two observations can be made from these two studies. First, CNN is robust for informationloss recovery. Second, the channels in the latter macroblocks seem to have higher degrees of information redundancy.
We next extend proportional onehot lesion to the macroblock level to investigate the relation between model size and prediction accuracy. Let denote channel number reduction in the macroblock level, which we reduce all channels in macroblock from to () and keep the other channels in () intact.
Figure 5 shows the RD performance of each macroblock by , where . The xaxis depicts the model size of the reduced CNN model from small on the lefthand side to large on the righthand side. The yaxis depicts the corresponding prediction accuracy of the modified CNN trained on CIFAR. Figure 5 shows that macroblock (in gray) enjoys the best RD performance (by maintaining high prediction accuracy) among the macroblocks. In addition, macroblock enjoys the best compaction since the model size is significantly reduced by compressing macroblock as compared with compacting the other macroblocks.
These lesion studies lead to our following hypotheses:

The latter macroblock contains lower information density. We can observe that the latter macroblock has lower information density based on our lesion studies. The last macroblock has better prediction accuracy even after we have removed more channels from it than from macroblock .

Network distortion is based on information density. The actual network distortion is correlated with the information density instead of simply the number of removed channels. Therefore, the network demonstrates strong capabilities to recover from the distortions when we remove the channels with low information density.
IiiD Backward Reduction for Model Reduction
Figure 6 describes the key idea of our proposed backward reduction method. The downsampling layer plays the role of being an information gateway, through which information (some may be redundant) passes for achieving prediction accuracy. We can reduce the information flow by decreasing the channel number of the previous convolution macroblock as long as there maintains sufficient information for achieving the target accuracy. While applying this idea recursively, we must ensure the information to be still sufficient for the classification task in the FCLayer. Since the classification is performed at the end of a CNN pipeline, our model reduction process starts from the last convolution macroblock. Another reason to perform backward instead of forward reduction is that our lesion studies reveal the latter macroblocks containing lower information density. We will shortly justify this backward choice by presenting experimental results.
Algorithm 1 presents our proposed backward reduction algorithm (named BRIEF) based on the information flow structure conjectured from a combination of information flow density and the dynamic nature of CNNs.
The model reduction problem of a CNN can be formulated as follows
(3) 
where represents the bit rate of the model for the optimized channel width configuration , denotes the prediction accuracy degradation as compared with the original performance, and is the acceptable performance loss budget.
The model reduction formulation of Eq. (3) encapsulates an integer programming problem (i.e., channel number ’s must be positive integers), which makes the optimization an NPhard problem. Furthermore, no closedform distortion evaluation function exists, since the CNN is usually considered to be a blackbox. Fortunately, with the aid of the conjectured information flow structure, we can reduce the original NPhard problem to solving a onedimensional greedy search problem. We search for the proper scaling factor staring from the last convolution macroblock,
(4) 
That is, we optimize for the proper value of the scaling factor for the last convolution macroblock within distortion budget . Then we iterate through the same greedy search process for the previous convolution macroblock .
Our algorithm optimizes for the channel width multiplier for each convolution macroblock from down to . The initial setting of the lower bound in step is based on the observation by MobileNet [4] that there is a significant drop in prediction accuracy when the channel multiplier is less than . The default distortion budget in our algorithm is set to .
Our algorithm BRIEF retrains the CNN from scratch (step ) to investigate the prediction accuracy of current configuration since our lesion studies have demonstrated network robustness against channel removal. The retraining process can be relaxed by fewer training epochs as long as the training setting can retain the original CNN prediction accuracy. Once the accuracy distortion is acquired, we can adjust the upper bound and the lower bound for accordingly using a greedy binary search method (steps and ).
Earlier we mentioned that to ensure information being sufficiently maintained for the classification task, which is performed at the last layer of a CNN, we propose performing backward reduction. To confirm this backward heuristic to be accurate, we conducted experiments to compare the effectiveness of backward versus forward reduction. Table
II shows that backward reduction outperforms forward reduction significantly in size ( versus ) at a similar prediction accuracy. As a result, the backward reduction algorithm acts to remove the channels with low information density. Therefore, the network can recover information loss from the latter macroblocks when there is sufficient information provided from the earlier macroblocks. On the other hand, the forward reduction approach removes information starting from the input source, which distorts the original information and eliminates the network’s ability to recover from such distortion.Model  Acc. (%) [Diff.]  Size (MB)  Saving (%) 

Original    
Forward  
Backward 
Iv Experimental Results
This section reports experimental results with BRIEF on the ImageNet dataset for various CNN models. We conducted BRIEF’s backward reduction algorithm on the last two convolution macroblocks. We implemented BRIEF with PyTorch
. Our evaluation metric is model size. (We care about model compaction not to significantly degrade prediction accuracy, but we do not compare accuracy between different CNN models.) To realistically measure model size, we use the actual required storage (including headers) of a trained model (i.e., state_dic in PyTorch).
Iva ImageNet Dataset
ImageNet consists of M training images and k validation images, divided into classes. We trained our models for epochs with a batch size of . The learning rate, set to initially, is divided by at epoch and . Simple data augmentation was adopted based on the ImageNet script by PyTorch, which is the same as [5]. The singlecentercrop validation accuracy is reported for the CNN models.
In order to reduce the computational effort of BRIEF’s greedy backward reduction process on ImageNet, we empirically adopted the shorter training setting of training epochs with a learning rate (set to initially) divided by at epoch and . We used epoch setting in Algorithm 1 to quickly find the best , and eventually used this to train the final model for epochs. Then, we evaluated the final performance by training the CNN in the original epoch training setting.
For the shorter training setting, we designed and performed experiments to determine training epochs. In these experiments, we trained several CNN models for different training epochs and fixed the other hyperparameters of the training. The different choices of training epochs we explored are , and . We observed that 1) there is considerable discrepancy between the results of epoch and epoch settings, 2) the results of epoch setting can most reflect the results of epoch setting, and 3) the results of epoch and epoch settings are basically the same. Based on these three observations of our designed experiments, we thus selected as our shorter training setting.
ResNet  Top (%)  Param. ()  Saving 

Baseline of [7]    
[7]  %  
Baseline    
Proposed 
Model  Top (%) [Diff.]  Param. ()  Size (MB)  Saving (%)  Config. 

ResNet    
Proposed  
ResNet    
Proposed  
MobileNet    
Proposed  
MobileNet    
Proposed  
MobileNet    
Proposed  
SqueezeNet    
Proposed 
IvB Evaluation of Various CNN Models
Table III reports that BRIEF significantly outperforms the stateoftheart approach [7] conducted on ResNet, a convolutionlayer dominant CNN model. BRIEF reduces the model by more () than the previous work does () with similar top accuracy. Our algorithm removed the channels aggressively even including the nonzero weighting parameter as long as the prediction accuracy is maintained. Therefore, we can explore the additional regions that are ignored by the traditional approaches. This result confirms the great potential in model reduction from the perspective on information flow.
We evaluated our method on a variety of Convolutionlayer dominant CNN models (e.g., ResNet, MobileNet, and SqueezeNet) since these compact models are more challenging targets. MobileNet is a tough reduction target because it has a widthmultiplier technique that adjusts all the channel numbers by the same scaling factor. We also evaluated the proposed backward reduction method on the further scalingreduced MobileNets (e.g., MobileNet). We reported the scaling factors for the final two macroblocks of SqueezeNet with irregular structure and the actual channel number in the experiments for the CNNs with regular structures.
Table IV presents the performance of BRIEF among the various Convolutionlayer dominant CNN models. The configuration column of Table IV shows the architecture of each convolution macroblock before and after the proposed backward reduction algorithm. The accuracies of the CNN models in Table IV are evaluated using the epoch training setting. We achieved an up to model size reduction on MobileNet with only a accuracy loss.
Notice that the model reduction ratio decreases as we applied BRIEF on the further optimized MobileNets (e.g., MobileNet, MobileNet). Even for MobileNet, we still accomplished a model size reduction, which resulted in a storage size of only MB. The model size of the reduced MobileNet ( MB) is already smaller than that of SqueezeNet ( MB), while the prediction accuracy of MobileNet still outperforms SqueezeNet. For the highly optimized CNN, SqueezeNet, we achieved a model reduction while the accuracy loss remains within .
IvC RateDistortion Behavior
We compared our backward reduction algorithm to the traditional scaling approach that multiplies the whole CNN to the same scaling factor . The configuration is set to for the sequential CNN on CIFAR in this experiment. Figure 7 shows the RD curves of the scalingonly approach and the scaling with our backward reduction algorithm. The scaling factor ranges from to with step size . The axis represents the required bit rate (KB), and the axis denotes the prediction accuracy. The RD performance experiment is conducted on the sequential CNN with convolution depth = . For the vast majority of models, the scaling with BRIEF further improves the simple scaling approach in terms of RD performance.
V Conclusion
The conjectured information flow structure describes CNNs as dynamic structures rather than static ones. As long as sufficient information is provided from the prior convolution macroblocks, we can safely remove the redundant channels and still maintain the same level of accuracy. Therefore, we can explore the additional regions of CNNs, which were not possible with the traditional filter magnitude based approaches. Using our backward reduction algorithm BRIEF, we reduced ResNet significantly, attaining better reduction than previous approaches. We also reduced MobileNet to just MB, making it smaller than SqueezeNet ( MB), while achieving a slightly higher prediction accuracy. The capability of backward reduction was validated on highly compact CNNs, including SqueezeNet and MobileNet.
Our method is applied at the macroblock level, which the greedy algorithm may impose the computational burden for searching the proper channel number for the macroblock. We plan to derive the minimum required channel number directly with the aid of further information flow analysis for model reduction in our future work. The proposed greedy algorithm demonstrates the potential of model reduction by leveraging the information flow analysis, which we expect this work to act as a stepping stone towards opening the blackbox of CNNs to establish more efficient and compact CNN models. Our macroblock level model reduction method complements to the reduction approaches conducted in the finegrained level (e.g., sparse filter connection), which will achieve a compound improvement in model compression.
References
 [1] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “Fitnets: Hints for thin deep nets,” International Conference on Learning Representations (ICLR), 2015.
 [2] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” in Deep Learning and Representation Learning Workshop, NIPS, 2014.
 [3] R. ShwartzZiv and N. Tishby, “Opening the black box of deep neural networks via information,” CoRR, vol. abs/1703.00810, 2017.
 [4] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” CoRR, vol. abs/1704.04861, 2017.

[5]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, June 2016.  [6] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnetlevel accuracy with 50x fewer parameters and <1mb model size,” CoRR, vol. abs/1602.07360, 2016.
 [7] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters for efficient convnets,” International Conference on Learning Representations (ICLR), 2017.
 [8] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in CVPR, July 2017.

[9]
S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network
training by reducing internal covariate shift,” in
International Conference on Machine Learning (ICML)
, vol. 37. Lille, France: PMLR, 07–09 Jul 2015, pp. 448–456.  [10] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
 [11] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in International Conference on Neural Information Processing Systems (NIPS), ser. NIPS’12, 2012, pp. 1097–1105.
 [12] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” CoRR, vol. abs/1409.1556, 2014.
 [13] V. Sze, Y. H. Chen, T. J. Yang, and J. S. Emer, “Efficient processing of deep neural networks: A tutorial and survey,” Proceedings of the IEEE, vol. 105, no. 12, pp. 2295–2329, Dec 2017.
 [14] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky, “Sparse convolutional neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
 [15] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” International Conference on Learning Representations (ICLR), 2016.
 [16] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “Eie: Efficient inference engine on compressed deep neural network,” International Symposium on Computer Architecture (ISCA), 2016.
 [17] A. Aghasi, A. Abdi, N. Nguyen, and J. Romberg, “Nettrim: Convex pruning of deep neural networks with performance guarantee,” in International Conference on Neural Information Processing Systems (NIPS), 2017, pp. 3180–3189.

[18]
I. Hubara, M. Courbariaux, D. Soudry, R. ElYaniv, and Y. Bengio, “Binarized neural networks,” in
International Conference on Neural Information Processing Systems (NIPS), 2016, pp. 4107–4115. 
[19]
M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnornet: Imagenet classification using binary convolutional neural networks,” in
European Conference on Computer Vision (ECCV), 2016, pp. 525–542.  [20] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng, “Quantized convolutional neural networks for mobile devices,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
 [21] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep learning with limited numerical precision,” in International Conference on Machine Learning (ICML), vol. 37. Lille, France: PMLR, 07–09 Jul 2015, pp. 1737–1746.

[22]
W. Tang, G. Hua, and L. Wang, “How to train a compact binary neural network
with high accuracy?” in
Association for the Advancement of Artificial Intelligence (AAAI)
, 2017.  [23] X. Lin, C. Zhao, and W. Pan, “Towards accurate binary convolutional neural network,” in International Conference on Neural Information Processing Systems (NIPS), 2017, pp. 344–352.
 [24] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning efficient convolutional networks through network slimming,” in IEEE International Conference on Computer Vision (ICCV), Oct 2017.
 [25] Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating very deep neural networks,” in IEEE International Conference on Computer Vision (ICCV), Oct 2017.
 [26] A. Veit, M. Wilber, and S. Belongie, “Residual networks behave like ensembles of relatively shallow networks,” in International Conference on Neural Information Processing Systems (NIPS), 2016, pp. 550–558.
 [27] T. M. Cover and J. A. Thomas, Elements of Information Theory. WileyInterscience, 2006.
 [28] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable architectures for scalable image recognition,” CoRR, vol. abs/1707.07012, 2017.