1 Introduction
While convolutional neural networks (CNNs) [23] have demonstrated great success in various computer vision tasks, such as classification [13, 22], detection [39, 10] and semantic segmentation [5, 32]. However, their large demands of computation power and memory footprint make most stateoftheart CNNs notoriously challenging to deploy on resourcelimited devices such as smartphones or wearable devices. Lots of research resources have been devoted to CNN compression and acceleration, including but not limited to parameter quantization [19, 53], filter compression [50, 34, 25] and automatic network pruning [51, 48, 8, 30, 52, 14].
Among these compression methods, automatic network pruning has recently attracted a lot of attentions and has been successfully applied to stateoftheart architectures, such as EfficientNet [46], MnasNet [45] and MobileNetV3 [16]. Given a pretrained network, the compression method aims to automatically simplify the channel of the network until the resource budget is met while maximizing the accuracy. It is also named as channel number search [8, 52, 48], automatic model compression [14] or network adaption [51]. Based on the design of the methodology, we empirically categorize network pruning methods into two groups and discuss them as below.
Metric based methods compress the size of a network by selecting filters rely on handcrafted metrics. We can further divide existing metrics into two types, local metrics [25, 34, 50] and global metrics [28, 36]. The first pursues to identify the importance of filters inside a layer. In other words, local metric based methods require human experts to design and decide hyperparameters like channel number and then pruning filters in each layer according to norm [25] or geometric median [15]. These methods are not automatic and thus less practical in compressing various models. In addition, extensive experiments in [4, 31] suggest that filterlevel pruning does not help as much as selecting a better channel number for a specific architecture. The second designs a metric to compare the importance of filters across different layers. Such methods implicitly decide the channel number and thus alleviate the large amount of human efforts compared to local metric based methods. However, these methods usually perform network pruning followed by a datadriven and/or iterative optimization to recover accuracy, both of which are timecost.
Search based methods.
Apart from these humandesigned heuristics, efforts on search based methods have been made recently. The main difference between these methods lies in the search algorithm and architecture evaluation method. In term of search algorithm, AMC
[14] and NetAdapt [51]leverage reinforcement learning to efficiently sample the search space. Autoslim
[52] greedily slim the layer with minimal accuracy drop. In terms of the evaluation method, existing methods usually adopt weight sharing strategy to evaluate the accuracy [48, 8, 30, 52, 14]. To explain, weight sharing methods maintain an overparameterized network that covers the entire search space. The sampled architectures directly inherit the corresponding weights from the overparameterized network, which is then used to evaluate the performance and update the weights. Both of these methods require recompression or retraining while the constraints changing. This practically limits the application of network pruning since the model needs to be deployed on a wide range of devices.In this paper, we propose an information theoryinspired pruning (ITPruner) strategy that does not need the aforementioned iterative training and search process, which is simple and straightforward. Specifically, we first introduce the normalized HilbertSchmidt Independence Criterion (nHSIC) on network activations as an accurate and robust layerwise importance indicator. Such importance is then combined with the constraint to convert the architecture search problem to a linear programming problem with bounded variables. In this way, we obtain the optimal architecture by solving the linear programming problem, which only takes a few seconds on a single CPU and GPU.
Our method is motivated by the information bottleneck (IB) theory [47]. That is, for each network layer we should minimize the information between the layer activation and input. We generalize such a principle to different layers for automatic network pruning, i.e., a layer correlated to other layers is less important. In other words, we build a connection between network redundancy and information theory. However, calculating the mutual information among the intractable distribution of the layer activation is impractical. We thus adopt a nonparametric kernelbased method the HilbertSchmidt Independence Criterion (HSIC) to characterize the statistical independence. In summary, our contributions are as follows:

Information theory on deep learning redundancy.
To our best knowledge, this is the first to build a relationship between independence and redundancy of CNN by generalizing the information bottleneck principle. We are also the first that apply such relationship in automatic network pruning. ^{1}^{1}1Previous work [7] also proposed to compress the network using variational information bottleneck. However, their metric is only applied inside a specific layer, which is not automatic and less effective. 
Methodology. We propose a unified framework to automatically compress a network without any search process. The framework integrates HSIC importance with constraints to transform the network pruning to the convex optimization problem, which is efficient of the deployment on various devices.

Theoretical contributions. Except for the potential theoretical guarantees of HSIC, we also theoretically prove the robustness of HSIC as well as the relationship between our method and mutual information. In particular, as proved in Sec. 4, optimizing HSIC is equivalent to minimize the layerwise mutual information under the gaussian modeling.
The experimental results demonstrate the efficiency and effectiveness on different network architectures and datasets. Notably, without any search process, our compressed MobileNetV1 obtain FLOPs reduction with Top1 accuracy. Meanwhile, the compressed MobileNet V2, ResNet50 and MobileNet V1 obtained by ITPruner achieve the best performance gains over the stateoftheart AutoML methods.
2 Related Work
There are a large amount works that aim at compressing CNNs and employing information theory into deep learning, where the main approaches are summarized as follows.
Automatic Network Pruning. Following previous works [51, 48, 8, 30, 52, 14], automatic network pruning adapts channels of a pretrained model automatically to a specific platform (e.g., mobile, IoT devices) under a given resource budget. Similar works that modifying channels to reduce its FLOPs and speedup its inference have been proposed [50, 34, 25, 48, 8, 30, 52, 14] in the literature, which is known as network pruning. Most of the pruningbased methods have been discussed in Sec. 1. Therefore, we select the two most related works [36, 28], which proposed to prune networks by designing a global metric. However, these methods are still far from satisfactory. On one hand, there is a significant performance and architecture gap compared to search based methods [30, 52, 14]. On the other hand, these methods also need an extra iterative optimization step to recover the performance. In this case, search based methods and global metric based methods require almost the same computation resources. In contrast, our method shows superior performance compared to search based methods. Besides, we do not need any iterative training or search in our method and generating an optimal architecture only takes a few seconds.
Information theory[6]
is a fundamental research area which often measures information of random variables associated with distributions. Therefore, a lot of principles and theories have been employed to explore the learning dynamic
[11, 2]or as a training objective to optimize the network without stochastic gradient descent
[35, 43]. To our best knowledge, there exist only two prior works [7, 1] that apply information theory in network compression. Dai et al. proposed to compress the network using variation information bottleneck [47]. However, their metric measures importance inside a specific layer, which is not automatic and less effective. Ahn et al. proposed a variational information distillation method, aiming to distill the knowledge from teacher to the student. Such a method is orthogonal to our method. In other words, we can employ the method as a postprocessing step to further improve the performance.3 Methodology
Notations. Upper case (e.g.,
) denotes random variables. Bold denotes vectors (
e.g.,), matrices (e.g.,) or tensors. Calligraphic font denotes spaces (
e.g.,). We further define some common notations in CNNs. The th convolutional layer in a specific CNN convert an input tensor to an output tensor by using a weight tensor , where and denote the numbers of input channels and filters (output channels), respectively, and is the spatial size of the filters.Problem formulation. Given a pretrained CNN model that contains convolutional layers, we refer to as the desired architecture, where is the compression ratio of th layers. Formally, we address the following optimization problem:
(1) 
where is the objective function that is differentiable w.r.t. the network weight , and is not differentiable w.r.t. the architecture indicator vector . is a function that denotes the constraints for architectures, such as FLOPs and latency^{2}^{2}2For example, most mobile devices have the constraint that the FLOPs should be less than M., and is given for different hardwares.
We present an overview of ITPruner in Fig. 1, which aims to automatically discover the redundancy of each layer in a pretrained network. Notably, the redundancy is also characterized by the layerwise pruning ratio or sparsity. The detailed motivations, descriptions and analysis are presented in the following subsections.
3.1 Information Bottleneck in Deep Learning
The information bottleneck (IB) [47] principle express a tradeoff of mutual information from the hidden representation to input and output, which is formally defined as
(2) 
where are the random variables of input and label, and represents the hidden representation. Intuitively, the IB principle is proposed to compress the mutual information between the input and the hidden representation while preserving the information between the hidden representation about the input data. Considering that CNNs are hierarchical and almost all of CNNs are adopted from residual neural networks, we further generalize the IB principle in Eq. 2 as
(3) 
Note that there always a finetuning process in the pruning process. We thus further simplify the IB principle as
(4) 
According to the generalized IB principle, we hope that the mutual information between different layers is close to . In other words, representations from different layers are best to be independent of each other. We thus generalize such a principle in network compression, i.e., a layer correlated to other layers is less important. Meanwhile, layerwise mutual information can be considered as a robust and accurate indicator to achieve searchfree automatic network pruning.
The IB is hard to compute in practice. On one hand, the distribution of the hidden representation is intractable in CNN. Previous work [1]
define a variational distribution that approximates the distribution. However, such a method also introduces a new noise between the approximated and true distribution, which is inevitable as the KullbackLeiber divergence is nonnegative. On the other hand, the hidden representations usually have a high dimension in CNN. In this case, many algorithms based on binning suffers from the curse of dimensionality. Besides, they also yield different results with different bin sizes.
3.2 Normalized HSIC Indicator
To solve the issues in Sec. 3.1, we introduce the normalized HSIC to replace the mutual information term in Eq. 4. Formally, the HilbertSchmidt Independence Criterion (HSIC) [12] is defined as
(5)  
Here are kernel functions, are the Reproducing Kernel Hilbert Space (RKHS) and denotes the expectation over independent examples drawn from . In order to make HSIC practically available, Gretton et al. [12] further proposed to empirically approximate given a finite number of observations. Specifically, let contain i.i.d samples drawn from the distribution
. The estimation of HSIC is given by
(6) 
where , and have entries , and is the centering matrix. Notably, an alternative implementation of is to centralize the examples, i.e., , where . In this way, Eq. 6 is further simplified as
(7) 
where are centralized matrices. In our paper, we use the normalized HSIC (nHSIC) based on the centered kernel alignment [20], given by
(8) 
The HSIC can be effectively computed in time. Meanwhile, the uniform convergence bounds derived in [12] with respect to is . When an appropriate kernel is selected, if and only if the random variables are independent, i.e., . We also provide proof in Sec. 4 to demonstrate the relationship between HSIC and mutual information: When the linear kernel is selected, minimizing HSIC is equivalent to minimize the mutual information under a Gaussian assumption. Meanwhile, in this case, nHSIC is also equivalent to the RV coefficient [40] and Tucker’s congruence coefficient [33].
3.3 Pruning Strategy
We now describe how to use the nHSIC to compress a network under an optimal tradeoff between compression rate and accuracy. Specifically, we first sample images to obtain the feature map of each layer for the network that needs to be compressed. We then use these feature maps and Eq. 8 to get the independence between different layers, thereby constructing an independence matrix , where has entries . As mentioned before, a layer correlated to other layers is less important. Therefore, the importance of a specific layer is formally defined as
(9) 
where is proposed to control relative compression rate between different layers. To explain, the compressed network tend to has a significant compression rate difference between different layers with a small , and vice versa. We also provide extensive experiments in Sec. 5 to verify the conclusion. With the layerwise importance , the problem of Eq. 1 is transformed to a linear programming problem that is differentiable w.r.t . Formally, Eq. 1 is rewritten as
(10) 
In most cases, the constraints such as FLOPs, parameters and inference time can be decomposed to the sum over different layers. Therefore, Eq. 10 is further simplified to a linear programming with quadratic constraints
(11) 
where is the constraint factor matrix corresponding to the layers. For example, suppose we choose the parameter size as the constraint. In this case, for a specific layer , the parameter size of the layer is correlated to the input and the output channels, i.e., parameter size in th layer is . Therefore, the constraints are quadratic in Eq. 11. Note that solving the problem in Eq. 11 is extremely effective, which only takes a few seconds on a single CPU by using the solver in [21]. Our proposed ITPruner is also summarized in Alg. 1.
Discussion. Compared to previous methods [48, 8, 30, 52, 14], ITPruner is effective and easy to use. In terms of effectiveness, our method does not need any training and search process to obtain the optimal architecture. The time complexity is only correlated to the sample number , i.e., in the feature generation and in HSIC calculation, both of them only take a few seconds on a single GPU and CPU, respectively. In terms of usability, there are only two hyperparameters and in our method. Meanwhile, and are generalized to different architectures and datasets. Extensive experiments in Sec. 5.3 demonstrate that the performance gap between different and is negligible. This is in stark contrast to the previous search based methods, as they need to exhaustedly select a large amount of hyperparameters for different architectures.
4 Theoretical Analysis
In this section, we provide theoretical analysis about the proposed method. Specifically, the invariance of scale and orthogonal transformation is already proposed in [20], which is also the key property of feature similarity metric [20] in deep learning. Here we provide the strict proofs about the aformentioned invariance in Theorem 1 and Theorem 2 respectively. To further explain the relationship between and mutual information, we theoretically prove that optimizing is equivalent to minimizing the mutual information in Theorem 3.
To make the implementation easy, we use linear kernel in all our experiments. In this case, . And nHSIC is further simplified as
(12)  
where is Frobenius norm or the HilbertSchmidt norm. Eq. 12 is easy to compute and has some useful properties of being invariant to scale and orthogonal transformations. To explain, we provide two formal theorems and the corresponding proofs as follows
Theorem 1.
is invariant to scale transformation, i.e., .
Proof.
∎
Theorem 2.
is invariant to orthogonal transformation , where .
Proof.
which means that is invariant to orthogonal transformation. ∎
Theorem 1 and 2 illustrate the robustness of
in different network architectures. Specifically, when the network architecture contain the batch normalization
[18] or orthogonal convolution layer [49], we can obtain exactly the same layerwise importance , and thus obtain a same network architecture. Considering that ITPruner has a superior performance in different architectures and datasets, we conclude that is a stable and accurate indicator. And ITPruner thus is a robust compression method.Except for the theorem about the relationship between independence and HSIC in [12], we propose a new theorem to illustrate how the mutual information is also minimized by the linear HSIC under a Gaussian assumption.
Theorem 3.
Assuming that , .
Proof.
According to the definition of entropy and mutual information, we have
Meanwhile, according to the definition of the multivariate Gaussian distribution, we have
and , where
Therefore, when follows a Gaussian distribution, the mutual information is represented as
(13) 
Everitt et al. [9] proposed a inequation that , and the equality holds if and only if
is a zero matrix. Applying the inequation to Eq.
13 we have , and the equality holds if and only if is a zero matrix. According to the defination of Frobenius norm, we have . Apparently, when minimizing Eq. 12, we are also minimizing the distance between and zero matrix. In other words, while minimizing , we are also minimizing the mutual information between two Gaussian distributions, namely, ∎Model  Method  Type  Top1 Acc  FLOPs  Ratio  Search  
()  (M)  Epoch  
VGG [44]  Baseline          
L1 [25]  Local Metric    
FPGM [15]  Local Metric    
GAL [29]  Local Metric    
HRank [26]  Local Metric    
ITPruner  Automatic  94.00  98.8  68.5%  0  
ResNet20 [13]  Baseline        
FPGM [15]  Local Metric    
APS [48]  Automatic  600  
TAS [8]  Automatic  600  
Taylor [37]  Global Metric    
ITPruner  Automatic  92.01  20.8  48.8%  0  
ResNet56 [13]  Baseline        
GAL [29]  Local Metric    
APS [48]  Automatic  600  
TAS [8]  Automatic  600  
AMC [14]  Automatic  400  
ITPruner  Automatic  93.43  59.5  52.4%  0  
ITPruner + FPGM  Automatic  94.05  59.5  52.4%  0 
accuracy, compression ratio and search cost of different backbones on CIFAR10. Specifically, ‘
’ denotes the increase and the decrease of accuracy comparing to baseline models, and ‘Ratio’ indicates the reduction of FLOPs. Notability, metric based methods usually integrate the search with the training process. It is thus hard to recognize search epochs in these methods and ‘’ stands for unavailable records. The proposed method is emphasized in bold format.Model  Method  Type  Top1 Acc  FLOPs  Ratio  Search  
()  (M)  Epoch  
MobileNetV1 [17]  Baseline          
Uniform ()      
NetAdapt [51]  Automatic    
AMC [14]  Automatic  
ITPruner  Automatic  70.92  283  50.3%  0  
Uniform ()      
MetaPruning [30]  Automatic  
AutoSlim [52]  Automatic  
ITPruner  Automatic  68.06  149  73.8%  0  
MobileNetV2 [42]  Baseline        
Uniform ()      
AMC [14]  Automatic  
MetaPruning [30]  Automatic  
AutoSlim [52]  Automatic  
ITPruner  Automatic  71.54  219  27.2%  0  
APS [48]  Automatic  
ITPruner  Automatic  69.13  149  50.5%  0  
ResNet50 [13]  Baseline        
ABCPruner [27]  Automatic  
GAL [29]  Local Metric    
AutoSlim [52]  Automatic  
TAS [8]  Automatic  
Taylor [37]  Global Metric    
MetaPruning [30]  Automatic  
HRank [26]  Local Metric    
FPGM [15]  Local Metric    
ITPruner  Automatic  75.75  2236  45.3%  0  
ITPruner  Automatic  75.28  1943  52.5%  0  
ITPruner  Automatic  78.05  1943  52.5%  0 
Model  Method  Batch Size  

1  4  8  
MobileNetV1  Baseline  ms  ms  ms 
Ours(283M)  15ms  70ms  135ms  
Ours(149M)  10ms  45ms  84ms  
MobileNetV2  Baseline  ms  ms  ms 
Ours(217M)  17ms  75ms  144ms  
Ours(148M)  13ms  55ms  110ms 
5 Experiment
We quantitatively demonstrate the robustness and efficiency of ITPruner in this section. We first describe the detailed experimental settings of the proposed algorithm in Sec. 5.1. Then, we apply ITPruner to the automatic network pruning for image classification on the widelyused CIFAR10 [24] and ImageNet [41] datasets with different constraints in Sec. 5.2. To further understand the ITPruner, we conduct a few ablation studies to show the effectiveness of our algorithm in Sec. 5.3
. The experiments are done with NVIDIA Tesla V100, and the algorithm is implemented using PyTorch
[38]. We have also released all the source code.5.1 Experimental Settings
We use the same datasets and evaluation metric with existing compression methods
[51, 48, 8, 30, 52, 14]. First, most of the experiments are conducted on CIFAR10, which has K training images and K testing images from 10 classes with a resolution of . The color intensities of all images are normalized to . To further evaluate the generalization capability, we also evaluate the classification accuracy on ImageNet dataset, which consists of classes with M training images and K validation images. Here, we consider the input image size is . We compare different methods under similar baselines, training conditions and search spaces in our experiment. We elaborate on training conditions as follows.For the standard training, all networks are trained via the SGD with a momentum . We train the founded network over epochs in CIFAR10. In ImageNet, we train and epochs for ResNet and MobileNet, respectively ^{3}^{3}3Such setting is consistent with the previous compression works for a fair comparison.. We set the batch size of and the initial learning rate of
and cosine learning rate strategy in both datasets. We also use basic data augmentation strategies in both datasets: images are randomly cropped and flipped, then resized to the corresponding input sizes as we mentioned before and finally normalized with their mean and standard deviation.
5.2 Comparison with StateoftheArt Methods
CIFAR10. We first compare the proposed ITPruner with different types of compression methods on VGG, ResNet20 and ResNet56. Results are summarized in Tab. I. Obviously, without any search cost, our method achieves the best compression tradeoff compared to other methods. Specifically, ITPruner shows a larger reduction of FLOPs but with better performance. For example, compared to the local metric method HRank [26], ITPruner achieves higher reductions of FLOPs ( vs ) with higher top1 accuracy ( vs ). Meanwhile, compared to search based method TAS [8], ITPruner yields better performance on ResNet56 ( vs. in FLOPs reduction, and vs in top1 accuracy). Besides, the proposed ITPruner is orthonormal to the local metric based method, which means that ITPruner is capable of integrating these methods to achieve better performance. As we can see, integrating FPGM [15] with ITPruner achieves a further improvement in accuracy, which is even better than the baseline. Another interesting observation of Tab. I is the performance rank between search and local metric based methods. In particular, search based methods show a significant performance gap on the efficient model like ResNet20, while a worse performance on large models such as VGG and ResNet56. Such an observation provides a suggestion on how to select compression tools for different models.
ImageNet 2012. We further compare our ITPruner scheme with other methods on widelyused ResNet [13], MobileNetV1[17] and MobileNetV2 [42]. As shown in Tab. II, our method still achieves the best tradeoff compared to different types of methods. Specifically, ITPruner yields compression rate with only a decrease of in Top1 accuracy on ResNet50. Similar results are reported by Tab. II for other backbones. In particular, the proposed method shows a clear performance gap compared to other methods when compressing compact models MobileNetV1 and MobileNetV2. For example, ITPruner achieves a similar reduction of FLOPs with a much lower accuracy drop ( vs ) in MobileNetV1 compared to widely used AMC [14]. Moreover, we also demonstrate the effectiveness of networks adapted by ITPruner on Google Pixel . As shown in Tab. III, ITPruner achieves and acceleration rates with batch size on MobileNetV1 and MobileNetV2, respectively.
5.3 Ablation Study
In this section, we analyze the architecture found by ITPruner and the influence of the hyperparameters and . To sum up, the architecture found by ITPruner shows surprisingly similar characteristics compared with the optimal one. Meanwhile, ITPruner is easy to use, i.e., the only two hyperparameters and show a negligible performance gap with different values. The detailed experiments are described as follows.
Architecture analysis. We first adopt Evolutionary Algorithm (EA) [3] to find the optimal architecture, which is employed as a baseline for a better comparison. Specifically, we directly employ the aforementioned training hyperparameters in the architecture evaluation step. Meanwhile, the search algorithm is executed several times to determine the optimal architecture, which is illustrated in the green line of Fig. 2. Notably, such a process is extremely timeconsuming which takes a few GPU days in CIFAR10. Interestingly, the proposed ITPruner finds architecture that has similar characteristics compared with the optimal architecture. In particular, there are significant peaks in these architectures whenever there is a down sampling operation, i.e.,stride . In such an operation, the resolution is halved thus needs to be compensated by more channels. Similar observations are also reported in MetaPruning [30]. Nevertheless, we do not need any exhausted search process to obtain such insight. We also notice that there is some difference in channel choice between architectures find by ITPruner and EA. That is, ITPruner is prone to assign more channels in the first and the last layers. Meanwhile, the performance gap between these architectures is negligible (), which demonstrates the efficiency of our method.
Influence of and . There are only two hyperparameters in ITPruner. One is the importance factor , which is proposed to control relative compression rates between different layers. The other one is sample image , which determines the computation cost of ITPruner. Fig. 3 reports the accuracy and variance of layerwise importance in different . As we can see, layerwise importance tends to has a significant variance with a large , and vice versa. Such variance finally determines the difference of layerwise compression rate through Eq. 10. However, the performance gap between different is negligible. Similar results are observed for different . The experiment is available in the supplementary materials. In particular, the variance of top1 accuracy in CIFAR10 between different is around . We thus conclude that ITPruner is easy to implement.
We further conduct a experiment to demonstrate the influence of the sample size , which is reported in Fig. 4. As we can see, the variance of the accuracy with different is 0.1%, which means accuracy will not fluctuate greatly with the change of n and thus demonstrate the robustness of our method.
6 Conclusion
In this paper, we present an information theoryinspired strategy for automatic network pruning (ITPruner), which determines the importance of layer by observing the independence of feature maps and do not need any search process. To that effect, ITPruner is derived from the information bottleneck principle, which proposed to use normalized HSIC to measure and determine the layerwise compression ratio. We also mathematically prove the robustness of the proposed method as well as the relation to mutual information. Extensive experiments on various modern CNNs demonstrate the effectiveness of ITPruner in reducing the computational complexity and model size. In the future work, we will do more on the theoretical analysis on neural architecture search to effectively find optimal architecture without any search process.
References

[1]
Sungsoo Ahn, Shell Xu Hu, Andreas Damianou, Neil D Lawrence, and Zhenwen Dai.
Variational information distillation for knowledge transfer.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pages 9163–9171, 2019.  [2] Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck. arXiv preprint arXiv:1612.00410, 2016.

[3]
Thomas Back.
Evolutionary Algorithms in Theory and Practice: Evolution Strategies, Evolutionary Programming, Genetic algorithms
. 1996.  [4] Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, and John Guttag. What is the state of neural network pruning? arXiv preprint arXiv:2003.03033, 2020.
 [5] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence (TPAMI), 2018.
 [6] Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999.

[7]
Bin Dai, Chen Zhu, Baining Guo, and David Wipf.
Compressing neural networks using the variational information
bottleneck.
In
International Conference on Machine Learning
, pages 1135–1144. PMLR, 2018.  [8] Xuanyi Dong and Yi Yang. Network pruning via transformable architecture search. Advances in Neural Information Processing Systems, 2019.
 [9] W. N. Everitt. A note on positive definite matrices. Proceedings of the Glasgow Mathematical Association, 3(4):173–175, 1958.
 [10] Ross Girshick. Fast rcnn. In International Conference on Computer Vision (ICCV), 2015.
 [11] Z Goldfeld, E van den Berg, K Greenewald, I Melnyk, N Nguyen, B Kingsbury, and Y Polyanskiy. Estimating information flow in neural networks. arxiv eprints, page. arXiv preprint arXiv:1810.05728, 2018.
 [12] Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Schölkopf. Measuring statistical dependence with hilbertschmidt norms. In International conference on algorithmic learning theory, pages 63–77. Springer, 2005.
 [13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR), 2016.
 [14] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, LiJia Li, and Song Han. Amc: Automl for model compression and acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision (ECCV), pages 784–800, 2018.
 [15] Yang He, Ping Liu, Ziwei Wang, Zhilan Hu, and Yi Yang. Filter pruning via geometric median for deep convolutional neural networks acceleration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4340–4349, 2019.
 [16] Andrew Howard, Mark Sandler, Grace Chu, LiangChieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1314–1324, 2019.
 [17] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 [18] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. PMLR, 2015.
 [19] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integerarithmeticonly inference. Computer Vision and Pattern Recognition (CVPR), 2018.
 [20] Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. In International Conference on Machine Learning, pages 3519–3529. PMLR, 2019.
 [21] Dieter Kraft et al. A software package for sequential quadratic programming. 1988.
 [22] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 2012.
 [23] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 2015.
 [24] Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 1998.
 [25] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. International Conference on Learning Representations (ICLR), 2016.
 [26] Mingbao Lin, Rongrong Ji, Yan Wang, Yichen Zhang, Baochang Zhang, Yonghong Tian, and Ling Shao. Hrank: Filter pruning using highrank feature map. IEEE Conference on Computer Vision and Pattern Recognition, 2020.
 [27] Mingbao Lin, Rongrong Ji, Yuxin Zhang, Baochang Zhang, Yongjian Wu, and Yonghong Tian. Channel pruning via automatic structure search.
 [28] Shaohui Lin, Rongrong Ji, Yuchao Li, Yongjian Wu, Feiyue Huang, and Baochang Zhang. Accelerating convolutional networks via global & dynamic filter pruning. In IJCAI, pages 2425–2432, 2018.
 [29] Shaohui Lin, Rongrong Ji, Chenqian Yan, Baochang Zhang, Liujuan Cao, Qixiang Ye, Feiyue Huang, and David Doermann. Towards optimal structured cnn pruning via generative adversarial learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2790–2799, 2019.
 [30] Zechun Liu, Haoyuan Mu, Xiangyu Zhang, Zichao Guo, Xin Yang, KwangTing Cheng, and Jian Sun. Metapruning: Meta learning for automatic neural network channel pruning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3296–3305, 2019.
 [31] Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network pruning. In International Conference on Learning Representations, 2018.
 [32] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Computer Vision and Pattern Recognition (CVPR), 2015.
 [33] Urbano LorenzoSeva and Jos MF Ten Berge. Tucker’s congruence coefficient as a meaningful index of factor similarity. Methodology, 2(2):57–64, 2006.
 [34] JianHao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filter level pruning method for deep neural network compression. International Conference on Computer Vision (ICCV), 2017.
 [35] WanDuo Kurt Ma, JP Lewis, and W Bastiaan Kleijn. The hsic bottleneck: Deep learning without backpropagation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 5085–5092, 2020.
 [36] Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. Importance estimation for neural network pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
 [37] Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. Importance estimation for neural network pruning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11264–11272, 2019.
 [38] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
 [39] S. Ren, K. He, R. Girshick, and J. Sun. Faster rcnn: Towards realtime object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence (TPAMI), 2017.
 [40] Paul Robert and Yves Escoufier. A unifying tool for linear multivariate statistical methods: The rvcoefficient. Journal of the Royal Statistical Society: Series C (Applied Statistics), 25(3):257–265, 1976.
 [41] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 2015.
 [42] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and LiangChieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.
 [43] Ravid ShwartzZiv and Naftali Tishby. Opening the black box of deep neural networks via information. CoRR, 2017.
 [44] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [45] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. Mnasnet: Platformaware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2820–2828, 2019.
 [46] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pages 6105–6114. PMLR, 2019.
 [47] Naftali Tishby, Fernando C. Pereira, and William Bialek. The information bottleneck method. In Proc. of the 37th Annual Allerton Conference on Communication, Control and Computing, 1999.
 [48] Jiaxing Wang, Haoli Bai, Jiaxiang Wu, Xupeng Shi, Junzhou Huang, Irwin King, Michael Lyu, and Jian Cheng. Revisiting parameter sharing for automatic neural channel number search. Advances in Neural Information Processing Systems, 33, 2020.
 [49] Jiayun Wang, Yubei Chen, Rudrasis Chakraborty, and Stella X Yu. Orthogonal convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11505–11515, 2020.
 [50] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. Advances in Neural Information Processing Systems (NeurIPS), 2016.
 [51] TienJu Yang, Andrew Howard, Bo Chen, Xiao Zhang, Alec Go, Mark Sandler, Vivienne Sze, and Hartwig Adam. Netadapt: Platformaware neural network adaptation for mobile applications. In Proceedings of the European Conference on Computer Vision (ECCV), pages 285–300, 2018.
 [52] Jiahui Yu and Thomas Huang. Autoslim: Towards oneshot architecture search for channel numbers. arXiv preprint arXiv:1903.11728, 2019.
 [53] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients. Computer Vision and Pattern Recognition (CVPR), 2016.