1 Introduction
Deep neural networks (DNNs) have dominated the field of computer vision because of superior performance in all kinds of tasks. It is a tendency that the network architecture is becoming deeper and more complex
[27, 26, 12, 35, 16] to yield higher accuracy. However, the great computing expense of deeper networks contradicts the demands of many resourceconstrained applications, which prefer lightweight networks [15, 25, 38, 21] to meet limited computation or storage requirement.An elegant solution is to make use of dynamic inference mechanism [31, 29, 34, 17, 5, 7, 8, 28], reconfiguring the inference path according to the input sample adaptively to meet a better accuracyefficiency tradeoff. Prevalent dynamic inference techniques are mostly layerwise methods [31, 29, 34, 7, 28], as shown in Fig. 0(a). These methods are usually adopted to determine the execution status of a whole layer at runtime based on a specified mechanism.
All these existing dynamic inference methods only alter the depth of the network. The drawbacks are obvious. First, it is impractical to drop the whole layer/block since some channels of a skipped layer may be useful. Second, the redundant information between different channels may still exist in the remaining layers. A recent study [37] visualizes the hidden features of CNN models and shows the performance contribution from different channels and different layers. There exists different emphasis on extracting feature among different channels and layers.
In this work, we attempt to improve the conventional dynamic inference scheme in terms of both network width and depth and find an effective forward mechanism for different inputs at runtime from a new perspective of block design. We propose Dynamic Multipath Neural Network (DMNN), a novel dynamic inference method that provides various inference path selections. Fig. 0(b) gives an overview of our approach. Different from conventional methods, it is expected that each channel has its gate to predict whether to execute or not. The primary technical challenge of DMNN is how to design an efficient and effective controller.
Challenge of efficiency. Since DMNN is aimed to conduct channelwise dynamic evaluation, it is ideal for controlling the execution of each channel of the network at runtime. However, this would lead to a significant increase in computational complexity. Moreover, as controllers are used at each layer/block of the network, they are desirable for lightweight design and generate only a small amount of computational cost.
Challenge of effectiveness. The gate control mechanism is similar to SENet [16], which adaptively recalibrates channelwise feature responses by explicitly modeling interdependencies between channels. However, SENet makes use of softweighted sum, while DMNN adopts the hardmax mechanism for faster inference while maintaining or boosting accuracy. In order to obtain a more reasonable inference path, it would be better if we take both previous state information and object category into consideration. Besides, the resourceconstrained loss is also required to make the computational complexity controllable.
To tackle the challenges, considering that different channels have different representation characteristics, we split the original block of the network into several subblocks. Thus the proposed method provides more optional inference paths. A gate controller is introduced to decide whether to execute or skip one subblock for the current input, which only generates minor additional computational cost during inference. Each block has its controller to control the status of every subblocks. We also carefully design the gate controller to take both previous state information and object category into consideration. Moreover, we introduce resourceconstrained loss which integrates FLOPs constraint into the optimization process to make the computational complexity controllable. The proposed DMNN is easy to implement and can be incorporated into most modern network architectures.
The contributions are summarized as follows:

We propose a novel dynamic inference method called Dynamic Multipath Neural Network, which can provide more path selection choices in terms of network width and depth during inference.

We carefully design a gate module controller, which takes into account both previous state and object category information. The resourceconstrained loss is also introduced to control the computational complexity of the target network.

Experimental results demonstrate the superiority of our method on both efficiency and overall classification accuracy. To be specific, DMNN101 significantly outperforms ResNet101 with an encouraging 45.1% FLOPs reduction, and DMNN50 performs comparable results to ResNet101 with 42.1% fewer parameters.
2 Related Work
Adaptive Computation. Adaptive computation aims to reduce overall inference time by changing network topology based on different samples while maintaining or even boosting accuracy. This idea has been adopted in early cascade detectors [6, 30], relying on extra prediction modules or handcrafted control strategies. Learning based layerwise dynamic inference schemes are widely investigated in the field of computer vision. Early prediction models like BranchyNet [28]and Adaptive Computation Time [7] adopt branches or halt units to decide whether the model could stop early. Some works use gate mechanism to determine the execution of a specific block. Wang et al. [31]
propose SkipNet which uses a gating network to selectively skip convolutional blocks based on the activations of the previous layer. A hybrid learning algorithm that combines supervised learning and reinforcement learning is used to address the challenges of nondifferentiable skipping decisions. Wu et al.
[34] propose BlockDrop and also make use of a reinforcement learning setting for the reward of utilizing a minimal number of blocks while preserving recognition accuracy. ConvNetAIG is proposed in [29], which utilizes the GumbelMax trick [10] to optimize the gate module. However, the blockwise method can only alter the depth of the network, which could be too rough as some channels of an abandoned block may be useful.On the other hand, the channelwise method can manually adjust the number of active channels of a specific model. However, as far as we know, only [36] is similar to such a method. The proposed Slimmable Neural Networks can adjust its width on the fly according to the ondevice benchmarks and resource constraints. Strictly speaking, it is not a dynamic process as the procedure of choosing the active channels is finished before inference. Moreover, the predefined width multipliers negatively affect the flexibility of the dynamic inference mechanism. Our work is close to [29]. However, we attempt to combine the merits of both the above two methods and propose a novel dynamic inference method which can provide more path selection choices in terms of network width and depth.
Model Compression. The great computing expense of deeper networks contradicts the demands of many resourceconstrained applications, such as mobile platforms, therefore, reducing storage and inference time also plays an important role in deploying topperforming deep neural networks. Lots of techniques are proposed to attack this problem, such as pruning [13, 22, 33], distillation [14, 23], quantization [11, 32, 19], lowrank factorization [18], compression with structured matrices [2]
and network binarization
[3]. However, these works are usually applied after training the initial networks and generally used as postprocessing, while DMNN could be trained endtoend without welldesigned training rules.3 Methodology
In this section, we introduce the proposed dynamic multipath neural network (DMNN) in detail, including the subdivision of the block, the architecture of the controller and the optimization approach.
3.1 Block Subdivision
It is ideal for controlling the execution of each channel of the network at runtime. However, this would lead to a significant increase in computational complexity. In this work, we divide the origin block of the network into several subblocks, and each subblock has its switch to decide whether to execute or not, resulting to a dynamic inference path for different samples. We interpret optimizing the network structure as executing or skipping of each subblock during the inference stage.
A key issue is how to split one block into subblocks. The guiding principle is that the parameters of the new block must be consistent with or approximate to the original block for fair comparison. Fig. 2 shows the subdivision of blocks of MobileNetV2 and ResNet.
For the block of MobileNetV2, we divide the origin block into subblocks, the expansion ratio of each subblock is set to . Thus the sum of every subblock’s computation and parameters are the same with the original block since it only consists of pixelwise convs and depthwise convs, more detail can be seen in Fig 1(a).
While for ResNet, it is not that straightforward. As shown in Fig. 1(b)
, suppose the shape of the input tensor is
, the output channels of each conv operation are . The parameter of the original block is(1) 
The original block is then split into subblocks. The output channels of each subblock are . Then the parameter becomes
(2) 
If we simply set the number of channels of each subblock to of the origin blocks, i.e. , Eqn. 2 can be rewritten as follow:
(3) 
which is not equal to Eqn. 1. Thus, to make subsequent extensive studies fair, we make minor modifications to ResNet and design the corresponding DMNN version to make Eqn. 1 Eqn. 2. For saving space, the detailed architecture of DMNN50 can be referred to supplementary materials.
3.2 The Architecture of Controller
The controller is elaborately designed to predict the status of each subblock (on/off) with an minimal cost. It is the inference paths optimizer of DMNN. An overview of the dynamic path selection framework is shown in Fig. 3. Given an input image, its forward path is determined by the gate controllers and Fig. 2(a) shows the gate mechanism of DMNN. Suppose we split th block into subblocks, the output of th block is the combination of the outputs of an identity connection and subblocks. Formally,
(4) 
where is output of th block, refers to the off/on status which is predicted by the controller. refers to the output of th subblock of th block.
Spatial and previous state information embedding. On the one hand, the control modules make decisions based on the global spatial information, and we achieve this process by applying global average pooling to compress the high dimension features to one dimension along channels. We further use a fully connected layer followed by an activation layer to map the pooling features to lowdimensional space. Specifically, represents the input features of th block, we calculate the th channel statistic by
(5) 
The final embedding feature is
(6) 
where , ,
is the ReLU
[9] function, is the dimension of the hidden layer.On the other hand, there are some connections between the current controller and the previous controllers. Thus the integration of previous state information is also crucial. We first employ a fully connected layer followed by ReLU function to map the previous state hidden features into the same subspace with . Then we perform an addition operation on the hidden feature and to get the result of the current state. Formally,
(7) 
where , represents the ReLU function. Bias terms are omitted for simplicity. The status predictions of each subblock at th block are made through by using a softmax trick which we will introduce in section 3.3.
Softmax Trick with Gumbel Noise. To decide whether to execute or omit a subblock is inherently discrete and therefore nondifferentiable. In this work, we use softmax trick with gumbel noise to solve this problem, which has been proved to be successful in [29]. Formally, let be the number of subblocks and , , is the bias term. is then reshaped to for the final predictions. The activation can be written as follows
(8) 
where refers to the status of each subblock of th block, and is a random noise following the Gumbel distribution, which can increase the stability of the training process of our network.
Supervised learning of controller
. Deep CNNs compute feature hierarchies in each layer and produce feature maps with different depths and resolutions. This can also be considered as a feature extraction process from coarse to fine. The proposed DMNN has a diversity of inference paths, and we hope that different classes would select different paths. However, if the path selection mechanism is trained only by optimizing the classification loss at the last layer, it will be difficult for the controller to learn the category information. To solve this problem, we introduce category loss to each controller to enable all of them to become categoryaware. Considering that predicting each class as a different category by the controller is computationally expensive, we cluster samples into fewer categories than original classes. For the ImageNet dataset
[4], we cluster all the 1000 classes samples into 58 big categories with the help of the hierarchical structure of ImageNet provided in [4]. For the CIFAR100 dataset [20], it groups the 100 classes into 20 superclasses. We use the 20 superclasses as the big categories directly. Then cross entropy loss is employed to supervise all controllers as shown in Fig. 2(b). Formally, the category loss of th controller can be written as follow(9) 
where
represents the probability of
th class. if is the groundtruth class and 0 otherwise, indicates the number of categories. It is worth noting that the loss weights of each block’s controller are not always equal since the features of different layers have different semantic information. Deep layers have a stronger semantic information than shallow layers. In DMNN50, there are four stages composed of 3, 4, 6, 3 stacked blocks respectively, resulting in 16 controllers. The loss weight of the first stage is set to 0.0001, and it will increase by a factor of 10 in the next stages. DMNN101 follows the same principle. The loss of supervised controller can be represented as follows(10) 
where denotes the loss weight of th controller and denotes the number of blocks. The category information will be removed after training, so it will not generate any extra computational burden during testing.
The controller is desirable for its lightweight characteristic during the optimization of network structure. The dimension of the hidden layer is set to 32 in all experiments. This setting generates only little computational cost and can be omitted compared to the whole computation of the network. If we take DMNN50 as an example, the total 16 controllers only generate about 0.02% FLOPs of the original ResNet50.
3.3 Optimization
Resourceconstrained Loss. The resource constraint comes from two aspects: the block execution rate and the total FLOPs. The execution rate of each block in a minibatch is used to constrain the average block activation rate to the target rate . Let denotes the execution rate of th block within a minibatch, we define the execution rate as
(11) 
where is the minibatch size, is the executed number of th subblock within a minibatch. The total execution rate loss can be written as follow
(12) 
The other constraint is the total FLOPs. To meet the desired FLOPs, we explicitly introduce the target FLOPs rate to the loss function. In each minibatch, we compute the actual FLOPs via
(13) 
where indicates the FLOPs of th subblock at th block of the network. The FLOPs loss can be formulated as
(14) 
where and represent the full FLOPs and the actual execution FLOPs of the network respectively, and denotes the target FLOPs rate. We set in all experiments since they have strong positive correlation and similar values. Thus, the resourceconstrained loss is defined as
(15) 
The total training loss is
(16) 
where is the classification loss. In our experiments
. The joint loss would be optimized by minibatch stochastic gradient descent.
Model  Top1 Err. (%)  Params ()  FLOPs ()  FLOPs Ratio (%) 

ResNet50 [12]  24.7  25.56  3.8   
ResNet50 (PyTorch Official) [24] 
23.85  25.56  3.96  100.0 
ResNet50^{†} (ours)  23.51  25.56  3.96  100.0 
ResNet50 + Pruning [22]  23.91  20.45  2.66  70.0 
ResNeXt50 [] [35]  23.0  25.4  4.16  105.1 
ResNeXt50 [] [35]  22.6  25.3  4.20  106.1 
ConvNetAIG50 [] [29]  23.82  26.56  3.06  77.3 
SResNet500.75 [36]  25.1  19.2  2.3  58.1 
DMNN50, []  24.06  24.67  2.07  52.3 
DMNN50, []  23.50  24.67  2.28  57.6 
DMNN50, []  23.22  24.67  2.52  63.6 
DMNN50, []  22.57  24.67  3.12  78.8 
DMNN50, []  22.54  25.81  3.16  79.8 
DMNN50, []  22.32  25.81  3.17  80.1 
ResNet101 [12]  23.6  44.54  7.6   
ResNet101 (PyTorch Official) [24]  23.63  44.55  7.67  100.0 
ResNet101^{†} (ours)  22.02  44.55  7.67  100.0 
ResNeXt101 [] [35]  21.7  44.46  7.9  103.0 
ConvNetAIG101[] [29]  22.63  46.23  5.11  66.6 
DMNN101, []  22.82  43.12  2.48  32.3 
DMNN101, []  21.95  43.12  4.21  54.9 
DMNN101, []  21.43  43.12  5.57  72.6 

Our implementations of ResNet50, ResNet101, DMNN50, DMNN101 use in conv layers just as the PyTorch community does [24] which is slightly different from the original paper.
4 Experiments
In this section, we evaluate the performance of the proposed DMNN on benchmark datasets including ImageNet and CIFAR100.
4.1 Training Setup
ImageNet. The ImageNet dataset [4] consists of 1.2 million training images and 50K validation images of 1000 classes. We train networks on the training set and report the top1 errors on the validation set. We apply standard practice and perform data augmentation with random horizontal flipping and randomsize cropping to 224
224 pixels. We follow the standard Nesterov SGD optimizer with momentum 0.9 and a minibatch of 256. The cosine learning rate scheduler is employed for better convergence and the initial learning rate is set to 0.1. For different scale models, We use different weight decays, 0.0001 for ResNet and 0.00004 for MobileNet. All models are trained for 120 epochs from scratch.
CIFAR100. The CIFAR100 datasets [20] consist of 60,000 color images of 10, 000 classes. They are split into the training set and testing set by the ratio of 5:1. Considering the small size of images () in CIFAR, we follow the same setting as [12]
to construct our DMNNs for a fair comparison. We augment the input image by padding 4 pixels on each side with the value of 0, followed by random cropping with a size of
and random horizontal flipping. We train the network using SGD with the momentum of 0.9 and weight decay of 0.0001. The minibatch size is set to 256, and the initial learning rate is set to 0.1. We train the networks for 200 epochs and divide the learning rate by 10 twice, at the 100th epoch and 150th epoch respectively.4.2 Performance Analysis
We compare our method with ResNet [12], ResNeXt [35], MobileNetV2 [25], pruning method [22] and other dynamic inference methods [36, 29]. We denote as the number of subblocks of each block, as the FLOPs target rate.
Model  Top1 Err. (%)  Params ()  FLOPs ()  FLOPs Ratio (%) 

MobileNet V2 [25]  28.0  3.47     
MobileNet V2 (ours)  28.09  3.50^{†}  0.30  100.0 
SMobileNet V20.75 [36]  31.1  2.7  0.23  76.7 
DMNNMobileNetV2, []  28.30  3.63  0.22  73.3 
DMNNMobileNetV2, []  28.15  3.63  0.24  80.0 
DMNNMobileNetV2, []  27.74  3.63  0.27  90.0 
MobileNetV2 (1.4) [25]  25.3  6.06     
MobileNetV2 (1.4) (ours)  25.30  6.09^{†}  0.57  100.0 
DMNNMobileNetV2 (1.4), []  26.03  6.29  0.42  73.7 
DMNNMobileNetV2 (1.4), []  25.53  6.29  0.47  82.5 
DMNNMobileNetV2 (1.4), []  25.26  6.29  0.52  91.2 

Our implementation of MobileNet V2 is based on PyTorch and its parameter quantities are counted by PyTorch Summary [1].
Performance on heavy networks. Tab. 1 shows that our DMNN achieves remarkable results compared to other heavy models on ImageNet. First of all, we compare DMNN with ResNet. When , , our DMNN50 achieves similar performance with ResNet50 but saves more than 42.4% FLOPs. When we set , , our DMNN50 further reduces 1.19% Top1 error while still saving 19.9% FLOPs. Our DMNN101 outperforms ResNet101 and save 45.1% FLOPs in the same time when we set , . The above comparison demonstrates that DMNN can greatly reduce FLOPs and improve the accuracy when compared to the models with similar parameters. On the other hand, DMNN50 achieves even better performance than origin ResNet101 (closely to ResNet101 by our implementation), with 42.0% parameter size reduction, which indicates that DMNN can greatly save the parameters and is feasible for pratical model deployment.
Then, we make comparason with DMNN and stronger baseline models ResNeXt. As we set , our method is superior to ResNeXt50 () in both FLOPs and accuracy. When we set , , our DMNN50 reduces 0.28% Top1 error while still saves 24.5% FLOPs. Our DMNN101 outperforms ResNet101 and save 45.1% FLOPs in the same time when we set , . Similar result can be found while comparing to ResNeXt101 () if we set . DMNN shows great superiority over ResNeXt mainly because of better control for different convolution groups.
Our method outperforms ConvNetAIG [29] in both accuracy and computational complexity, demonstrating that multipath design is more elaborate and superior than roughly skipping the whole block. Especially, our DMNN50 with and achieves comparable performance with ConvNetAIG yet greatly reduces the FLOPs by approximately 33.3%. Fig 4 shows the tradeoff between computational cost and accuracy of our DMNN while comparing to other dynamic inference methods including slimmable neural network SResNet [36]. Meanwhile, as an endtoend method, DMNN shows great advantages over post process method such as pruning methods.
We also conduct experiments on CIFAR100 dataset, as shown in Tab. 3. It can be seen that DMNN50 with and can even outperform ResNet50 by 1.4% on CIFAR100 with only 78.7% FLOPs.
Performance on lightweight networks. We apply DMNN to lightweight network MobileNetV2, as shown in Tab. 2. Although similar conclusions can be obtained, it is normal that the improvements is not as large as heavy models because of the compact structures. Specially, our DMNN with and can save 10.0% FLOPs and achieves better top1 error than MobileNetV2. The proposed method is also better than other dynamic inference methods.
In summary, our method performs superbly in accuracy and computational complexity for both heavy and lightweight networks, which demonstrates its great applicability to different networks and robustness on different datasets.
Model  FLOPs ()  Top1 Err. (%) 

ResNet50 [12]  0.33  27.55 
DMNN50, []  0.18  28.24 
DMNN50, []  0.22  27.34 
DMNN50, []  0.26  26.15 
Method  PREV  CAT  Top1 Err. (%) 

ResNet50 [12]  23.51  
DMNN50  23.25  
DMNN50  23.09  
DMNN50  23.20  
DMNN50  22.57 
4.3 Ablation Study
Effectiveness of the gate controller. In order to show the effectiveness of the controllers, we conduct four groups of experiments on ImageNet dataset with different configurations. Tab. 4
shows the comparison of different models. If we employ previous features and supervised learning separately, additional promotions are obtained. After aggregating these two improvements, we can boost the performance by 0.68%, demonstrating the benefits of previous state information embedding and supervised learning of controllers. It is worth noting that it only introduces a fully connected layer with 32 hidden neurons while applying previous controller’s features, the additional computation cost can be omitted. The supervised learning of the controllers may generate minor additional computational cost during training, yet it will be removed at the testing stage.
The impact of and . We adopt different values of and to explore their impacts on the performance. As shown in Tab. 1, we set , while keep on DMNN50. The model with obtains the lowest test error rate, indicating that bigger can lead to more path selection choices and consequently better performance. We further keep and change to 0.4, 0.5 and 0.6 respectively. Larger leads to more computational cost that verifies the effectiveness of our resourceconstrained mechanism. The model with larger FLOPs rate gains higher performance since more computation units are involved. The DMNN can achieve a better accuracyefficiency tradeoff in terms of the computational budgets. We have not conducted more experiments on larger due to resource limitation. But we will explore the characteristic of DMNN with larger in the future work.
4.4 Visualization
Visualization of dynamic inference paths. The inference paths vary across images, which leads to different computation cost. Fig. 5 shows the distribution of FLOPs on the ImageNet validation set using our DMNN50 model with , . The proportion of images with FLOPs in the middle is the highest, and images do occupy different computing resources guided by computational constraint. We further visualize the execution rates of each subblock within the categories of animals, artifacts, natural objects, geological formations as shown in Fig. 7. We can see that some subblocks, especially at the first two blocks of the network, are executed all the time and the execution rates of other subblocks vary from categories. One reason could be that different categories share the same shallow layers’ features which are important for classification. As the layer goes deeper, the semantic information of the features becomes stronger, which depends on categories.
Visualization of “easy” and “hard” samples. We find that even samples of the same category would have different inference paths. A reasonable explanation is that hard samples need more computation than easy ones. Fig 6 shows examples of easy and hard samples with different actual FLOPs. Although for some classes such as malamute and lifeboat, the “hard” samples are difficult than “easy” ones, for most classes, the quality gap is not indeed noticeable. We infer that it is because the definition of easy and hard samples mainly depends on the representation property of the neural networks, rather than on the intuition of human beings.
5 Conclusion
In this paper, we present a novel dynamic inference method called Dynamic Multipath Neural Network (DMNN). The proposed method splits the original block into multiple subblocks, making the network become more flexibility to handle different samples adaptively. We also carefully design the structure of the gate controller to get reasonable inference path, and introduce resourceconstrained lose to make full use of the representation capacity of subblocks. Experimental results demonstrate the superiority of our method.
References
 [1] S. Chandel. sksq96/pytorchsummary, Sep 2018.
 [2] Y. Cheng, F. X. Yu, R. S. Feris, S. Kumar, A. Choudhary, and S.F. Chang. An exploration of parameter redundancy in deep networks with circulant projections. In Proceedings of the IEEE International Conference on Computer Vision, pages 2857–2865, 2015.
 [3] M. Courbariaux, I. Hubara, D. Soudry, R. ElYaniv, and Y. Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or1. arXiv preprint arXiv:1602.02830, 2016.
 [4] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. ImageNet: A LargeScale Hierarchical Image Database. In CVPR09, 2009.

[5]
X. Dong, J. Huang, Y. Yang, and S. Yan.
More is less: A more complicated network with less inference
complexity.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 5840–5848, 2017.  [6] P. F. Felzenszwalb, R. B. Girshick, and D. McAllester. Cascade object detection with deformable part models. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 2241–2248. IEEE, 2010.
 [7] M. Figurnov, M. D. Collins, Y. Zhu, L. Zhang, J. Huang, D. Vetrov, and R. Salakhutdinov. Spatially adaptive computation time for residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1039–1048, 2017.
 [8] M. Figurnov, A. Ibraimova, D. P. Vetrov, and P. Kohli. Perforatedcnns: Acceleration through elimination of redundant convolutions. In Advances in Neural Information Processing Systems, pages 947–955, 2016.

[9]
X. Glorot, A. Bordes, and Y. Bengio.
Deep sparse rectifier neural networks.
In
Proceedings of the fourteenth international conference on artificial intelligence and statistics
, pages 315–323, 2011.  [10] E. J. Gumbel. Statistical theory of extreme values and some practical applications: a series of lectures, volume 33. US Government Printing Office, 1954.
 [11] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
 [12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [13] Y. He, X. Zhang, and J. Sun. Channel pruning for accelerating very deep neural networks. In International Conference on Computer Vision (ICCV), volume 2, 2017.
 [14] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
 [15] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 [16] J. Hu, L. Shen, and G. Sun. Squeezeandexcitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
 [17] G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten, and K. Q. Weinberger. Multiscale dense networks for resource efficient image classification. arXiv preprint arXiv:1703.09844, 2017.
 [18] Y. Ioannou, D. Robertson, J. Shotton, R. Cipolla, and A. Criminisi. Training cnns with lowrank filters for efficient image classification. arXiv preprint arXiv:1511.06744, 2015.
 [19] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko. Quantization and training of neural networks for efficient integerarithmeticonly inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2704–2713, 2018.
 [20] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 [21] N. Ma, X. Zhang, H.T. Zheng, and J. Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. arXiv preprint arXiv:1807.11164, 2018.
 [22] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz. Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440, 2016.
 [23] A. Polino, R. Pascanu, and D. Alistarh. Model compression via distillation and quantization. arXiv preprint arXiv:1802.05668, 2018.
 [24] PyTorch. torchvision.models¶.
 [25] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.C. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
 [26] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [27] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
 [28] S. Teerapittayanon, B. McDanel, and H. Kung. Branchynet: Fast inference via early exiting from deep neural networks. In 2016 23rd International Conference on Pattern Recognition (ICPR), pages 2464–2469. IEEE, 2016.
 [29] A. Veit and S. Belongie. Convolutional networks with adaptive inference graphs. In European Conference on Computer Vision, pages 3–18. Springer, 2018.

[30]
P. Viola and M. J. Jones.
Robust realtime face detection.
International journal of computer vision, 57(2):137–154, 2004.  [31] X. Wang, F. Yu, Z.Y. Dou, and J. E. Gonzalez. Skipnet: Learning dynamic routing in convolutional networks. arXiv preprint arXiv:1711.09485, 2017.
 [32] Y. Wei, X. Pan, H. Qin, W. Ouyang, and J. Yan. Quantization mimic: Towards very tiny cnn for object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pages 267–283, 2018.
 [33] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, pages 2074–2082, 2016.
 [34] Z. Wu, T. Nagarajan, A. Kumar, S. Rennie, L. S. Davis, K. Grauman, and R. Feris. Blockdrop: Dynamic inference paths in residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8817–8826, 2018.
 [35] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5987–5995. IEEE, 2017.
 [36] J. Yu, L. Yang, N. Xu, J. Yang, and T. Huang. Slimmable neural networks. arXiv preprint arXiv:1812.08928, 2018.
 [37] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014.

[38]
X. Zhang, X. Zhou, M. Lin, and J. Sun.
Shufflenet: An extremely efficient convolutional neural network for mobile devices.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6848–6856, 2018.
Comments
There are no comments yet.