Fractional Skipping: Towards Finer-Grained Dynamic CNN Inference

01/03/2020 ∙ by Jianghao Shen, et al. ∙ Texas A&M University Rice University 0

While increasingly deep networks are still in general desired for achieving state-of-the-art performance, for many specific inputs a simpler network might already suffice. Existing works exploited this observation by learning to skip convolutional layers in an input-dependent manner. However, we argue their binary decision scheme, i.e., either fully executing or completely bypassing one layer for a specific input, can be enhanced by introducing finer-grained, "softer" decisions. We therefore propose a Dynamic Fractional Skipping (DFS) framework. The core idea of DFS is to hypothesize layer-wise quantization (to different bitwidths) as intermediate "soft" choices to be made between fully utilizing and skipping a layer. For each input, DFS dynamically assigns a bitwidth to both weights and activations of each layer, where fully executing and skipping could be viewed as two "extremes" (i.e., full bitwidth and zero bitwidth). In this way, DFS can "fractionally" exploit a layer's expressive power during input-adaptive inference, enabling finer-grained accuracy-computational cost trade-offs. It presents a unified view to link input-adaptive layer skipping and input-adaptive hybrid quantization. Extensive experimental results demonstrate the superior tradeoff between computational cost and model expressive power (accuracy) achieved by DFS. More visualizations also indicate a smooth and consistent transition in the DFS behaviors, especially the learned choices between layer skipping and different quantizations when the total computational budgets vary, validating our hypothesis that layer quantization could be viewed as intermediate variants of layer skipping. Our source code and supplementary material are available at



There are no comments yet.


page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Although convolutional neural networks (CNNs) have show state of the art performance in many visual perception tasks

[13, 19], the high computational cost has limited their application in resource constrained platforms such as drones, self-driving cars, wearables and many more. The growing demand of unleashing the intelligent power of CNN into these devices has posed unique challenges in developing algorithms that enables more computationally efficient inference of CNNs. Earlier resource-efficient implementations assumed that CNNs are first compressed before being deployed, thus being “static” and unable to adjust their own complexity at inference. Later on, a series of works [4, 23]

pointed out that the continuous improvements in accuracy, while significant, are marginal compared to the growth in model complexity. This implies that computationally intensive models may only be necessary to classify a handful of difficult inputs correctly, and they might become “wasteful” for many simple inputs.

Motivated by this observation, several works have tackled the problem of input-dependent adaptive inference, by dynamically bypassing unnecessary computations on the layer level, i.e., selectively executing a subset of layers [4, 27]. However, the binary decision scheme of either executing a layer fully, or skipping it completely, leaves no room for intermediate options. We conjecture that finer-grained dynamic execution options can contribute to better calibrating the inference accuracy of CNNs w.r.t. the complexity consumed.

On a separate note, CNN quantization appears to exploit model redundancy at the finest level - by reducing the bitwidth of the element-level numerical representations of weights and activations. Earlier works [6, 32] presented to quantize all layer-wise weights and activations to the same low bitwidth, yet ignored the fact that different layers can have different importance. The latest work [22] learned to assign different bitwidths for each layer. However, no work has yet discussed an input-adaptive, layer-wise bitwidth allocation at the inference time, not to mention linking between quantization with dynamic inference.

In an effort to enable finer-grained dynamic inference beyond “binary” layer skipping, we propose a Dynamic Fractional Skipping (DFS) framework, that treats layer quantization (to different bitwidths) as softer, intermediate versions of layer-wise skipping. Below are our contributions:

  • We propose to link two efficient CNN inference mindsets: dynamic layer skipping and static quantization, and show that they can be unified into one framework. Specifically, DFS considers a quantized layer to be a “fractionally executed” layer, in contrast to either a fully executed (selected) or non-executed (bypassed) layer in the existing layer skipping regime. In this way, DFS can more flexibly calibrate the trade-off between the inference accuracy and the total computational costs.

  • We introduce input-adaptive quantization at inference for the first time. Based on each input, DFS learns to dynamically assign different bitwidths to both weights and activations of different layers, using a two-step training procedure. That is in contrast to [22] that learns layer-wise bit allocation during training, which is then fixed for inference regardless of inputs. The existing layer skipping could be viewed as DFS’s coarse-grained version, i.e., allowing only to select between full bits (executing without quantization) and zero bit (bypassing).

  • We conduct extensive experiments to illustrate that DFS strikes a better computational cost and inference accuracy balance, compared to dynamic layer skipping and other relevant competitors. Moreover, we visualize the skipping behaviors of DFS when varying the total inference computations in a controlled way, and observe a smooth transition from selecting, to quantizing, and to bypassing layers. The observation empirically supports our conjecture that layer quantization can be viewed as soft and intermediate variants of layer skipping.

2 Related Works

Model Compression. Model compression has been widely studied to speedup CNN inference by reducing model size [26]. Existing works focus on pruning unimportant model weights, or quantizing the model into low bitwidths.

Pruning: There has been extensive studies on model pruning in different granularities. [6, 7] reduce the redundant parameters by performing element-wise weight pruning. Coarser-grained channel level pruning has been explored in [30, 16, 9] by enforcing group sparsity. [25] exploits parameter redundancy in a multi-grained manner by grouping weights into structured groups during pruning, each with a Lasso regularization. [28] proposes a hybrid pruning by performing element-wise pruning on top of the filter-wise pruned model. [12] performs multi-grained model pruning by adding explicit objectives for different levels. [3] presents a comprehensive review on pruning techniques. These methods are applied to well-trained networks and do not dynamically adjust the model complexity conditioned on the input.

Network Quantization: Quantizing network weights and activations has been proven to be an effective approach to reduce the memory and computational budgets. Most of the existing works quantize the model to varied bitwidths with a marginal accuracy loss. [17]binarized each convolution filter into {-w, +w}. [31] used one bit for network weights and two bits for activations. [11] made use of 8-bit integers for both weights and activations. With the recent development of hardware design, it becomes possible to use flexible bitwidths for different layers [22]. [6] determines the layer-wise bit allocation policy based on domain experts; [22]

further enhanced the idea by automating the decision process with a reinforcement learning method. These works either empirically find fixed bitwidths or automatically learn fixed layer-wise bit allocation regardless of input, ignoring that the importance of each layer may vary with different inputs. Our proposed DFS models are orthogonal to existing static quantization methods.

Dynamic Inference. While model compression presents “static” solutions, i.e., the compressed models cannot adaptively adjust their complexity at inference, for improving inference efficiency, the recently developed dynamic inference methods offer a different option to execute partial inference, conditioned on input complicacy or resource constraints.

Dynamic Layer Skipping: Many works [27, 23, 24] have formulate dynamic inference as a sequential decision problem, and selectively bypass subsets of layers in the network conditioned on the input. The common approach of these works is to use gating networks to determine the layer-wise skipping policy in ResNet-style models [8], which is inherently suitable for skipping design due to its resiudal structure. The work of SkipNet uses a hybrid learning algorithm that sequentially performs supervised pretraining and reinforcement fine-tuning, achieving better resource saving and accuracy tradeoff than existing static model compression methods [23]. BlockDrop [27] uses a decision network to train a global skipping policy for residual blocks, and [15] trains separate control units for the execution policy of sub parts of the network. In these works, a layer will either be executed fully or skipped completely, leaving no space for any intermediate options. We show that by adding “softer” intermediate quantization options between the two extremes, the DFS framework exploits the layer’s expressive power in a finer granularity, achieving a better accuracy than layer skipping methods under the same computational cost.

Dynamic Channel Selection/Pruning

: Since layer skipping only works well in network architectures with residual connections, channel pruning methods have been developed to exploit the redundancy in CNNs at a finer level.


formulates the channel pruning problem as a Markov decision process, and apply RNN gating network to determine which channel to prune conditioned on the input. GaterNet

[2] uses a separate network to calculate the channel activation strategy. The slimmable neural network [29] trains the network with varied layer widths, and adjust channel number during inference to meet resource budgets. [21] selectively executes branches of network based on input. Compare to quantization, the idea of channel selection exploits fine-grained model redundancy in the channel level, which is orthogonal to our method, and can potentially be combined with our framework to yield further resource savings.

Early Exiting: In many real world applications, there are strict resource constraints: the networks should hence allow for “anytime prediction” and be able to halt the inference whenever a specified budget is met. A few prior works enable CNNs with “early exit” functions. [20] adds additional branch classifiers to the backbone CNNs, forcing a large portion of inputs to exit at the branches in order to meet resource demands. [10] further boosts the performance of early exiting by aggregating features from different scale for early prediction. The early exiting works have been developed for resource-dependent inference, which is orthogonal to our input-dependent inference, and the two can be combined to yield further resource savings.

3 The Proposed Framework

In resource constrained platforms, the ideal efficient CNN inference should save as much resource as possible without non-negligible accuracy degradation. This requires the algorithm to maximally take advantage of the model’s expressive power, while dropping any redundant parts. Existing works like SkipNet exploit the model redundancy on the layer level, the binary decision of either executing a layer fully or skipping it completely makes it impossible to make use of the layer’s representational power in any finer levels. In contrast, CNN quantization exploits the model redundancy in the finest level - by reducing the bitwidth of the numerical representation of weights and activations. Thus, a natural thought is to use bitwidth options to fill in the gap between the binary options of layer skipping, striking an optimal tradeoff between computational cost and accuracy.

We hereby propose a Dynamic Fractional Skipping (DFS) framework that combines the following two schemes into one continuous fine-grained decision spectrum:

  • Input-Dependent Layer Skipping. On the coarse-grained level, the “executed” option of the layer skip decision is equivalent to the full bitwidth option of layer quantization in the DFS framework, and the “skip” option is equivalent to a zero-bit option of layer quantization.

  • Input-Dependent Network Quantization. On the fine-grained level, any lower than full bitwidth execution can be viewed as “fractionally” executing a layer, enabling the model to take advantage of the expressive power of the layer in its low bitwidth version.

To our best knowledge, DFS is the first attempt to unify binary layer skipping design and one alternative of its intermediate “soft” variants, i.e., quantization, into one dynamic inference framework. Together they achieve optimal tradeoffs between accuracy and computational usage by skipping layers if possible or executing varied “fractions” of the layers. Meanwhile, state-of-the-art hardware design of CNNs have shown that such DFS schemes are hardware friendly. For example, [18] proposed a bit-flexible CNN accelerator that constitutes an array of bit-level processing elements to dynamically match the bitwidth of each individual layer. With such dedicated accelerators, the proposed DFS’s energy savings would be maximized.

DFS Overview. We here introduce how the DFS framework is implemented in ResNet-style models, which has been the most popular backbone CNNs for dynamic inference [23, 27]. Figure 1 illustrates the operation of our DFS framework. Specifically, for the -th layer, we let denote its output feature maps and therefore as its input ones, where denotes the total number of channels and denote the feature map size. Also, we employ to denote the convolutional operation in the -th layer executed in bits (e.g., = 32 corresponds to the full bitwidth) and design a gating network for determining fractional skipping of the -th layer. Suppose there are a total of decision options, including the two binary skipping options (i.e., SkipNet) and a set of varied bitwidth options for quantization, and then

outputs a gating probability vector of length

. The operation of a convolutional layer under the DFS framework can then be formulated as:


Where is the gating probability vector of the ith layer, denotes the value of its -th entry, and represents the bitwidth option corresponding to the -th entry. When k = 0, we let represent the probability of a complete skip.

Figure 1: An illustration of the DFS framework where C1, C2, C3 denote three consecutive convolution layers, each of which consists of a column of filters as represented using cuboids. In this example, the first conv layer is executed fractionally with a low bitwidth, the second layer is fully executed using the full bitwidth, while the third one is skipped.

Gating Design of DFS. In the DFS framework, the execution decision of a layer is calculated based on the output of the previous layer. Therefore, the gating network should be able to capture the relevance between consecutive layers in order to make informative decision. As discussed in [23]

, recurrent neural networks (RNNs) have the advantages of both light weight (due to its parameter sharing design, which accounts for only 0.04% of the computational cost of a residual block) and being able to learn sequential tasks (due to its recurrent structure), thus, we adopt this convention and implement the gating function

as a Long Short Term Memory (LSTM) network, as depicted in Figure

2. Specifically, suppose there are options including the binary skipping options and the intermediate bitwidth options, then the LSTM output will be projected into a skipping probability vector of length

via softmax function. During inference, the largest element of the vector will be quantized to 1 and selected for execution; during training, the skipping probability will be used for backpropagation, which will be introduced in more detail in the subsection of DFS training.

Figure 2: An illustration of the RNN gate used in DFS. The output is a skipping probability vector, where the green arrows denote the layer skip options (skip/keep), and the blue arrows represent the quantization options. During inference, the skip/keep/quantization options corresponding to the largest vector element will be selected to be executed.

Training of DFS. Objective Function: The learning objective of DFS is to boost the prediction accuracy while minimizing the computational cost, and is defined as follows,


where represents the accuracy loss, is the resource-aware loss, is a weighted factor that trades off importance between accuracy and resource budget, and and denote the parameters of the backbone model and the gating network, respectively. The resource-aware loss is calculated as the total computational cost associated with executed layers (measured in FLOPs in this paper).

Skipping Policy Learning with Softmax Approximation:
During inference, the execution decision is automatically made based on the skipping probability vector of each gate: the layer is either skipped or executed in one of the chosen bitwidths. This discrete and therefore non-differentiable

decision process can make it difficult to train DFS with stochastic gradient descent methods. One alternative is to use softmax approximation for backpropagation, and quantize to discrete decisions for inference

[5, 23]. In this paper, we adopt the same technique for training the gating network.

Two-Step Training Procedure: Given a pre-trained CNN model , our goal is to jointly train and its gating network for targeted computational budget. However, directly training with randomly initialized gating networks results in much lower accuracy than that of

. One possible reason is that with random gate initialization, the model may start at a point with a majority of its layers skipped or executed in low bitwidths, causing large deviation of the feature maps’ statistics, which cannot be captured by the batch normalization parameters from the originally trained

. To tackle this problem, we conjecture that if the training starts with fully executed model, and then gradually reduces the computational cost towards the target budget, the batch normalization parameters will adapt to the new feature statistics. Motivated by this idea, we use a two-step training procedure:

1) Fix the parameters of and train the gating network to reach the state of executing all the layers with full bitwidth.

2) With the initialization obtained from the first step, we jointly train and the gating network to achieve the targeted computational budget.

The computational cost is controlled via computation percentage (), which is defined as the ratio between the FLOPs of executed layers and the FLOPs of the full bitwidth model. During training, we dynamically change the sign of in in Equation (2) to stablize the of the model: for each iteration, if the of the current batch of samples is above the targeted , we set to be positive, enforcing the model to reduce its by suppressing the resource loss in Equation (2); if the is below the targeted , we set to be negative, encouraging the model to increase its by reinforcing . In the end the of the model will be stabilized around the targeted . The absolute value of is the step size to adjust the , since we empirically found out that the performance of our model is robust to a wide range of step sizes, we fix the absolute value of . More detailed experiments of the choice of will be presented in section 4.

4 Experimental Results

Experiment Setup. Models and Datasets: We evaluate the DFS method using ResNet38 and ResNet74 as the backbone models on two datasets: CIFAR-10 and CIFAR-100. In particular, the structure of the ResNet models follow the design in [8]. For layer quantization, we consider 4 dynamic execution choices: skip, 8 bits, 16 bits, keep (full bitwidth), and follow the suggestion in [23] to keep the first residual block always executed with full bitwidth. Metrics: We compare DFS with relevant state-of-the-art techniques in terms of the tradeoff between prediction accuracy and computation percentage. Note that for a layer that is executed with a bitwidth of 8 bits, its corresponding computation percentage is of the full bitwidth layer.

Training Details: The training of DFS follows the two-step procedure as described in Section 3. For the first step, we set the initial learning rate as 0.1, and train the gating network with a total of 64000 iterations; the learning rate is reduced by after the 32000-th iteration, and further reduced by

after the 48000-th iteration. The specified computation budget is set to 100%. The hyperparameters including the momentum, weight decaying factor, and batch size are set to be 0.9, 1e-4, and 128, respectively, and the absolute value of

in Equation (2) is set to 5e-6.

After the first step is finished, we use the resulting LSTM gating network as the initialization for the second step, where we jointly train the backbone model and gating network to reach the specified computation budget. Here we use an initial learning rate of 0.01, with pre-specified target , and all other settings are the same as the first step.

DFS Performance Evaluation. We evaluate the proposed DFS against the competitive dynamic inference technique SkipNet [23] and two state-of-the-art static CNN quantization techniques proposed in [1] and [22].

Comparison with SkipNet (Dynamic Inference Method): In this subsection, we compare the performance of DFS with that of the SkipNet method. Specifically, we compare the performance of DFS on ResNet38 and ResNet74 with that of SkipNet38 and SkipNet74, respectively, on both the CIFAR-10 and CIFAR-100 datasets. We denote DFS-ResNetxx as the models with DFS applied on top of ResNexx backbone.

Experimental results on CIFAR-10 are shown in Figure 3 (vs. ResNet38) and Figure 4 (vs. ResNet74). Specifically, Figures 3-4 show that (1) given the same computation budget, DFS-ResNet38/74 consistently achieves a higher prediction accuracy than that of SkipNet38/74 under a wide range of computation percentage (20%-80%), with the largest margin being about  4% (93.61% vs. 89.26%) at the computation percentage of 20%; (2) given the same or even with a higher accuracy, DFS-ResNet38/74 achieves more than 60% computational saving as compared to SkipNet38/74; (3) interestingly, DFS-ResNet38/74 even achieves better accuracies than the original full bitwidth ResNet38/74. We conjecture that this is because DFS can help in relieving model overfitting thanks to its finer-grained dynamic feature.

Figure 5 (vs. ResNet38) and Figure LABEL:fig:6 (vs. ResNet74) show the results on CIFAR-100. We can see that the accuracy improvement (or computational savings) achieved by DFS-ResNet38/74 over SkipNet38/74 is even more pronounced given the same computation percentage (or the same/higher accuracy). For example, as shown in Figure 5, DFS-ResNet38 achieves 8% (68.91% and 60.38%) better prediction accuracy than SkipNet38 under the computation percentage of 20%; and Figure LABEL:fig:6 shows that DFS-ResNet74 outperforms SkipNet74 with 6% (70.94% and 65.09%) accuracy when computation percentage is 20%.

The four sets of experimental results above (i.e., Figures 3-LABEL:fig:6) show that (1) CNNs with DFS outperform the corresponding SkipNets even at a high computation percentage of 80% (i.e., small computational savings of 20% over the original ResNet backbones); (2) as the computation percentage decreases from 80% to 20% (corresponding to computational savings from 20% to 80%), the prediction accuracy of CNNs with DFS stays relatively stable (slightly fluctuate within a range of 0.5% while being consistently higher than that of SkipNet under the same computation percentage), whereas the accuracy of SkipNet decreases drastically. These observations validate our conjecture that DFS’s finer-grained dynamic execution options can better calibrate the inference accuracy of CNNs w.r.t. the complexity consumed.

Figure 3: Comparing the accuracy vs. computation percentage of DFS-ResNet38 and SkipNet38 on CIFAR10.
Figure 4: Comparing the accuracy vs. computation percentage of DFS-ResNet74 and SkipNet74 on CIFAR10.
Figure 5: Comparing the accuracy vs. computation percentage of DFS-ResNet38 and SkipNet38 on CIFAR-100.

Comparison with Statically Quantized CNNs: In this section, we compare DFS with two state-of-the-art static CNN quantization methods: the scalable network [1] and HAQ [22], with ResNet38 as the backbone on CIFAR-10. Specifically, for the scalable network [1], we train it under a set of bitwidths of (8bit, 10bit, 12bit, 14bit, 16bit, 18bit, 20bit, 22bit); according to HAQ’s official implementation111, only the weights are quantized, we control HAQ quantized models’ computation percentage via compression ratio (the ratio between the size of the quantized weights and the full bitwidth weights). The bitwidth allocation of HAQ is shown in the supplementary material, it can be seen that HAQ learns fine-grained quantization options, the smallest difference between two options is 1 bit. Note that (1) DFS is orthogonal to static CNN quantization methods, and thus can be applied on top of quantized models for further reducing CNNs’ required computational cost; and (2) This set of experiments are not to show that DFS is better than static CNN quantization methods. Instead, the motivation is that it can be insightful to observe the behaviors of static and dynamic quantization methods under the same settings.

Figure 6 shows the results. It can be seen that DFS-ResNet38 achieves similar or slightly better accuracy (up to 1.2% over the scalable method and 0.2% over HAQ) than both the scalable and HAQ methods, even with a much more coarser-grained quantization options (keep, skip, 8bits, and 16bits). Furthermore, among the three methods, the prediction accuracy of the scalable method fluctuates the most as the computation percentage changes, showing that CNNs with layer-wise adaptive bitwidths can achieve better tradeoffs between accuracy and computational cost.

Figure 6: Accuracy vs. computation percentage of DFS-ResNet38, the scalable quantized ResNet38 and HAQ quantized ResNet38.

Choice of Parameter : We conduct two sets of experiments to demonstrate how dynamically changing the sign of as described in section 3 is necessary for reaching the targeted , and how the absolute value of affect the model’s performance. Table 1 compares two training scenarios of DFS-ResNet74 on CIFAR10, DFS-ResNet74-D denotes the case where dynamic changing sign of is applied, and DFS-ResNet74-C denotes the case where is a positive constant, and we set the absolute value of to 1e-5. It can be seen that when is constant, the resulting actual significantly deviates from the targeted , since a positive will keep enforcing the model to redcuce without constraint, while the dynamic case achieves the desired . Table 2 shows how the performance of the DFS-ResNet74 model varies with the absolute value of on CIFAR10. During training, the dynamic changing sing of is applied. It can be seen that there is an obvious accuracy drop ( 0.2%) under both targeted when the absolute value of increases to (1e-4,1e-3), where the actual deviated from the target by around 3%. This is because a larger step size will cause the of the model to fluctuate, and thus the unstable training will results in degraded accuracy, while the stable performance in the range (1e-6,1e-5) proves that the model is robust to smaller step size under different targeted .

Model Target Actual Acc
DFS-ResNet74-D 40% 40.20% 93.53%
DFS-ResNet74-C 40% 8.20% 93.12%
Base-ResNet74 93.55%
Table 1: DFS performance under dynamic changing and constant .
Target = 40 % Target = 50 %
abs of Actual Acc Actual Acc
1e-6 40.10% 93.53% 50.08% 93.72%
1e-5 40.93% 93.54% 50.20% 93.74%
1e-4 40.80% 93.31% 51.01% 93.52%
1e-3 43.75% 93.27% 53.40% 93.42%
Table 2: Performance of DFS models under different . The ’abs’ in the leftmost column represents absolute value.

Decision Behavior Analysis and Visualization

We then visualize and study the learned layer-wise decision behaviors of DFS, and how they evolve as increases. We demonstrate that quantization options are indeed natural candidates as intermediate “fractional” skipping choices. Specifically, we investigate how these decisions gradually change to layer quantization at different bitwidths. In general, the (full) layer skip options are likely to be taken only when a very low is enforced. When the computational saving requirement is mild, the model shows a strong tendency to “fractioally” skip all its layers.

Figures 7-9 show the layer-wise “decision distributions” (e.g., the skip option taken per layer) of DFS-ResNet74 trained on CIFAR-10, as the computation percentage increases from 4% to 6.25%. In this specific case and (quite low) percentage range, the model is observed to only choose between “skip” and “8bit” in a vast majority of input cases. Therefore, we only plot “skip” and “8bit” columns for compact display. we can observe a smooth transition of decision behaviors as the computational percentage varies: from a mixture of layer skipping and quantization, gradually to all layer quantization. Specifically, from Figure 7 to Figure 8, within the first residual group, the percentage of skipping options for blocks 2,4,9 remains roughly unchanged, while we observe an obvious drop of skipping percentages at block 5 (from  55% to 0%) and block 8 (from  100% to  10%). Similarly, for the second and third residual groups, the skipping percentage of most residual blocks gradually reduces to  0%, with that of the remaining blocks (20,22,23,24) stays roughly unchanged. From Figure8 to Figure9, the decisions of all the layers shift to 8bit. The smooth transition empirically endorses our hypothesis made in Section 3, that the layer quantization options can serve as a “fractional” intermediate stage between the binary layer skipping options.

Figure 7: Visualization of layerwise decision distribution of DFS-ResNet74 on CIFAR10: computation percentage = 4 %.
Figure 8: Visualization of layerwise decision distribution of DFS-ResNet74 on CIFAR10: computation percentage = 5 %.
Figure 9: Visualization of layerwise decision distribution of DFS-ResNet74 on CIFAR10: computation percentage = 6.25 %.

As increases, DFS apparently favors the finer-grained layer quantization options than the coarser-grained layer skipping. Figure 10 shows the accuracy of DFS-ResNet74 when the computation percentage increases from 4% to 30%. From 4% to 6.25% (when the layer skipping options gradually change to all-quantization options), there is a notable accuracy increase from 92.91% to 93.54%. The performance then reaches a plateau after the computation percentage of 6.25%, while we observe DFS now tends to choose quantization for all layers (see supplementary material). This phenomenon demonstrates that as a “fractional” layer skipping candidate, low bitwidth options can better restore the model’s accuracy under a wider range of .

Additionally, from Figures 7-9, we observe that the first residual block within each residual group is learned not to be skipped, regardless of the values of . That aligns with the findings in [5], which shows that for ResNet-style models, only the first residual block within each group extracts a completely new representation (and therefore being most important), and that the remaining residual blocks within the same group only refine this feature iteratively.

Figure 10: DFS-ResNet74 under lower .

5 Conclusion

We proposed a novel DFS framework, which extends binary layer skipping options with the “fractional skipping” ability - by quantizing the layer weights and activations into different bitwidths. The DFS framework exploits model redundancy in a much finer-grained level, leading to more flexible and effective calibration between inference accuracy and complexity. We evaluate DFS on the CIFAR-10 and CIFAR-100 benchmarks, and it was shown to compare favorably against both state-of-the-art dynamic inference method and static quantization techniques.

While we demonstrate that quantization indeed can be viewed as “fractional” intermediate states in-between binary layer skip options (by both achieved results, and the visualizations of skipping decision transitions), we recognize that more possible alternatives to “fractionally” execute a layer could be explored, such as channel slimming [29]. We leave this as a future work.


  • [1] R. Banner, I. Hubara, E. Hoffer, and D. Soudry (2018) Scalable methods for 8-bit training of neural networks. In NIPS, Cited by: §4, §4.
  • [2] Z. Chen, Y. Li, S. Bengio, and S. Si (2018) Gaternet: dynamic filter selection in convolutional neural network via a dedicated global gating network. arXiv preprint arXiv:1811.11205. Cited by: §2.
  • [3] Y. Cheng, D. Wang, P. Zhou, and T. Zhang (2017) A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282. Cited by: §2.
  • [4] M. Figurnov, M. D. Collins, Y. Zhu, L. Zhang, J. Huang, D. Vetrov, and R. Salakhutdinov (2017) Spatially adaptive computation time for residual networks. In CVPR, Cited by: §1, §1.
  • [5] K. Greff, R. K. Srivastava, and J. Schmidhuber (2016)

    Highway and residual networks learn unrolled iterative estimation

    arXiv preprint arXiv:1612.07771. Cited by: §3, §4.
  • [6] S. Han, H. Mao, and W. J. Dally (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §1, §2, §2.
  • [7] S. Han, J. Pool, J. Tran, and W. Dally (2015) Learning both weights and connections for efficient neural network. In NIPS, Cited by: §2.
  • [8] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §2, §4.
  • [9] Y. He, X. Zhang, and J. Sun (2017) Channel pruning for accelerating very deep neural networks. In ICCV, Cited by: §2.
  • [10] G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten, and K. Q. Weinberger (2017) Multi-scale dense networks for resource efficient image classification. arXiv preprint arXiv:1703.09844. Cited by: §2.
  • [11] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. In CVPR, Cited by: §2.
  • [12] E. Kim, C. Ahn, and S. Oh (2018) NestedNet: learning nested sparse structures in deep neural networks. In CVPR, Cited by: §2.
  • [13] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In NIPS, Cited by: §1.
  • [14] J. Lin, Y. Rao, J. Lu, and J. Zhou (2017) Runtime neural pruning. In NIPS, Cited by: §2.
  • [15] L. Liu and J. Deng (2018) Dynamic deep neural networks: optimizing accuracy-efficiency trade-offs by selective execution. In AAAI, Cited by: §2.
  • [16] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang (2017) Learning efficient convolutional networks through network slimming. In ICCV, Cited by: §2.
  • [17] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi (2016) Xnor-net: imagenet classification using binary convolutional neural networks. In ECCV, Cited by: §2.
  • [18] H. Sharma, J. Park, N. Suda, L. Lai, B. Chau, V. Chandra, and H. Esmaeilzadeh (2018-06) Bit fusion: bit-level dynamically composable architecture for accelerating deep neural network. 2018 ACM/IEEE 45th Annual (ISCA). External Links: ISBN 9781538659847, Link, Document Cited by: §3.
  • [19] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf (2014) Deepface: closing the gap to human-level performance in face verification. In CVPR, Cited by: §1.
  • [20] S. Teerapittayanon, B. McDanel, and H. Kung (2016) Branchynet: fast inference via early exiting from deep neural networks. In 2016 23rd ICPR, Cited by: §2.
  • [21] R. Teja Mullapudi, W. R. Mark, N. Shazeer, and K. Fatahalian (2018) Hydranets: specialized dynamic architectures for efficient inference. In CVPR, Cited by: §2.
  • [22] K. Wang, Z. Liu, Y. Lin, J. Lin, and S. Han (2019) HAQ: hardware-aware automated quantization with mixed precision. In CVPR, Cited by: 2nd item, §1, §2, §4, §4.
  • [23] X. Wang, F. Yu, Z. Dou, T. Darrell, and J. E. Gonzalez (2018) Skipnet: learning dynamic routing in convolutional networks. In ECCV, Cited by: §1, §2, §3, §3, §3, §4, §4.
  • [24] Y. Wang, T. Nguyen, Y. Zhao, Z. Wang, Y. Lin, and R. Baraniuk (2018) EnergyNet: energy-efficient dynamic inference. In Thirty-second Conference on Neural Information Processing Systems (NIPS 2018) Workshop, Cited by: §2.
  • [25] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li (2016) Learning structured sparsity in deep neural networks. In NIPS, Cited by: §2.
  • [26] J. Wu, Y. Wang, Z. Wu, Z. Wang, A. Veeraraghavan, and Y. Lin (2018)

    Deep k-means: re-training and parameter sharing with harder cluster assignments for compressing deep convolutions


    International Conference on Machine Learning

    pp. 5359–5368. Cited by: §2.
  • [27] Z. Wu, T. Nagarajan, A. Kumar, S. Rennie, L. S. Davis, K. Grauman, and R. Feris (2018) Blockdrop: dynamic inference paths in residual networks. In CVPR, Cited by: §1, §2, §3.
  • [28] X. Xu, M. S. Park, and C. Brick (2018) Hybrid pruning: thinner sparse networks for fast inference on edge devices. arXiv preprint arXiv:1811.00482. Cited by: §2.
  • [29] J. Yu, L. Yang, N. Xu, J. Yang, and T. Huang (2018) Slimmable neural networks. arXiv preprint arXiv:1812.08928. Cited by: §2, §5.
  • [30] J. Yu, A. Lukefahr, D. Palframan, G. Dasika, R. Das, and S. Mahlke (2017) Scalpel: customizing dnn pruning to the underlying hardware parallelism. ACM SIGARCH Computer Architecture News. Cited by: §2.
  • [31] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou (2016) Dorefa-net: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160. Cited by: §2.
  • [32] C. Zhu, S. Han, H. Mao, and W. J. Dally (2016) Trained ternary quantization. arXiv preprint arXiv:1612.01064. Cited by: §1.