1 Introduction
Although convolutional neural networks (CNNs) have show state of the art performance in many visual perception tasks
[13, 19], the high computational cost has limited their application in resource constrained platforms such as drones, selfdriving cars, wearables and many more. The growing demand of unleashing the intelligent power of CNN into these devices has posed unique challenges in developing algorithms that enables more computationally efficient inference of CNNs. Earlier resourceefficient implementations assumed that CNNs are first compressed before being deployed, thus being “static” and unable to adjust their own complexity at inference. Later on, a series of works [4, 23]pointed out that the continuous improvements in accuracy, while significant, are marginal compared to the growth in model complexity. This implies that computationally intensive models may only be necessary to classify a handful of difficult inputs correctly, and they might become “wasteful” for many simple inputs.
Motivated by this observation, several works have tackled the problem of inputdependent adaptive inference, by dynamically bypassing unnecessary computations on the layer level, i.e., selectively executing a subset of layers [4, 27]. However, the binary decision scheme of either executing a layer fully, or skipping it completely, leaves no room for intermediate options. We conjecture that finergrained dynamic execution options can contribute to better calibrating the inference accuracy of CNNs w.r.t. the complexity consumed.
On a separate note, CNN quantization appears to exploit model redundancy at the finest level  by reducing the bitwidth of the elementlevel numerical representations of weights and activations. Earlier works [6, 32] presented to quantize all layerwise weights and activations to the same low bitwidth, yet ignored the fact that different layers can have different importance. The latest work [22] learned to assign different bitwidths for each layer. However, no work has yet discussed an inputadaptive, layerwise bitwidth allocation at the inference time, not to mention linking between quantization with dynamic inference.
In an effort to enable finergrained dynamic inference beyond “binary” layer skipping, we propose a Dynamic Fractional Skipping (DFS) framework, that treats layer quantization (to different bitwidths) as softer, intermediate versions of layerwise skipping. Below are our contributions:

We propose to link two efficient CNN inference mindsets: dynamic layer skipping and static quantization, and show that they can be unified into one framework. Specifically, DFS considers a quantized layer to be a “fractionally executed” layer, in contrast to either a fully executed (selected) or nonexecuted (bypassed) layer in the existing layer skipping regime. In this way, DFS can more flexibly calibrate the tradeoff between the inference accuracy and the total computational costs.

We introduce inputadaptive quantization at inference for the first time. Based on each input, DFS learns to dynamically assign different bitwidths to both weights and activations of different layers, using a twostep training procedure. That is in contrast to [22] that learns layerwise bit allocation during training, which is then fixed for inference regardless of inputs. The existing layer skipping could be viewed as DFS’s coarsegrained version, i.e., allowing only to select between full bits (executing without quantization) and zero bit (bypassing).

We conduct extensive experiments to illustrate that DFS strikes a better computational cost and inference accuracy balance, compared to dynamic layer skipping and other relevant competitors. Moreover, we visualize the skipping behaviors of DFS when varying the total inference computations in a controlled way, and observe a smooth transition from selecting, to quantizing, and to bypassing layers. The observation empirically supports our conjecture that layer quantization can be viewed as soft and intermediate variants of layer skipping.
2 Related Works
Model Compression. Model compression has been widely studied to speedup CNN inference by reducing model size [26]. Existing works focus on pruning unimportant model weights, or quantizing the model into low bitwidths.
Pruning: There has been extensive studies on model pruning in different granularities. [6, 7] reduce the redundant parameters by performing elementwise weight pruning. Coarsergrained channel level pruning has been explored in [30, 16, 9] by enforcing group sparsity. [25] exploits parameter redundancy in a multigrained manner by grouping weights into structured groups during pruning, each with a Lasso regularization. [28] proposes a hybrid pruning by performing elementwise pruning on top of the filterwise pruned model. [12] performs multigrained model pruning by adding explicit objectives for different levels. [3] presents a comprehensive review on pruning techniques. These methods are applied to welltrained networks and do not dynamically adjust the model complexity conditioned on the input.
Network Quantization: Quantizing network weights and activations has been proven to be an effective approach to reduce the memory and computational budgets. Most of the existing works quantize the model to varied bitwidths with a marginal accuracy loss. [17]binarized each convolution filter into {w, +w}. [31] used one bit for network weights and two bits for activations. [11] made use of 8bit integers for both weights and activations. With the recent development of hardware design, it becomes possible to use flexible bitwidths for different layers [22]. [6] determines the layerwise bit allocation policy based on domain experts; [22]
further enhanced the idea by automating the decision process with a reinforcement learning method. These works either empirically find fixed bitwidths or automatically learn fixed layerwise bit allocation regardless of input, ignoring that the importance of each layer may vary with different inputs. Our proposed DFS models are orthogonal to existing static quantization methods.
Dynamic Inference. While model compression presents “static” solutions, i.e., the compressed models cannot adaptively adjust their complexity at inference, for improving inference efficiency, the recently developed dynamic inference methods offer a different option to execute partial inference, conditioned on input complicacy or resource constraints.
Dynamic Layer Skipping: Many works [27, 23, 24] have formulate dynamic inference as a sequential decision problem, and selectively bypass subsets of layers in the network conditioned on the input. The common approach of these works is to use gating networks to determine the layerwise skipping policy in ResNetstyle models [8], which is inherently suitable for skipping design due to its resiudal structure. The work of SkipNet uses a hybrid learning algorithm that sequentially performs supervised pretraining and reinforcement finetuning, achieving better resource saving and accuracy tradeoff than existing static model compression methods [23]. BlockDrop [27] uses a decision network to train a global skipping policy for residual blocks, and [15] trains separate control units for the execution policy of sub parts of the network. In these works, a layer will either be executed fully or skipped completely, leaving no space for any intermediate options. We show that by adding “softer” intermediate quantization options between the two extremes, the DFS framework exploits the layer’s expressive power in a finer granularity, achieving a better accuracy than layer skipping methods under the same computational cost.
Dynamic Channel Selection/Pruning
: Since layer skipping only works well in network architectures with residual connections, channel pruning methods have been developed to exploit the redundancy in CNNs at a finer level.
[14]formulates the channel pruning problem as a Markov decision process, and apply RNN gating network to determine which channel to prune conditioned on the input. GaterNet
[2] uses a separate network to calculate the channel activation strategy. The slimmable neural network [29] trains the network with varied layer widths, and adjust channel number during inference to meet resource budgets. [21] selectively executes branches of network based on input. Compare to quantization, the idea of channel selection exploits finegrained model redundancy in the channel level, which is orthogonal to our method, and can potentially be combined with our framework to yield further resource savings.Early Exiting: In many real world applications, there are strict resource constraints: the networks should hence allow for “anytime prediction” and be able to halt the inference whenever a specified budget is met. A few prior works enable CNNs with “early exit” functions. [20] adds additional branch classifiers to the backbone CNNs, forcing a large portion of inputs to exit at the branches in order to meet resource demands. [10] further boosts the performance of early exiting by aggregating features from different scale for early prediction. The early exiting works have been developed for resourcedependent inference, which is orthogonal to our inputdependent inference, and the two can be combined to yield further resource savings.
3 The Proposed Framework
In resource constrained platforms, the ideal efficient CNN inference should save as much resource as possible without nonnegligible accuracy degradation. This requires the algorithm to maximally take advantage of the model’s expressive power, while dropping any redundant parts. Existing works like SkipNet exploit the model redundancy on the layer level, the binary decision of either executing a layer fully or skipping it completely makes it impossible to make use of the layer’s representational power in any finer levels. In contrast, CNN quantization exploits the model redundancy in the finest level  by reducing the bitwidth of the numerical representation of weights and activations. Thus, a natural thought is to use bitwidth options to fill in the gap between the binary options of layer skipping, striking an optimal tradeoff between computational cost and accuracy.
We hereby propose a Dynamic Fractional Skipping (DFS) framework that combines the following two schemes into one continuous finegrained decision spectrum:

InputDependent Layer Skipping. On the coarsegrained level, the “executed” option of the layer skip decision is equivalent to the full bitwidth option of layer quantization in the DFS framework, and the “skip” option is equivalent to a zerobit option of layer quantization.

InputDependent Network Quantization. On the finegrained level, any lower than full bitwidth execution can be viewed as “fractionally” executing a layer, enabling the model to take advantage of the expressive power of the layer in its low bitwidth version.
To our best knowledge, DFS is the first attempt to unify binary layer skipping design and one alternative of its intermediate “soft” variants, i.e., quantization, into one dynamic inference framework. Together they achieve optimal tradeoffs between accuracy and computational usage by skipping layers if possible or executing varied “fractions” of the layers. Meanwhile, stateoftheart hardware design of CNNs have shown that such DFS schemes are hardware friendly. For example, [18] proposed a bitflexible CNN accelerator that constitutes an array of bitlevel processing elements to dynamically match the bitwidth of each individual layer. With such dedicated accelerators, the proposed DFS’s energy savings would be maximized.
DFS Overview. We here introduce how the DFS framework is implemented in ResNetstyle models, which has been the most popular backbone CNNs for dynamic inference [23, 27]. Figure 1 illustrates the operation of our DFS framework. Specifically, for the th layer, we let denote its output feature maps and therefore as its input ones, where denotes the total number of channels and denote the feature map size. Also, we employ to denote the convolutional operation in the th layer executed in bits (e.g., = 32 corresponds to the full bitwidth) and design a gating network for determining fractional skipping of the th layer. Suppose there are a total of decision options, including the two binary skipping options (i.e., SkipNet) and a set of varied bitwidth options for quantization, and then
outputs a gating probability vector of length
. The operation of a convolutional layer under the DFS framework can then be formulated as:(1) 
Where is the gating probability vector of the ith layer, denotes the value of its th entry, and represents the bitwidth option corresponding to the th entry. When k = 0, we let represent the probability of a complete skip.
Gating Design of DFS. In the DFS framework, the execution decision of a layer is calculated based on the output of the previous layer. Therefore, the gating network should be able to capture the relevance between consecutive layers in order to make informative decision. As discussed in [23]
, recurrent neural networks (RNNs) have the advantages of both light weight (due to its parameter sharing design, which accounts for only 0.04% of the computational cost of a residual block) and being able to learn sequential tasks (due to its recurrent structure), thus, we adopt this convention and implement the gating function
as a Long Short Term Memory (LSTM) network, as depicted in Figure
2. Specifically, suppose there are options including the binary skipping options and the intermediate bitwidth options, then the LSTM output will be projected into a skipping probability vector of lengthvia softmax function. During inference, the largest element of the vector will be quantized to 1 and selected for execution; during training, the skipping probability will be used for backpropagation, which will be introduced in more detail in the subsection of DFS training.
Training of DFS. Objective Function: The learning objective of DFS is to boost the prediction accuracy while minimizing the computational cost, and is defined as follows,
(2) 
where represents the accuracy loss, is the resourceaware loss, is a weighted factor that trades off importance between accuracy and resource budget, and and denote the parameters of the backbone model and the gating network, respectively. The resourceaware loss is calculated as the total computational cost associated with executed layers (measured in FLOPs in this paper).
Skipping Policy Learning with Softmax Approximation:
During inference, the execution decision is automatically made based on the skipping probability vector of each gate: the layer is either skipped or executed in one of the chosen bitwidths. This discrete and therefore nondifferentiable
decision process can make it difficult to train DFS with stochastic gradient descent methods. One alternative is to use softmax approximation for backpropagation, and quantize to discrete decisions for inference
[5, 23]. In this paper, we adopt the same technique for training the gating network.TwoStep Training Procedure: Given a pretrained CNN model , our goal is to jointly train and its gating network for targeted computational budget. However, directly training with randomly initialized gating networks results in much lower accuracy than that of
. One possible reason is that with random gate initialization, the model may start at a point with a majority of its layers skipped or executed in low bitwidths, causing large deviation of the feature maps’ statistics, which cannot be captured by the batch normalization parameters from the originally trained
. To tackle this problem, we conjecture that if the training starts with fully executed model, and then gradually reduces the computational cost towards the target budget, the batch normalization parameters will adapt to the new feature statistics. Motivated by this idea, we use a twostep training procedure:1) Fix the parameters of and train the gating network to reach the state of executing all the layers with full bitwidth.
2) With the initialization obtained from the first step, we jointly train and the gating network to achieve the targeted computational budget.
The computational cost is controlled via computation percentage (), which is defined as the ratio between the FLOPs of executed layers and the FLOPs of the full bitwidth model. During training, we dynamically change the sign of in in Equation (2) to stablize the of the model: for each iteration, if the of the current batch of samples is above the targeted , we set to be positive, enforcing the model to reduce its by suppressing the resource loss in Equation (2); if the is below the targeted , we set to be negative, encouraging the model to increase its by reinforcing . In the end the of the model will be stabilized around the targeted . The absolute value of is the step size to adjust the , since we empirically found out that the performance of our model is robust to a wide range of step sizes, we fix the absolute value of . More detailed experiments of the choice of will be presented in section 4.
4 Experimental Results
Experiment Setup. Models and Datasets: We evaluate the DFS method using ResNet38 and ResNet74 as the backbone models on two datasets: CIFAR10 and CIFAR100. In particular, the structure of the ResNet models follow the design in [8]. For layer quantization, we consider 4 dynamic execution choices: skip, 8 bits, 16 bits, keep (full bitwidth), and follow the suggestion in [23] to keep the first residual block always executed with full bitwidth. Metrics: We compare DFS with relevant stateoftheart techniques in terms of the tradeoff between prediction accuracy and computation percentage. Note that for a layer that is executed with a bitwidth of 8 bits, its corresponding computation percentage is of the full bitwidth layer.
Training Details: The training of DFS follows the twostep procedure as described in Section 3. For the first step, we set the initial learning rate as 0.1, and train the gating network with a total of 64000 iterations; the learning rate is reduced by after the 32000th iteration, and further reduced by
after the 48000th iteration. The specified computation budget is set to 100%. The hyperparameters including the momentum, weight decaying factor, and batch size are set to be 0.9, 1e4, and 128, respectively, and the absolute value of
in Equation (2) is set to 5e6.After the first step is finished, we use the resulting LSTM gating network as the initialization for the second step, where we jointly train the backbone model and gating network to reach the specified computation budget. Here we use an initial learning rate of 0.01, with prespecified target , and all other settings are the same as the first step.
DFS Performance Evaluation. We evaluate the proposed DFS against the competitive dynamic inference technique SkipNet [23] and two stateoftheart static CNN quantization techniques proposed in [1] and [22].
Comparison with SkipNet (Dynamic Inference Method): In this subsection, we compare the performance of DFS with that of the SkipNet method. Specifically, we compare the performance of DFS on ResNet38 and ResNet74 with that of SkipNet38 and SkipNet74, respectively, on both the CIFAR10 and CIFAR100 datasets. We denote DFSResNetxx as the models with DFS applied on top of ResNexx backbone.
Experimental results on CIFAR10 are shown in Figure 3 (vs. ResNet38) and Figure 4 (vs. ResNet74). Specifically, Figures 34 show that (1) given the same computation budget, DFSResNet38/74 consistently achieves a higher prediction accuracy than that of SkipNet38/74 under a wide range of computation percentage (20%80%), with the largest margin being about 4% (93.61% vs. 89.26%) at the computation percentage of 20%; (2) given the same or even with a higher accuracy, DFSResNet38/74 achieves more than 60% computational saving as compared to SkipNet38/74; (3) interestingly, DFSResNet38/74 even achieves better accuracies than the original full bitwidth ResNet38/74. We conjecture that this is because DFS can help in relieving model overfitting thanks to its finergrained dynamic feature.
Figure 5 (vs. ResNet38) and Figure LABEL:fig:6 (vs. ResNet74) show the results on CIFAR100. We can see that the accuracy improvement (or computational savings) achieved by DFSResNet38/74 over SkipNet38/74 is even more pronounced given the same computation percentage (or the same/higher accuracy). For example, as shown in Figure 5, DFSResNet38 achieves 8% (68.91% and 60.38%) better prediction accuracy than SkipNet38 under the computation percentage of 20%; and Figure LABEL:fig:6 shows that DFSResNet74 outperforms SkipNet74 with 6% (70.94% and 65.09%) accuracy when computation percentage is 20%.
The four sets of experimental results above (i.e., Figures 3LABEL:fig:6) show that (1) CNNs with DFS outperform the corresponding SkipNets even at a high computation percentage of 80% (i.e., small computational savings of 20% over the original ResNet backbones); (2) as the computation percentage decreases from 80% to 20% (corresponding to computational savings from 20% to 80%), the prediction accuracy of CNNs with DFS stays relatively stable (slightly fluctuate within a range of 0.5% while being consistently higher than that of SkipNet under the same computation percentage), whereas the accuracy of SkipNet decreases drastically. These observations validate our conjecture that DFS’s finergrained dynamic execution options can better calibrate the inference accuracy of CNNs w.r.t. the complexity consumed.
Comparison with Statically Quantized CNNs: In this section, we compare DFS with two stateoftheart static CNN quantization methods: the scalable network [1] and HAQ [22], with ResNet38 as the backbone on CIFAR10. Specifically, for the scalable network [1], we train it under a set of bitwidths of (8bit, 10bit, 12bit, 14bit, 16bit, 18bit, 20bit, 22bit); according to HAQ’s official implementation^{1}^{1}1https://github.com/mithanlab/haqrelease, only the weights are quantized, we control HAQ quantized models’ computation percentage via compression ratio (the ratio between the size of the quantized weights and the full bitwidth weights). The bitwidth allocation of HAQ is shown in the supplementary material, it can be seen that HAQ learns finegrained quantization options, the smallest difference between two options is 1 bit. Note that (1) DFS is orthogonal to static CNN quantization methods, and thus can be applied on top of quantized models for further reducing CNNs’ required computational cost; and (2) This set of experiments are not to show that DFS is better than static CNN quantization methods. Instead, the motivation is that it can be insightful to observe the behaviors of static and dynamic quantization methods under the same settings.
Figure 6 shows the results. It can be seen that DFSResNet38 achieves similar or slightly better accuracy (up to 1.2% over the scalable method and 0.2% over HAQ) than both the scalable and HAQ methods, even with a much more coarsergrained quantization options (keep, skip, 8bits, and 16bits). Furthermore, among the three methods, the prediction accuracy of the scalable method fluctuates the most as the computation percentage changes, showing that CNNs with layerwise adaptive bitwidths can achieve better tradeoffs between accuracy and computational cost.
Choice of Parameter : We conduct two sets of experiments to demonstrate how dynamically changing the sign of as described in section 3 is necessary for reaching the targeted , and how the absolute value of affect the model’s performance. Table 1 compares two training scenarios of DFSResNet74 on CIFAR10, DFSResNet74D denotes the case where dynamic changing sign of is applied, and DFSResNet74C denotes the case where is a positive constant, and we set the absolute value of to 1e5. It can be seen that when is constant, the resulting actual significantly deviates from the targeted , since a positive will keep enforcing the model to redcuce without constraint, while the dynamic case achieves the desired . Table 2 shows how the performance of the DFSResNet74 model varies with the absolute value of on CIFAR10. During training, the dynamic changing sing of is applied. It can be seen that there is an obvious accuracy drop ( 0.2%) under both targeted when the absolute value of increases to (1e4,1e3), where the actual deviated from the target by around 3%. This is because a larger step size will cause the of the model to fluctuate, and thus the unstable training will results in degraded accuracy, while the stable performance in the range (1e6,1e5) proves that the model is robust to smaller step size under different targeted .
Model  Target  Actual  Acc 

DFSResNet74D  40%  40.20%  93.53% 
DFSResNet74C  40%  8.20%  93.12% 
BaseResNet74  93.55% 
Target = 40 %  Target = 50 %  

abs of  Actual  Acc  Actual  Acc 
1e6  40.10%  93.53%  50.08%  93.72% 
1e5  40.93%  93.54%  50.20%  93.74% 
1e4  40.80%  93.31%  51.01%  93.52% 
1e3  43.75%  93.27%  53.40%  93.42% 
Decision Behavior Analysis and Visualization
We then visualize and study the learned layerwise decision behaviors of DFS, and how they evolve as increases. We demonstrate that quantization options are indeed natural candidates as intermediate “fractional” skipping choices. Specifically, we investigate how these decisions gradually change to layer quantization at different bitwidths. In general, the (full) layer skip options are likely to be taken only when a very low is enforced. When the computational saving requirement is mild, the model shows a strong tendency to “fractioally” skip all its layers.
Figures 79 show the layerwise “decision distributions” (e.g., the skip option taken per layer) of DFSResNet74 trained on CIFAR10, as the computation percentage increases from 4% to 6.25%. In this specific case and (quite low) percentage range, the model is observed to only choose between “skip” and “8bit” in a vast majority of input cases. Therefore, we only plot “skip” and “8bit” columns for compact display. we can observe a smooth transition of decision behaviors as the computational percentage varies: from a mixture of layer skipping and quantization, gradually to all layer quantization. Specifically, from Figure 7 to Figure 8, within the first residual group, the percentage of skipping options for blocks 2,4,9 remains roughly unchanged, while we observe an obvious drop of skipping percentages at block 5 (from 55% to 0%) and block 8 (from 100% to 10%). Similarly, for the second and third residual groups, the skipping percentage of most residual blocks gradually reduces to 0%, with that of the remaining blocks (20,22,23,24) stays roughly unchanged. From Figure8 to Figure9, the decisions of all the layers shift to 8bit. The smooth transition empirically endorses our hypothesis made in Section 3, that the layer quantization options can serve as a “fractional” intermediate stage between the binary layer skipping options.
As increases, DFS apparently favors the finergrained layer quantization options than the coarsergrained layer skipping. Figure 10 shows the accuracy of DFSResNet74 when the computation percentage increases from 4% to 30%. From 4% to 6.25% (when the layer skipping options gradually change to allquantization options), there is a notable accuracy increase from 92.91% to 93.54%. The performance then reaches a plateau after the computation percentage of 6.25%, while we observe DFS now tends to choose quantization for all layers (see supplementary material). This phenomenon demonstrates that as a “fractional” layer skipping candidate, low bitwidth options can better restore the model’s accuracy under a wider range of .
Additionally, from Figures 79, we observe that the first residual block within each residual group is learned not to be skipped, regardless of the values of . That aligns with the findings in [5], which shows that for ResNetstyle models, only the first residual block within each group extracts a completely new representation (and therefore being most important), and that the remaining residual blocks within the same group only refine this feature iteratively.
5 Conclusion
We proposed a novel DFS framework, which extends binary layer skipping options with the “fractional skipping” ability  by quantizing the layer weights and activations into different bitwidths. The DFS framework exploits model redundancy in a much finergrained level, leading to more flexible and effective calibration between inference accuracy and complexity. We evaluate DFS on the CIFAR10 and CIFAR100 benchmarks, and it was shown to compare favorably against both stateoftheart dynamic inference method and static quantization techniques.
While we demonstrate that quantization indeed can be viewed as “fractional” intermediate states inbetween binary layer skip options (by both achieved results, and the visualizations of skipping decision transitions), we recognize that more possible alternatives to “fractionally” execute a layer could be explored, such as channel slimming [29]. We leave this as a future work.
References
 [1] (2018) Scalable methods for 8bit training of neural networks. In NIPS, Cited by: §4, §4.
 [2] (2018) Gaternet: dynamic filter selection in convolutional neural network via a dedicated global gating network. arXiv preprint arXiv:1811.11205. Cited by: §2.
 [3] (2017) A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282. Cited by: §2.
 [4] (2017) Spatially adaptive computation time for residual networks. In CVPR, Cited by: §1, §1.

[5]
(2016)
Highway and residual networks learn unrolled iterative estimation
. arXiv preprint arXiv:1612.07771. Cited by: §3, §4.  [6] (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §1, §2, §2.
 [7] (2015) Learning both weights and connections for efficient neural network. In NIPS, Cited by: §2.
 [8] (2016) Deep residual learning for image recognition. In CVPR, Cited by: §2, §4.
 [9] (2017) Channel pruning for accelerating very deep neural networks. In ICCV, Cited by: §2.
 [10] (2017) Multiscale dense networks for resource efficient image classification. arXiv preprint arXiv:1703.09844. Cited by: §2.
 [11] (2018) Quantization and training of neural networks for efficient integerarithmeticonly inference. In CVPR, Cited by: §2.
 [12] (2018) NestedNet: learning nested sparse structures in deep neural networks. In CVPR, Cited by: §2.
 [13] (2012) Imagenet classification with deep convolutional neural networks. In NIPS, Cited by: §1.
 [14] (2017) Runtime neural pruning. In NIPS, Cited by: §2.
 [15] (2018) Dynamic deep neural networks: optimizing accuracyefficiency tradeoffs by selective execution. In AAAI, Cited by: §2.
 [16] (2017) Learning efficient convolutional networks through network slimming. In ICCV, Cited by: §2.
 [17] (2016) Xnornet: imagenet classification using binary convolutional neural networks. In ECCV, Cited by: §2.
 [18] (201806) Bit fusion: bitlevel dynamically composable architecture for accelerating deep neural network. 2018 ACM/IEEE 45th Annual (ISCA). External Links: ISBN 9781538659847, Link, Document Cited by: §3.
 [19] (2014) Deepface: closing the gap to humanlevel performance in face verification. In CVPR, Cited by: §1.
 [20] (2016) Branchynet: fast inference via early exiting from deep neural networks. In 2016 23rd ICPR, Cited by: §2.
 [21] (2018) Hydranets: specialized dynamic architectures for efficient inference. In CVPR, Cited by: §2.
 [22] (2019) HAQ: hardwareaware automated quantization with mixed precision. In CVPR, Cited by: 2nd item, §1, §2, §4, §4.
 [23] (2018) Skipnet: learning dynamic routing in convolutional networks. In ECCV, Cited by: §1, §2, §3, §3, §3, §4, §4.
 [24] (2018) EnergyNet: energyefficient dynamic inference. In Thirtysecond Conference on Neural Information Processing Systems (NIPS 2018) Workshop, Cited by: §2.
 [25] (2016) Learning structured sparsity in deep neural networks. In NIPS, Cited by: §2.

[26]
(2018)
Deep kmeans: retraining and parameter sharing with harder cluster assignments for compressing deep convolutions
. InInternational Conference on Machine Learning
, pp. 5359–5368. Cited by: §2.  [27] (2018) Blockdrop: dynamic inference paths in residual networks. In CVPR, Cited by: §1, §2, §3.
 [28] (2018) Hybrid pruning: thinner sparse networks for fast inference on edge devices. arXiv preprint arXiv:1811.00482. Cited by: §2.
 [29] (2018) Slimmable neural networks. arXiv preprint arXiv:1812.08928. Cited by: §2, §5.
 [30] (2017) Scalpel: customizing dnn pruning to the underlying hardware parallelism. ACM SIGARCH Computer Architecture News. Cited by: §2.
 [31] (2016) Dorefanet: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160. Cited by: §2.
 [32] (2016) Trained ternary quantization. arXiv preprint arXiv:1612.01064. Cited by: §1.
Comments
There are no comments yet.