|latency||12.2 ms||12.4ms||16.6 ms||7.9 ms||7.2 ms|
As deep neural networks are becoming deeper and wider to achieve higher performance, there is an urgent need to explore efficient models for common mobile platforms, such as self-driving cars, smartphones, drones and robots. In recent years, many different approaches have been proposed to improve the inference efficiency of neural networks, including network pruning[Li2016PruningFF, liu2017NetworkSlimming, SoftFilterPruning, He2017ChannelPF, Liu2019MetaPruningML, luo2017thinet], weight quantization [jacob2018quantization], knowledge distillation [Ba2013DoDN, Romero2014FitNetsHF, Hinton2015DistillingTK], manually and automatically designing of efficient networks [Tan2019EfficientNetRM, Sandler2018MobileNetV2IR, ZhangLPCZGS21, RenXCHLCW2020, bender2018understanding, ZhangLPCGS20, ChengZHDCLDG20, Guo2019SinglePO, li2019blockwisely, ZhangLPCS20] and dynamic inference [Bolukbasi2017AdaptiveNN, Huang2018MultiScaleDN, wang2018skipnet, veit2018AIG, Li2019ImprovedTF, hua2019channel, gao2018dynamic].
Among the above approaches, dynamic inference methods, including networks with dynamic depth [Bolukbasi2017AdaptiveNN, Huang2018MultiScaleDN, wang2018skipnet, veit2018AIG, Li2019ImprovedTF] and dynamic width [gao2018dynamic, hua2019channel, Chen2019YouLT] have attracted increasing attention because of their promising capability of reducing computational redundancy by automatically adjusting their architecture for different inputs. As illustrated in Fig. 2, the dynamic network learns to configure different architecture routing adaptively for each input, instead of optimizing the architecture among the whole dataset like Neural Architecture Search (NAS) or Pruning. A performance-complexity trade-off simulated with exponential functions is also shown in Fig. 2, the optimal solution of dynamic networks is superior to the static NAS or pruning solution. Ideally, dynamic network routing can significantly improve model performance under certain complexity constraints.
However, networks with dynamic width, i.e., dynamic pruning methods [gao2018dynamic, hua2019channel, Chen2019YouLT], unlike its orthogonal counterparts with dynamic depth, have never achieved actual acceleration in a real-world implementation. As natural extensions of network pruning, dynamic pruning methods predictively prune the convolution filters with regard to different input at runtime. The varying sparse patterns are incompatible with computation on hardware. Actually, many of them are implemented as zero masking or inefficient path indexing, resulting in a massive gap between the theoretical analysis and the practical acceleration. As shown in Tab. 1, both masking and indexing lead to inefficient computation waste.
To address the aforementioned issues in dynamic networks, we propose Dynamic Slimmable Network (DS-Net), which achieves good hardware-efficiency via dynamically adjusting filter numbers of networks at test time with respect to different inputs. To avoid the extra burden on hardware caused by dynamic sparsity, we adopt a scheme named dynamic slicing to keep filters static and contiguous when adjusting the network width. Specifically, we propose a double-headed dynamic gate with an attention head and a slimming head upon slimmable networks to predictively adjust the network width with negligible extra computation cost. The training of dynamic networks is a highly entangled bilevel optimization problem. To ensure generality of each candidate’s architecture and the fairness of gate, a disentangled two-stage training scheme inspired by one-shot NAS is proposed to optimize the supernet and the gates separately. In the first stage, the slimmable supernet is optimized with a novel training method for weight-sharing networks, named In-place Ensemble Bootstrapping (IEB)
. IEB trains the smaller sub-networks in the online network to fit the output logits of an ensemble of larger sub-networks in the momentum target network. Learning from the ensemble of different sub-networks will reduce the conflict among sub-networks and increase their generality. Using the exponential moving average of the online network as the momentum target network can provide a stable and accurate historical representation, and bootstrap the online network and the target network itself to achieve higher overall performance. In the second stage, to prevent dynamic gates from collapsing into static ones in the multiobjective optimization problem, a technique namedSandwich Gate Sparsification (SGS) is proposed to assist the gate training. During training, SGS identifies easy and hard samples online and further generates the ground truth label for the dynamic gates.
Overall, our contributions are three-fold as follows:
We propose a new dynamic network routing regime, achieving good hardware-efficiency by predictively adjusting filter numbers of networks at test time with respect to different inputs. Unlike dynamic pruning methods, we dynamically slice the network parameters while keeping them stored statically and contiguously in hardware to prevent the extra burden of masking, indexing, and weight-copying. The dynamic routing is achieved by our proposed double-headed dynamic gate with negligible extra computation cost.
We propose a two-stage training scheme with IEB and SGS techniques for DS-Net. Proved experimentally, IEB stabilizes the training of slimmable networks and boosts its accuracy by 1.8% and 0.6% in the slimmest and widest sub-networks respectively. Moreover, we empirically show that the SGS technique can effectively sparsify the dynamic gate and improves the final performance of DS-Net by 2%.
Extensive experiments demonstrate our DS-Net outperforms its static counterparts [yu2019autoslim, Yu2019UniversallySN] as well as state-of-the-art static and dynamic model compression methods by a large margin (up to 5.9%, Fig. 1). Typically, DS-Net achieves 2-4 computation reduction and 1.62 real-world acceleration over ResNet-50 and MobileNet with minimal accuracy drops on ImageNet. Gate visualization proves the high dynamic diversity of DS-Net.
2 Related works
Anytime neural networks [larsson2016fractalnet, Huang2018MultiScaleDN, hu2019learning, Li2019ImprovedTF, Lee2018AnytimeNP, Yu2019SlimmableNN, Yu2019UniversallySN, hou2020dynabert] are single networks that can execute with their sub-networks under different budget constraints, thus can deploy instantly and adaptively in different application scenarios. Anytime neural networks have been studied in two orthogonal directions: networks with variable depth and variable width. Networks with variable depth [larsson2016fractalnet, Huang2018MultiScaleDN, hu2019learning, Li2019ImprovedTF]
are first studied widely, benefiting from the naturally nested structure in depth dimension and residual connections in ResNet[he2016deep] and DenseNet [huang2017densely]. Network with variable width was first studied in [Lee2018AnytimeNP]. Recently, slimmable networks [Yu2019SlimmableNN, Yu2019UniversallySN] using
switchable batch normalizationand in-place distillation achieve higher performance than their stand-alone counterparts in any width. Some recent works [cai2019once, Yu2020BigNASSU, hou2020dynabert] also explore anytime neural networks in multiple dimensions, e.g. depth, width, kernel size, etc.
Dynamic neural networks [veit2018AIG, wang2018skipnet, li2020DynamicRouting, Yang2019CondConvCP] change their architectures based on the input data. Dynamic networks for efficient inference aim to reduce average inference cost by using different sub-networks adaptively for inputs with diverse difficulty levels. Networks with dynamic depth [Bolukbasi2017AdaptiveNN, Huang2018MultiScaleDN, wang2018skipnet, veit2018AIG, Li2019ImprovedTF] achieve efficient inference in two ways, early exiting when shallower sub-networks have high classification confidence [Bolukbasi2017AdaptiveNN, Huang2018MultiScaleDN, Li2019ImprovedTF], or skipping residual blocks adaptively [wang2018skipnet, veit2018AIG]. Recently, dynamic pruning methods [hua2019channel, gao2018dynamic, Chen2019YouLT]
using a variable subset of convolution filters have been studied. Channel Gating Neural Network[hua2019channel] and FBS [gao2018dynamic] identify and skip the unimportant input channels at run-time. In GaterNet [Chen2019YouLT], a separate gater network is used to predictively select the filters of the main network. Please refer to [han2021dynamic] for a more comprehensive review of dynamic neural networks.
Weight sharing NAS [brock2017smash, akimoto2019adaptive, bender2018understanding, Liu2018DARTSDA, Cai2018ProxylessNASDN, Wu2018FBNetHE, Guo2019SinglePO, li2019blockwisely, cai2019once, li2021bossnas], aiming at designing neural network architectures automatically and efficiently, has been developing rapidly in recent two years. They integrate the whole search space of NAS into a weight sharing supernet and optimize network architecture by pursuing the best-performing sub-networks. These methods can be roughly divided into two categories: jointly optimized methods [Liu2018DARTSDA, Cai2018ProxylessNASDN, Wu2018FBNetHE], in which the weight of the supernet is jointly trained with the architecture routing agent (typically a simple learnable factor for each candidate route); and one-shot methods [brock2017smash, akimoto2019adaptive, bender2018understanding, Guo2019SinglePO, li2019blockwisely, cai2019once, li2021bossnas], in which the training of the supernet parameters and architecture routing agent are disentangled. After fair and sufficient training, the agent is optimized with the weights of supernet frozen.
3 Dynamic Slimmable Network
Our dynamic slimmable network achieves dynamic routing for different samples by learning a slimmable supernet and a dynamic gating mechanism. As illustrated in Fig. 3, the supernet in DS-Net refers to the whole module undertaking the main task. In contrast, the dynamic gates are a series of predictive modules that route the input to use sub-networks with different widths in each stage of the supernet.
In previous dynamic networks [veit2018AIG, wang2018skipnet, li2020DynamicRouting, Yang2019CondConvCP, Bolukbasi2017AdaptiveNN, Huang2018MultiScaleDN, Li2019ImprovedTF, hua2019channel, gao2018dynamic, Chen2019YouLT], the dynamic routing agent and the main network are jointly trained, analogous to jointly optimized NAS methods [Liu2018DARTSDA, Cai2018ProxylessNASDN, Wu2018FBNetHE]. Inspired by one-shot NAS methods [brock2017smash, akimoto2019adaptive, bender2018understanding, Guo2019SinglePO, li2019blockwisely], we propose a disentangled two-stage training scheme to ensure the generality of every path in our DS-Net. In Stage I, we disable the slimming gate and train the supernet with the IEB technique, then in Stage II, we fix the weights of the supernet and train the slimming gate with the SGS technique.
3.1 Dynamic Supernet
In this section, we first introduce the hardware efficient channel slicing scheme and our designed supernet, then present the IEB technique and details of training Stage I.
Supernet and Dynamic Channel Slicing. In some of dynamic networks, such as dynamic pruning [hua2019channel, gao2018dynamic] and conditional convolution [Yang2019CondConvCP, li2021revisiting], the convolution filters are conditionally parameterized by a function to the input . Generally, the dynamic convolution has a form of:
where represents the selected or generated input-dependent convolution filters. Here is used to denote a matrix multiplication. Previous dynamic pruning methods [hua2019channel, gao2018dynamic] reduce theoretical computation cost by varying the channel sparsity pattern according to the input. However, they fail to achieve real-world acceleration because their hardware-incompatible channel sparsity results in repeatedly indexing and copying selected filters to a new contiguous memory for multiplication. To achieve practical acceleration, filters should remain contiguous and relatively static during dynamic weight selection. Base on this analysis, we design a architecture routing agent with the inductive bias of always outputting a dense architecture, e.g. a slice-able architecture. Specifically, we consider a convolutional layer with at most output filters and input channels. Omitting the spatial dimension, its filters can be denoted as . The output of the architecture routing agent for this convolution would be a slimming ratio indicating that the first piece-wise of the output filters are selected. Then, a dynamic slice-able convolution is defined as follows:
where is a slice operation denoted in a python-like style. Remarkably, the slice operation and the dense matrix multiplication are much more efficient than an indexing operation or a sparse matrix multiplication in real-world implementation, which guarantees a practical acceleration of using our slice-able convolution.
After aggregating the slice-able convolutions sequentially, a supernet executable at different widths is formed. Paths with different widths can be seen as sub-networks. By disabling the routing agent, the supernet is analogous to a slimmable network [Yu2019SlimmableNN, Yu2019UniversallySN], and can be trained similarly.
In-place Ensemble Bootstrapping. The sandwich rule and in-place distillation techniques [Yu2019UniversallySN] proposed for Universally Slimmable Networks enhanced their overall performance. In in-place distillation, the widest sub-network is used as the target network generating soft labels for other sub-networks. However, acute fluctuation appeared in the weight of the widest sub-network can cause convergence hardship, especially in the early stage of training. As observed in BigNAS [Yu2020BigNASSU], training a more complex model with in-place distillation could be highly unstable. Without residual connection and special weight initialization tricks, the loss exploded at the early stage and can never converge. To overcome the convergence hardship in slimmable networks and improve the overall performance of our supernet, we proposed a training scheme named In-place Ensemble Bootstrapping (IEB).
In recent years, a growing number of self-supervised methods with bootstrapping [BYOL, guo2020bootstrap, deepcluster] and semi-supervised methods based on consistency regularization [laine2016temporal, tarvainen2017mean] use their historical representations to produce targets for the online network. Inspired by this, we propose to bootstrap on previous representations in our supervised in-place distillation training. We use the exponential moving average (EMA) of the model as the target network that generates soft labels. Let and denote the parameters of the online network and the target network, respectively. We have:
where is a momentum factor controlling the ratio of the historical parameter and is a training timestamp which is usually measured by a training iteration. During training, the EMA of the model are more stable and more precise than the online network, thus can provide high quality target for the slimmer sub-networks.
As pointed out in [Meal, Mealv2], an ensemble of teacher networks can generate more diverse, more accurate and more general soft labels for distillation training of the student network. In our supernet, there are tons of sub-models with different architectures, which can generate different soft labels. Motivated by this, we use different sub-networks as a teacher ensemble when performing in-place distillation. The overall train process is shown in Fig. 4. Following the sandwich rule [Yu2019UniversallySN], the widest (denoted with ), the slimmest (denoted with ) and random width sub-networks (denoted with ) are sampled in each training step. Sub-network at the largest width is trained to predict the ground truth label ; sub-networks with random width are trained to predict the soft label generated by the widest sub-network of the target network,
; the slimmest sub-network is trained to predict the probability ensemble of all the aforementioned sub-networks in the target network:
To sum up, the IEB losses for the supernet training are:
3.2 Dynamic Slimming Gate
In this section, we design the channel gate function that generates the factor in Eqn. (2) and present the double-headed design of the dynamic gate. Then, we introduce the details of training stage II with an advanced technique that is sandwich gate sparsification (SGS).
Double-headed Design. There are two possible ways to transform a feature map into a slimming ratio in Eqn. (2): (i) scalar design directly output a sigmoid activated scalar ranging from 0 to 1 to be the slimming ratio; (ii) one-hot design use an /
activated one-hot vector to choose the respective slimming ratioin a discrete candidate list vector . Both of the implementations are evaluated and compared in Sec. 4.4. Here, we thoroughly describe our dynamic slimming gate with the better-performing one-hot design. To reduce the input feature map to a one-hot vector, we divide to two functions:
where is an encoder that reduces feature maps to a vector and the function maps the reduced feature to a one-hot vector used for the subsequent channel slicing. Considering the -th gate in Fig. 3, given a input feature with dimension , reduces it to a vector which can be further mapped to a one-hot vector. By computing the dot product of this one-hot vector and , we have the newly predicted slimming ratio:
Similar to prior works [hu2019squeeze, yang2019gated] on channel attention and gating, we simply utilize average pooling as a light-weight encoder to integrate spatial information. As for feature mapping function , we adopt two fully connected layers with weights and (where represents the hidden dimension and
represents the number of candidate slimming ratio) and a ReLU non-linearity layerin between to predict scores for each slimming ratio choice. An function is subsequently applied to generate a one-hot vector indicating the predicted choice:
Note that input with dynamic channel number is projected to a vector with fixed length by the dynamically sliced weight .
Our proposed channel gating function has a similar form with recent channel attention methods [hu2019squeeze, yang2019gated]. The attention mechanism can be integrated into our gate with nearly zero cost, by adding another fully connected layer with weights that projects the hidden vector back to the original channel number . Based on the conception above, we propose a double-headed dynamic gate with a soft channel attention head and a hard channel slimming head.The channel attention head can be defined as follows:
is the activation function adopted for the attention head. Unlike the slimming head, the channel attention head is activated in training stage I.
Sandwich Gate Sparsification. In training stage II, we propose to use the end-to-end classification cross-entropy loss and a complexity penalty loss to train the gate, aiming to choose the most efficient and effective sub-networks for each instance. To optimize the non-differentiable slimming head of dynamic gate with , we use - [jang2016categorical], a classical way to optimize neural networks with by relaxing it to differentiable in gradient computation.
However, we empirically found that the gate easily collapses into a static one even if we add Gumbel noise [jang2016categorical] to help the optimization of -. Apparently, using only - technique is not enough for this multi-objective dynamic gate training. To further overcome the convergence hardship and increase the dynamic diversity of the gate, a technique named Sandwich Gate Sparsification (SGS) is further proposed. We use the slimmest sub-network and the whole network to identify easy and hard samples online and further generate the ground truth slimming factors for the slimming heads of all the dynamic gates.
As analysed in [Yu2019UniversallySN]
, wider sub-networks should always be more accurate because the accuracy of slimmer ones can always be achieved by learning new connections to zeros. Thus, given a well-trained supernet, input samples can be roughly classified into three difficulty levels:a) Easy samples that can be correctly classified by the slimmest sub-network; b) Hard samples that can not be correctly classified by the widest sub-network; c) Dependent samples : Other samples in between. In order to minimize the computation cost, easy samples should always be routed to the slimmest sub-network (i.e. gate target ). For dependent samples and hard samples, we always encourage them to pass through the widest sub-network, even if the hard samples can not be correctly classified (i.e. ). Another gate target strategy is also discussed in Sec. 4.4.
Based on the generated gate target, we define the SGS loss that facilitates the gate training:
where represents whether is truely predicted by the slimmest sub-network and is the Cross-Entropy loss over activated gate scores and the generated gate target.
Dataset. We evaluate our method on two classification datasets (i.e., ImageNet [deng2009imagenet]
and CIFAR-10[Krizhevsky09cifar]) and a standard object detection dataset (i.e.everingham2010pascal]). The ImageNet dataset is a large-scale dataset containing 1.2 M set images and 50 K set images in 1000 classes. We use all the training data in both of the two training stages. Our results are obtained on the set with image size of . We also test the transferability of our DS-Net on CIFAR-10, which comprises 10 classes with 50,000 training and 10,000 test images. Note that few previous works on dynamic networks and network pruning reported results on object detection. We take PASCAL VOC, one of the standard datasets for evaluating object detection performance, as an example to further test the generality of our dynamic networks on object detection. All the detection models are trained with the combined dataset from 2007 and 2012 and tested on VOC 2007 set.
Architecture details. Following previous works on static and dynamic network pruning, we use two representative networks, i.e., the heavy residual network ResNet 50 [he2016deep] and the lightweight non-residual network MobileNetV1 [howard2017mobilenets], to evaluate our method.
In Dynamic Slimmable ResNet 50 (DS-ResNet), we insert our double-headed gate in the begining of each residual blocks. The slimming head is only used in the first block of each stage. Each one of those blocks contains a skip connection with a projection layer, i.e. convolution. The filter number of this projection convolution is also controlled by the gate to avoid channel inconsistency when adding skip features with residual output. In other residual blocks, the slimming heads of the gates are disabled and all the layers in those blocks inherit the widths of the first blocks of each stage. To sum up, there are 4 gates (one for each stage) with both heads enabled. Every gates have 4 equispaced candidate slimming ratios, i.e. . The total routing space contains possible paths with different computation complexities. All batch normalization (BN) layers in DS-ResNet are replaced with group normalization to avoid test-time representation shift caused by inaccurate BN statistics in weight-sharing networks [Yu2019SlimmableNN, Yu2019UniversallySN].
Unlike DS-ResNet, we only use one single slimming gate after the fifth depthwise separable convolution block of Dynamic Slimmable MobileNetV1 (DS-MBNet). Specifically, a fixed slimming ratio is used in the first 5 blocks while the width of the rest 8 blocks are controlled by the gate with the candidate slimming ratios . This architecture with only 18 paths in its routing space is similar to an uniform slimmable network [Yu2019SlimmableNN, Yu2019UniversallySN], guaranteeing itself the practicality to use batch normalization. Following [Yu2019UniversallySN], we perform BN recalibration for all the 18 paths in DS-MBNet after the supernet training stage.
We train our supernet with 512 total batch size on ImageNet, using SGD optimizer with 0.2 initial learning rate for DS-ResNet and 0.08 initial learning rate for DS-MBNet, respectively. We use cosine learning rate scheduler to reduce the learning rate to its 1% in 150 epochs. Other settings are following previous works on slimmable networks[Yu2019SlimmableNN, Yu2019UniversallySN, yu2019autoslim]. For gate training, we use SGD optimizer with 0.05 initial learning rate for a total batch size of 512. The learning rate decays to 0.9
of its value in every epoch. It takes 10 epochs for the gate to converge. For transfer learning experiments on CIFAR-10, we follow similar settings with[Kornblith2018DoBI] and [Huang2019GPipeET]. We transfer our supernet for 70 epochs including 15 warm-up epochs and use cosine learning rate scheduler with an initial learning rate of 0.7 for a total batch size of 1024. For object detection task , we train all the networks following [liu2016ssd] and [li2017fssd] with a total batch size of 128 for 300 epochs. The learning rate is set to 0.004 at the first, then divided by 10 at epoch 200 and 250.
|3B MAdds||SFP [SoftFilterPruning]||2.9B||75.1|
|ThiNet-70 [luo2017thinet, liu2018RethinkingPruning]||2.9B||75.8|
|MetaPruning 0.85 [Liu2019MetaPruningML]||3.0B||76.2|
|2B MAdds||ResNet-50 0.75 [he2016deep]||2.3B||74.9|
|ThiNet-50 [luo2017thinet, liu2018RethinkingPruning]||2.1B||74.7|
|MetaPruning 0.75 [Liu2019MetaPruningML]||2.0B||75.4|
|1B MAdds||ResNet-50 0.5 [he2016deep]||1.1B||72.1|
|ThiNet-30 [luo2017thinet, liu2018RethinkingPruning]||1.2B||72.1|
|MetaPruning 0.5 [Liu2019MetaPruningML]||1.0B||73.4|
|500M MAdds||MBNetV1 1.0 [howard2017mobilenets]||569M||63ms||70.9|
|US-MBNetV1 1.0 [Yu2019UniversallySN]||569M||-||71.8|
|300M MAdds||MBNetV1 0.75 [howard2017mobilenets]||317M||48ms||68.4|
|US-MBNetV1 0.75 [Yu2019UniversallySN]||317M||-||69.5|
|150M MAdds||MBNetV1 0.5 [howard2017mobilenets]||150M||33ms||63.3|
|US-MBNetV1 0.5 [Yu2019UniversallySN]||150M||-||64.2|
4.1 Main Results on ImageNet
We first validate the effectiveness of our method on ImageNet. As shown in Tab. 2 and Fig. 5, DS-Net with different computation complexity consistently outperforms recent static pruning methods, dynamic inference methods and NAS methods. First, our DS-ResNet and DS-MBNet models achieve 2-4 computation reduction over ResNet-50 (76.1% [he2016deep]) and MobileNetV1 (70.9% [howard2017mobilenets]) with minimal accuracy drops (0% to -1.5% for ResNet and +0.9% to -0.8% for MobileNet). We also tested the real world latency on efficient networks. Compare to the ideal acceleration tested on channel scaled MobileNetV1, which is 1.31 and 1.91, our DS-MBNet achieves comparable 1.17 and 1.62 acceleration with much higher performance. In paticular, our DS-MBNet surpasses the original and the channel scaled MobileNetV1 [howard2017mobilenets] by 3.6%, 4.4% and 6.8% with similar MAdds and minor increase in latency. Second, our method outperforms classic and state-of-the-art static pruning methods in a large range. Remarkably, DS-MBNet outperforms the SOTA pruning methods EagleEye [li2020eagleeye] and Meta-Pruning [Liu2019MetaPruningML] by 1.9% and 2.2%. Third, our DS-Net maintain superiority comparing with powerful dynamic inference methods with varying depth, width or input resolution. For example, our DS-MBNet-M surpasses dynamic pruning method CG-Net [hua2019channel] by 2.5%. Fourth, our DS-Net also consistently outperforms its static counterparts. Our DS-MBNet-S surpasses AutoSlim [yu2019autoslim] and US-Net [Yu2019UniversallySN] by 2.2% and 5.9%.
|ResNet-50 [he2016deep, Kornblith2018DoBI]||4.1B||96.8|
|ResNet-101 [he2016deep, Kornblith2018DoBI]||7.8B||97.6|
|DS-ResNet w/o GT||1.7B||97.4|
|DS-ResNet w/ GT||1.6B||97.8|
|MBNetV1 + FSSD [howard2017mobilenets, li2017fssd]||4.3B||71.9|
|DS-MBNet-S + FSSD||2.3B||70.7|
|DS-MBNet-M + FSSD||2.7B||72.8|
|DS-MBNet-L + FSSD||3.2B||73.7|
To evaluate the transferability of DS-Net and its dynamic gate, we perform transfer learning in two settings: (i) DS-Net w/o gate transfer: we transfer the supernet without slimming gate to CIFAR-10 and retrain the dynamic gate. (ii) DS-Net w/ gate transfer: we first transfer the supernet then load the ImageNet trained gate and perform transfer leaning for the gate. The results along with the transfer learning results of the original ResNets are shown in Tab. 3. Gate transfer boosts the performance of DS-ResNet by 0.4% on CIFAR-10, demonstrating the transferability of dynamic gate. Remarkably, both of our transferred DS-ResNet outperforms the original ResNet-50 in a large range (0.6% and 1.0%) with about 2.5 computation reduction. Among them, DS-ResNet with gate transfer even outperforms the larger ResNet-101 with 4.9 fewer computation complexity, proving the superiority of DS-Net in transfer learning.
4.3 Object Detection
In this section, we evaluate and compare the performance of original MobileNet and DS-MBNet used as feature extractor in object detection with Feature Fusion Single Shot Multibox Detector(FSSD) [li2017fssd]
. We use the features from the 5-th, 11-th and 13-th depthwise convolution blocks (with the output stride of 8, 16, 32) of MobileNet for the detector. When using DS-MBNet as the backbone, all the features from dynamic source layers are projected to a fixed channel dimention by the feature transform module in FSSD[li2017fssd].
Results on VOC 2007 set are given in Tab. 4. Comparing to MobileNetV1, DS-MBNet-M and DS-MBNet-L with FSSD achieves 0.9 and 1.8 mAP improvement with 1.59 and 1.34 computation reduction respectively, which demonstrates that our DS-Net remain its superiority after deployed as the backbone network in object detection task.
4.4 Ablation study
In-place Ensemble Bootstrapping. We statistically analysis the effect of IEB technique with MobileNetV1. We train a Slimmable MobileNetV1 supernet with three settings: original in-place distillation, in-place distillation with EMA target and our complete IEB technique. As shown in Tab. 5, the slimmest and widest sub-network trained with EMA target surpassed the baseline by 1.6% and 0.3% respectively. With IEB, the supernet improves 1.8% and 0.6% on its slimmest and widest sub-networks comparing with in-place distillation. The evaluation accuracy progression curves of the slimmest sub-networks trained with these three settings are illustrated in Fig. 6. The beginning stage of in-place distillation is unstable. Adopting EMA target improves the performance. However, there are a few sudden drops of accuracy in the middle of the training with EMA target. Though being able to recover in several epochs, the model may still be potentially harmed by those fluctuation. After fully adopting IEB, the model converges to a higher final accuracy without any conspicuous fluctuations in the training process, demonstrating the effectiveness of our IEB technique in stablizing the training and boosting the overall performance of slimmable networks.
Effect of losses. To examine the impact of the three losses used in our gate training, i.e. target loss , complexity loss and SGS loss , we conduct extensive experiments with DS-ResNet on ImageNet, and summarize the results in Tab. 6 and Fig. 7 left. Firstly, as illustrated in Fig. 7 left, models trained with SGS (red line) are more efficient than models trained without it (purple line). Secondly, as shown in Tab. 6, with target loss, the model pursues better performance while ignoring computation cost; complexity loss pushes the model to be lightweight while ignoring the performance; SGS loss itself can achieve a balanced complexity-accuracy trade-off by encouraging easy and hard samples to use slim and wide sub-networks, respectively.
SGS strategy. Though we always want the easy samples to be routed to the slimmest sub-network, there are two possible target definition for hard samples in SGS loss: (i) Try Best: Encourage the hard samples to pass through the widest sub-network, even if they can not be correctly classified (i.e. ). (ii) Give Up: Push the hard samples to use the slimmest path to save computation cost (i.e. ). In both of the strategies, dependent samples are encouraged to use the widest sub-network (i.e. ). The results for both of the strategies are shown in Tab. 6 and Fig. 7 left. As shown in the third and fourth lines in Tab. 6, Give Up strategy lowers the computation complexity of the DS-ResNet but greatly harms the model performance. The models trained with Try Best strategy (red line in Fig. 7 left) outperform the one trained with Give Up strategy (blue dot in Fig. 7 left) in terms of efficiency. This can be attribute to Give Up strategy’s optimization difficulty and the lack of samples that targeting on the widest path (dependent samples only account for about 10% of the total training samples). These results prove our Try Best strategy is easier to optimize and can generalize better on validation set or new data.
|✓ Give Up||1.5B||73.7|
|✓ Try Best||3.1B||76.6|
|✓||✓ Try Best||1.2B||74.6|
|✓||✓||✓ Try Best||2.2B||76.1|
|weight sharing||slimming head||MAdds||Top-1 Acc.|
Gate design. First, to evaluate the effect of our weight-sharing double-headed gate design, we train a DS-ResNet without sharing the the first fully-connected layer for comparison with SGS loss only. As shown in Tab. 7 and Fig. 7 left, the performance of DS-ResNet increase substantially (3.9%) by applying the weight sharing design (green dot vs. red line in Fig. 7 left). This might be attribute to overfitting of the slimming head. As observed in our experiment, sharing the first fully-connected layer with attention head can greatly improve the generality. Second, we also trained a DS-ResNet with scalar design (refer to Sec 3.2) of the slimming head to compare with one-hot design. Both of the networks are trained with SGS loss only. The results are present in Tab. 7 and Fig. 7 left. The performance of scalar design (orange dot in Fig. 7 left) is much lower than the one-hot design (red line in Fig. 7 left), indicating that the scalar gate could not route the input to the correct paths.
4.5 Gate visualization
To demonstrate the dynamic diversity of our DS-Net, we visualize the gate distribution of DS-ResNet over the validation set of ImageNet in Fig. 7 right. In block 1 and 2, about half of the inputs are routed to the slimmest sub-network with 0.25 slimming ratio, while in higher level blocks, about half of the inputs are routed to the widest sub-network. For all the gate, the slimming ratio choices are highly input-dependent, demonstrating the high dynamic diversity of our DS-Net.
In this paper, we have proposed Dynamic Slimmable Network (DS-Net), a novel dynamic network on efficient inference, achieving good hardware-efficiency by predictively adjusting the filter numbers at test time with respect to different inputs. We propose a two stage training scheme with In-place Ensemble Bootstrapping (IEB) and Sandwich Gate Sparsification (SGS) technique to optimize DS-Net. We demonstrate that DS-Net can achieve 2-4 computation reduction and 1.62 real-world acceleration over ResNet-50 and MobileNet with minimal accuracy drops on ImageNet. Proved empirically, DS-Net and can surpass its static counterparts as well as state-of-the-art static and dynamic model compression method on ImageNet by a large margin (2%) and can generalize well on CIFAR-10 classification task and VOC object detection task.
This work was supported in part by National Key R&D Program of China under Grant No. 2020AAA0109700, National Natural Science Foundation of China (NSFC) under Grant No.U19A2073, No.61976233 and No.61906109, Guangdong Province Basic and Applied Basic Research (Regional Joint Fund-Key) Grant No.2019B1515120039, Shenzhen Outstanding Youth Research Project (Project No. RCYX20200714114642083), Shenzhen Basic Research Project (Project No. JCYJ20190807154211365), Leading Innovation Team of the Zhejiang Province (2018R01017) and CSIG Young Fellow Support Fund. Dr Xiaojun Chang is partially supported by the Australian Research Council (ARC) Discovery Early Career Researcher Award (DECRA) (DE190100626).
A. Implementation Details
Losses in Stage II. Complexity penalty loss is used to increase the model efficiency in training stage II. To provide a stable and fair constraint, we use the number of multiply-adds on the fly, , as the metrics of model complexity. Specifically, the complexity penalty is given by:
where is a normalize factor set to the total MAdds of the supernet in our implementation. Note that this loss term always pushes the gate to route towards a faster architecture, towards an architecture with target MAdds, which can effectively prevent routing easy and hard instances to the same architecture.
Overall, the slimming gate can be optimized with a joint loss function:
The three balancing factors are set to , , in our experiments. Different target MAdds is reached by adjusting the routing space during gate training. For instance, when training the gate of DS-MBNet-S, we set to prevent routing to heavier sub-networks.
Equispaced channel group. Following previous works [Yu2019UniversallySN, yu2019autoslim], we set the the smallest division of channel number to 8. When using as the interval of , rounding channels by 8 may result in different intervals, which could lead to training failure when using Group Normalization [Wu2018GroupN]. To prevent such problem, we always adopt a consistent interval (e.g. 8, 16, 32) in a single layer, instead of multiplying and rounding the channel. This results in a difference of the slimming ratio between our implemented architecture and our design.
Additional details. Weight decay is set to in all of our experiments on ImageNet. To stablize the optimization, weight decay of all the layers in the dynamic gate is removed. The weight of the last normalization layer of each residual block is initialized to zeros following [Yu2020BigNASSU]. The weight of the fully-connected layer in channel attention head, in Eqn. 8, is also zero-initialized to ease the optimization following [yang2019gated]. Additional training techniques include [goyal2017accurate, cubuk2019autoaugment]. We do not use label smoothing [pereyra2017regularizing], DropPath [larsson2016fractalnet]
and RMSProp[tieleman2012lecture], which are popularly used in previous works [Tan2019EfficientNetRM, Howard2019SearchingFM, yu2019autoslim, Yu2019UniversallySN].
B. Experiments on EfficientNet
We also applied our method on EfficientNet [Tan2019EfficientNetRM], a state-of-the-art network family with high efficiency. Similar to our DS-MBNet, Dynamic Slimmable EfficientNet-B0 (DS-EffNet) has only one slimming gate after its 8-th inverted residual block, controlling the rest 8 blocks. The fixed slimming ratio for the first 8 blocks is 0.5, while a uniform dynamic slimming ratio is used for the last 8 blocks. This supernet with 20 paths in total is trained with a similar config with the supernet of DS-ResNet and DS-MBNet.
We train the supernet with 512 total batch size using 0.2 learning rate that decays with a cosine scheduler in 150 epochs. To enable direct comparision, we opt to reproduce the EfficientNet results using our training setup, with a 150 epoch schedule and no extra enhancement of DropPath [larsson2016fractalnet], RMSProp [tieleman2012lecture], etc.
The result is shown in Tab. 8. DS-EffNet outperforms the original EfficientNet-B0 by 0.7% and 0.8%, proving its efficacy on recent methods with inverted bottleneck blocks [Sandler2018MobileNetV2IR] and Squeeze-and-Excitation module [hu2019squeeze].
|400M MAdds||EffNet-B0 [Tan2019EfficientNetRM] (repro.)||399M||76.0|
|200M MAdds||EffNet-B0 0.75 [Tan2019EfficientNetRM]||267M||74.6|
C. Additional Ablations
Slimming gate. We analysis the improvement brought by slimming gate by comparing the performance of DS-Net and its supernet. As shown in Tab. 9, slimming gate boosts the performance of DS-MBNet-S and DS-ResNet-S by 0.8% and 1.2% respectively, comparing to sub-networks with similar sizes in their supernet.
Distillation temperature. Temperature in distillation loss was first introduced in [Hinton2015DistillingTK] to control the smoothness of the target. Using a properly larger usually yields better performance of the student. Surprisingly, we find a huge performance degradation in the slimmest sub-network when using larger in in-place distillation. We test with DS-MBNet for 40 epochs and compare the it with the performance of default setting (). As shown in Tab. 10, the performance of the slimmest sub-network decrease by 10.2% after applying the temperature .