With the development of deep learning, many methods have revealed a nature that convolutional neural networks (CNNs) have many redundant structures and many components can be removed without compromising performance. For example, neural network pruning techniques[6, 23] and lottery ticket hypothesis  both proved the redundant property. In this way, how to effectively explore the networks inference capacity to gain better performance is an essential problem in deep learning.
start to emerge, which focus on the new inference mechanism. In order to better take advantage of networks, these methods try to establish a dynamic routing mechanism, in which only part of the whole network is activated during single inference. These methods can reduce the computational cost effectively, but these dynamic routing networks can hardly achieve the accuracy of the original full networks. It’s odd why dynamic networks can hardly achieve the same performance, since similar techniques like pruning have proven the network can be compressed without compromising performance.
The reason why present dynamic methods have inferior accuracy is that these methods only focus on the design of the routing mechanism while ignoring the premise of utilizing more capacity of the network. On the one hand, the paths generated by the routing mechanism should be diverse enough to fully tap the potential of networks. It is evident that the more routing paths we use, the more capacity of the network we get. On the other hand, there is a common sense that similar inputs have similar feature activation in the deep learning field. Accordingly, the routing paths of similar inputs should be similar. For example, if similar inputs have distinct routes, the capacity of network would be poorly utilized intuitively. Moreover, similar routing paths of similar inputs can benefit the parameter sharing and help the learning of the network. The shared parameters on the similar routing paths are consistently stimulated by similar inputs, thus formatting compact and specific routing paths for specific similar inputs. In this way, the properties of compact learning and consistency of the routing mechanism should be considered. In summary, we can take fully advantage of network by diverse and consistent routing mechanism. From another view of computational cost, we should able to explicitly control the cost of dynamic routing networks to better utilize the network. Moreover, flexible computation cost controlling could give us the opportunity to balance the trade-off between cost and accuracy.
With the above motivations, we propose a novel dynamic neural network aiming at diverse and consistent routing mechanism with the ability of customizable controlling of computational cost, as shown in Fig. 1. Specifically, we propose a framework that maximizes the difference of routes between dissimilar inputs for diversity and minimizes the difference of routes between similar inputs for consistency. If we regard the routing paths as samples in the routing space, then the routing paths of similar inputs can be regarded as intra-class samples and the ones for dissimilar inputs can be regarded as inter-class samples. In this way, the above problems of diversity and consistency can be naturally solved in a classical linear discriminant analysis way, which minimizes the intra-class distance and maximizes inter-class distance. In order to control the computational cost, we also propose a differentiable computational cost controlling loss.
The main contributions of our work can be summarized into three parts:
We propose a novel consistency and diversity-based dynamic routing network to fully take advantage of networks, which outperforms the original full network by a large margin.
We propose a customizable computational cost controlling method, which could be flexibly used to adjust the computational cost of the network and balance the cost and performance.
Extensive experiments and ablation studies show our method could achieve SOTA results compared with baseline plain networks, dynamic neural networks and model compression networks.
2 Related Works
Dynamic Prediction Network. A series of methods are proposed to drop layers at test time. Veit et al discovered that only short paths of deep residual networks are needed during training. After that, SkipNet proposes a routing method for selecting a subset of residual blocks to be executed. Besides, BlockDrop designs an extra module sampling paths from the whole space to speed up ResNets inference. Recently, Slimmable Nets intend to train a model to support multiple widths to adaptively fit different computational constraints. Once-for-all takes a progressive shrinking algorithm for supporting diverse architectural settings. Most of these methods are applied to pre-trained models and require finetuning to achieve competitive results. The training processes are more similar to optimize attentions from different dimensions like blocks, layers, channels, et al.
apply heuristic and greedy methods to handle features and predictor costs. Odena et al
employ Markov Decision Processes for computation of predictors and features, then learn policies. Recently, the multi-scale dense network introduces early-exit branches on DenseNet to achieve deep network anytime inference. Figurnov et al studied early termination in several groups of blocks of ResNet. RNN architecture was proposed in  for dynamically determining the number of layers according to the allowed time budget. Similarly, ACT uses an RNN model with a halting unit determining whether computation should continue or not. Li et al propose a self-distillation mechanism to optimize different exit-branches.
Model Compression. The need to deploy high performance deep neural network models on limited computation platforms like mobile devices motivates techniques that can effectively reduce the computational costs. Knowledge distillation [13, 4, 3], low-rank factorization[19, 31, 20], filter pruning [23, 12, 25], quantization[10, 37, 29] are widely used to compress the structure and prune the parameters of neural networks. In addition, Neural Architecture Search provides other technique plans achieving low-cost models including mobile-nas, proxyless-nas.
In this section, we illustrate the details of our method. First, we introduce the dynamic routing mechanism used in this work as a base of the proposed method. Then we show how to take into account the optimization of diversity and consistency with the routing mechanism. Meanwhile, the explicit controlling of computational cost is also introduced. In the end, we summarize the training and testing of our method as a whole.
3.1 Dynamic routing mechanism
Revisiting static and dynamic inference Present CNNs usually predict in a static and sequential way. Denote as the features of the -th block of the network and as the operation of -th block. The inference of present CNNs can be simply written as:
In this way, a natural idea of dynamic routing is to skip certain blocks as shown in Fig. 2(a), which can be written as:
In order to realize dynamic routing, there are two major problems to solve. The first problem is how to design routers used for routing. The second one is how to optimize because routing paths are discrete and hard to train. These problems are essential to the dynamic routing networks.
How to design routers.
Router is the important component in dynamic routing networks. It has to be strong enough to select the correct route while maintaining low computational cost.
Two kinds of routers have been proposed, which are CNN based router and LSTM router . In this work, we propose to use fully connected layer to achieve the dynamic routing mechanism, whose computational cost is less than of the convolutional block. The structure of our router is shown in Fig. 2(b). The corresponding studies about different routers can be seen in Sec. 4.4.
How to optimize. As mentioned above, we use a continuous approximation of discrete routing, which could achieve higher performance in both accuracy and routing cost. Denote as the routing method. We first use successive softmax to approximate the discrete routing variable:
in whichis the parameter and is the approximated continuous routing variable with 0-based indexing.
With the continuous routing variable, the original discrete routing in Eq. 2 can be rewritten as:
In this way, the dynamic routing can be realized and optimized in an end-to-end fashion.
3.2 Optimization with consistency and diversity
We find the learned routers lack diversity and consistency, which should be considered in the optimization of dynamic routing networks. Based on this idea, we propose two hypotheses:
For similar inputs, the routing paths should be the same or similar.
The available paths should be diverse to fully take advantage of the networks.
For H3.2, the differences of routes between similar inputs should be small.
In order to achieve H3.2, the differences of routes between dissimilar inputs should be large.
During training, we treat the copies of the same sample with different augmentations as a group of similar inputs. Different groups generated by different samples are regarded as dissimilar inputs. The whole optimization is shown in Fig. 3.
Consistency of routing
. In order to represent the route, we use the joint distribution of the routing variable of all blocks. Denoteas the route, in which is the number of blocks. Suppose is the -th sample in a training batch, is the -th augmentation methods. Then is the route of sample with the augmentation of . In this way, the optimization of consistency can be written as:
in which is the batch size, is the number of augmentations, is the mean function, is the L2 norm, and denotes the hinge, is the routes of sample with all augmentations and is the margin for consistency.
The optimization of is to minimize the differences between every route and the mean route, which can be regarded as reducing the consistency of a group of similar inputs.
Diversity of routing. Similarly, the optimization of diversity can be written as:
in which is the margin for diversity.
For every pair of groups, the mean route is optimized to maximize the differences. In this way, the route of different groups can be dissimilar and the diversity of routing can be guaranteed.
Analysis of effects on consistency and diversity As shown in Fig. 4, the optimization of actually pulls the routes of similar inputs together. The optimization of pushes the center of each group of dissimilar inputs away. The effects of consistency and diversity can be summarized into two parts. First, the diversity of routing paths enhances the potential capacity of the networks, because the inference path and representation space of the network is enlarged with more available routing paths compared with the standard network with a single inference path. Second, the consistency of routing paths encourages similar paths for similar inputs, and the consistent path is only in charge of specific similar inputs. In this way, the learning of consistent routing paths actually reduces the difficulty of optimization. The corresponding ablation studies can be seen in Sec. 4.2.
3.3 Customizable computational cost controlling
As described in the introduction, present dynamic networks lack the property of controlling inference time. The routers are learned without any constraint but accuracy. It is hard to achieve with present dynamic routing networks if we want a light-weight dynamic routing network with a slight performance drop, or a heavy network with better performance. In this way, customizable and flexible controlling could give the opportunity of balancing the trade-off between cost and accuracy.
In order to achieve controlling of computational cost, the optimization of computational cost should be differentiable. Fortunately, we can directly optimize the computational cost by the proposed routing mechanism. Suppose the computational cost of -th block is and the cost for skip connection is 0. Then the total computational cost of the network can be written as:
in which is the continuous routing variable. Since is the computational cost of -th block, is constant and can be utilized via look up table. Because the cost of every block is constant, the is differentiable and can be directly optimized.
The advantages of the customizable computational cost controlling are twofold. It is differentiable and can be directly optimized in an end-to-end fashion with other losses. By adjusting the weight of , we can customize the learned network with different computational costs and performances. Moreover, the adjusting scale is continuous and order-preserving, i.e., the larger weight of correspond to the lighter-weight model and the smaller weight correspond to the larger model.
3.4 Training and Testing
With the proposed modules, we show the training and testing settings in this section. In the training phase, the whole optimization target of our method is:
in which is the cross entropy loss used for classification. , , and are the hyper parameters. The is the regularization term:
The gradient of the router in the training stage is continuous but discrete when testing. There is a gap between training and testing. In order to reduce the gap, we can alternatively finetune our network by freezing the weight of routers.
In the testing phase, the routing is no longer continuous. The inference of our network can be written as:
In this way, the inference of the network is dynamic and the computational cost and energy can be reduced.
To construct a dynamic network, we apply the routing mechanism and instance-specific path optimization method on a series of network architectures. Then we evaluate the results on four widely used classification datasets. In Sec. 4.1, we first demonstrate the experimental details and optimization settings. The ablation studies about the effectiveness of the proposed diversity based optimization, consistency based optimization, and the computational cost controlling method are shown in Sec. 4.2. In Sec. 4.3, the experiments with state-of-the-art methods are shown. In Sec. 4.4, we show the extensive quantitative studies to elaborate on the advantages of our router mechanisms.
4.1 Experimental Details
, and Tiny-ImageNet. CIFAR-10/100 datasets consist of 60,000 colored images with a size of 32
32, in which 50,000 images are used for training and the other 10,000 images are used for testing. Tiny-ImageNet contains 100,000 training images and 10,000 testing images that are annotated with 200 classes. For the SVHN dataset, 73,257 training images and 26,032 testing images are used. We use the classification accuracy (Top-1) as our evaluation metric. For data augmentation, we use the same setting following[11, 5] .
Models. For CIFAR-10 and CIFAR-100 , we use ResNet as the base models, including ResNet-32, ResNet-44, ResNet-56, ResNet-74 and ResNet-110. As for Tiny-ImageNet and SVHN, ResNet-50 and ResNet-110 are adopted as the base model, respectively.
. During training, we use SGD as our optimizer with a momentum of 0.9 and a weight decay of 1e-4. The initial learning rate is set to 0.1. A multi-step learning rate scheduler is used, which decays at 150 and 200 epochs. As for the loss setting, we setand to 0.2 on CIFAR and SVHN, 0.4 on Tiny-ImageNet. and are set to 0.05 and 0.5, respectively. To control the computational cost precisely, the weight of the cost controlling loss is adjusted adaptively. Besides, the margins and are set to 0.2 and 0.5 respectively.
4.2 Ablation Study
In this subsection, we show the quantitative results and qualitatively visualization to prove the effectiveness of the proposed diversity and consistency losses. To prove that our method could control the computational cost flexibly, we show the trade-off between accuracy and computational cost.
Research on diversity. In Sec. 3.2, we assume available routing paths should be diverse to take full advantage of the network and the paths for dissimilar images should be different. In this section, we show the statistics of the routing paths to proves the hypothesis. Fig. 7 shows the numbers of paths during the training under different settings of margins. From Fig. 7 we can see that the number of routing paths increases significantly with a larger margin. The baseline network without diversity optimization only has a very limited number of routing paths, which proves that our method increases the diversity of routing paths significantly.
Fig. 5 is the visualization of the images of different categories that share the same paths when using diversity loss. We can see that even for the same class, the generated routing paths are still different, which further proves the effectiveness of the proposed diversity optimization.
Research on consistency. In this part, we design an interesting experiment to illustrate the effectiveness of consistency optimization and prove the hypothesis qualitatively. We use several data augmentation methods including random crop, horizontal flip, vertical flip and rotation to augment the 10,000 images on the CIFAR-10 test set. With the augmented images, the routing paths of two networks are recorded, which are our method with consistency loss and a baseline without it. As a result of this experiment, there are 5,342 out of 10,000 images have the same path using our method, while the one without consistency loss only has 2,265 same routing paths. Interestingly, the accuracy of our method is much higher than the baseline method on the augmented images, while the accuracies of these two methods are comparable on the testing set without augmentation. These findings suggest that the optimization of consistency could enhance the robustness of the network.
To better understand the effectiveness of the consistency loss, we show the images that have consistent paths with consistency loss while inconsistent without it in Fig. 6. As shown in Fig. 6, these images have three common attributes. Blurry is the common attribute of images on the third row. Other images either focus on a small part of the object or only contain small parts of the image, which are hard to learn. With our method, the inconsistent routing paths of these hard images become consistent, thus improves the performance of classification.
In order to verify the effectiveness of the proposed consistency and diversity based optimization as a whole, we present the numerical improvement on Tiny-ImageNet and SVHN datasets. Table 1 (a) presents the results on Tiny-Image. When optimized with diversity and consistency losses, the performance is improved by . As for SVHN, similar results also prove the effectiveness of the proposed consistency and diversity losses.
Research on cost controlling. Another benefit of our method is that we can control the computation cost explicitly. As we have discussed in Sec.3.3, the proposed cost controlling loss allows us to balance the trade-off between accuracy and cost. And we only need to adjust the weight of controlling loss to get the desired model. Tab. 2 shows the trade-off between classification accuracy, FLOPs and average length of routing on the CIFAR-10 dataset. With the increase of routing length, the accuracy shows a rising trend. In this way, the effectiveness of the proposed computational cost controlling method is proven.
|Length of routing paths||45||38||19||17||16||13|
4.3 Performance Evaluation
In this subsection, we first provide the results of our methods and original full ResNet on two datasets to prove that our method could achieve better performance while reducing the computational cost. Then, we compare our method with state-of-the-art dynamic networks and anytime inference models. In the end, The comparison of our method and some related compression methods is shown.
Comparison with the original full ResNet. We show the comparison between our method and original ResNet using the accuracy and length of the routing path in Tab. 3. As we can see in Table 3 our method achieves higher accuracy with less computational cost in nearly all experimental settings. Moreover, with comparable accuracy, our method could reduce the computational cost compared with the original network. Especially, when compared with the ResNet110 network on the CIFAR-10 dataset, our method only needs of number of blocks, i.e. the length of the routing path, to achieve comparable results.
On the CIFAR-100 dataset, our method could achieve similar results compared with the original network. It should be noted that the reduced number of blocks is relatively small with the shallower networks. The reason for this phenomenon might be that there exists a minimal length of routing path for certain datasets. The number of blocks of shallower networks is close to the minimal length of routing, and the reduction of length of the routing path is limited. With a deeper network, our method could effectively reduce the computational cost.
Comparison with state-of-the-arts111The results of ACT and SACT are obtained from , which is not reported in their original papers.
. In this section, the comparisons of our method, other state-of-the-art dynamic routing methods, and anytime inference networks are shown. As we have discussed in Sec. 3.3, we can get a series of networks with different accuracy and computational costs by adjusting the hyperparameter of cost controlling loss. For a fair comparison, we use FLOPs as the metric of computational cost, which is hardware-independent.
To illustrate the comparison intuitively, we divide them into three groups as shown in Fig. 8. The first group is a comparison between our method and the full ResNets. In Fig. 8(b), we compare our method with state-of-the-art dynamic routing networks, which are ACT, SACT, SkipNet, and BlockDrop. ACT and SACT are anytime inference methods. SkipNet and BlockDrop apply different mechanisms to reduce the number of blocks used for computing. From Fig. 8(b) we can observe that our model outperforms other methods in most cases.
Comparison with compression methods. Fig. 8 (c) presents the results of our method and other model compression methods. Following , PFEC and LCCL are used for comparison. From Fig. 8 (c) we can see that our method could achieve higher performance than PFEC and LCCL in most settings. Furthermore, our method does not violate the mechanism of compression methods, thus our method can be adopted with compression methods simultaneously to gain better performance.
4.4 Analysis of Routing Module
In order to analysis the design of our routing module, we compare our methods with several selecting strategies and routing modules in this section. Two aspects of the routing module are discussed, which are the design of the router and the approximation of dynamic routing. To analysis the design of the router, router in our method and other heuristic methods are compared. For the analysis of the optimization of dynamic routing, the proposed approximation method and sampling method with Bernoulli distribution are compared.
The proposed router vs. heuristics. We compare our routing strategy with different heuristic algorithms, which are FirstK, RandomK and DistributedK. FirstK selects the first K blocks as the inference path and RandomK selects K blocks randomly. The DistributedK is to use predefined routing paths based on classes. The results of comparison shows in Tab. 4 (a). It is evident that our method is much better than FirstK and RandomK on different ResNet models. Compared with DistributionK that uses predefined routing paths, our method could still achieve better performance.
Analysis of router during training and testing. To bridge the gap between training and testing, we use a continuous approximation method to simulate the discrete routing, as shown in Fig. 2 (b). In this part, we first analysis the effects of router design. The design can be summarized into two categories: CNN router and RNN router. CNN router is commonly used in other methods, but it is too expensive to obtain much speed up . Although RNN router is much cheaper than CNN router, the performance is unsatisfied. In our method, we use fully connected layer in the routing modules of which the extra routing cost is less than the cost of the convolutional block. The comparison of different routers is shown in Table 5.
|CNN Router||RNN Router||Ours|
Second, we show the comparison of a different sampling method. According to the probability distributions vector outputing from each router, we research two different ideas about how to route the path: sampling the distribution or choosing the highest probability. With the first idea, we build a Bernoulli distribution according to the probability and sample a decision. As Tab.4 (b) shows, the performance of our method is better than the sampling method. Therefore, our design of the router is reasonable in both training and test stages.
In this paper, we have proposed a novel dynamic routing method with diversity and consistency that better takes advantage of the network capacity. The inherent properties of dynamic routing are explored and examined, i.e., the consistency and diversity of routing mechanisms. Moreover, the average computational cost can be controlled explicitly by the proposed cost controlling method. Extensive experiments and the corresponding results show that our method could achieve state-of-the-art results compared to the original full network, dynamic networks, and model compression method.
-  (2019) Once for all: train one network and specialize it for efficient deployment. arXiv preprint arXiv:1908.09791. Cited by: §2.
-  (2018) Proxylessnas: direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332. Cited by: §2.
-  (2017) Learning efficient object detection models with knowledge distillation. In Advances in Neural Information Processing Systems, Cited by: §2.
-  (2018) Distilling the knowledge from handcrafted features for human activity recognition. IEEE Transactions on Industrial Informatics. Cited by: §2.
-  (2019) Autoaugment: learning augmentation strategies from data. In , Cited by: §4.1.
-  (2017) More is less: a more complicated network with less inference complexity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §4.3.
-  (2017) Spatially adaptive computation time for residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2, §4.3.
-  (2018) The lottery ticket hypothesis: finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635. Cited by: §1.
Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983. Cited by: §2, §4.3.
-  (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §2.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §4.1, §4.1.
-  (2017) Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §2.
-  (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2.
-  (1997) Long short-term memory. Neural computation. Cited by: §3.1.
-  (2014) Efficient feature group sequencing for anytime linear prediction. arXiv preprint arXiv:1409.5495. Cited by: §2.
-  (2017) Multi-scale dense networks for resource efficient image classification. In ICLR, Cited by: §2.
-  (2017) Multi-scale dense networks for resource efficient image classification. arXiv preprint arXiv:1703.09844. Cited by: §1.
-  (2017) Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
-  (2015) Training cnns with low-rank filters for efficient image classification. arXiv preprint arXiv:1511.06744. Cited by: §2.
-  (2014) Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866. Cited by: §2.
-  (2009) Learning multiple layers of features from tiny images. Cited by: §4.1.
-  (2015) Tiny imagenet visual recognition challenge. CS 231N. Cited by: §4.1.
-  (2016) Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710. Cited by: §1, §2, §4.3.
-  (2019) Improved techniques for training adaptive deep networks. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §1, §2.
-  (2017) Thinet: a filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, Cited by: §2.
-  (2018) Recurrent segmentation for variable computational budgets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Cited by: §2.
-  (2011) Reading digits in natural images with unsupervised feature learning. Cited by: §4.1.
Changing model behavior at test-time using reinforcement learning. arXiv preprint arXiv:1702.07780. Cited by: §2.
-  (2018) Model compression via distillation and quantization. arXiv preprint arXiv:1802.05668. Cited by: §2.
-  (2011) Boosting on a budget: sampling for feature-efficient prediction. Cited by: §2.
-  (2015) Convolutional neural networks with low-rank regularization. arXiv preprint arXiv:1511.06067. Cited by: §2.
-  (2019) Mnasnet: platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
-  (2016) Residual networks behave like ensembles of relatively shallow networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Cited by: §2.
-  (2018) SkipNet: learning dynamic routing in convolutional networks. In The European Conference on Computer Vision (ECCV), Cited by: §2, §4.3.
-  (2018) Skipnet: learning dynamic routing in convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §1, §4.4, Table 5.
-  (2002) Bernoulli distribution. sigma 19, pp. 20. Cited by: Table 4.
-  (2016) Quantized convolutional neural networks for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
-  (2018) BlockDrop: dynamic inference paths in residual networks. In CVPR, Cited by: §2, §4.3, §4.3, footnote 1.
-  (2018) Blockdrop: dynamic inference paths in residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, Table 4.
-  (2018) Slimmable neural networks. arXiv preprint arXiv:1812.08928. Cited by: §2.