Diversifying Inference Path Selection: Moving-Mobile-Network for Landmark Recognition

12/01/2019 ∙ by Biao Qian, et al. ∙ 0

Deep convolutional neural networks have largely benefited computer vision tasks. However, the high computational complexity limits their real-world applications. To this end, many methods have been proposed for efficient network learning, and applications in portable mobile devices. In this paper, we propose a novel Moving-Mobile-Network, named M^2Net, for landmark recognition, equipped each landmark image with located geographic information. We intuitively find that M^2Net can essentially promote the diversity of the inference path (selected blocks subset) selection, so as to enhance the recognition accuracy. The above intuition is achieved by our proposed reward function with the input of geo-location and landmarks. We also find that the performance of other portable networks can be improved via our architecture. We construct two landmark image datasets, with each landmark associated with geographic information, over which we conduct extensive experiments to demonstrate that M^2Net achieves improved recognition accuracy with comparable complexity.



There are no comments yet.


page 1

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep Convolutional Neural Networks (CNNs) have largely benefited the field of pattern recognition. However, the large complexity for training the CNNs limits their real-world applications. To this end, many works

[5, 12, 13, 4, 3, 3] have been proposed on modeling compression networks, which has naturally led to a surge in research [10, 26], on applying them to real-world portable mobile devices. For this, BlockDrop [23] module is the typical model, which dynamically selects the useful inference path for efficiency.

Figure 1: (a)(b)Examples of similar landmarks with different geographic information, the existing compression networks are more likely to detect them as identical landmark. (c)Traditional portable network, such as ShuffleNet [26] and HetConv [18] only generate unique inference path. (d)MNet combines geographic information with visual information to generate diverse inference paths for each input.

However, current models fail under the presence of visually similar landmarks with different geographic information, as illustrated in Fig.1, where these landmarks belong to different small classes in the same large categories, e.g., tower from different cities. Therefore, the visual differences between them are challenging to distinguish even when using powerful deep network representations [10, 26, 18, 23]. In other words, these mobile nets fail to consider the mobility of mobile networks, with moving geo-locations.

Figure 2: Overview of M

Net, where the blocks with light blue are kept, while red ones are dropped. For the policy network, geographic information in combination with the landmark images is exploited to generate diverse policies, to determine which blocks are kept, which dynamically promotes the diversity of the inference path selection in the recognition network. Specifically, the information from the geographic location and landmark image of mobile devices is fused. Then, an N-dimension Bernoulli distribution is utilized to perform the diverse decision to select block subsets as the inference path. The reward function encourages the selection of the inference path with minimal number of blocks and diverse output policies under correct predictions. p: Policy network. a: Output policy. b: The prediction accuracy of recognition network. c: The reward function. As discussed above, a circle module reflects promotion relationship among policy diversity(p,a), recognition accuracy(b), and reward function(c), which will be introduced in detail in the Section


To address this problem, we propose a novel moving-mobile-network, named M

Net, which exploits geographic information in combination with visual information, to improve the landmark recognition accuracy. Our basic idea is to learn a policy network based on reinforcement learning

[19], which dynamically selects the layers or blocks in the network to construct the inference path. We choose Residual Networks (ResNet) [6] as our backbone networks, due to their robustness to layer removal [20].

We remark that MNet also belongs to the BlockDrop module, yet with non-trivial extensions to tackle mobility issues of the portable network. The following observations are drawn:

  1. There are diverse policies, i.e., more diverse inference paths can improve the accuracy of MNet.

  2. When MNet makes a correct prediction, the policy network is positively rewarded. Thus the improved accuracy will provide more rewards for the policy network.

  3. During the training process, to get the largest reward value, the policy network aims to make a sparse and unique policy, resulting in diverse inference paths for MNet.

For ease of understanding, we illustrate the major framework of MNet in Fig.2, which can promote diverse inference paths in the network, improving the recognition accuracy. Our MNet can also obtain comparable computational efficiency due to dynamic policy network output for each landmark image. The policy network is trained based on both the landmark images and geographic locations, to encourage the policy selection with less performance loss. The policy network and the recognition network are jointly learned to improve the accuracy, while achieve the feasible computational efficiency.

Figure 3:

(a)Encoding the geographic location to an 8-dimension vector is beneficial for acquiring more information from the location vector and adopting a Multi-Layer Perceptron (MLP) to extract the location features. (b)The policy network fuses geographic information and landmark images.

and are the weights of the MLP and ResNet-10, respectively.

is the fusion factor. Sigmoid function

. B represents an N(N=3)-dimension Bernoulli distribution. Combining geographic information and landmark images can lead to more diverse policies.

Our major contributions are summarized as follows:

  • We propose MNet to exploit the geographic information in combination with the landmark image to learn a policy network to enable the diverse inference paths selection, which improves the recognition accuracy on mobile devices. In addition, we introduce a novel reward function for the policy network to promote the generation of diverse policies.

  • For each landmark, we dynamically choose blocks, i.e., a subset of the inference path for the recognition network, to make the inference efficient.

  • As no existing research on compression or mobile network investigates image recognition with geo-location information, we create two such datasets named Landmark-420 and Landmark-732 for evaluations. The experimental results show that MNet achieves accuracy gain and speedup on average compared to the original ResNet-47 on Landmark-420, and outperforms state-of-the-art methods [23, 10, 26, 18] in terms of the accuracy, while achieving a feasible computational complexity.

2 Related work

To better appreciate our research findings, we discuss related work for deep network compressions.

2.1 Dynamic Layer Selection

Several methods have been proposed to dynamically drop the residual layer in residual networks. Specifically, [20] found that residual networks can be regarded as a collection of short paths, and removing a single block has negligible effects on network performance. [11] introduced a scaling factor to scale the residual blocks and added a sparsity constraint on these factors during the training process, removing the blocks with small factors. For SkipNet [22], the gating networks are proposed to decide whether to skip the corresponding layer during inference, where the outputs of previous layers are used as the inputs of the gating networks. Unlike SkipNet that assigns a gating module for each residual layer, Blockdrop [23] module characterized a policy network based on reinforcement learning to output a set of binary decisions for each block in a pre-trained ResNet, where the policy network takes images as the inputs.

Orthogonal to the above, MNet exploits both the geographic information and visual information to learn a policy network to achieve diverse policies, which can improve the performance.

2.2 Model Compression

Many methods aim to accelerate the network computation and reduce the model size. On one hand, some of the techniques aim to simplify the existing networks, such as network pruning [5, 13, 7, 8, 2], quantization [4, 1, 21, 16, 24], low-rank factorization [3, 15] and knowledge distillation [9, 25]. On the other hand, efficient network architectures are designed to train compact neural networks, such as MobileNet [10], ShuffleNet [26], and HetConv [18]. These methods obtain compact networks after training; however, during inference, the architectures are kept unchanged for all images.

Unlike the above, MNet focuses on dynamically adjusting network architecture for different input images during inference, which contributes to allocating computing resources reasonably.

3 MNet: Moving Mobile Network

Formally, as shown in Fig.2, MNet is composed of two subnetworks of the policy network, denoted as and . Given inputs of two subnetworks: the visual images and the geographic locations , where M is the number of landmark samples, the policy network aims at generating binary policy vectors , where N is the number of residual blocks in recognition network.

Before shedding the light on the architecture of MNet, we first discuss how to encode the mobility of MNet, that is, geographic information, in the next section.

Figure 4: An example of unique policy. operation represents merging the same elements. is output policy of the policy network, and is the corresponding unique policy. The diversity of (a) is 3.

3.1 Encoding Geographic Location

Geographic location can be encoded via the following formats: Degrees Minutes Seconds ( ), Decimal Minutes ( ), and Decimal Degrees (). N, S, E or W represent North, South, East or West, respectively.

We opt for the first format due to its higher dimension, while replacing N and S or E and W with as flags. This can effectively extract richer information from the input. As shown in Fig.3(a), we construct an 8-dimensional vector. We observe that Degrees, Minutes and Seconds have different scales, while the sensitivity to change is of great difference. To address this, we adopt an Multi-Layer Perceptron (MLP) to extract the location features, with different layers for various scales.

With the help of geographic information, we find that MNet can intuitively promote diversity for inference path selection, which will be discussed in the next section.

3.2 Diversity for MNet

3.2.1 Unique Policy and Diversity

As MNet selects the inference path via the policy network, we introduce , a set of unique policy, which is obtained by


where indicates merging the same elements in , as shown in Fig.4.

Diversifying inference path selection via geo-information. We measure the diversity as the number of unique policies, i.e., the length of , for M samples. One toy example (M=5) is shown in Fig.4(a) are the policies made by the policy network. As aforementioned, the number of unique policies of (a), i.e., policy diversity, is 3. As shown in Fig.3(b), given any two landmarks, denoted as and

, with large visual similarity, but highly distinct locations, and vice versa (similar locations yet small visual similarity), such that the outputs

and are close, often, and , the outputs and are usually highly different, making the fusion result i.e., the outputs of sigmoid function, diverse.

To exhibit the intuitions, we show some visualization results in Fig.5 on diverse inference paths selection of MNet. The diverse policies, as shown in Fig.5(a) and (b), of MNet offer diverse inference paths for landmark recognition, with the help of both geographic information and landmarks, to improve the performance. Instead, Fig.5(c) shows that BlockDrop module can generate only one inference path.

We define the uniqueness, which describes the difference between policies of the policy network for samples, as a vector ,


where denotes the normalized Hamming distance between two binary vectors. The larger is, the larger the difference between and other M-1 policies is, which indicates that has larger possibility to become unique policy. In addition, during the training process, we aim at promoting the diversity of output policies by increasing .

To this end, we propose our reward function for the policy network, which will be discussed in the Section 3.3.

Figure 5: An overview of policy diversity. (a)Similar geographic information with different landmarks. (b)Similar landmarks with different geographic information. (c)Similar landmarks without geographic information. (a)(b) and (c) come from MNet and BlockDrop, respectively. Compared with (c), our method (a)(b) makes more diverse policies (2vs1), which can improve the accuracy. In addition, (a)(b) show the difference between the inference paths, where colorful boxes mark the corresponding position, which make the inference path diverse.

3.3 Policy Network based on Reward Function

Motivated by [22] and [23], we apply reinforcement learning to train our policy network. As shown in Fig.2 and Fig.3(b), given a geographic location and landmark , with a pre-trained recognition network with N residual blocks, a policy can be seen as an N-dimensional Bernoulli distribution:


where s is fusion result of the policy network, which can be formulated as


where denotes the policy network with the weights and . is the fusion factor, . is the output of the policy network, and the -th entry of (

) indicates the probability that the

-th residual block is selected. The policy is obtained based on s, where 1 and 0 indicate selecting and skipping the corresponding block, respectively.

Based on the above, we define the reward function to encourage more unique policy and minimal block usage, along with correct predictions. Our reward function is formulated as follows:


where is the percentage of reserved blocks, implying the sparsity of the pre-trained ResNet, and reflects the difference from other policies, promoting diverse policies. When a prediction is correct, a large positive reward is offered to the policy network, while is applied to penalize incorrect predictions. and represent the weights.

To learn the policy network, we maximize the following objective function [23]:


3.3.1 Sparsity of the Inference Path

We imply that the term in Eqn. (5) increases the selection of the inference path, so as to be diversified. It is also closely related to the complexity for the inference path selection. We deeply study that by further define that measure the sparsity of the selected inference path, as follows:


We measure the proportion , such as 15/21 for and . also implies the computational complexity for each image during the inference, where a smaller value means lower complexity. For MNet, we adopt the average of the inference paths to measure the efficiency.

We also obverse that the sparsity has non-negligible influence on the maximum number of unique policies, which can be computed by , e.g., . Fig.6 describes the relationship between them. We observe that the maximum number of unique policies is of great difference under different , and the peak value is obtained around .

Figure 6: The maximum number of unique policies under different . When , the upper bound of the available policy increases given decreases. Besides, the upper bound is much higher around than others.

3.4 Learning Paramters of MNet

To learn two subnetwork weights of the policy network, we compute the gradients of as motivated by [23]. The gradients can be represented as:


where and is the maximally probable configuration, i.e., if , and otherwise [17]. denotes the weights of the policy network, which contains and .

We can further derive the gradients of with respect to and as follows:


3.5 More Insights among Reward Function, Policy Diversity and Recognition Accuracy

As shown in Fig.2, the reward function, policy diversity and recognition accuracy of MNet can inherently promote each other. To better understand this, we discuss the relationship between each pair of them, respectively, which is shown below:

  • Policy Diversity and Recognition Accuracy:

    For MNet, the multiple output policies of the policy network select the diverse inference path in the recognition network. Besides, geographic information together with visual information are exploited to increase policy diversity. Combining Fig.7 with Fig.10(d) shows that MNet achieves larger policy diversity and higher recognition accuracy, than the model that removed geographic information, implying that diverse policy can improve the recognition accuracy.

  • Recognition Accuracy and Reward Function:

    For Eqn.(5), when a prediction is correct, the reward function will generate a large reward value. The higher the recognition accuracy, the more correct predictions, resulting in more diverse reward values. That indicates that the larger recognition accuracy provides diverse reward values for the policy network.

  • Reward Function and Policy Diversity:

    To maximize Eqn.(6), the reward function as Eqn.(5) will encourage a smaller and diverse policy set. We also find that the smaller the value of , the greater the maximum number of unique policies, within a certain range, as shown in Fig.6. Thus the selection set of the output policy is expanded, leading to larger possibilities for diverse policies. This naturally indicates that the reward function can encourage larger policy diversity.

Figure 7: Policy diversity. (a)(b) shows policy diversity of the policy network during finetuning process. MNet obtains more diverse policy compared to Remove-MLP and Remove-U as decreases.

4 Experiment

In this section, we experimentally validate the effectiveness of MNet, including the comparisons with state-of-the-arts and comprehensive ablation studies.

4.1 Experiment Setup

4.1.1 Datasets

We construct two landmark classification datasets: Landmark-420 and Landmark-732, where each image is picked from the Google-Landmark-v2 [14], which contains 5M images labeled for 200k unique landmarks ranging from artificial buildings to natural landscapes. We show the examples in Fig.8. Each landmark image contains geographic information represented with latitude and longitude. The geographic location information is encoded as an 8-dimension vector, as stated in the Section  3.1.

In summary:

  • The Landmark-420 dataset consists of 165,000 colored images, with 150,000 labeled landmark images from 732 classes for training and 15,000 for test.

  • The Landmark-732 dataset consists of 380,000 labeled training landmark images across 732 categories and 40,000 images for test.

Figure 8: Examples of landmark images with geographic information from our datasets, which contain artificial building and natural landscape.
Layer Input Output
FC1 8 128
FC2 128 256
FC3 256 256
FC4 256 128
FC5 128 K
Table 1: Multi-Layer Perceptron (MLP) architecture dealing with geographic information. K is set to 9, 15 and 21 for different pretrained ResNets, respectively.

4.1.2 Policy Network Architecture

Our policy network combines geography locations with landmark images, with two subnetworks, as shown in Fig.3(b). For geographic locations, a 5-layer Multi-Layer Perceptron, as shown in Table 1, is adopted to extract the geographic location features. For landmark images, we utilize a ResNet variant named ResNet-10, as shown in Table 2, to extract the visual features, while capturing the feature vectors via a fully connected layer [23]. To exploit image features and corresponding geographic location, we fuse the output vectors of two subnetworks with a scaling factor to balance them. In our experiments, we empirically set as 0.7.

block Layer Filter Stride
Conv2d 2
MaxPool2d 2
Residual Conv2d 1
block Conv2d 1
Residual block Conv2d 2
Conv2d 1
Downsample 2
Residual block Conv2d 2
Conv2d 1
Downsample 2
Residual block Conv2d 2
Conv2d 1
Downsample 2
AvgPool2d 4
Linear K -
Table 2: Residual network architecture extracting landmark image features. K is set to 9, 15 and 21 for three pretrained ResNets, respectively.

4.1.3 Recognition Network

We adopt three variants of ResNet [6]

, named ResNet-29, ResNet-47 and ResNet-65, respectively, which start with a convolutional layer followed by 9, 15 and 21 residual blocks that are organized into three blocks, respectively. Down-sampling layers are evenly inserted into them. To match our datasets, we make up a fully connected layer with 420/732 neurons at the end of the network. Each of the residual blocks adopts the bottleneck design

[6], including the three convolutional layers. To obtain a pretrained ResNet, we train three ResNets from scratch on Landmark-420 and Landmark-732.

Method Landmark-420 Landmark-732
ResNet-29 ResNet-47 ResNet-47
MobileNet [10]
ShuffleNet(G2) [26]
ShuffleNet(G4) [26]
HetConv(P2) [18]
HetConv(P4) [18]
BlockDrop [23]
MNet 78.7% 83.6% 83.8%
Table 3: Accuracy comparison with the state-of-the-arts.

4.2 Comparison with the State-of-the-Arts

First of all, we compare the proposed MNet with several compressed network models:

  • Baseline: it includes two variants of ResNet [6], named ResNet-29, ResNet-47.

  • MobileNet [10]: it introduces a depth-wise separable convolution and convolution to speed up convolutional computation.

  • ShuffleNet [26]: Based on MobileNet, it proposes a group convolution and shuffle operation to replace the convolution.

  • HetConv [18]: it proposes to replace kernels with kernels to speed up the convolution operation.

  • Blockdrop [23]: it proposes a policy network based on reinforcement learning to output a set of binary decisions for dynamically dropping blocks in a pre-trained ResNet, where the policy network takes images as the inputs.

We implement the methods with the backbone network ResNet on Landmark-420 and Landmark-732. In particular, for ShuffleNet, G2 and G4 denote that the group number of convolution is 2 and 4. While for HetConv, P2 and P4 indicate that the proportion of kernels in a convolutional filter are 1/2 and 1/4.

Figure 9: Comparison of FLOPs with the state-of-the-arts on Landmark-420. Our MNet(cyan) outperforms other techniques with comparable computational complexity.

Table 3 presents the results on Landmark-420 and Landmark-732. The results show that MNet outperforms other methods by large margins. We observe that our best model achieves accuracy gain over the baseline. When the recognition network changes from ResNet-29 to ResNet-47, our advantage is more obvious. This is because the higher number of blocks in ResNet-47 offer more possible options ( ) for the inference path. Thus, the policy network can generate more unique policies, and diverse inference paths can further be constructed.

Fig.9 presents the comparison of FLOPs with the state-of-the-art methods on Landmark-420. With ResNet-47, MNet obtains comparable computational complexity to MobileNet ( ) under the same level of accuracy. For some samples, MNet can even achieve faster inference speed than MobileNet. With ResNet-47, MNet achieves better accuracy than HetConv with the same level of computational efficiency. Compared to BlockDrop, M

Net achieves higher computational efficiency with a larger variance, such as

with ResNet-29. Different from the static methods (MobileNet, ShuffleNet, HetConv), the computation efficiency of MNet is dynamically changeable for each input as shown in Fig.9, which indicates that MNet can allocate computing resources reasonably.

4.3 Ablation Study

To better validate each module of MNet, we conduct a series of ablation studies in the next sections.

4.3.1 With MLP vs Without MLP

We explore the effect of geographic information in the policy network. We study the effect of the MLP. Without the MLP, the input of the policy network only contains images; we denote it as Remove-MLP for simplicity.

The orange curves in Fig.7 and Fig.10 present the results of Remove-MLP. From Fig.7, we observe that the policy diversity of MNet is much higher than that of Remove-MLP( and ), implying that combining geographic information with visual information can increase the number of unique policy. Besides, in Fig.10(c)(d), MNet achieves a large accuracy gain over Remove-MLP and keeps stable accuracy(83% and 85%). Thereby, we conclude that diverse policies contribute to improving the recognition accuracy.

In addition, Fig.10(c)(d) show that the accuracy of Remove-MLP sharply reduces as decreases, while that of MNet can be improved, which means that MNet can keep a good balance between accuracy and computational efficiency. In particular, Fig.10(c)(d) report that the accuracy curves reach a peak around or , where the selection space of output policy is largest as shown in Fig.6.

Figure 10: Comparison of Test Accuracy. (a)(b) indicate the accuracy comparison for training the policy network. (c)(d) indicate the relationship between and test accuracy during finetuning. in (c)(d) represents the number of block retained on average. Smaller means less computational cost. We observe that MNet(blue) achieves obvious accuracy gains compared to Remove-MLP and Remove-U.

4.3.2 Reward Function

We also explore the effect of uniqueness in the reward function. In the experiments, we remove the second term of Eqn.(5) and denote them as Remove-U for simplicity.

The green curves in Fig.7 and Fig.10 present the results of Remove-U. Fig.7 shows that MNet can keep higher diversity compared to Remove-U when decreases, which indicates that the reward function with contributes to promoting the difference between output policies and increasing policy diversity. In addition, the accuracy of Remove-U decreases by a large margin( and ) compared to MNet, as shown in Fig.10.

Figure 11: Comparison of accuracy and FLOPs on Landmark-420. The orange denotes the results after adopting the backbone networks of other models as recognition network of MNet.

4.4 Compatibility with State-of-the-Arts

In the above experiments, we adopt the traditional pretrained ResNet as the recognition network. Actually, MNet is complementary to other methods, the backbone networks of which can also be used as our recognition network. We conduct the experiments on Landmark-420, where ResNet-47 is adopted as the baseline. Fig.11 presents the results of MNet after replacing the recognition network. We observe that not only is the recognition accuracy greatly improved, but the computational complexity is further reduced after applying our architecture to other methods. For MobileNet, ShuffleNet(G4) and HetConv(P4), the recognition accuracy is raised by , and , and the FLOPs is reduced by , and , respectively.

5 Conclusion

In this paper, we propose a novel moving-mobile net, named MNet, for dynamically selecting inference path for efficacy landmark recognition. Unlike existing methods, the geographic information of the landmarks is exploited to train a policy network, where the output served as the input to our proposed reward function, which can further promote the diverse selection of the inference path of MNet, so as to improve the performance. To validate the advantages of MNet, we create two landmark datasets with geo-information, over which extensive experiments are conducted. The results validate the superiority of our method in terms of recognition accuracy and efficiency.


  • [1] Z. Cai, X. He, J. Sun, and N. Vasconcelos (2017) Deep learning with low precision by half-wave gaussian quantization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5918–5926. Cited by: §2.2.
  • [2] W. Chen, Y. Zhang, D. Xie, and S. Pu (2019) A layer decomposition-recomposition framework for neuron pruning towards accurate lightweight networks. In AAAI, Vol. 33, pp. 3355–3362. Cited by: §2.2.
  • [3] T. Cheng, X. Tong, X. Wang, and E. Weinan (2016) Convolutional neural networks with low-rank regularization. Computer Science. Cited by: §1, §2.2.
  • [4] S. Han, H. Mao, and W. J. Dally (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §1, §2.2.
  • [5] S. Han, J. Pool, J. Tran, and W. Dally (2015) Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143. Cited by: §1, §2.2.
  • [6] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §1, 1st item, §4.1.3.
  • [7] Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang (2018) Soft filter pruning for accelerating deep convolutional neural networks. In , IJCAI’18, pp. 2234–2240. External Links: ISBN 978-0-9992411-2-7, Link Cited by: §2.2.
  • [8] Y. He, P. Liu, Z. Wang, Z. Hu, and Y. Yang (2019) Filter pruning via geometric median for deep convolutional neural networks acceleration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4340–4349. Cited by: §2.2.
  • [9] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2.2.
  • [10] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: 3rd item, §1, §1, §2.2, 2nd item, Table 3.
  • [11] Z. Huang and N. Wang (2018) Data-driven sparse structure selection for deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 304–320. Cited by: §2.1.
  • [12] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf (2016) Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710. Cited by: §1.
  • [13] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang (2017) Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2736–2744. Cited by: §1, §2.2.
  • [14] H. Noh, A. Araujo, J. Sim, T. Weyand, and B. Han (2017) Large-scale image retrieval with attentive deep local features. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3456–3465. Cited by: §4.1.1.
  • [15] B. Peng, W. Tan, Z. Li, S. Zhang, D. Xie, and S. Pu (2018) Extreme network compression via filter group approximation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 300–316. Cited by: §2.2.
  • [16] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi (2016) Xnor-net: imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pp. 525–542. Cited by: §2.2.
  • [17] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel (2017)

    Self-critical sequence training for image captioning

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008–7024. Cited by: §3.4.
  • [18] P. Singh, V. K. Verma, P. Rai, and V. P. Namboodiri (2019) HetConv: heterogeneous kernel-based convolutions for deep cnns. In CVPR, Cited by: Figure 1, 3rd item, §1, §2.2, 4th item, Table 3.
  • [19] R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §1.
  • [20] A. Veit, M. J. Wilber, and S. Belongie (2016) Residual networks behave like ensembles of relatively shallow networks. In Advances in neural information processing systems, pp. 550–558. Cited by: §1, §2.1.
  • [21] P. Wang, Q. Hu, Y. Zhang, C. Zhang, Y. Liu, and J. Cheng (2018) Two-step quantization for low-bit neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4376–4384. Cited by: §2.2.
  • [22] X. Wang, F. Yu, Z. Dou, T. Darrell, and J. E. Gonzalez (2018) Skipnet: learning dynamic routing in convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 409–424. Cited by: §2.1, §3.3.
  • [23] Z. Wu, T. Nagarajan, A. Kumar, S. Rennie, L. S. Davis, K. Grauman, and R. Feris (2018) Blockdrop: dynamic inference paths in residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8817–8826. Cited by: 3rd item, §1, §1, §2.1, §3.3, §3.3, §3.4, 5th item, §4.1.2, Table 3.
  • [24] Y. Xu, X. Dong, Y. Li, and H. Su (2019) A main/subsidiary network framework for simplifying binary neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7154–7162. Cited by: §2.2.
  • [25] J. Yim, D. Joo, J. Bae, and J. Kim (2017)

    A gift from knowledge distillation: fast optimization, network minimization and transfer learning

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4133–4141. Cited by: §2.2.
  • [26] X. Zhang, X. Zhou, M. Lin, and J. Sun (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856. Cited by: Figure 1, 3rd item, §1, §1, §2.2, 3rd item, Table 3.