1 Introduction
Deep Convolutional Neural Networks (CNNs) have largely benefited the field of pattern recognition. However, the large complexity for training the CNNs limits their realworld applications. To this end, many works
[5, 12, 13, 4, 3, 3] have been proposed on modeling compression networks, which has naturally led to a surge in research [10, 26], on applying them to realworld portable mobile devices. For this, BlockDrop [23] module is the typical model, which dynamically selects the useful inference path for efficiency.However, current models fail under the presence of visually similar landmarks with different geographic information, as illustrated in Fig.1, where these landmarks belong to different small classes in the same large categories, e.g., tower from different cities. Therefore, the visual differences between them are challenging to distinguish even when using powerful deep network representations [10, 26, 18, 23]. In other words, these mobile nets fail to consider the mobility of mobile networks, with moving geolocations.
To address this problem, we propose a novel movingmobilenetwork, named M
Net, which exploits geographic information in combination with visual information, to improve the landmark recognition accuracy. Our basic idea is to learn a policy network based on reinforcement learning
[19], which dynamically selects the layers or blocks in the network to construct the inference path. We choose Residual Networks (ResNet) [6] as our backbone networks, due to their robustness to layer removal [20].We remark that MNet also belongs to the BlockDrop module, yet with nontrivial extensions to tackle mobility issues of the portable network. The following observations are drawn:

There are diverse policies, i.e., more diverse inference paths can improve the accuracy of MNet.

When MNet makes a correct prediction, the policy network is positively rewarded. Thus the improved accuracy will provide more rewards for the policy network.

During the training process, to get the largest reward value, the policy network aims to make a sparse and unique policy, resulting in diverse inference paths for MNet.
For ease of understanding, we illustrate the major framework of MNet in Fig.2, which can promote diverse inference paths in the network, improving the recognition accuracy. Our MNet can also obtain comparable computational efficiency due to dynamic policy network output for each landmark image. The policy network is trained based on both the landmark images and geographic locations, to encourage the policy selection with less performance loss. The policy network and the recognition network are jointly learned to improve the accuracy, while achieve the feasible computational efficiency.
Our major contributions are summarized as follows:

We propose MNet to exploit the geographic information in combination with the landmark image to learn a policy network to enable the diverse inference paths selection, which improves the recognition accuracy on mobile devices. In addition, we introduce a novel reward function for the policy network to promote the generation of diverse policies.

For each landmark, we dynamically choose blocks, i.e., a subset of the inference path for the recognition network, to make the inference efficient.

As no existing research on compression or mobile network investigates image recognition with geolocation information, we create two such datasets named Landmark420 and Landmark732 for evaluations. The experimental results show that MNet achieves accuracy gain and speedup on average compared to the original ResNet47 on Landmark420, and outperforms stateoftheart methods [23, 10, 26, 18] in terms of the accuracy, while achieving a feasible computational complexity.
2 Related work
To better appreciate our research findings, we discuss related work for deep network compressions.
2.1 Dynamic Layer Selection
Several methods have been proposed to dynamically drop the residual layer in residual networks. Specifically, [20] found that residual networks can be regarded as a collection of short paths, and removing a single block has negligible effects on network performance. [11] introduced a scaling factor to scale the residual blocks and added a sparsity constraint on these factors during the training process, removing the blocks with small factors. For SkipNet [22], the gating networks are proposed to decide whether to skip the corresponding layer during inference, where the outputs of previous layers are used as the inputs of the gating networks. Unlike SkipNet that assigns a gating module for each residual layer, Blockdrop [23] module characterized a policy network based on reinforcement learning to output a set of binary decisions for each block in a pretrained ResNet, where the policy network takes images as the inputs.
Orthogonal to the above, MNet exploits both the geographic information and visual information to learn a policy network to achieve diverse policies, which can improve the performance.
2.2 Model Compression
Many methods aim to accelerate the network computation and reduce the model size. On one hand, some of the techniques aim to simplify the existing networks, such as network pruning [5, 13, 7, 8, 2], quantization [4, 1, 21, 16, 24], lowrank factorization [3, 15] and knowledge distillation [9, 25]. On the other hand, efficient network architectures are designed to train compact neural networks, such as MobileNet [10], ShuffleNet [26], and HetConv [18]. These methods obtain compact networks after training; however, during inference, the architectures are kept unchanged for all images.
Unlike the above, MNet focuses on dynamically adjusting network architecture for different input images during inference, which contributes to allocating computing resources reasonably.
3 MNet: Moving Mobile Network
Formally, as shown in Fig.2, MNet is composed of two subnetworks of the policy network, denoted as and . Given inputs of two subnetworks: the visual images and the geographic locations , where M is the number of landmark samples, the policy network aims at generating binary policy vectors , where N is the number of residual blocks in recognition network.
Before shedding the light on the architecture of MNet, we first discuss how to encode the mobility of MNet, that is, geographic information, in the next section.
3.1 Encoding Geographic Location
Geographic location can be encoded via the following formats: Degrees Minutes Seconds ( ), Decimal Minutes ( ), and Decimal Degrees (). N, S, E or W represent North, South, East or West, respectively.
We opt for the first format due to its higher dimension, while replacing N and S or E and W with as flags. This can effectively extract richer information from the input. As shown in Fig.3(a), we construct an 8dimensional vector. We observe that Degrees, Minutes and Seconds have different scales, while the sensitivity to change is of great difference. To address this, we adopt an MultiLayer Perceptron (MLP) to extract the location features, with different layers for various scales.
With the help of geographic information, we find that MNet can intuitively promote diversity for inference path selection, which will be discussed in the next section.
3.2 Diversity for MNet
3.2.1 Unique Policy and Diversity
As MNet selects the inference path via the policy network, we introduce , a set of unique policy, which is obtained by
(1) 
where indicates merging the same elements in , as shown in Fig.4.
Diversifying inference path selection via geoinformation. We measure the diversity as the number of unique policies, i.e., the length of , for M samples. One toy example (M=5) is shown in Fig.4(a) are the policies made by the policy network. As aforementioned, the number of unique policies of (a), i.e., policy diversity, is 3. As shown in Fig.3(b), given any two landmarks, denoted as and
, with large visual similarity, but highly distinct locations, and vice versa (similar locations yet small visual similarity), such that the outputs
and are close, often, and , the outputs and are usually highly different, making the fusion result i.e., the outputs of sigmoid function, diverse.To exhibit the intuitions, we show some visualization results in Fig.5 on diverse inference paths selection of MNet. The diverse policies, as shown in Fig.5(a) and (b), of MNet offer diverse inference paths for landmark recognition, with the help of both geographic information and landmarks, to improve the performance. Instead, Fig.5(c) shows that BlockDrop module can generate only one inference path.
We define the uniqueness, which describes the difference between policies of the policy network for samples, as a vector ,
(2) 
where denotes the normalized Hamming distance between two binary vectors. The larger is, the larger the difference between and other M1 policies is, which indicates that has larger possibility to become unique policy. In addition, during the training process, we aim at promoting the diversity of output policies by increasing .
To this end, we propose our reward function for the policy network, which will be discussed in the Section 3.3.
3.3 Policy Network based on Reward Function
Motivated by [22] and [23], we apply reinforcement learning to train our policy network. As shown in Fig.2 and Fig.3(b), given a geographic location and landmark , with a pretrained recognition network with N residual blocks, a policy can be seen as an Ndimensional Bernoulli distribution:
(3) 
where s is fusion result of the policy network, which can be formulated as
(4)  
where denotes the policy network with the weights and . is the fusion factor, . is the output of the policy network, and the th entry of (
) indicates the probability that the
th residual block is selected. The policy is obtained based on s, where 1 and 0 indicate selecting and skipping the corresponding block, respectively.Based on the above, we define the reward function to encourage more unique policy and minimal block usage, along with correct predictions. Our reward function is formulated as follows:
(5) 
where is the percentage of reserved blocks, implying the sparsity of the pretrained ResNet, and reflects the difference from other policies, promoting diverse policies. When a prediction is correct, a large positive reward is offered to the policy network, while is applied to penalize incorrect predictions. and represent the weights.
To learn the policy network, we maximize the following objective function [23]:
(6) 
3.3.1 Sparsity of the Inference Path
We imply that the term in Eqn. (5) increases the selection of the inference path, so as to be diversified. It is also closely related to the complexity for the inference path selection. We deeply study that by further define that measure the sparsity of the selected inference path, as follows:
(7) 
We measure the proportion , such as 15/21 for and . also implies the computational complexity for each image during the inference, where a smaller value means lower complexity. For MNet, we adopt the average of the inference paths to measure the efficiency.
We also obverse that the sparsity has nonnegligible influence on the maximum number of unique policies, which can be computed by , e.g., . Fig.6 describes the relationship between them. We observe that the maximum number of unique policies is of great difference under different , and the peak value is obtained around .
3.4 Learning Paramters of MNet
To learn two subnetwork weights of the policy network, we compute the gradients of as motivated by [23]. The gradients can be represented as:
(8) 
where and is the maximally probable configuration, i.e., if , and otherwise [17]. denotes the weights of the policy network, which contains and .
We can further derive the gradients of with respect to and as follows:
(9) 
(10) 
3.5 More Insights among Reward Function, Policy Diversity and Recognition Accuracy
As shown in Fig.2, the reward function, policy diversity and recognition accuracy of MNet can inherently promote each other. To better understand this, we discuss the relationship between each pair of them, respectively, which is shown below:

Policy Diversity and Recognition Accuracy:
For MNet, the multiple output policies of the policy network select the diverse inference path in the recognition network. Besides, geographic information together with visual information are exploited to increase policy diversity. Combining Fig.7 with Fig.10(d) shows that MNet achieves larger policy diversity and higher recognition accuracy, than the model that removed geographic information, implying that diverse policy can improve the recognition accuracy.

Recognition Accuracy and Reward Function:
For Eqn.(5), when a prediction is correct, the reward function will generate a large reward value. The higher the recognition accuracy, the more correct predictions, resulting in more diverse reward values. That indicates that the larger recognition accuracy provides diverse reward values for the policy network.

Reward Function and Policy Diversity:
To maximize Eqn.(6), the reward function as Eqn.(5) will encourage a smaller and diverse policy set. We also find that the smaller the value of , the greater the maximum number of unique policies, within a certain range, as shown in Fig.6. Thus the selection set of the output policy is expanded, leading to larger possibilities for diverse policies. This naturally indicates that the reward function can encourage larger policy diversity.
4 Experiment
In this section, we experimentally validate the effectiveness of MNet, including the comparisons with stateofthearts and comprehensive ablation studies.
4.1 Experiment Setup
4.1.1 Datasets
We construct two landmark classification datasets: Landmark420 and Landmark732, where each image is picked from the GoogleLandmarkv2 [14], which contains 5M images labeled for 200k unique landmarks ranging from artificial buildings to natural landscapes. We show the examples in Fig.8. Each landmark image contains geographic information represented with latitude and longitude. The geographic location information is encoded as an 8dimension vector, as stated in the Section 3.1.
In summary:

The Landmark420 dataset consists of 165,000 colored images, with 150,000 labeled landmark images from 732 classes for training and 15,000 for test.

The Landmark732 dataset consists of 380,000 labeled training landmark images across 732 categories and 40,000 images for test.
Layer  Input  Output 

FC1  8  128 
FC2  128  256 
FC3  256  256 
FC4  256  128 
FC5  128  K 
4.1.2 Policy Network Architecture
Our policy network combines geography locations with landmark images, with two subnetworks, as shown in Fig.3(b). For geographic locations, a 5layer MultiLayer Perceptron, as shown in Table 1, is adopted to extract the geographic location features. For landmark images, we utilize a ResNet variant named ResNet10, as shown in Table 2, to extract the visual features, while capturing the feature vectors via a fully connected layer [23]. To exploit image features and corresponding geographic location, we fuse the output vectors of two subnetworks with a scaling factor to balance them. In our experiments, we empirically set as 0.7.
block  Layer  Filter  Stride 

Conv2d  2  
MaxPool2d  2  
Residual  Conv2d  1  
block  Conv2d  1  
Residual block  Conv2d  2  
Conv2d  1  
Downsample  2  
Residual block  Conv2d  2  
Conv2d  1  
Downsample  2  
Residual block  Conv2d  2  
Conv2d  1  
Downsample  2  
AvgPool2d  4  
Linear  K   
4.1.3 Recognition Network
We adopt three variants of ResNet [6]
, named ResNet29, ResNet47 and ResNet65, respectively, which start with a convolutional layer followed by 9, 15 and 21 residual blocks that are organized into three blocks, respectively. Downsampling layers are evenly inserted into them. To match our datasets, we make up a fully connected layer with 420/732 neurons at the end of the network. Each of the residual blocks adopts the bottleneck design
[6], including the three convolutional layers. To obtain a pretrained ResNet, we train three ResNets from scratch on Landmark420 and Landmark732.4.2 Comparison with the StateoftheArts
First of all, we compare the proposed MNet with several compressed network models:

Baseline: it includes two variants of ResNet [6], named ResNet29, ResNet47.

MobileNet [10]: it introduces a depthwise separable convolution and convolution to speed up convolutional computation.

ShuffleNet [26]: Based on MobileNet, it proposes a group convolution and shuffle operation to replace the convolution.

HetConv [18]: it proposes to replace kernels with kernels to speed up the convolution operation.

Blockdrop [23]: it proposes a policy network based on reinforcement learning to output a set of binary decisions for dynamically dropping blocks in a pretrained ResNet, where the policy network takes images as the inputs.
We implement the methods with the backbone network ResNet on Landmark420 and Landmark732. In particular, for ShuffleNet, G2 and G4 denote that the group number of convolution is 2 and 4. While for HetConv, P2 and P4 indicate that the proportion of kernels in a convolutional filter are 1/2 and 1/4.
Table 3 presents the results on Landmark420 and Landmark732. The results show that MNet outperforms other methods by large margins. We observe that our best model achieves accuracy gain over the baseline. When the recognition network changes from ResNet29 to ResNet47, our advantage is more obvious. This is because the higher number of blocks in ResNet47 offer more possible options ( ) for the inference path. Thus, the policy network can generate more unique policies, and diverse inference paths can further be constructed.
Fig.9 presents the comparison of FLOPs with the stateoftheart methods on Landmark420. With ResNet47, MNet obtains comparable computational complexity to MobileNet ( ) under the same level of accuracy. For some samples, MNet can even achieve faster inference speed than MobileNet. With ResNet47, MNet achieves better accuracy than HetConv with the same level of computational efficiency. Compared to BlockDrop, M
Net achieves higher computational efficiency with a larger variance, such as
with ResNet29. Different from the static methods (MobileNet, ShuffleNet, HetConv), the computation efficiency of MNet is dynamically changeable for each input as shown in Fig.9, which indicates that MNet can allocate computing resources reasonably.4.3 Ablation Study
To better validate each module of MNet, we conduct a series of ablation studies in the next sections.
4.3.1 With MLP vs Without MLP
We explore the effect of geographic information in the policy network. We study the effect of the MLP. Without the MLP, the input of the policy network only contains images; we denote it as RemoveMLP for simplicity.
The orange curves in Fig.7 and Fig.10 present the results of RemoveMLP. From Fig.7, we observe that the policy diversity of MNet is much higher than that of RemoveMLP( and ), implying that combining geographic information with visual information can increase the number of unique policy. Besides, in Fig.10(c)(d), MNet achieves a large accuracy gain over RemoveMLP and keeps stable accuracy(83% and 85%). Thereby, we conclude that diverse policies contribute to improving the recognition accuracy.
In addition, Fig.10(c)(d) show that the accuracy of RemoveMLP sharply reduces as decreases, while that of MNet can be improved, which means that MNet can keep a good balance between accuracy and computational efficiency. In particular, Fig.10(c)(d) report that the accuracy curves reach a peak around or , where the selection space of output policy is largest as shown in Fig.6.
4.3.2 Reward Function
We also explore the effect of uniqueness in the reward function. In the experiments, we remove the second term of Eqn.(5) and denote them as RemoveU for simplicity.
The green curves in Fig.7 and Fig.10 present the results of RemoveU. Fig.7 shows that MNet can keep higher diversity compared to RemoveU when decreases, which indicates that the reward function with contributes to promoting the difference between output policies and increasing policy diversity. In addition, the accuracy of RemoveU decreases by a large margin( and ) compared to MNet, as shown in Fig.10.
4.4 Compatibility with StateoftheArts
In the above experiments, we adopt the traditional pretrained ResNet as the recognition network. Actually, MNet is complementary to other methods, the backbone networks of which can also be used as our recognition network. We conduct the experiments on Landmark420, where ResNet47 is adopted as the baseline. Fig.11 presents the results of MNet after replacing the recognition network. We observe that not only is the recognition accuracy greatly improved, but the computational complexity is further reduced after applying our architecture to other methods. For MobileNet, ShuffleNet(G4) and HetConv(P4), the recognition accuracy is raised by , and , and the FLOPs is reduced by , and , respectively.
5 Conclusion
In this paper, we propose a novel movingmobile net, named MNet, for dynamically selecting inference path for efficacy landmark recognition. Unlike existing methods, the geographic information of the landmarks is exploited to train a policy network, where the output served as the input to our proposed reward function, which can further promote the diverse selection of the inference path of MNet, so as to improve the performance. To validate the advantages of MNet, we create two landmark datasets with geoinformation, over which extensive experiments are conducted. The results validate the superiority of our method in terms of recognition accuracy and efficiency.
References
 [1] (2017) Deep learning with low precision by halfwave gaussian quantization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5918–5926. Cited by: §2.2.
 [2] (2019) A layer decompositionrecomposition framework for neuron pruning towards accurate lightweight networks. In AAAI, Vol. 33, pp. 3355–3362. Cited by: §2.2.
 [3] (2016) Convolutional neural networks with lowrank regularization. Computer Science. Cited by: §1, §2.2.
 [4] (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §1, §2.2.
 [5] (2015) Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143. Cited by: §1, §2.2.
 [6] (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §1, 1st item, §4.1.3.
 [7] (2018) Soft filter pruning for accelerating deep convolutional neural networks. In , IJCAI’18, pp. 2234–2240. External Links: ISBN 9780999241127, Link Cited by: §2.2.
 [8] (2019) Filter pruning via geometric median for deep convolutional neural networks acceleration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4340–4349. Cited by: §2.2.
 [9] (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2.2.
 [10] (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: 3rd item, §1, §1, §2.2, 2nd item, Table 3.
 [11] (2018) Datadriven sparse structure selection for deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 304–320. Cited by: §2.1.
 [12] (2016) Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710. Cited by: §1.
 [13] (2017) Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2736–2744. Cited by: §1, §2.2.
 [14] (2017) Largescale image retrieval with attentive deep local features. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3456–3465. Cited by: §4.1.1.
 [15] (2018) Extreme network compression via filter group approximation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 300–316. Cited by: §2.2.
 [16] (2016) Xnornet: imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pp. 525–542. Cited by: §2.2.

[17]
(2017)
Selfcritical sequence training for image captioning
. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008–7024. Cited by: §3.4.  [18] (2019) HetConv: heterogeneous kernelbased convolutions for deep cnns. In CVPR, Cited by: Figure 1, 3rd item, §1, §2.2, 4th item, Table 3.
 [19] (2018) Reinforcement learning: an introduction. MIT press. Cited by: §1.
 [20] (2016) Residual networks behave like ensembles of relatively shallow networks. In Advances in neural information processing systems, pp. 550–558. Cited by: §1, §2.1.
 [21] (2018) Twostep quantization for lowbit neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4376–4384. Cited by: §2.2.
 [22] (2018) Skipnet: learning dynamic routing in convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 409–424. Cited by: §2.1, §3.3.
 [23] (2018) Blockdrop: dynamic inference paths in residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8817–8826. Cited by: 3rd item, §1, §1, §2.1, §3.3, §3.3, §3.4, 5th item, §4.1.2, Table 3.
 [24] (2019) A main/subsidiary network framework for simplifying binary neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7154–7162. Cited by: §2.2.

[25]
(2017)
A gift from knowledge distillation: fast optimization, network minimization and transfer learning
. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4133–4141. Cited by: §2.2.  [26] (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856. Cited by: Figure 1, 3rd item, §1, §1, §2.2, 3rd item, Table 3.
Comments
There are no comments yet.