Several concurrent works design some CNN architectures attempting to interpret how relevant patches in the input image contribute the final prediction. BagNet  constructs ResNet-like  architecture to extract the feature map and implement a spatially linear aggregation (i.e., a simple average) before softmax layer, where each spatial location can be mapped back to a small patch in the input image, thus the contribution of each patch can be determined by activation value. Saccader 
follows the BagNet and further introduces a hard attention module to localize the most salient locations, and then estimates the relevance of various image patches. However, the notably common neglection of them is that they only have one kind receptive field (RF) accumulated throughout the CNN, resulting in the interpretability of only single-scale image patches. It is widely known that real objects usually have various scales along with coarse- or fine-grained texture features, which would hinder the effective modeling by only single-scale RF. Moreover, the most relevant patches are unable to be further applied in the downstream classification tasks for more practical inference acceleration, due to the extremely high complexity of upstream localizer.
Inspired by the above works of linear feature mapping for retaining the interpretability, and to further address the regrets, we build a light-weight and multi-scale localizer called AnchorNet, the tail of which are three localization branches with various accumulated RF, i.e. , and in this paper, to capture multi-scale patches, where each branch is also equipped with an attention branch to assist the localization of relevant patches by providing a spatial attention map that can also be visualized to display the semantically salient locations. To further make the decisions of the predicted class and which branch to localize the semantic patches is most suitable for the given image, we introduce a simple yet effective mechanism that leverages the softmax distributions generated by three branches to achieve these goals. After that we can capture the semantic patches according to the given class and localization branch by a simple algorithm called LSP (Localizing Semantic Patches) we proposed. Extensive experiments on downstream classification demonstrate that AnchorNet (parameters: 1.6M, FLOPs: 0.5G) combined with our algorithms can localize more semantic patches and obtain better performance than state-of-the-art (SOTA) Saccader (parameters: 33.58M, FLOPs: 21.6G) only using an order of magnitude fewer complexity.
By visualizing the localized multi-scale patches across the ImageNet 2012 validation set, we observe that the localization branch with wider RF is more prone to localize larger object and coarse-grained global features, while that with narrower RF always localize smaller object and fine-grained local features, which matches our intuitive expectations. To pursue more practical application, we further use multi-scale patches for downstream classification tasks to validate the semantics of them and effectiveness of inference acceleration. Experimental results show that using multiple semantic patches to replace the original images for classification can consistently get clear acceleration for inference with tiny drop of accuracy across widely applied networks, e.g., resulting in about 50% FLOPs reduction for ResNet-50 with only 0.7% top-1 accuracy drop without any modifications of the original model.
In brief, our contribution lies in three folds:
We construct a light-weight AnchorNet combined with our proposed localization algorithms to adaptively localize multi-scale semantic patches via a linearly interpretable manner.
We analyze the characteristics of feature extraction for localization branches with different RF based on the visual explanation, and further interpret the intriguing cases of confusion caused by them.
Using multi-scale semantic patches for downstream classification task can get a clear inference acceleration with only tiny performance degradation compared with the original images, which is orthogonal and complementary for popular model-based acceleration methods.
2 Related Work
2.1 Localizing semantic features
Some previous works aim to interpret the decision of CNN by visualizing the semantic feature heatmap, mainly divided into response-based or gradient-based [17, 15, 16] manners. However, visual explanation only displays the semantic region that is unable to implement the downstream classification task due to its irregular shape. Furthermore, attention mechanisms are usually introduced to highlight the spatially semantic locations by providing a spatial attention map [18, 4, 3]. AnchorNet differs from those prior practices in that we apply multi-branch spatial attention mechanism so as to perform multi-scale semantic localization.
Some seemingly similar but essentially different approaches are region proposal models for object detection [5, 14, 6], which typically combine contextual information to infer relevant object regions rather than the local information in the fixed regions, and use ground-truth bounding boxes for training. Unlike these work, our AnchorNet is only supervised by image-level labels and extracts local features in the fixed regions that are strictly spatial alignment to the initial input image. Zhou et al. zhou2016learning implement object localization without supervised on any bounding box annotations, which shares the similarity to us of training by image-level labels. However, the information is still gathered from the whole image instead of local regions, hence the contributions of various patches to final prediction would get tangled. Additionally, only one patch to localize the full object would be difficult for downstream classification task due to the dramatically various scale. In contrast, AnchorNet utilizes one or more patches with the same size to cover the object, which are quite advantageous to classification meanwhile obtaining a good performance.
2.2 Inference acceleration
Modern acceleration mothods mainly concentrate on channel pruning [12, 8, 13], which aims to remove redundant convolutional filters in the model, or dynamic inference [10, 19], which aims to only use a part of structure of the model conditioned on the input image at inference time. In this work, we localize informative local image patches over the whole image guided by light-weight AnchorNet, then the downstream networks only need to process semantic feature patches which have much smaller size than the original images, thus producing a clear acceleration. Moreover, data-based localization is CNN-agnostic and thus can be regarded as the orthogonal and complementary of model-based pruning or dynamic inference.
3.1 Review of Feature Mapping
Modern CNNs gradually decrease the spatial resolution for the input image by several convolutional blocks until the global average pooling (GAP) layer. Many hyperparameters in convolutional layer settings, e.g., kernel size, padding or stride can affect resolution size of the output. We set padding 0 across all the convolutional layers in AnchorNet, so each final spatial location of the feature map before GAP layer can be mapped to the input image exactly without the cases of beyond bounds. Given an example for better understanding, assumed that one CNN model receives a image withpixels as the input, and has accumulated RF size and strides before GAP, we will obtain the spatial locations, where each location can be mapped back to a region with the size of . Figure 1 illustrates the mapping rule.
We develop a CNN called AnchorNet which can automatically localize the most suitable semantic features with various patch sizes conditioned on the input image. Figure 2 illustrates the overall architecture of AnchorNet schematically, which contains the following three components:
Head. The input image is firstly processed by a head to extract low-level features, the all details of it is shown in Table 1. We adopt efficient bottleneck unit, SE block  and the hyperparameter settings following Howard et al. howard2019searching, except that we replace most convolutions with convolutions to restrict accumulated RF throughout the head, and only perform less down-sampling compared with popular networks on ImageNet dataset to retain the higher resolution of feature map for providing more patch mappings to the input image.
Localization Branch.We construct three branches to further localize semantically multi-scale regions along the spatial dimension after head. To this end, bottlenecks with various kernel sizes are intentionally equipped to adjust the accumulated RF sizes of these branches individually. Table 2 elaborates the information of accumulated RF, it means that each spatial location of three feature maps processed by the three localization branches would obtain a mapping patch size of , or to the original image. Due to the accumulated stride of all the three branches is , they can map to , , possible semantic locations for the original image, respectively. Before classification, we utilize a linearwhich combined with the spatial attention map by broadcast element-wise multiplication, where denotes the given branch, and denote the spatial height and width, respectively. And then we apply a global average pooling for
and a softmax layer to obtain the class probability distribution. Compared with the setting of popular fully-connected (FC) layer, it is noteworthy that we just perform a linear average aggregation along the spatial dimension and then attach softmax function that can allows us to pinpoint exactly how various patches contribute the final prediction, this is what we refer to as the concept oflinear in this paper. While FC would facilitate the interaction between patch-wise evidences thus destroying the interpretability of mapping. The outputs of all branches after softmax layer are supervised by the cross-entropy loss with image-level labels.
Attention Branch. To assist feature learning for the localization branch, we further construct an attention branch to emphasize semantic locations by generating a spatial attention map. A bottleneck is applied to produce the feature map for attention localization, where denote the number of channels. Then a convolutional filter compacts along the channel dimension to , and followed by a softmax function to generate spatial weights :
According to normalized spatial weights , we employ global weighted average pooling to and produce a channel attention map , the -th channel of is as (2), denotes the broadcast element-wise multiplication here.
Then softmax function normalize the to generate the final channel attention map :
According to normalized channel weights , we employ weighted shrinking of along the channel demension to generate a spatial attention map :
After applying softmax function to , we output the final spatial attention map that we need:
It is noteworthy that we introduce a FC layer and softmax function behind the attention features to implement a additionally direct supervision by cross-entropy loss with image-level labels beside the main localization branches, which can be more easier to learn discriminative features and facilitate the attention localization. Note that the connectivity from attention branch to GAP is occluded at inference stage.
3.3 Localizing Multi-scale Semantic Patches
Given a input image to AnchorNet, localization branch can predict the class probability distribution , where denotes the probability of the class , and , . Then we make a simple decision for individual branch as following:
Where and denote the predicted class and its probability by branch , respectively. Then we can implement the systematic decisions of final class and branch for patch localization according to (7) and (8) as following:
Given the branch , each channel of logits tensor corresponds the specific class activation map, which emphasizes class-specific semantical regions. Given the predicted class label , the heatmap can be obtained that is equal to , which represents the interpretable contribution of each mapped patch for predicted class . Instead of simply selecting top patches with maximum activations, we perform LSP as Algorithm 1 to ensure the localized patches that are not only semantic but also partly separated to cover more information. First, we flatten the to a list including 2-demensional coordinates , and sort them from maximum to minimum according to their corresponding activation values. Then, we straightforward map the first coordinate point which has the maximum activation to the corresponding patch, mapping rule is as mentioned in section 3.1, and put it in the collection . Next, we visit each point sequentially from front to back, the mapped patch with the size of of which can be put in the only if the point meets the following conditions: the of this patch and any patches in is less than the threshold . Where , and is a quite practical indicator to quantify the intersection between two patches A and B:
Where calculate the pixel number of the region. That means that localized patches can be controlled to be separated and semantic concurrently by introducing the mechanism. When the number of patches in achieves the upper limitation , the final collection of patches can be obtained.
4.1 Dataset and Settings
We experiment AnchorNet on ImageNet 2012 dataset  to validate the effectiveness of localizing multi-scale semantic patches. ImageNet is a large-scale dataset for image recognition, which contains 1.2 million training images and 50k validation images with 1000 classes, and each image is resized to pixels at test time on the validation set. For training, standard data augmentation is employed following He et al. he2016deep, and we use synchronous SGD with a momentum of 0.9 , batch size 256 and weight decay
for 100 epochs. The learning rate starts at 0.1 and decayed by a factor of 10 every 30 epochs.
About LSP algorithm, we set , , , , , , for visualizing localized semantic patches and analyzing characteristics among branches with various RFs. Further, we use slacker settings: , , , , , for producing more semantic patches so as to fine-tune the downstream models and enhance its robustness for recognition tasks.
4.2 Localized Multi-scale Semantic Patches
As can be observed in Figure 3, we can obviously conclude that the branch with wider RF size may be more prone to localize larger object and coarse-grained global features, such as scuba diver, remote control and airliner, which occupy the most part in images, are captured by . While the branch with relatively narrower RF always localize smaller object and fine-grained local features, e.g., can not only captures the miniature object such as ladybug, fish and violin, but also identify the local texture features of large objects, such as corn and crocodile. Here we consider whether the size of object is large or not that is relative to the size of the corresponding image.
Another intriguing case is the misclassification that may take place in both and due to their characteristics of feature extraction, as illustrated in Figure 4. Combined with the above discussion, we further consider that although narrow RF can capture local features, it may ignore the more informative global features, e.g., concentrates on local keyboard yet omits the global typewriter, leading to confusion with computer keyboard. In contrast, wide RF prefers localizing coarse-grained features but ignore local fine-grained features, e.g., the outline and color of custard apple are interpreted as evidences for head cabbage by , which omits the different texture information between them.
|Model||Scale||FLOPs (G)||Top-1 (%)||Top-5 (%)|
4.3 Using Semantic Patches for Classification
We further conduct downstream classification according to localized patches so as to verify their representations for semantics of the original images. As shown in Table 3, we utilize pre-trained ResNet-50, ResNeXt-50 , DenseNet-169  and HCGNet-C  fine-tuned on training patches to implement classification tasks. The results are evaluated on ImageNet validation set, where each image is localized by one of three branches. Across all 50K validation images, where 15050, 18047, 16903 images are localized by , and corresponding with 5.6, 2.9 and 2.1 patches for an image on average, respectively. Since one image may generate multiple relevant patches, we implement the final decision by adding the softmax distributions of them, and determine the class with maximum probability. Table 3 shows that without any changes of SOTA models, using multiple semantic patches instead of the original images can achieve about acceleration with tiny drop of top-1 accuracy, varying from % on ResNet-50 as minimum to % on HCGNet-C as maximum.
To further demonstrate the performance of remarkable acceleration and good accuracy is attributed to localizing but can not be obtained by simple scale reduction from the original images, we make a comparison between them and evaluated on the same networks without any extra trainings. Each image is only localized one semantic patch with maximum activation for the correspond scale decided by AnchorNet, meanwhile it is also performed simply rescaling as the counterpart. Table 4 shows that rescaling consistently incurs significant accuracy drop compared with localizing, indicating the semantic patches are indeed effective.
4.4 Analysis of AnchorNet
4.4.1 What have attention branches learned?
Attention branches are introduced to assist semantic feature localization, we further visualize the generated spatial attention maps from various branches, which are depicted in Figure 5. All heatmaps show that all attention branches can not only attend to the informative locations, but also adaptively capture various scale objects based on their individual RFs.
4.4.2 AnchorNet localizes relevant patches for classification.
From Figure 6, it can be observed that all localizers generally lead to better accuracy as the covered area increases. Moreover, using relevant patches localized by AnchorNet can achieve the best performance compared with other localizers under the same coverage, which proves the superiority of AnchorNet for downstream classification task is not simply attributed to wider image coverage. We further investigate the importance of localized patches by using them to mask the original images (i.e., set the pixels to 0) and then perform a classification on resulting images. Figure 6 show that masking by AnchorNet leads to more significant drop in performance than other localizers. Based on the above analysis, we think that AnchorNet outperforms other SOTA localizers, largely due to the multi-scale localization capability.
We construct a AnchorNet combined with our LSP algorithm to adaptively localize multi-scale semantic patches, meanwhile retaining the interpretability by linearly spatial information aggreation. Compared with previous SOTA localizers, AnchorNet is more feasible for downstream classification and can obtain a better performance due to its capability of light-weight and multi-scale feature extraction. We hope our AnchorNet may inspire the future study of interpretable semantic feature localization and application.
-  (2019) Approximating cnns with bag-of-local-features models works surprisingly well on imagenet. In International Conference on Learning Representations, Cited by: §1.
-  (2009) Imagenet: a large-scale hierarchical image database. In , pp. 248–255. Cited by: §1, §4.1.
Saccader: improving accuracy of hard attention models for vision. In Advances in Neural Information Processing Systems, pp. 700–712. Cited by: §1, §2.1.
-  (2019) Attention branch network: learning of attention mechanism for visual explanation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10705–10714. Cited by: §2.1.
-  (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §2.1.
-  (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §2.1.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1.
-  (2017) Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1389–1397. Cited by: §2.2.
-  (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §3.2.
-  (2018) Multi-scale dense networks for resource efficient image classification. In International Conference on Learning Representations, Cited by: §2.2.
-  (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §4.3.
-  (2017) Pruning filters for efficient convnets. In International Conference on Learning Representations, Cited by: §2.2.
-  (2019) MetaPruning: meta learning for automatic neural network channel pruning. arXiv preprint arXiv:1903.10258. Cited by: §2.2.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §2.1.
-  (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626. Cited by: §2.1.
-  (2017) Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825. Cited by: §2.1.
-  (2015) Striving for simplicity: the all convolutional net. In International Conference on Learning Representations, Cited by: §2.1.
-  (2018) Cbam: convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: §2.1.
-  (2018) Blockdrop: dynamic inference paths in residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8817–8826. Cited by: §2.2.
-  (2017) Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: §4.3.
-  (2019) Gated convolutional networks with hybrid connectivity for image classification. arXiv preprint arXiv:1908.09699. Cited by: §4.3.