The idea of grouping features is long-standing. In the early research of computer vision, many artificially designed image features are presented in groups, such as SIFTlowe2004distinctive , HOG dalal2005histograms . For example, a HOG vector comes from several spatial cells where each cell is represented by a normalized orientation histogram. With the rapid development of CNNs lecun1990handwritten ; krizhevsky2012imagenet ; simonyan2014very ; szegedy2015going ; he2016deep ; huang2017densely ; wang2018mixed , there are widely used module designs that introduce the grouping methodology, such as group convolution xie2017aggregated and group normalization wu2018group . These techniques typically group features along the channel dimension in a convolutional feature map into multiple sub-features, and use general convolution or normalization for the transformations of these sub-features in each group. In CapsuleNet sabour2017dynamic , the grouped sub-features are modeled as capsules, which represent the instantiation parameters of a specific type of entity, such as an object or an object part.
In addition to grouping the dimension of channels into multiple sub-features to represent different semantics, we also need to consider another important dimension in the convolutional feature map: the space. For a particular semantic group, it is reasonable and beneficial to generate the corresponding semantic features in the correct spatial positions of the original image. However, due to lack of supervision of specific region details and possible noise in the image, the spatial distribution of the semantic features will suffer from certain chaos, which considerably weakens the representation of learning and makes it difficult in constructions of hierarchical understanding (see of Figure 1).
In order to make each set of features robust and well-distributed over the space, we model a spatial enhance mechanism inside each feature group, by scaling the feature vectors over all the locations with an attention mask. Such an attention mask is designed intentionally to suppress the possible noise and highlight the correct semantic feature regions. Different from other popular attention methods wang2017residual ; hu2018squeeze ; li2019selective ; park2018bam ; woo2018cbam , we use the similarity between the global statistical feature and the local ones of each location as the source of generation for the attention masks. This simple yet effective mechanism described above is our proposed Spatial Group-wise Enhance (SGE) module, which is extremely lightweight and requires almost no additional parameters and calculations by its nature.
We examine the changes in the distribution of the feature map and the statistics of the variance of the activation values for each group after the introduction of the SGE module. The results show that SGE significantly improves the spatial distribution of different semantic sub-features within its group, and produces a large variance statistically, which strengthens the feature learning in semantic regions and compresses the noise and interference.
We show on the ImageNet russakovsky2015imagenet benchmark that the SGE module performs better or comparable to a series of recently proposed state-of-the-art attention modules, despite its superiority in both model capacity and complexity. Meanwhile, for the most advanced detectors (Faster/Mask/Cascade RCNN ren2015faster ; he2017mask ; cai2018cascade ), SGE can always bring more than 1% AP gains on the COCO lin2014microsoft benchmark. Notably, on RetinaNet lin2017focal , SGE outperforms the widely used SE hu2018squeeze module on detecting small objects by 1% AP, which demonstrates its remarkable advantages in accurate spatial modeling.
2 Related Work
Grouped Features. Learning and distributing features into groups in convolutional networks has been widely studied recently. AlexNet krizhevsky2012imagenet initially presents the group convolution and divides features into two groups on different GPUs to save computing budgets. ResNeXt xie2017aggregated examines the importance of grouping in feature transfer and suggests that the number of groups should be increased to obtain higher accuracy under similar model complexity. The MobileNet series howard2017mobilenets ; sandler2018mobilenetv2 ; howard2019searching and Xception carreira1998xception treat each channel as a group and model only spatial relationships inside these groups. The ShuffleNet zhang1707shufflenet ; ma2018shufflenet family rearranges the grouped features to produce efficient feature representation. Res2Net gao2019res2net uses a hierarchical mode to transfer grouped sub-features, enabling the network to incorporate multi-scale features in a single bottleneck. CapsuleNet sabour2017dynamic
models each of the grouped neurons as a capsule, where the activities of the neurons within an active capsule represent the various properties of a particular entity that is present in the image. The overall length of the vector of instantiation parameters is used to represent the existence of the entity and the orientation of the vector is forced to represent the properties of the entity. In SGE, all enhancements are operated inside groups, which saves computational overhead similarly as in group convolution. Conceptually, the SGE module adopts the basic modeling assumptions of CapsuleNet, and believes that the features of each group are able to actively learn various semantic entity representations. At the same time, in the process of visualization of this paper, we also use the length of the sub-feature to measure as its activation value, analogous to the probability of the existence of entities in CapsuleNet.
and is later extended to more natural language processing tasks such as text summaryrush2015neural and reading comprehension seo2016bidirectional . Since then, it has also achieved very promising results in the field of computer vision with emerging applications, such as person re-ID chen2018person , image recovery zhang2018image , lip reading xu2018lcanet , image classification wang2017residual , and object detection cao2019GCNet . SENet hu2018squeeze brings an effective, lightweight gating mechanism to self-recalibrate the feature map via channel-wise importances. Beyond channel, BAM park2018bam and CBAM woo2018cbam introduce spatial attention in a similar way. SKNet li2019selective further introduces a dynamic kernel selection mechanism which is guided by the multi-scale group convolutions, with a small number of additional parameters and calculations to improve the classification performance. GCNet cao2019GCNet fully explores the advantages and disadvantages of Non-Local wang2018non and SE hu2018squeeze modules, and combines the advantages of both to design a more effective global context module, obtaining compelling results on object detection tasks. SGE differs from all existing attention mechanisms in that it aims at improving the learning of different semantic sub-features of each group, intentionally self-enhancing its spatial distribution within the group. Compared to other attention modules, SGE has fewer parameters, less computational complexity (Table 1), and a more interpretable mechanism (Figure 2).
3.1 Spatial Group-wise Enhance
We consider a channels, convolutional feature map and divide it into groups along the channel dimension. Without loss of generality, we first examine a certain group separately (see the bottom black box in Figure 1). Then the group has a vector representation at every position in space, namely . Conceptually inspired by the capsules sabour2017dynamic , we further assume that this group gradually captures a specific semantic response (such as the dog’s eyes) during the course of network learning. In this group space, ideally we can get features with strong responses at the eye positions (i.e., features with a larger vector length and similar vector directions among multiple eye regions), whilst other positions almost have no activation and become zero vectors. However, due to the unavoidable noise and the existence of similar patterns, it is usually difficult for CNNs to obtain the well-distributed feature responses. To address this issue, we propose to utilize the overall information of the entire group space to further enhance the learning of semantic features in critical regions, given the fact that the features of the entire space are not dominated by noise (otherwise the model learns nothing from this group). Therefore we can use the global statistical feature through spatial averaging function to approximate the semantic vector that this group learns to represent:
Next, using this global feature, we can generate the corresponding importance coefficient for each feature, which is obtained by simple dot product that measures the similarity between the global semantic feature g and local feature to some extent. Thereby for each position, we have:
Note that can also be expanded as , where is the angle between g and . It indicates that features that have a larger vector length (i.e., ) and a direction (i.e., ) closer to g are more likely to obtain a larger initial coefficient, which is in line with our assumptions. In order to prevent the biased magnitude of coefficients between various samples, we normalize over the space, as is widely practiced in ioffe2015batch ; wu2018group ; weightstandardization :
where (e.g., 1e-5) is a constant added for numerical stability. To make sure that the normalization inserted in the network can represent the identity transform, we introduce a pair of parameters for each coefficient , which scale and shift the normalized value:
Note that here are the only parameters introduced in our module. In a single SGE unit, the number of is the same as the number of groups , and the order of their magnitude is about tens (typically, 32 or 64), which is basically negligible compared to the millions of parameters of the entire network. Finally, to obtain the enhanced feature vector , the original is scaled by the generated importance coefficients
via a sigmoid function gateover the space:
and all the enhanced features form the resulted feature group .
3.2 Visualization and Interpretation
Visualization of Semantic Activation. In order to verify whether our approach achieves the goal of improving the semantic feature representation, we train a network based on ResNet50 on ImageNet russakovsky2015imagenet and place the SGE module after the last BatchNorm ioffe2015batch layer of each bottleneck with reference to SENet hu2018squeeze , by setting = 64. To better reflect the semantic information while preserving the large spatial resolution as much as possible, we choose to examine the feature maps of the 4th stage with output size of 14 14. For each feature vector of each group, we use its length (i.e., ) to indicate their activation value and linearly normalize it to the interval for a better view. Figure 2 shows three representative groups with semantic responses. As listed in three large columns, they are the 18th, 22nd, and 41st group, which are empirically found to correspond to the concept of the nose, tongue, and eyes. Each large column contains three small columns, where the first small column is the original image, the second small column is the feature map response from the original ResNet50, and the third one is the feature map response enhanced by the SGE module. We select images of dogs of different angles and types to test the robustness of SGE for feature enhancement. Despite its simplicity, the SGE module is very effective in improving the feature representation of specific semantics at corresponding locations while suppressing a large amount of noise. It is worth noting that in the 4th and 7th rows, SGE can strongly emphasize the activation of the eye areas, although their eyes are almost closed. In contrast, the original ResNet fails to capture such patterns.
The Statistical Change of Activation. We note that if the ideal feature map is obtained, the spatial activation value of the network will have a more pronounced contrast, such as a large or sharp numerical activation in the semantically relevant regions, and nearly no response in other non-correlated regions. This contrast may probably correspond to a large degree of variance or sparsity to some extent. To validate this, we take the length of each sub-feature (i.e., ) as the activation value, and calculate their distribution of variance in each group of the last (highest) residual layer before and after using the SGE module. These statistics are based on the pretrained SGE-ResNet50, using all the samples on ImageNet validation set (i.e., 50k samples). As shown in Figure 4, the statistical results are in line with our expectations. The response variance of the feature map enhanced by the SGE module is indeed statistically increased, which greatly improves the efficiency of SGE to accurately capture semantic features. Furthermore, we plot the detailed histogram of the activation values of the first group over each position and all validation samples in Figure 3. It is observed that the smaller activation values bias towards zero and larger activation values nearly remain unchanged, which statistically implies the noise suppression and critical-region enhancement.
4 Experiments on Image Classification
We first compare SGE with a set of state-of-the-art attention modules on ImageNet benchmark. The ImageNet 2012 dataset russakovsky2015imagenet comprises 1.28 million training images and 50k validation images from 1k classes. We train networks on the training set and report the Top-1 and Top-5 accuracies on the validation set with single 224 224 central crop. For data augmentation, we follow the standard practice szegedy2015going and perform the random-size cropping to 224 224 and random horizontal flipping. The practical mean channel subtraction is adopted to normalize the input images. All networks are trained with naive softmax cross entropy without label-smoothing regularization szegedy2016rethinking
. We train all the architectures from scratch by synchronous SGD with weight decay 0.0001 and momentum 0.9 for 100 epochs, starting from learning rate 0.1 and decreasing it by a factor of 10 every 30 epochs. The total batch size is set as 256 and 8 GPUs (32 images per GPU) are utilized for training, using the weight initialization strategy inhe2015delving
. Our codes are implemented in the pytorchpaszke2017pytorch framework. Note that in the following tables, Param. denotes the number of parameter and the definition of FLOPs follow zhang1707shufflenet , i.e., the number of multiply-adds.
4.1 Comparisons with state-of-the-art Attention Modules
We select a series of state-of-the-art attention modules, which is considered to be relatively lightweight, and demonstrate their performance based on ResNet50 and ResNet101 he2016deep ; he2016identity . They contain SE hu2018squeeze , SK li2019selective , BAM park2018bam , CBAM woo2018cbam , and GC cao2019GCNet . For a fair comparison, we implement all the attention modules (partially refer to the official codes111https://github.com/Jongchan/attention-module, https://github.com/xvjiarui/GCNet) with their respective best settings using a unified pytorch framework. Following hu2018squeeze ; woo2018cbam , these attention modules are placed after the last BatchNorm ioffe2015batch layer inside each bottleneck except for BAM and SK. BAM park2018bam is naturally designed between stages. SK li2019selective is originally designed on ResNeXt-like bottlenecks with multiple large-kernel group convolutions. To transfer it to the ResNet backbones, we make a slight modification and only append one additional 3 3 group ( = 32) convolution to each original 3 3 convolution of ResNet, to prevent the parameters and calculations of the corresponding SKNets from being too large or too small. From the results of Table 1, we observe that based on ResNet50, SGE is on par with the best entries from CBAM (Top-1) and SK/SE (Top-5) but has much fewer parameters and slightly less calculations. As for ResNet101, it outperforms most other competing modules with a non-negligible margin. Please note that in our experiments, we find that the GC cao2019GCNet module is difficult to train from the beginning, and it will be stuck in a higher loss for a long time before the training loss begins to decline normally. Therefore it does not eventually lead to a high accuracy. In the original paper of GCNet, the authors do not adopt the commonly used training from scratch settings, but finetune the GC module on the well pretrained ResNets to report the results.
|Backbone||Param.||GFLOPs||Top-1 Acc (%)||Top-5 Acc (%)|
4.2 Ablation Study
In this section, we report the ablation studies on the ImageNet dataset based on SGE-ResNet50, to thoroughly investigate the components of the SGE modules.
Group number . In the SGE module, the number of groups
controls the number of different semantic sub-features. Since the total number of channels is fixed, too many groups will result in a reduction in the sub-feature dimension within each group, leading to weaker feature representation for each semantic response; On the contrary, too few groups will make the diversity of semantics limited. It is natural to speculate that there is a moderate hyperparameterthat balances semantic diversity and the ability of representing each semantic to optimize network performance. From Figure 5, we can see that with the increase of , the performance of the network shows a trend of increasing first and then decreasing (especially in terms of Top-1 accuracy), which is highly consistent with our deduction. Through the experimental results, we usually recommend the number of groups to be 32 or 64. In subsequent experiments, we use = 64 by default.
Initialization of the and . During the experiment, we found that the initialization of the parameter and has a small but not negligible effect on the result. To investigate this, we use values 0, 1 for grid search to see the effects of the initialization. From Table 5 we find that initializing
to 0 tends to get better results. We speculate that when the ordinary patterns of semantic learning has not yet been completely formulated in convolutional feature maps during the initial stage of network training, it may be appropriate to temporarily discard the attention mechanism, but let the network learn a basic semantic representation first. After the initial training period, the attention modules then need to be gradually turned in effect. Therefore, in the early moments of network learning, the attention mechanism of SGE is not suggested to participate heavily in training by settingto 0. Such an operation is almost equivalent to simulate the learning process of a network without attention modules during the very early training stage, since each sub-feature of each location is linearly multiplied by the same constant (i.e., ), whose effect can be cancelled by the following BatchNorm layer.
Normalization. To investigate the importance of normalization in SGE modules, we conduct experiments by eliminating the normalization part from SGE (as shown in Table 5) and find that performance is considerably reduced. This confirms our previous conjecture: because the distribution of features generated by different samples for the same semantic group is inconsistent, it is difficult to learn robust importance coefficients without normalization. This is also partially validated in Figure 4, where the variance statistic usually has a relatively large standard deviation. It demonstrates that the variance of the activation values of different samples in the same group can be statistically very different, indicating that normalization is essential for SGE to work.
5 Experiments on Object Detection
We further evaluate the SGE module on object detection on COCO 2017 lin2014microsoft , whose train set is comprised of 118k images, validation set of 5k images. We follow the standard setting he2017mask of evaluating object detection via the standard mean Average-Precision (AP) scores at different box IoUs or object scales, respectively.
The input images are resized with their shorter side being 800 pixels lin2017feature . We train on 8 GPUs with 2 images per each. The backbones of all models are pretrained on ImageNet russakovsky2015imagenet (directly borrowed from the models listed in Table 1), then all layers except for the first two stages are jointly finetuned with FPN lin2017feature neck and a set of detector heads. Following the conventional finetuning setting he2017mask , the BatchNorm layers are frozen during finetuning. All models are trained for 24 epochs using synchronized SGD with a weight decay of 0.0001 and momentum of 0.9. The learning rate is initialized to 0.02, and decays by a factor of 10 at the 18th and 22nd epochs. The choice of hyper-parameters follows the latest release of the detection benchmark mmdetection2018 .
5.1 Experiments on state-of-the-art Detectors
We embed the SGE modules into the popular detector framework separately to check if the enhanced feature map helps to detect objects. We select three popular two-stage detection frameworks, including Faster RCNN ren2015faster , Mask RCNN he2017mask , and Cascade RCNN cai2018cascade , and choose the widely used FPN lin2017feature as the detection neck. For a fair comparison, we only replace the pretrained backbone model on ImageNet while keeping the other components in the entire detector intact. Table 4 shows the performance of embedding the backbone with the SGE module on these state-of-the-art detectors. We find that although SGE introduces almost no additional parameters and calculations, the gain of detection performance is still very noticeable with basically more than 1% AP point. It is worth noting that SGE can be more prominently advanced on stronger detectors (+1.5% AP on ResNet50 and +1.8% on ResNet101 in Cascade RCNN).
|ResNet50 he2016deep||23.51M||88.032||Faster RCNN ren2015faster||37.5||59.1||40.6|
|SGE-ResNet50||23.51M||88.149||Faster RCNN ren2015faster||38.7 (+1.2)||60.8||41.7|
|ResNet50 he2016deep||23.51M||88.032||Mask RCNN he2017mask||38.6||60.0||41.9|
|SGE-ResNet50||23.51M||88.149||Mask RCNN he2017mask||39.6 (+1.0)||61.5||42.9|
|ResNet50 he2016deep||23.51M||88.032||Cascade RCNN cai2018cascade||41.1||59.3||44.8|
|SGE-ResNet50||23.51M||88.149||Cascade RCNN cai2018cascade||42.6 (+1.5)||61.4||46.2|
|ResNet101 he2016deep||42.50M||167.908||Faster RCNN ren2015faster||39.4||60.7||43.0|
|SGE-ResNet101||42.50M||168.099||Faster RCNN ren2015faster||41.0 (+1.6)||63.0||44.3|
|ResNet101 he2016deep||42.50M||167.908||Mask RCNN he2017mask||40.4||61.6||44.2|
|SGE-ResNet101||42.50M||168.099||Mask RCNN he2017mask||42.1 (+1.7)||63.7||46.1|
|ResNet101 he2016deep||42.50M||167.908||Cascade RCNN cai2018cascade||42.6||60.9||46.4|
|SGE-ResNet101||42.50M||168.099||Cascade RCNN cai2018cascade||44.4 (+1.8)||63.2||48.4|
5.2 Comparisons with state-of-the-art Attention Modules
Next, we chose a representative one-stage detection framework RetinaNet lin2017focal , to compare SGE with several competitive state-of-the-art attention modules, especially for objects with three different scales. The original backbones are replaced with the corresponding attention embedded ResNets, which are pretrained on ImageNet, for a fair comparison. In Table 5, SGE greatly improves the accuracy of detection for small objects while its performance of the media and large objects is close to the optimal ones (41.2 vs 41.3 from SE and 49.9 vs 50.4 from SK), indicating that the SGE module is able to retain the feature representation of the precise spatial area well and is very robust to various object scales. Conversely, the SE/SK module has only a small increase in the recognition of small objects. For SE/SK, in each channel, the same importance coefficient is allocated to each location of the space, probably resulting in the loss of the ability to express the details of micro-regions.
|SE-ResNet50 hu2018squeeze||26.04M||88.152||20.7 (+0.8)||41.3 (+1.7)||50.0 (+1.7)|
|SK-ResNet50 li2019selective||24.11M||89.414||20.2 (+0.3)||40.9 (+1.3)||50.4 (+2.1)|
|BAM-ResNet50 park2018bam||23.87M||89.804||19.6 (-0.3)||40.1 (+0.5)||49.9 (+1.6)|
|CBAM-ResNet50 woo2018cbam||26.04M||88.302||21.8 (+1.9)||40.8 (+1.2)||49.5 (+1.2)|
|SGE-ResNet50||23.51M||88.149||21.8 (+1.9)||41.2 (+1.6)||49.9 (+1.6)|
We propose a Spatial Group-wise Enhance (SGE) module that enables each of its feature groups to autonomously enhance its learnt semantic representation and suppress possible noise, nearly without introducing additional parameters and computational complexity. We visually show that the feature groups have the ability to express different semantics, while the SGE module can significantly enhance this ability. Despite its simplicity, SGE has achieved a steady improvement in both image classification and detection tasks, which demonstrates its compelling effectiveness in practice.
- (1) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
- (2) Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving into high quality object detection. In CVPR, 2018.
- (3) Yue Cao, Jiarui Xu, Stephen Lin, Fangyun Wei, and Han Hu. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. arXiv preprint arXiv:1904.11492, 2019.
- (4) João Carreira, Henrique Madeira, and João Gabriel Silva. Xception: A technique for the experimental evaluation of dependability in modern computers. Transactions on Software Engineering, 1998.
- (5) Di Chen, Shanshan Zhang, Wanli Ouyang, Jian Yang, and Ying Tai. Person search via a mask-guided two-stream cnn model. arXiv preprint arXiv:1807.08107, 2018.
- (6) Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. mmdetection. https://github.com/open-mmlab/mmdetection, 2018.
- (7) Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.
- (8) Shang-Hua Gao, Ming-Ming Cheng, Kai Zhao, Xin-Yu Zhang, Ming-Hsuan Yang, and Philip Torr. Res2net: A new multi-scale backbone architecture. arXiv preprint arXiv:1904.01169, 2019.
- (9) Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In ICCV, 2017.
- (10) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 2015.
- (11) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
- (12) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In ECCV, 2016.
- (13) Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. arXiv preprint arXiv:1905.02244, 2019.
- (14) Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
- (15) Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In CVPR, 2018.
- (16) Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In CVPR, 2017.
- (17) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
- (18) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NeurIPs, 2012.
- (19) Yann LeCun, Bernhard E Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne E Hubbard, and Lawrence D Jackel. Handwritten digit recognition with a back-propagation network. In NeurIPs, 1990.
- (20) Xiang Li, Wenhai Wang, Xiaolin Hu, and Jian Yang. Selective kernel networks. In CVPR, 2019.
- (21) Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In CVPR, 2017.
- (22) Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In ICCV, 2017.
- (23) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
- (24) David G Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004.
- (25) Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. arXiv preprint arXiv:1807.11164, 2018.
- (26) Jongchan Park, Sanghyun Woo, Joon-Young Lee, and In So Kweon. Bam: Bottleneck attention module. arXiv preprint arXiv:1807.06514, 2018.
Adam Paszke, Sam Gross, Soumith Chintala, and Gregory Chanan.
Pytorch: Tensors and dynamic neural networks in python with strong gpu acceleration.PyTorch: Tensors and dynamic neural networks in Python with strong GPU acceleration, 2017.
- (28) Siyuan Qiao, Huiyu Wang, Chenxi Liu, Wei Shen, and Alan Yuille. Weight standardization. arXiv preprint arXiv:1903.10520, 2019.
- (29) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS, 2015.
- (30) Alexander M Rush, Sumit Chopra, and Jason Weston. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685, 2015.
- (31) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 2015.
- (32) Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dynamic routing between capsules. In NeurIPs, 2017.
- (33) Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, 2018.
- (34) Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attention flow for machine comprehension. arXiv preprint arXiv:1611.01603, 2016.
- (35) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- (36) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015.
- (37) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016.
- (38) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPs, 2017.
- (39) Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang Zhang, Xiaogang Wang, and Xiaoou Tang. Residual attention network for image classification. arXiv preprint arXiv:1704.06904, 2017.
- (40) Wenhai Wang, Xiang Li, Jian Yang, and Tong Lu. Mixed link networks. arXiv preprint arXiv:1802.01808, 2018.
- (41) Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In CVPR, 2018.
- (42) Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention module. arXiv preprint arXiv:1807.06521, 2018.
- (43) Yuxin Wu and Kaiming He. Group normalization. In ECCV, 2018.
- (44) Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.
- (45) Kai Xu, Dawei Li, Nick Cassimatis, and Xiaolong Wang. Lcanet: End-to-end lipreading with cascaded attention-ctc. In International Conference on Automatic Face & Gesture Recognition, 2018.
- (46) X Zhang, X Zhou, M Lin, and J Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. arxiv 2017. arXiv preprint arXiv:1707.01083, 2017.
- (47) Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using very deep residual channel attention networks. arXiv preprint arXiv:1807.02758, 2018.