Spatial Group-wise Enhance: Improving Semantic Feature Learning in Convolutional Networks

05/23/2019 ∙ by Xiang Li, et al. ∙ Nanjing University 0

The Convolutional Neural Networks (CNNs) generate the feature representation of complex objects by collecting hierarchical and different parts of semantic sub-features. These sub-features can usually be distributed in grouped form in the feature vector of each layer, representing various semantic entities. However, the activation of these sub-features is often spatially affected by similar patterns and noisy backgrounds, resulting in erroneous localization and identification. We propose a Spatial Group-wise Enhance (SGE) module that can adjust the importance of each sub-feature by generating an attention factor for each spatial location in each semantic group, so that every individual group can autonomously enhance its learnt expression and suppress possible noise. The attention factors are only guided by the similarities between the global and local feature descriptors inside each group, thus the design of SGE module is extremely lightweight with almost no extra parameters and calculations. Despite being trained with only category supervisions, the SGE component is extremely effective in highlighting multiple active areas with various high-order semantics (such as the dog's eyes, nose, etc.). When integrated with popular CNN backbones, SGE can significantly boost the performance of image recognition tasks. Specifically, based on ResNet50 backbones, SGE achieves 1.2% Top-1 accuracy improvement on the ImageNet benchmark and 1.0∼2.0% AP gain on the COCO benchmark across a wide range of detectors (Faster/Mask/Cascade RCNN and RetinaNet). Codes and pretrained models are available at



There are no comments yet.


page 2

page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The idea of grouping features is long-standing. In the early research of computer vision, many artificially designed image features are presented in groups, such as SIFT

lowe2004distinctive , HOG dalal2005histograms . For example, a HOG vector comes from several spatial cells where each cell is represented by a normalized orientation histogram. With the rapid development of CNNs lecun1990handwritten ; krizhevsky2012imagenet ; simonyan2014very ; szegedy2015going ; he2016deep ; huang2017densely ; wang2018mixed , there are widely used module designs that introduce the grouping methodology, such as group convolution xie2017aggregated and group normalization wu2018group . These techniques typically group features along the channel dimension in a convolutional feature map into multiple sub-features, and use general convolution or normalization for the transformations of these sub-features in each group. In CapsuleNet sabour2017dynamic , the grouped sub-features are modeled as capsules, which represent the instantiation parameters of a specific type of entity, such as an object or an object part.

In addition to grouping the dimension of channels into multiple sub-features to represent different semantics, we also need to consider another important dimension in the convolutional feature map: the space. For a particular semantic group, it is reasonable and beneficial to generate the corresponding semantic features in the correct spatial positions of the original image. However, due to lack of supervision of specific region details and possible noise in the image, the spatial distribution of the semantic features will suffer from certain chaos, which considerably weakens the representation of learning and makes it difficult in constructions of hierarchical understanding (see of Figure 1).

In order to make each set of features robust and well-distributed over the space, we model a spatial enhance mechanism inside each feature group, by scaling the feature vectors over all the locations with an attention mask. Such an attention mask is designed intentionally to suppress the possible noise and highlight the correct semantic feature regions. Different from other popular attention methods wang2017residual ; hu2018squeeze ; li2019selective ; park2018bam ; woo2018cbam , we use the similarity between the global statistical feature and the local ones of each location as the source of generation for the attention masks. This simple yet effective mechanism described above is our proposed Spatial Group-wise Enhance (SGE) module, which is extremely lightweight and requires almost no additional parameters and calculations by its nature.

We examine the changes in the distribution of the feature map and the statistics of the variance of the activation values for each group after the introduction of the SGE module. The results show that SGE significantly improves the spatial distribution of different semantic sub-features within its group, and produces a large variance statistically, which strengthens the feature learning in semantic regions and compresses the noise and interference.

We show on the ImageNet russakovsky2015imagenet benchmark that the SGE module performs better or comparable to a series of recently proposed state-of-the-art attention modules, despite its superiority in both model capacity and complexity. Meanwhile, for the most advanced detectors (Faster/Mask/Cascade RCNN ren2015faster ; he2017mask ; cai2018cascade ), SGE can always bring more than 1% AP gains on the COCO lin2014microsoft benchmark. Notably, on RetinaNet lin2017focal , SGE outperforms the widely used SE hu2018squeeze module on detecting small objects by 1% AP, which demonstrates its remarkable advantages in accurate spatial modeling.

Figure 1: Illustration of the proposed lightweight SGE module. It processes the sub-features of each group in parallel, and uses the similarity between global statistical feature and local positional features in each group as the attention guidance to enhance the features, thus obtaining well-distributed semantic feature representations in space.

2 Related Work

Grouped Features. Learning and distributing features into groups in convolutional networks has been widely studied recently. AlexNet krizhevsky2012imagenet initially presents the group convolution and divides features into two groups on different GPUs to save computing budgets. ResNeXt xie2017aggregated examines the importance of grouping in feature transfer and suggests that the number of groups should be increased to obtain higher accuracy under similar model complexity. The MobileNet series howard2017mobilenets ; sandler2018mobilenetv2 ; howard2019searching and Xception carreira1998xception treat each channel as a group and model only spatial relationships inside these groups. The ShuffleNet zhang1707shufflenet ; ma2018shufflenet family rearranges the grouped features to produce efficient feature representation. Res2Net gao2019res2net uses a hierarchical mode to transfer grouped sub-features, enabling the network to incorporate multi-scale features in a single bottleneck. CapsuleNet sabour2017dynamic

models each of the grouped neurons as a capsule, where the activities of the neurons within an active capsule represent the various properties of a particular entity that is present in the image. The overall length of the vector of instantiation parameters is used to represent the existence of the entity and the orientation of the vector is forced to represent the properties of the entity. In SGE, all enhancements are operated inside groups, which saves computational overhead similarly as in group convolution. Conceptually, the SGE module adopts the basic modeling assumptions of CapsuleNet, and believes that the features of each group are able to actively learn various semantic entity representations. At the same time, in the process of visualization of this paper, we also use the length of the sub-feature to measure as its activation value, analogous to the probability of the existence of entities in CapsuleNet.

Attention Models.Attention models have recently become very popular. It first attracts widespread attention from the field of machine translation bahdanau2014neural ; vaswani2017attention

and is later extended to more natural language processing tasks such as text summary

rush2015neural and reading comprehension seo2016bidirectional . Since then, it has also achieved very promising results in the field of computer vision with emerging applications, such as person re-ID chen2018person , image recovery zhang2018image , lip reading xu2018lcanet , image classification wang2017residual , and object detection cao2019GCNet . SENet hu2018squeeze brings an effective, lightweight gating mechanism to self-recalibrate the feature map via channel-wise importances. Beyond channel, BAM park2018bam and CBAM woo2018cbam introduce spatial attention in a similar way. SKNet li2019selective further introduces a dynamic kernel selection mechanism which is guided by the multi-scale group convolutions, with a small number of additional parameters and calculations to improve the classification performance. GCNet cao2019GCNet fully explores the advantages and disadvantages of Non-Local wang2018non and SE hu2018squeeze modules, and combines the advantages of both to design a more effective global context module, obtaining compelling results on object detection tasks. SGE differs from all existing attention mechanisms in that it aims at improving the learning of different semantic sub-features of each group, intentionally self-enhancing its spatial distribution within the group. Compared to other attention modules, SGE has fewer parameters, less computational complexity (Table 1), and a more interpretable mechanism (Figure 2).

Figure 2: We select several feature groups with representative semantics to display before and after using SGE on ResNet50. The semantics of the activated regions are found to be the nose from the 18th group, the tongue from the 22nd group, and the eyes from the 41st group, respectively. We sample images of different shapes, categories, and angles to verify the robustness of the SGE module.

3 Method

3.1 Spatial Group-wise Enhance

We consider a channels, convolutional feature map and divide it into groups along the channel dimension. Without loss of generality, we first examine a certain group separately (see the bottom black box in Figure 1). Then the group has a vector representation at every position in space, namely . Conceptually inspired by the capsules sabour2017dynamic , we further assume that this group gradually captures a specific semantic response (such as the dog’s eyes) during the course of network learning. In this group space, ideally we can get features with strong responses at the eye positions (i.e., features with a larger vector length and similar vector directions among multiple eye regions), whilst other positions almost have no activation and become zero vectors. However, due to the unavoidable noise and the existence of similar patterns, it is usually difficult for CNNs to obtain the well-distributed feature responses. To address this issue, we propose to utilize the overall information of the entire group space to further enhance the learning of semantic features in critical regions, given the fact that the features of the entire space are not dominated by noise (otherwise the model learns nothing from this group). Therefore we can use the global statistical feature through spatial averaging function to approximate the semantic vector that this group learns to represent:


Next, using this global feature, we can generate the corresponding importance coefficient for each feature, which is obtained by simple dot product that measures the similarity between the global semantic feature g and local feature to some extent. Thereby for each position, we have:


Note that can also be expanded as , where is the angle between g and . It indicates that features that have a larger vector length (i.e., ) and a direction (i.e., ) closer to g are more likely to obtain a larger initial coefficient, which is in line with our assumptions. In order to prevent the biased magnitude of coefficients between various samples, we normalize over the space, as is widely practiced in ioffe2015batch ; wu2018group ; weightstandardization :


where (e.g., 1e-5) is a constant added for numerical stability. To make sure that the normalization inserted in the network can represent the identity transform, we introduce a pair of parameters for each coefficient , which scale and shift the normalized value:


Note that here are the only parameters introduced in our module. In a single SGE unit, the number of is the same as the number of groups , and the order of their magnitude is about tens (typically, 32 or 64), which is basically negligible compared to the millions of parameters of the entire network. Finally, to obtain the enhanced feature vector , the original is scaled by the generated importance coefficients

via a sigmoid function gate

over the space:


and all the enhanced features form the resulted feature group .

3.2 Visualization and Interpretation

Visualization of Semantic Activation. In order to verify whether our approach achieves the goal of improving the semantic feature representation, we train a network based on ResNet50 on ImageNet russakovsky2015imagenet and place the SGE module after the last BatchNorm ioffe2015batch layer of each bottleneck with reference to SENet hu2018squeeze , by setting = 64. To better reflect the semantic information while preserving the large spatial resolution as much as possible, we choose to examine the feature maps of the 4th stage with output size of 14 14. For each feature vector of each group, we use its length (i.e., ) to indicate their activation value and linearly normalize it to the interval for a better view. Figure 2 shows three representative groups with semantic responses. As listed in three large columns, they are the 18th, 22nd, and 41st group, which are empirically found to correspond to the concept of the nose, tongue, and eyes. Each large column contains three small columns, where the first small column is the original image, the second small column is the feature map response from the original ResNet50, and the third one is the feature map response enhanced by the SGE module. We select images of dogs of different angles and types to test the robustness of SGE for feature enhancement. Despite its simplicity, the SGE module is very effective in improving the feature representation of specific semantics at corresponding locations while suppressing a large amount of noise. It is worth noting that in the 4th and 7th rows, SGE can strongly emphasize the activation of the eye areas, although their eyes are almost closed. In contrast, the original ResNet fails to capture such patterns.

Figure 3: Histogram of activations.

The Statistical Change of Activation. We note that if the ideal feature map is obtained, the spatial activation value of the network will have a more pronounced contrast, such as a large or sharp numerical activation in the semantically relevant regions, and nearly no response in other non-correlated regions. This contrast may probably correspond to a large degree of variance or sparsity to some extent. To validate this, we take the length of each sub-feature (i.e., ) as the activation value, and calculate their distribution of variance in each group of the last (highest) residual layer before and after using the SGE module. These statistics are based on the pretrained SGE-ResNet50, using all the samples on ImageNet validation set (i.e., 50k samples). As shown in Figure 4, the statistical results are in line with our expectations. The response variance of the feature map enhanced by the SGE module is indeed statistically increased, which greatly improves the efficiency of SGE to accurately capture semantic features. Furthermore, we plot the detailed histogram of the activation values of the first group over each position and all validation samples in Figure 3. It is observed that the smaller activation values bias towards zero and larger activation values nearly remain unchanged, which statistically implies the noise suppression and critical-region enhancement.

Figure 4:

The distribution of variance of activation values of each group, from the feature maps before and after SGE module in the last bottleneck of SGE-ResNet50. Standard deviation is also plotted.

4 Experiments on Image Classification

We first compare SGE with a set of state-of-the-art attention modules on ImageNet benchmark. The ImageNet 2012 dataset russakovsky2015imagenet comprises 1.28 million training images and 50k validation images from 1k classes. We train networks on the training set and report the Top-1 and Top-5 accuracies on the validation set with single 224 224 central crop. For data augmentation, we follow the standard practice szegedy2015going and perform the random-size cropping to 224 224 and random horizontal flipping. The practical mean channel subtraction is adopted to normalize the input images. All networks are trained with naive softmax cross entropy without label-smoothing regularization szegedy2016rethinking

. We train all the architectures from scratch by synchronous SGD with weight decay 0.0001 and momentum 0.9 for 100 epochs, starting from learning rate 0.1 and decreasing it by a factor of 10 every 30 epochs. The total batch size is set as 256 and 8 GPUs (32 images per GPU) are utilized for training, using the weight initialization strategy in


. Our codes are implemented in the pytorch

paszke2017pytorch framework. Note that in the following tables, Param. denotes the number of parameter and the definition of FLOPs follow zhang1707shufflenet , i.e., the number of multiply-adds.

4.1 Comparisons with state-of-the-art Attention Modules

We select a series of state-of-the-art attention modules, which is considered to be relatively lightweight, and demonstrate their performance based on ResNet50 and ResNet101 he2016deep ; he2016identity . They contain SE hu2018squeeze , SK li2019selective , BAM park2018bam , CBAM woo2018cbam , and GC cao2019GCNet . For a fair comparison, we implement all the attention modules (partially refer to the official codes111, with their respective best settings using a unified pytorch framework. Following hu2018squeeze ; woo2018cbam , these attention modules are placed after the last BatchNorm ioffe2015batch layer inside each bottleneck except for BAM and SK. BAM park2018bam is naturally designed between stages. SK li2019selective is originally designed on ResNeXt-like bottlenecks with multiple large-kernel group convolutions. To transfer it to the ResNet backbones, we make a slight modification and only append one additional 3 3 group ( = 32) convolution to each original 3 3 convolution of ResNet, to prevent the parameters and calculations of the corresponding SKNets from being too large or too small. From the results of Table 1, we observe that based on ResNet50, SGE is on par with the best entries from CBAM (Top-1) and SK/SE (Top-5) but has much fewer parameters and slightly less calculations. As for ResNet101, it outperforms most other competing modules with a non-negligible margin. Please note that in our experiments, we find that the GC cao2019GCNet module is difficult to train from the beginning, and it will be stuck in a higher loss for a long time before the training loss begins to decline normally. Therefore it does not eventually lead to a high accuracy. In the original paper of GCNet, the authors do not adopt the commonly used training from scratch settings, but finetune the GC module on the well pretrained ResNets to report the results.

Backbone Param. GFLOPs Top-1 Acc (%) Top-5 Acc (%)
ResNet50 he2016deep 25.56M 4.122 76.3840 92.9080
SE-ResNet50 hu2018squeeze 28.09M 4.130 77.1840 93.6720
SK-ResNet50 li2019selective 26.15M 4.185 77.5380 93.7000
BAM-ResNet50 park2018bam 25.92M 4.205 76.8980 93.4020
CBAM-ResNet50 woo2018cbam 28.09M 4.139 77.6260 93.6600
GC-ResNet50 cao2019GCNet 28.11M 4.130 73.8880 91.6800
SGE-ResNet50 25.56M 4.127 77.5840 93.6640
ResNet101 he2016deep 44.55M 7.849 78.2000 93.9060
SE-ResNet101 hu2018squeeze 49.33M 7.863 78.4680 94.1020
SK-ResNet101 li2019selective 45.68M 7.978 78.7920 94.2680
BAM-ResNet101 park2018bam 44.91M 7.933 78.2180 94.0180
CBAM-ResNet101 woo2018cbam 49.33M 7.879 78.3540 94.0640
GC-ResNet101 cao2019GCNet 49.36M 7.863 74.6420 92.0720
SGE-ResNet101 44.55M 7.858 78.7980 94.3680
Table 1: Comparisons to the state-of-the-art attention modules on ImageNet validation set. Single 224 224 central crop is adopted for evaluation. All results are reproduced in the pytorch framework. denotes the modified versions based on ResNet backbones. The best and the second best records are marked as bold and blue, respectively.

4.2 Ablation Study

In this section, we report the ablation studies on the ImageNet dataset based on SGE-ResNet50, to thoroughly investigate the components of the SGE modules.

Figure 5: Performance of SGE-ResNet50 as a function of group number .
Table 2: Performance of SGE-ResNet50 as a function of initializations of and . SGE-ResNet50 Acc Top-1 (%) Top-5 (%) 0 0 77.3780 93.7140 0 1 77.5840 93.6640 1 0 77.2200 93.5820 1 1 77.0820 93.7040 Table 3: Performance of SGE-ResNet50 with and without the normalization part. Norm SGE-ResNet50 Acc Top-1 (%) Top-5 (%) w/ 77.5840 93.6640 w/o 76.4980 93.1580

Group number . In the SGE module, the number of groups

controls the number of different semantic sub-features. Since the total number of channels is fixed, too many groups will result in a reduction in the sub-feature dimension within each group, leading to weaker feature representation for each semantic response; On the contrary, too few groups will make the diversity of semantics limited. It is natural to speculate that there is a moderate hyperparameter

that balances semantic diversity and the ability of representing each semantic to optimize network performance. From Figure 5, we can see that with the increase of , the performance of the network shows a trend of increasing first and then decreasing (especially in terms of Top-1 accuracy), which is highly consistent with our deduction. Through the experimental results, we usually recommend the number of groups to be 32 or 64. In subsequent experiments, we use = 64 by default.

Initialization of the and . During the experiment, we found that the initialization of the parameter and has a small but not negligible effect on the result. To investigate this, we use values 0, 1 for grid search to see the effects of the initialization. From Table 5 we find that initializing

to 0 tends to get better results. We speculate that when the ordinary patterns of semantic learning has not yet been completely formulated in convolutional feature maps during the initial stage of network training, it may be appropriate to temporarily discard the attention mechanism, but let the network learn a basic semantic representation first. After the initial training period, the attention modules then need to be gradually turned in effect. Therefore, in the early moments of network learning, the attention mechanism of SGE is not suggested to participate heavily in training by setting

to 0. Such an operation is almost equivalent to simulate the learning process of a network without attention modules during the very early training stage, since each sub-feature of each location is linearly multiplied by the same constant (i.e., ), whose effect can be cancelled by the following BatchNorm layer.

Normalization. To investigate the importance of normalization in SGE modules, we conduct experiments by eliminating the normalization part from SGE (as shown in Table 5) and find that performance is considerably reduced. This confirms our previous conjecture: because the distribution of features generated by different samples for the same semantic group is inconsistent, it is difficult to learn robust importance coefficients without normalization. This is also partially validated in Figure 4, where the variance statistic usually has a relatively large standard deviation. It demonstrates that the variance of the activation values of different samples in the same group can be statistically very different, indicating that normalization is essential for SGE to work.

5 Experiments on Object Detection

We further evaluate the SGE module on object detection on COCO 2017 lin2014microsoft , whose train set is comprised of 118k images, validation set of 5k images. We follow the standard setting he2017mask of evaluating object detection via the standard mean Average-Precision (AP) scores at different box IoUs or object scales, respectively.

The input images are resized with their shorter side being 800 pixels lin2017feature . We train on 8 GPUs with 2 images per each. The backbones of all models are pretrained on ImageNet russakovsky2015imagenet (directly borrowed from the models listed in Table 1), then all layers except for the first two stages are jointly finetuned with FPN lin2017feature neck and a set of detector heads. Following the conventional finetuning setting he2017mask , the BatchNorm layers are frozen during finetuning. All models are trained for 24 epochs using synchronized SGD with a weight decay of 0.0001 and momentum of 0.9. The learning rate is initialized to 0.02, and decays by a factor of 10 at the 18th and 22nd epochs. The choice of hyper-parameters follows the latest release of the detection benchmark mmdetection2018 .

5.1 Experiments on state-of-the-art Detectors

We embed the SGE modules into the popular detector framework separately to check if the enhanced feature map helps to detect objects. We select three popular two-stage detection frameworks, including Faster RCNN ren2015faster , Mask RCNN he2017mask , and Cascade RCNN cai2018cascade , and choose the widely used FPN lin2017feature as the detection neck. For a fair comparison, we only replace the pretrained backbone model on ImageNet while keeping the other components in the entire detector intact. Table 4 shows the performance of embedding the backbone with the SGE module on these state-of-the-art detectors. We find that although SGE introduces almost no additional parameters and calculations, the gain of detection performance is still very noticeable with basically more than 1% AP point. It is worth noting that SGE can be more prominently advanced on stronger detectors (+1.5% AP on ResNet50 and +1.8% on ResNet101 in Cascade RCNN).

Backbone Param. GFLOPs Detector (%) (%) (%)
ResNet50 he2016deep 23.51M 88.032 Faster RCNN ren2015faster 37.5 59.1 40.6
SGE-ResNet50 23.51M 88.149 Faster RCNN ren2015faster 38.7 (+1.2) 60.8 41.7
ResNet50 he2016deep 23.51M 88.032 Mask RCNN he2017mask 38.6 60.0 41.9
SGE-ResNet50 23.51M 88.149 Mask RCNN he2017mask 39.6 (+1.0) 61.5 42.9
ResNet50 he2016deep 23.51M 88.032 Cascade RCNN cai2018cascade 41.1 59.3 44.8
SGE-ResNet50 23.51M 88.149 Cascade RCNN cai2018cascade 42.6 (+1.5) 61.4 46.2
ResNet101 he2016deep 42.50M 167.908 Faster RCNN ren2015faster 39.4 60.7 43.0
SGE-ResNet101 42.50M 168.099 Faster RCNN ren2015faster 41.0 (+1.6) 63.0 44.3
ResNet101 he2016deep 42.50M 167.908 Mask RCNN he2017mask 40.4 61.6 44.2
SGE-ResNet101 42.50M 168.099 Mask RCNN he2017mask 42.1 (+1.7) 63.7 46.1
ResNet101 he2016deep 42.50M 167.908 Cascade RCNN cai2018cascade 42.6 60.9 46.4
SGE-ResNet101 42.50M 168.099 Cascade RCNN cai2018cascade 44.4 (+1.8) 63.2 48.4
Table 4: Comparisons based on the state-of-the-art detectors. The Parm. and GFLOPs are only with the backbone parts, given that all the remaining structures are kept the same for a specific detector. The numbers in brackets denote the improvements over the baseline backbones. The SGE modules tend to obtain a larger gain on the stronger baseline detection models.

5.2 Comparisons with state-of-the-art Attention Modules

Next, we chose a representative one-stage detection framework RetinaNet lin2017focal , to compare SGE with several competitive state-of-the-art attention modules, especially for objects with three different scales. The original backbones are replaced with the corresponding attention embedded ResNets, which are pretrained on ImageNet, for a fair comparison. In Table 5, SGE greatly improves the accuracy of detection for small objects while its performance of the media and large objects is close to the optimal ones (41.2 vs 41.3 from SE and 49.9 vs 50.4 from SK), indicating that the SGE module is able to retain the feature representation of the precise spatial area well and is very robust to various object scales. Conversely, the SE/SK module has only a small increase in the recognition of small objects. For SE/SK, in each channel, the same importance coefficient is allocated to each location of the space, probably resulting in the loss of the ability to express the details of micro-regions.

Backbone Param. GFLOPs (%) (%) (%)
ResNet50 he2016deep 23.51M 88.032 19.9 39.6 48.3
SE-ResNet50 hu2018squeeze 26.04M 88.152 20.7 (+0.8) 41.3 (+1.7) 50.0 (+1.7)
SK-ResNet50 li2019selective 24.11M 89.414 20.2 (+0.3) 40.9 (+1.3) 50.4 (+2.1)
BAM-ResNet50 park2018bam 23.87M 89.804 19.6 (-0.3) 40.1 (+0.5) 49.9 (+1.6)
CBAM-ResNet50 woo2018cbam 26.04M 88.302 21.8 (+1.9) 40.8 (+1.2) 49.5 (+1.2)
SGE-ResNet50 23.51M 88.149 21.8 (+1.9) 41.2 (+1.6) 49.9 (+1.6)
Table 5: Performance on RetinaNet for objects of three scales. The notations are similar as in Table 4. The best and the second best records are marked as bold and blue, respectively. Compared to the SE/SK module, the detection of small objects from SGE has been significantly improved.

6 Conclusion

We propose a Spatial Group-wise Enhance (SGE) module that enables each of its feature groups to autonomously enhance its learnt semantic representation and suppress possible noise, nearly without introducing additional parameters and computational complexity. We visually show that the feature groups have the ability to express different semantics, while the SGE module can significantly enhance this ability. Despite its simplicity, SGE has achieved a steady improvement in both image classification and detection tasks, which demonstrates its compelling effectiveness in practice.