Log In Sign Up

Attentional Feature Fusion

by   Yimian Dai, et al.

Feature fusion, the combination of features from different layers or branches, is an omnipresent part of modern network architectures. It is often implemented via simple operations, such as summation or concatenation, but this might not be the best choice. In this work, we propose a uniform and general scheme, namely attentional feature fusion, which is applicable for most common scenarios, including feature fusion induced by short and long skip connections as well as within Inception layers. To better fuse features of inconsistent semantics and scales, we propose a multi-scale channel attention module, which addresses issues that arise when fusing features given at different scales. We also demonstrate that the initial integration of feature maps can become a bottleneck and that this issue can be alleviated by adding another level of attention, which we refer to as iterative attentional feature fusion. With fewer layers or parameters, our models outperform state-of-the-art networks on both CIFAR-100 and ImageNet datasets, which suggests that more sophisticated attention mechanisms for feature fusion hold great potential to consistently yield better results compared to their direct counterparts. Our codes and trained models are available online.


MDSSD: Multi-scale Deconvolutional Single Shot Detector for small objects

In order to improve the detection accuracy for objects at different scal...

Multi scale Feature Extraction and Fusion for Online Knowledge Distillation

Online knowledge distillation conducts knowledge transfer among all stud...

Parallel Multi-Scale Networks with Deep Supervision for Hand Keypoint Detection

Keypoint detection plays an important role in a wide range of applicatio...

On the Exploration of Convolutional Fusion Networks for Visual Recognition

Despite recent advances in multi-scale deep representations, their limit...

Multi-Scale Feature Fusion: Learning Better Semantic Segmentation for Road Pothole Detection

This paper presents a novel pothole detection approach based on single-m...

Hyperdimensional Feature Fusion for Out-Of-Distribution Detection

We introduce powerful ideas from Hyperdimensional Computing into the cha...

AFNet-M: Adaptive Fusion Network with Masks for 2D+3D Facial Expression Recognition

2D+3D facial expression recognition (FER) can effectively cope with illu...

Code Repositories


code and trained models for "Attentional Feature Fusion"

view repo

1 Introduction

Convolutional neural networks (CNNs) have seen a significant improvement of the representation power by going deeper [13], going wider [39, 50], increasing cardinality [48], and refining features dynamically [16]

, corresponding to advances in many computer vision tasks.

Apart from these strategies, in this paper, we investigate a different component of the network, feature fusion, to further boost the representation power of CNNs. Whether explicit or implicit, intentional or unintentional, feature fusion is omnipresent for modern network architectures and has been studied extensively in the previous literature [39, 37, 13, 31, 23]. For instance, in the InceptionNet family [39, 40, 38], the outputs of filters with multiple sizes on the same level are fused to handle the large variation of object size. In Residual Networks (ResNet) [13, 14] and its follow-ups [50, 48], the identity mapping features and residual learning features are fused as the output via short skip connections, enabling the training of very deep networks. In Feature Pyramid Networks (FPN) [23] and U-Net [31], low-level features and high-level features are fused vis long skip connections to obtain high-resolution and semantically strong features, which are vital for semantic segmentation and object detection. However, despite its prevalence in modern networks, most works on feature fusion focus on constructing sophisticated pathways to combine features in different kernels, groups, or layers. The feature fusion method has rarely been addressed and is usually implemented via simple operations such as addition or concatenation, which merely offer a fixed linear aggregation of feature maps and are entirely unaware of whether this combination is suitable for specific objects.

Recently, Selective Kernel Networks (SKNet) [21] and ResNeSt [51] have been proposed to render dynamic weighted averaging of features from multiple kernels or groups in the same layer based on the global channel attention mechanism [16]. Although such attention-based methods present nonlinear approaches for feature fusion, they still suffer from the following shortcomings:

  1. Limited scenarios:

    SKNet and ResNeSt only focus on the soft feature selection in the same layer, whereas the


    fusion in skip connections has not been addressed, leaving their schemes quite heuristic. Despite having different scenarios, all kinds of feature fusion implementations face the same challenge, in essence, that is, how to integrate features of different scales for better performance. A module that can overcome the semantic inconsistency and effectively integrate features of different scales should be able to consistently improve the quality of fused features in various network scenarios. However, so far, there is still a lack of a generalized approach that can unify different feature fusion scenarios in a consistent manner.

  2. Unsophisticated initial integration: To feed the received features into the attention module, SKNet introduces another phase of feature fusion in an involuntary but inevitable way, which we call initial integration and is implemented by addition. Therefore, besides the design of the attention module, as its input, the initial integration approach also has a large impact on the quality of fusion weights. Considering the features may have a large inconsistency on the scale and semantic level, an unsophisticated initial integration strategy ignoring this issue can be a bottleneck.

  3. Biased context aggregation scale: The fusion weights in SKNet and ResNeSt are generated via the global channel attention mechanism [16], which is preferred for information that distributes more globally. However, objects in the image can have an extremely large variation in size. Numerous studies have emphasized this issue that arises when designing CNNs, i.e., that the receptive fields of predictors should match the object scale range [52, 34, 35, 22]. Therefore, merely aggregating contextual information on a global scale is too biased and weakens the features of small objects. This gives rise to the question if a network can dynamically and adaptively fuse the received features in a contextual scale-aware way.

Motivated by the above observations, we present the attentional feature fusion (AFF) module, trying to answer the question of how a unified approach for all kinds of feature fusion scenarios should be and address the problems of contextual aggregation and initial integration. The AFF framework generalizes the attention-based feature fusion from the same-layer scenario to cross-layer scenarios including short and long skip connections, and even the initial integration inside AFF itself. It provides a universal and consistent way to improve the performance of various networks, e.g., InceptionNet, ResNet, ResNeXt [48], and FPN, by simply replacing existing feature fusion operators with the proposed AFF module. Moreover, the AFF framework supports to gradually refine the initial integration, namely the input of the fusion weight generator, by iteratively integrating the received features with another AFF module, which we refer to as iterative attentional feature fusion (iAFF).

To alleviate the problems arising from scale variation and small objects, we advocate the idea that attention modules should also aggregate contextual information from different receptive fields for objects of different scales. More specifically, we propose the Multi-Scale Channel Attention Module (MS-CAM), a simple yet effective scheme to remedy the feature inconsistency across different scales for attentional feature fusion. Our key observation is that scale is not an issue exclusive to the spatial attention, and the channel attention can also have scales other than the global by varying the spatial pooling size. By aggregating the multi-scale context information along the channel dimension, MS-CAM can simultaneously emphasize large objects that distribute more globally and highlight small objects that distribute more locally, facilitating the network to recognize and detect objects under extreme scale variation.

2 Related Work

2.1 Multi-scale Attention Mechanism

The scale variation of objects is one of the key challenges in computer vision. To remedy this issue, an intuitive way is to leverage multi-scale image pyramids [30, 2], in which objects are recognized at multiple scales and the predictions are combined using non-maximum suppression. The other line of effort aims to exploit the inherent multi-scale, hierarchical feature pyramid of CNNs to approximate image pyramids, in which features from multiple layers are fused to obtain semantic features with high resolutions [12, 31, 23].

The attention mechanism in deep learning, which mimics the human visual attention mechanism

[5, 8], is originally developed on a global scale. For example, the matrix multiplication in self-attention draws global dependencies of each word in a sentence [42] or each pixel in an image [7, 45, 1]. The Squeeze-and-Excitation Networks (SENet) squeeze global spatial information into a channel descriptor to capture channel-wise dependencies [16]. Recently, researchers start to take into account the scale issue of attention mechanisms. Similar to the above-mentioned approaches handling scale variation in CNNs, multi-scale attention mechanisms are achieved by either feeding multi-scale features into an attention module or combining feature contexts of multiple scales inside an attention module. In the first type, the features at multiple scales or their concatenated result are fed into the attention module to generate multi-scale attention maps, while the scale of feature context aggregation inside the attention module remains single [2, 3, 46, 6, 36, 41]. The second type, which is also referred to as multi-scale spatial attention, aggregates feature contexts by convolutional kernels of different sizes [20] or from a pyramid [20, 44] inside the attention module .

The proposed MS-CAM follows the idea of ParseNet [25] with combining local and global features in CNNs and the idea of spatial attention with aggregating multi-scale feature contexts inside the attention module, but differ in at least two important aspects: 1) MS-CAM puts forward the scale issue in channel attention and is achieved by point-wise convolution rather than kernels of different sizes. 2) instead of in the backbone network, MS-CAM aggregates local and global feature contexts inside the channel attention module. To the best of our knowledge, the multi-scale channel attention has neven been discussed before.

2.2 Skip Connections in Deep Learning

Skip connection has been an essential component in modern convolutional networks. Short skip connections, namely the identity mapping shortcuts added inside Residual blocks, provide an alternative path for the gradient to flow without interruption during backpropagation

[13, 48, 50]. Long skip connections help the network to obtain semantic features with high resolutions by bridging features of finer details from lower layers and high-level semantic features of coarse resolutions [17, 23, 31, 26]. Despite being used to combine features in various pathways [9]

, the fusion of connected features is usually implemented via addition or concatenation, which allocate the features with fixed weights regardless of the variance of contents. Recently, a few attention-based methods, e.g., Global Attention Upsampe (GAU)

[20] and Skip Attention (SA) [49], have been proposed to use high-level features as guidance to modulate the low-level features in long skip connections. However, the fusion weights for the modulated features are still fixed.

To the best of our knowledge, it is the Highway Networks that first introduced a selection mechanism in short skip connections [37]. To some extent, the attentional skip connections proposed in this paper can be viewed as its follow-up, but differs in the three points: 1) Highway Networks employ a simple fully connected layer that can only generate a scalar fusion weight, while our proposed MS-CAM generates fusion weights as the same size of feature maps, enabling dynamic soft selections in an element-wise way. 2) Highway Networks only use one input feature to generate weight, while our AFF module is aware of both features. 3) We point out the importance of initial feature integration and the iAFF module is proposed as a solution.

3 Multi-scale Channel Attention

3.1 Revisiting Channel Attention in SENet

Given an intermediate feature with channels and feature maps of size , the channel attention weights in SENet can be computed as


where denotes the global feature context and is the global average pooling (GAP).

denotes the Rectified Linear Unit (ReLU)

[28], and

denotes the Batch Normalization (BN)


is the Sigmoid function. This is achieved by a bottleneck with two fully connected (FC) layers, where

is a dimension reduction layer, and is a dimension increasing layer. is the channel reduction ratio.

We can see that the channel attention squeezes each feature map of size into a scalar. This extreme coarse descriptor prefers to emphasize large objects that distribute globally and can potentially wipe out most of the image signal present in a small object. However, detecting very small objects stands out as the key performance bottleneck of state-of-the-art networks [35]

. For example, the difficulty of COCO is largely due to the fact that most object instances are smaller than 1% of the image area

[24, 34]. Therefore, global channel attention might not be the best choice. Multi-scale feature contexts should be aggregated inside the attention module to alleviate the problems arising from scale variation and small object instances.

3.2 Aggregating Local and Global Contexts

In this part, we depict the proposed multi-scale channel attention module (MS-CAM) in detail. The key idea is that the channel attention can be implemented in multiple scales by varying the spatial pooling size. To maintain it as lightweight as possible, we merely add the local context to the global context inside the attention module. We choose the point-wise convolution (PWConv) as the local channel context aggregator, which only exploits point-wise channel interactions for each spatial position. To save parameters, the local channel context is computed via a bottleneck structure as follows:


The kernel sizes of and are and is , respectively. It is noteworthy that has the same shape as the input feature, which can preserve and highlight the subtle details in the low-level features. Given the global channel context and local channel context , the refined feature by MS-CAM can be obtained as follows:


where denotes the attentional weights generated by MS-CAM. denotes the broadcasting addition and denotes the element-wise multiplication.

Figure 1: Illustration of the proposed MS-CAM

4 Attentional Feature Fusion

4.1 Unification of Feature Fusion Scenarios

Given two feature maps , by default, we assume is the feature map with a larger receptive field. More specifically,

  1. same-layer scenario: is the output of a kernel and is the output of a kernel in InceptionNet;

  2. short skip connection scenario: is the identity mapping, and is the learned residual in a ResNet block;

  3. long skip connection scenario: is the low-level feature map, and is the high-level semantic feature map in a feature pyramid.

Based on the multi-scale channel attention module , Attentional Feature Fusion (AFF) can be expressed as


where is the fused feature, and denotes the initial feature integration. In this subsection, for the sake of simplicity, we choose the element-wise summation as initial integration. The AFF is illustrated in Fig. LABEL:sub@subfig:aff, where the dashed line denotes . It should be noted that the fusion weights consists of real numbers between 0 and 1, so are the , which enable the network to conduct a soft selection or weighted averaging between and .

(a) AFF
(b) iAFF
Figure 4: Illustration of the proposed AFF and iAFF

We summarize different formulations of feature fusion in deep networks in table 1. denotes the global attention mechanism. Although there are many implementation differences among multiple approaches for various feature fusion scenarios, once being abstracted into mathematical forms, these differences in details disappear. Therefore, it is possible to unify these feature fusion scenarios with a carefully designed approach, thereby improving the performance of all networks by replacing original fusion operations with this unified approach.

Context-aware Type Formulation Scenario & Reference Example
None Addition Short Skip [13, 14], Long Skip [26, 23] ResNet, FPN
Concatenation Same Layer [39], Long Skip [31, 17] InceptionNet, U-Net
Partially Refinement Short Skip [16, 15, 47, 29] SENet
Modulation Long Skip [20] GAU
Soft Selection Short Skip [37] Highway Networks
Fully Modulation Long Skip [49] SA
Soft Selection Same Layer [21, 51] SKNet
Same Layer, Short Skip, Long Skip ours
Table 1: A brief overview of different feature fusion strategies in deep networks.

From table 1, it can be further seen that apart from the implementation of the weight generation module , the state-of-the-art fusion schemes mainly differ in two crucial points: (a) the context-awareness level. Linear approaches like addition and concatenation are entirely contextual unaware. Feature refinement and modulation are non-linear, but only partially aware of the input feature maps. In most cases, they only exploit the high-level feature map. Fully context-aware approaches utilize both input feature maps for guidance at the cost of raising the initial integration issue. (b) Refinement vs modulation vs selection. The sum of weights applied to two feature maps in soft selection approaches are bound to 1, while this is not the case for refinement and modulation.

4.2 Iterative Attentional Feature Fusion

Unlike partially context-aware approaches [20], fully context-aware methods have an inevitable issue, namely how to initially integrate input features. As the input of the attention module, the initial integration quality may profoundly affect final fusion weights. Since it is still a feature fusion problem, an intuitive way is to have another attention module to fuse input features. We call this two-stage approach iterative Attentional Feature Fusion (iAFF), which is illustrated in Fig. LABEL:sub@subfig:iaff. Then, the initial integration in eq. 4 can be reformulated as


4.3 Examples: InceptionNet, ResNet, and FPN

To validate the proposed AFF/iAFF as a uniform and general scheme, we choose ResNet, FPN, and InceptionNet as examples for the most common scenarios: short and long skip connections as well as the same layer fusion. It is straightforward to apply AFF/iAFF to existing networks by replacing the original addition or concatenation. Specifically, we replace the concatenation in the InceptionNet module as well as the addition in ResNet block (ResBlock) and FPN to obtain the attentional networks, which we call AFF-Inception module, AFF-ResBlock, and AFF-FPN, respectively. This replacement and the schemes of our proposed architectures are shown in

fig. 8. The iAFF is a particular case of AFF, so it does not need another illustration.

(a) AFF-Inception module
(b) AFF-ResBlock
Figure 8: The schema of the proposed AFF-Inception module, AFF-ResBlock, and AFF-FPN.

5 Experiments

The goal of the following experimental evaluation is to show that, with the proposed AFF/iAFF and MS-CAM, convolutional networks can gain a performance boost, even with fewer layers or parameters per network.

5.1 Experimental settings

For experimental evaluation, we resort to the following benchmark datasets: CIFAR-100 [19] and ImageNet [32] for image classification in the same-layer InceptionNet and short-skip connection ResNet scenarios as well as StopSign (a subset of COCO dataset [24]) for semantic segmentation in the long-skip connection FPN scenario. The detailed experimental settings are listed in table 2. is the ResBlock number in each stage, which is used to scale the network by depth. For more implementation details, please see the supplementary material as well as our code.

Task Dataset Host Network Fusing Scenario Epochs Batch Size Optimizer Learning Rate Learning Rate Mode Initialization
Image Classification CIFAR-100 Inception-ResNet-20- Same Layer 4 400 128 Nesterov 0.2 Step, Kaiming
ResNet-20- Short Skip 4 400 128 Nesterov 0.2 Step, Kaiming
ResNeXt-38-32x4d Short Skip 16 400 128 Nesterov 0.2 Step, Xavier
ImageNet ResNet-50 Short Skip 16 160 128 Nesterov 0.075 Cosine Kaiming
Semantic Segmentation StopSign ResNet-20- + FPN Long Skip 4 300 32 AdaGrad 0.01 Poly Kaiming
Table 2: Experimental settings for the networks integrated with the proposed AFF/iAFF.

5.2 Ablation study

5.2.1 Impact of Multi-Scale Context Aggregation

To study the impact of multi-scale context aggregation, in fig. 11, we construct two ablation modules “Global + Global” and “Local + Local”, in which the scales of the two contextual aggregation branches are set as the same, either global or local. The proposed AFF is dubbed as “Global + Local” here. All of them have the same parameter number. The only difference is their context aggregation scale.

Figure 11: Architectures for the ablation study on the impact of contextual aggregation scale.

Table 3 presents their comparison on CIFAR-100, ImageNet, and StopSign on various host networks. It can be seen that the multi-scale contextual aggregation (Global + Local) outperforms single-scale ones in all settings. The results suggest that the multi-scale feature context is vital for the attentional feature fusion.

Aggregation Scale InceptionNet on CIFAR-100 ResNet on CIFAR-100 ResNet + FPN on StopSign ResNet on ImageNet
Global + Global 0.735 0.766 0.775 0.789 0.754 0.796 0.811 0.821 0.911 0.923 0.936 0.939 0.777
Local + Local 0.746 0.771 0.785 0.787 0.754 0.794 0.808 0.814 0.895 0.919 0.921 0.924 0.780
Global + Local 0.756 0.784 0.794 0.801 0.763 0.804 0.816 0.826 0.924 0.935 0.939 0.944 0.784
Table 3: Comparison of contextual aggregation scales in attentional feature fusion given the same parameter budget. The results suggest that a mix of scales should always be preferred inside the channel attention module.
Fusion Type Context Strategy InceptionNet (Same Layer) ResNet (Short Skip) ResNet + FPN (Long Skip)
Add None \ 0.720 0.753 0.771 0.782 0.740 0.786 0.797 0.808 0.895 0.920 0.925 0.928
Concatenation None \ 0.725 0.749 0.772 0.779 0.742 0.782 0.793 0.798 0.897 0.909 0.925 0.939
MS-GAU Partially Modulation 0.751 0.774 0.788 0.795 0.766 0.803 0.815 0.819 0.917 0.926 0.937 0.941

Partially Refinement 0.752 0.780 0.790 0.798 0.765 0.799 0.814 0.820 0.915 0.929 0.940 0.940

Fully Modulation 0.756 0.779 0.790 0.798 0.761 0.801 0.814 0.822 0.920 0.932 0.938 0.941
AFF (ours) Fully Selection 0.756 0.784 0.794 0.801 0.763 0.804 0.816 0.826 0.924 0.935 0.939 0.944
iAFF (ours) Fully Selection 0.774 0.801 0.808 0.814 0.772 0.807 0.822 / 0.927 0.938 0.945 0.953
Table 4: Comparison of context-aware level and feature integration strategy in feature fusion given the same parameter budget. The results suggest that a fully context-aware and selective strategy should always be preferred for feature fusion. If no problem in optimization, we should adopt the iterative attentional feature fusion without hesitation for better performance.

5.2.2 Impact of Feature Integration Type

Further, we investigate which feature fusion strategy is the best in table 1. For fairness, we re-implement these approaches based on the proposed MS-CAM for attention weights. Since MS-CAM are different from their original attention modules, we add a prefix of ”MS-” to these newly implemented schemes. To keep the parameter budget the same, here the channel reduction ratio in MS-GAU, MS-SE, MS-SA, and AFF is , while in iAFF is .

(a) MS-GAU
(b) MS-SE
(c) MS-SA
Figure 15: Architectures for ablation study on the impact of feature integration strategies

table 4 provides the comparison results in three scenarios, from which it can be seen that: 1) compared to the linear approach, namely addition and concatenation, the non-linear fusion strategy with attention mechanism always offers better performance; 2) our fully context-aware and selective strategy is slightly but consistently better than the others, suggesting that it should be preferred for multiple feature integration; 3) the proposed iAFF approach is significantly better than the rest in most cases. The results strongly demonstrate our hypothesis that the early integration quality has a large impact on the attentional feature fusion, and another level of attentional feature fusion can further improve the performance. However, this improvement may be obtained at the cost of increasing the difficulty in optimization. We notice that when the network depth increases as changes from to , the performance of iAFF-ResNet did not improve but degraded.

5.2.3 Impact on Localization and Small Object

To study the impact of the proposed MS-CAM on object localization and small object recognition, we apply Grad-CAM [33] to ResNet-50, SENet-50, and AFF-ResNet-50 for the visualization results of images from the ImageNet dataset, which are illustrated in fig. 18. Given a specific class, Grad-CAM results show the network’s attended regions clearly. Here, we show the heatmaps of the predicted class, and the wrongly predicted image is denoted with the symbol . The predicted class names and their softmax scores are also shown at the bottom of heatmaps.

From the upper part of fig. 18, it can be seen clearly that the attended regions of the AFF-ResNet-50 highly overlap with the labeled objects, which shows that it learns well to localize objects and exploit the features in object regions. On the contrary, the localization capacity of the baseline ResNet-50 is relatively poor, misplacing the center of attended regions in many cases. Although SENet-50 are able to locate the true objects, the attended regions are over-large including many background components. It is because SENet-50 only utilizes the global channel attention, which is biased to the context of a global scale, whereas the proposed MS-CAM also aggregates the local channel context, which helps the network to attend the objects with fewer background clutters and is also beneficial to the small object recognition. In the bottom half of fig. 18, we can clearly see that AFF-ResNet-50 can predict correctly on the small-scale objects, while ResNet-50 fails in most cases.

Figure 18: Network visualization with Grad-CAM. The comparison results suggest that the proposed MS-CAM is beneficial to the object localization and small object recognition.

5.3 Comparison with the state-of-the-art networks

(a) InceptionNet (Same layer)
(b) ResNet (Short skip connection)
(c) FPN (Long skip connection)
Figure 22: Compassion with baseline and other state-of-the-art networks with a gradual increase of network depth.

To show that the network performance can be improved by replacing original fusion operations with the proposed attentional feature fusion, we compare the AFF and iAFF modules with other attention modules based on the same host networks in different feature fusion scenarios. fig. 22 illustrates the comparison results with a gradual increase in network depth for all networks. It can be seen that: 1) Comparing SKNet / SENet / GAU-FPN with AFF-InceptionNet / AFF-ResNet / AFF-FPN, we can see that our AFF or iAFF integrated networks are better in all scenarios, which shows that our (iterative) attentional feature fusion approach not only has superior performance, but a good generality. We believe the improved performance comes from the proposed multi-scale channel contextual aggregation inside the attention module. 2) Comparing the performance of iAFF-based networks with AFF-based networks, it should be noted that the proposed iterative attentional feature fusion scheme can further improve the performance. 3) By replacing the simple addition or concatenation with the proposed AFF or iAFF module, we can get a more efficient network. For example, in Fig. LABEL:sub@subfig:sotashortskip, iAFF-ResNet () achieves similar performance with the baseline ResNet (), while only 54% of the parameters were required.

Last, we validate the performance of AFF/iAFF based networks with state-of-the-art networks on CIFAR-100 and ImageNet. The results are listed in table 5 and table 6. The results show that the proposed AFF/iAFF based networks can improve performance over the state-of-the-art networks under much smaller parameter budgets. Remarkably, on CIFAR-100, AFF-ResNeXt-38-32x4d outperforms NAT-M4 [27] by above absolute 2% , although NAT-M4 has 15% more parameters. AFF-ResNet-32 achieves the same accuracy as AutoAugment+PyramidNet+ShakeDrop [4] by merely utilizing 19% of its parameters. On ImageNet, the proposed iAFF-ResNet-50 outperforms Gather-Excite--ResNet-101 [15] by 0.3% with only 60% parameters. These results indicate that the feature fusion in short skip connections matters a lot for ResNet and ResNeXt. Instead of blindly increasing the depth of the network, we should pay more attention to the quality of feature fusion.

Architecture Acc. Params
Attention-Augmented-Wide-ResNet-28-10 [1] 81.6 36.2 M
SENet-29 [16] 82.2 35.0 M
SKNet-29 [21] 82.7 27.7M
PyramidNet-272-200 [11] 83.6 26.0 M
Neural Architecture Transfer (NAT-M4) [27] 88.3 9.0 M
AutoAugment+PyramidNet+ShakeDrop [4] 89.3 26.0 M
AFF-ResNet-32 (ours) 89.3 5.0 M
AFF-ResNeXt-38-32x4d (ours) 90.3 7.8 M
Table 5: Comparison on CIFAR-100 with state of the art
Architecture top-1 err. Params
ResNet-101 [13] 23.2 42.5 M
Efficient-Channel-Attention-Net-101 [43] 21.4 42.5 M
Attention-Augmented-ResNet-101 [1] 21.3 45.4 M
SENet-101 [16] 20.9 49.4 M
Gather-Excite--ResNet-101 [15] 20.7 58.4 M
Local-Importance-Pooling-ResNet-101 [10] 20.7 42.9 M
AFF-ResNet-50 (ours) 20.9 30.3 M
AFF-ResNeXt-50-32x4d (ours) 20.8 29.9 M
iAFF-ResNet-50 (ours) 20.4 35.1 M
iAFF-ResNeXt-50-32x4d (ours) 20.2 34.7 M
Table 6: Comparison on ImageNet with state of the art

6 Conclusion

We generalize the concept of attention mechanisms as a selective and dynamic type of feature fusion to most scenarios, namely the same layer, short skip, and long skip connections as well as information integration inside the attention mechanism. To overcome the semantic and scale inconsistency issue among input features, we propose the multi-scale channel attention module, which adds local channel contexts to the global channel-wise statistics. Further, we point out that the initial integration of received features is a bottleneck in attention-based feature fusion, and it can be alleviated by adding another level of attention that we call iterative attentional feature fusion. We conducted detailed ablation studies to empirically verify the individual impact of the context-aware level, the feature integration type, and the contextual aggregation scales of our proposed attention mechanism. Experimental results on both the CIFAR-100 and the ImageNet dataset show that our models outperform state-of-the-art networks with fewer layers or parameters per network, which suggests that one should pay attention to the feature fusion in deep neural networks and that more sophisticated attention mechanisms for feature fusion hold the potential to consistently yield better results.


  • [1] Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, and Quoc V. Le. Attention augmented convolutional networks. In IEEE International Conference on Computer Vision (ICCV), pages 3286–3295, October 2019.
  • [2] Liang-Chieh Chen, Yi Yang, Jiang Wang, Wei Xu, and Alan L. Yuille. Attention to scale: Scale-aware semantic image segmentation. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016

    , pages 3640–3649, 2016.
  • [3] Xiao Chu, Wei Yang, Wanli Ouyang, Cheng Ma, Alan L. Yuille, and Xiaogang Wang.

    Multi-context attention for human pose estimation.

    In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 5669–5678. IEEE Computer Society, 2017.
  • [4] Ekin D. Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V. Le. Autoaugment: Learning augmentation strategies from data. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 113–123. Computer Vision Foundation / IEEE, 2019.
  • [5] Deng-Ping Fan, Wenguan Wang, Ming-Ming Cheng, and Jianbing Shen. Shifting More Attention to Video Salient Object Detection. IEEE.
  • [6] Yang Feng, Deqian Kong, Ping Wei, Hongbin Sun, and Nanning Zheng. A benchmark dataset and multi-scale attention network for semantic traffic light detection. In 2019 IEEE Intelligent Transportation Systems Conference, ITSC 2019, Auckland, New Zealand, October 27-30, 2019, pages 1–8. IEEE, 2019.
  • [7] Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu. Dual attention network for scene segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3146–3154, 2019.
  • [8] Keren Fu, Deng-Ping Fan, Ge-Peng Ji, and Qijun Zhao. JL-DCF: joint learning and densely-cooperative fusion framework for RGB-D salient object detection. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 3049–3059. IEEE, 2020.
  • [9] Keren Fu, Qijun Zhao, Irene Yu-Hua Gu, and Jie Yang. Deepside: A general deep framework for salient object detection. Neurocomputing, 356:69–82, Sep 2019.
  • [10] Ziteng Gao, Limin Wang, and Gangshan Wu. LIP: local importance-based pooling. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 3354–3363. IEEE, 2019.
  • [11] Dongyoon Han, Jiwhan Kim, and Junmo Kim. Deep pyramidal residual networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 6307–6315. IEEE Computer Society, 2017.
  • [12] Bharath Hariharan, Pablo Andrés Arbeláez, Ross B. Girshick, and Jitendra Malik. Hypercolumns for object segmentation and fine-grained localization. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 447–456. IEEE Computer Society, 2015.
  • [13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  • [14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision (ECCV), pages 630–645, 2016.
  • [15] Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Andrea Vedaldi. Gather-excite: Exploiting feature context in convolutional neural networks. In Annual Conference on Neural Information Processing Systems (NeurIPS), pages 9423–9433, 2018.
  • [16] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7132–7141, 2018.
  • [17] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269, 2017.
  • [18] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In

    International Conference on Machine Learning (ICML)

    , pages 448–456, 2015.
  • [19] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
  • [20] Hanchao Li, Pengfei Xiong, Jie An, and Lingxue Wang. Pyramid attention network for semantic segmentation. In British Machine Vision Conference (BMVC), pages 1–13, 2018.
  • [21] Xiang Li, Wenhai Wang, Xiaolin Hu, and Jian Yang. Selective kernel networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 510–519, 2019.
  • [22] Yanghao Li, Yuntao Chen, Naiyan Wang, and Zhao-Xiang Zhang. Scale-aware trident networks for object detection. In IEEE International Conference on Computer Vision (ICCV), pages 6053–6062, 2019.
  • [23] Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. Feature pyramid networks for object detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 936–944, 2017.
  • [24] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV), pages 740–755, Cham, 2014.
  • [25] Wei Liu, Andrew Rabinovich, and Alexander C. Berg. Parsenet: Looking wider to see better. CoRR, abs/1506.04579, 2015.
  • [26] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431–3440, 2015.
  • [27] Zhichao Lu, Gautam Sreekumar, Erik Goodman, Wolfgang Banzhaf, Kalyanmoy Deb, and Vishnu Naresh Boddeti. Neural Architecture Transfer. arXiv e-prints, page arXiv:2005.05859, May 2020.
  • [28] Vinod Nair and Geoffrey E. Hinton.

    Rectified linear units improve restricted boltzmann machines.

    In International Conference on Machine Learning (ICML), ICML’10, pages 807–814, USA, 2010.
  • [29] Jongchan Park, Sanghyun Woo, Joon-Young Lee, and In So Kweon. BAM: bottleneck attention module. In British Machine Vision Conference (BMVC), pages 1–14, 2018.
  • [30] Shaoqing Ren, Kaiming He, Ross B. Girshick, Xiangyu Zhang, and Jian Sun. Object detection networks on convolutional feature maps. IEEE Trans. Pattern Anal. Mach. Intell., 39(7):1476–1481, 2017.
  • [31] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pages 234–241, 2015.
  • [32] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  • [33] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. Int. J. Comput. Vis., 128(2):336–359, 2020.
  • [34] Bharat Singh and Larry S. Davis. An analysis of scale invariance in object detection - SNIP. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3578–3587, June 2018.
  • [35] Bharat Singh, Mahyar Najibi, and Larry S. Davis. SNIPER: efficient multi-scale training. In Annual Conference on Neural Information Processing Systems (NeurIPS), pages 9333–9343, 2018.
  • [36] Ashish Sinha and Jose Dolz. Multi-scale self-guided attention for medical image segmentation. IEEE Journal of Biomedical and Health Informatics, pages 1–14, Apr 2020.
  • [37] Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Training very deep networks. In Annual Conference on Neural Information Processing Systems (NeurIPS), pages 2377–2385, 2015.
  • [38] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A. Alemi.

    Inception-v4, inception-resnet and the impact of residual connections on learning.


    AAAI Conference on Artificial Intelligence

    , pages 4278–4284, 2017.
  • [39] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9, 2015.
  • [40] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826, 2016.
  • [41] Andrew Tao, Karan Sapra, and Bryan Catanzaro. Hierarchical multi-scale attention for semantic segmentation. arXiv preprint arXiv:2005.10821, 2020.
  • [42] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Annual Conference on Neural Information Processing Systems (NeurIPS), pages 5998–6008, 2017.
  • [43] Qilong Wang, Banggu Wu, Pengfei Zhu, Peihua Li, Wangmeng Zuo, and Qinghua Hu. Eca-net: Efficient channel attention for deep convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 11534–11542, 2020.
  • [44] Wenguan Wang, Shuyang Zhao, Jianbing Shen, Steven C. H. Hoi, and Ali Borji. Salient object detection with pyramid attention and salient edges. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 1448–1457. Computer Vision Foundation / IEEE, 2019.
  • [45] Xiaolong Wang, Ross B. Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7794–7803, 2018.
  • [46] Yi Wang, Haoran Dou, Xiaowei Hu, Lei Zhu, Xin Yang, Ming Xu, Jing Qin, Pheng-Ann Heng, Tianfu Wang, and Dong Ni. Deep Attentive Features for Prostate Segmentation in 3D Transrectal Ultrasound. IEEE Transactions on Medical Imaging, 38(12):2768–2778, Apr 2019.
  • [47] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. CBAM: convolutional block attention module. In European Conference on Computer Vision (ECCV), pages 3–19, 2018.
  • [48] Saining Xie, Ross B. Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5987–5995, 2017.
  • [49] Weitao Yuan, Shengbei Wang, Xiangrui Li, Masashi Unoki, and Wenwu Wang. A skip attention mechanism for monaural singing voice separation. IEEE Signal Processing Letters, 26(10):1481–1485, 2019.
  • [50] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In Richard C. Wilson, Edwin R. Hancock, and William A. P. Smith, editors, Proceedings of the British Machine Vision Conference 2016, BMVC 2016, York, UK, September 19-22, 2016. BMVA Press, 2016.
  • [51] Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, Zhi Zhang, Haibin Lin, Yue Sun, Tong He, Jonas Mueller, R. Manmatha, Mu Li, and Alexander Smola. ResNeSt: Split-Attention Networks. arXiv e-prints, page arXiv:2004.08955, Apr. 2020.
  • [52] Shifeng Zhang, Xiangyu Zhu, Zhen Lei, Hailin Shi, Xiaobo Wang, and Stan Z. Li. S3FD: Single shot scale-invariant face detector. In IEEE International Conference on Computer Vision (ICCV), Oct 2017.