code and trained models for "Attentional Feature Fusion"
Feature fusion, the combination of features from different layers or branches, is an omnipresent part of modern network architectures. It is often implemented via simple operations, such as summation or concatenation, but this might not be the best choice. In this work, we propose a uniform and general scheme, namely attentional feature fusion, which is applicable for most common scenarios, including feature fusion induced by short and long skip connections as well as within Inception layers. To better fuse features of inconsistent semantics and scales, we propose a multi-scale channel attention module, which addresses issues that arise when fusing features given at different scales. We also demonstrate that the initial integration of feature maps can become a bottleneck and that this issue can be alleviated by adding another level of attention, which we refer to as iterative attentional feature fusion. With fewer layers or parameters, our models outperform state-of-the-art networks on both CIFAR-100 and ImageNet datasets, which suggests that more sophisticated attention mechanisms for feature fusion hold great potential to consistently yield better results compared to their direct counterparts. Our codes and trained models are available online.READ FULL TEXT VIEW PDF
code and trained models for "Attentional Feature Fusion"
Convolutional neural networks (CNNs) have seen a significant improvement of the representation power by going deeper , going wider [39, 50], increasing cardinality , and refining features dynamically 
, corresponding to advances in many computer vision tasks.
Apart from these strategies, in this paper, we investigate a different component of the network, feature fusion, to further boost the representation power of CNNs. Whether explicit or implicit, intentional or unintentional, feature fusion is omnipresent for modern network architectures and has been studied extensively in the previous literature [39, 37, 13, 31, 23]. For instance, in the InceptionNet family [39, 40, 38], the outputs of filters with multiple sizes on the same level are fused to handle the large variation of object size. In Residual Networks (ResNet) [13, 14] and its follow-ups [50, 48], the identity mapping features and residual learning features are fused as the output via short skip connections, enabling the training of very deep networks. In Feature Pyramid Networks (FPN)  and U-Net , low-level features and high-level features are fused vis long skip connections to obtain high-resolution and semantically strong features, which are vital for semantic segmentation and object detection. However, despite its prevalence in modern networks, most works on feature fusion focus on constructing sophisticated pathways to combine features in different kernels, groups, or layers. The feature fusion method has rarely been addressed and is usually implemented via simple operations such as addition or concatenation, which merely offer a fixed linear aggregation of feature maps and are entirely unaware of whether this combination is suitable for specific objects.
Recently, Selective Kernel Networks (SKNet)  and ResNeSt  have been proposed to render dynamic weighted averaging of features from multiple kernels or groups in the same layer based on the global channel attention mechanism . Although such attention-based methods present nonlinear approaches for feature fusion, they still suffer from the following shortcomings:
SKNet and ResNeSt only focus on the soft feature selection in the same layer, whereas thecross-layer
fusion in skip connections has not been addressed, leaving their schemes quite heuristic. Despite having different scenarios, all kinds of feature fusion implementations face the same challenge, in essence, that is, how to integrate features of different scales for better performance. A module that can overcome the semantic inconsistency and effectively integrate features of different scales should be able to consistently improve the quality of fused features in various network scenarios. However, so far, there is still a lack of a generalized approach that can unify different feature fusion scenarios in a consistent manner.
Unsophisticated initial integration: To feed the received features into the attention module, SKNet introduces another phase of feature fusion in an involuntary but inevitable way, which we call initial integration and is implemented by addition. Therefore, besides the design of the attention module, as its input, the initial integration approach also has a large impact on the quality of fusion weights. Considering the features may have a large inconsistency on the scale and semantic level, an unsophisticated initial integration strategy ignoring this issue can be a bottleneck.
Biased context aggregation scale: The fusion weights in SKNet and ResNeSt are generated via the global channel attention mechanism , which is preferred for information that distributes more globally. However, objects in the image can have an extremely large variation in size. Numerous studies have emphasized this issue that arises when designing CNNs, i.e., that the receptive fields of predictors should match the object scale range [52, 34, 35, 22]. Therefore, merely aggregating contextual information on a global scale is too biased and weakens the features of small objects. This gives rise to the question if a network can dynamically and adaptively fuse the received features in a contextual scale-aware way.
Motivated by the above observations, we present the attentional feature fusion (AFF) module, trying to answer the question of how a unified approach for all kinds of feature fusion scenarios should be and address the problems of contextual aggregation and initial integration. The AFF framework generalizes the attention-based feature fusion from the same-layer scenario to cross-layer scenarios including short and long skip connections, and even the initial integration inside AFF itself. It provides a universal and consistent way to improve the performance of various networks, e.g., InceptionNet, ResNet, ResNeXt , and FPN, by simply replacing existing feature fusion operators with the proposed AFF module. Moreover, the AFF framework supports to gradually refine the initial integration, namely the input of the fusion weight generator, by iteratively integrating the received features with another AFF module, which we refer to as iterative attentional feature fusion (iAFF).
To alleviate the problems arising from scale variation and small objects, we advocate the idea that attention modules should also aggregate contextual information from different receptive fields for objects of different scales. More specifically, we propose the Multi-Scale Channel Attention Module (MS-CAM), a simple yet effective scheme to remedy the feature inconsistency across different scales for attentional feature fusion. Our key observation is that scale is not an issue exclusive to the spatial attention, and the channel attention can also have scales other than the global by varying the spatial pooling size. By aggregating the multi-scale context information along the channel dimension, MS-CAM can simultaneously emphasize large objects that distribute more globally and highlight small objects that distribute more locally, facilitating the network to recognize and detect objects under extreme scale variation.
The scale variation of objects is one of the key challenges in computer vision. To remedy this issue, an intuitive way is to leverage multi-scale image pyramids [30, 2], in which objects are recognized at multiple scales and the predictions are combined using non-maximum suppression. The other line of effort aims to exploit the inherent multi-scale, hierarchical feature pyramid of CNNs to approximate image pyramids, in which features from multiple layers are fused to obtain semantic features with high resolutions [12, 31, 23].
The attention mechanism in deep learning, which mimics the human visual attention mechanism[5, 8], is originally developed on a global scale. For example, the matrix multiplication in self-attention draws global dependencies of each word in a sentence  or each pixel in an image [7, 45, 1]. The Squeeze-and-Excitation Networks (SENet) squeeze global spatial information into a channel descriptor to capture channel-wise dependencies . Recently, researchers start to take into account the scale issue of attention mechanisms. Similar to the above-mentioned approaches handling scale variation in CNNs, multi-scale attention mechanisms are achieved by either feeding multi-scale features into an attention module or combining feature contexts of multiple scales inside an attention module. In the first type, the features at multiple scales or their concatenated result are fed into the attention module to generate multi-scale attention maps, while the scale of feature context aggregation inside the attention module remains single [2, 3, 46, 6, 36, 41]. The second type, which is also referred to as multi-scale spatial attention, aggregates feature contexts by convolutional kernels of different sizes  or from a pyramid [20, 44] inside the attention module .
The proposed MS-CAM follows the idea of ParseNet  with combining local and global features in CNNs and the idea of spatial attention with aggregating multi-scale feature contexts inside the attention module, but differ in at least two important aspects: 1) MS-CAM puts forward the scale issue in channel attention and is achieved by point-wise convolution rather than kernels of different sizes. 2) instead of in the backbone network, MS-CAM aggregates local and global feature contexts inside the channel attention module. To the best of our knowledge, the multi-scale channel attention has neven been discussed before.
Skip connection has been an essential component in modern convolutional networks. Short skip connections, namely the identity mapping shortcuts added inside Residual blocks, provide an alternative path for the gradient to flow without interruption during backpropagation[13, 48, 50]. Long skip connections help the network to obtain semantic features with high resolutions by bridging features of finer details from lower layers and high-level semantic features of coarse resolutions [17, 23, 31, 26]. Despite being used to combine features in various pathways 
, the fusion of connected features is usually implemented via addition or concatenation, which allocate the features with fixed weights regardless of the variance of contents. Recently, a few attention-based methods, e.g., Global Attention Upsampe (GAU) and Skip Attention (SA) , have been proposed to use high-level features as guidance to modulate the low-level features in long skip connections. However, the fusion weights for the modulated features are still fixed.
To the best of our knowledge, it is the Highway Networks that first introduced a selection mechanism in short skip connections . To some extent, the attentional skip connections proposed in this paper can be viewed as its follow-up, but differs in the three points: 1) Highway Networks employ a simple fully connected layer that can only generate a scalar fusion weight, while our proposed MS-CAM generates fusion weights as the same size of feature maps, enabling dynamic soft selections in an element-wise way. 2) Highway Networks only use one input feature to generate weight, while our AFF module is aware of both features. 3) We point out the importance of initial feature integration and the iAFF module is proposed as a solution.
Given an intermediate feature with channels and feature maps of size , the channel attention weights in SENet can be computed as
where denotes the global feature context and is the global average pooling (GAP).28], and
denotes the Batch Normalization (BN).
is the Sigmoid function. This is achieved by a bottleneck with two fully connected (FC) layers, whereis a dimension reduction layer, and is a dimension increasing layer. is the channel reduction ratio.
We can see that the channel attention squeezes each feature map of size into a scalar. This extreme coarse descriptor prefers to emphasize large objects that distribute globally and can potentially wipe out most of the image signal present in a small object. However, detecting very small objects stands out as the key performance bottleneck of state-of-the-art networks 
. For example, the difficulty of COCO is largely due to the fact that most object instances are smaller than 1% of the image area[24, 34]. Therefore, global channel attention might not be the best choice. Multi-scale feature contexts should be aggregated inside the attention module to alleviate the problems arising from scale variation and small object instances.
In this part, we depict the proposed multi-scale channel attention module (MS-CAM) in detail. The key idea is that the channel attention can be implemented in multiple scales by varying the spatial pooling size. To maintain it as lightweight as possible, we merely add the local context to the global context inside the attention module. We choose the point-wise convolution (PWConv) as the local channel context aggregator, which only exploits point-wise channel interactions for each spatial position. To save parameters, the local channel context is computed via a bottleneck structure as follows:
The kernel sizes of and are and is , respectively. It is noteworthy that has the same shape as the input feature, which can preserve and highlight the subtle details in the low-level features. Given the global channel context and local channel context , the refined feature by MS-CAM can be obtained as follows:
where denotes the attentional weights generated by MS-CAM. denotes the broadcasting addition and denotes the element-wise multiplication.
Given two feature maps , by default, we assume is the feature map with a larger receptive field. More specifically,
same-layer scenario: is the output of a kernel and is the output of a kernel in InceptionNet;
short skip connection scenario: is the identity mapping, and is the learned residual in a ResNet block;
long skip connection scenario: is the low-level feature map, and is the high-level semantic feature map in a feature pyramid.
Based on the multi-scale channel attention module , Attentional Feature Fusion (AFF) can be expressed as
where is the fused feature, and denotes the initial feature integration. In this subsection, for the sake of simplicity, we choose the element-wise summation as initial integration. The AFF is illustrated in Fig. LABEL:sub@subfig:aff, where the dashed line denotes . It should be noted that the fusion weights consists of real numbers between 0 and 1, so are the , which enable the network to conduct a soft selection or weighted averaging between and .
We summarize different formulations of feature fusion in deep networks in table 1. denotes the global attention mechanism. Although there are many implementation differences among multiple approaches for various feature fusion scenarios, once being abstracted into mathematical forms, these differences in details disappear. Therefore, it is possible to unify these feature fusion scenarios with a carefully designed approach, thereby improving the performance of all networks by replacing original fusion operations with this unified approach.
|Context-aware||Type||Formulation||Scenario & Reference||Example|
|None||Addition||Short Skip [13, 14], Long Skip [26, 23]||ResNet, FPN|
|Concatenation||Same Layer , Long Skip [31, 17]||InceptionNet, U-Net|
|Partially||Refinement||Short Skip [16, 15, 47, 29]||SENet|
|Modulation||Long Skip ||GAU|
|Soft Selection||Short Skip ||Highway Networks|
|Fully||Modulation||Long Skip ||SA|
|Soft Selection||Same Layer [21, 51]||SKNet|
|Same Layer, Short Skip, Long Skip||ours|
From table 1, it can be further seen that apart from the implementation of the weight generation module , the state-of-the-art fusion schemes mainly differ in two crucial points: (a) the context-awareness level. Linear approaches like addition and concatenation are entirely contextual unaware. Feature refinement and modulation are non-linear, but only partially aware of the input feature maps. In most cases, they only exploit the high-level feature map. Fully context-aware approaches utilize both input feature maps for guidance at the cost of raising the initial integration issue. (b) Refinement vs modulation vs selection. The sum of weights applied to two feature maps in soft selection approaches are bound to 1, while this is not the case for refinement and modulation.
Unlike partially context-aware approaches , fully context-aware methods have an inevitable issue, namely how to initially integrate input features. As the input of the attention module, the initial integration quality may profoundly affect final fusion weights. Since it is still a feature fusion problem, an intuitive way is to have another attention module to fuse input features. We call this two-stage approach iterative Attentional Feature Fusion (iAFF), which is illustrated in Fig. LABEL:sub@subfig:iaff. Then, the initial integration in eq. 4 can be reformulated as
To validate the proposed AFF/iAFF as a uniform and general scheme, we choose ResNet, FPN, and InceptionNet as examples for the most common scenarios: short and long skip connections as well as the same layer fusion. It is straightforward to apply AFF/iAFF to existing networks by replacing the original addition or concatenation. Specifically, we replace the concatenation in the InceptionNet module as well as the addition in ResNet block (ResBlock) and FPN to obtain the attentional networks, which we call AFF-Inception module, AFF-ResBlock, and AFF-FPN, respectively. This replacement and the schemes of our proposed architectures are shown infig. 8. The iAFF is a particular case of AFF, so it does not need another illustration.
The goal of the following experimental evaluation is to show that, with the proposed AFF/iAFF and MS-CAM, convolutional networks can gain a performance boost, even with fewer layers or parameters per network.
For experimental evaluation, we resort to the following benchmark datasets: CIFAR-100  and ImageNet  for image classification in the same-layer InceptionNet and short-skip connection ResNet scenarios as well as StopSign (a subset of COCO dataset ) for semantic segmentation in the long-skip connection FPN scenario. The detailed experimental settings are listed in table 2. is the ResBlock number in each stage, which is used to scale the network by depth. For more implementation details, please see the supplementary material as well as our code.
|Task||Dataset||Host Network||Fusing Scenario||Epochs||Batch Size||Optimizer||Learning Rate||Learning Rate Mode||Initialization|
|Image Classification||CIFAR-100||Inception-ResNet-20-||Same Layer||4||400||128||Nesterov||0.2||Step,||Kaiming|
|Semantic Segmentation||StopSign||ResNet-20- + FPN||Long Skip||4||300||32||AdaGrad||0.01||Poly||Kaiming|
To study the impact of multi-scale context aggregation, in fig. 11, we construct two ablation modules “Global + Global” and “Local + Local”, in which the scales of the two contextual aggregation branches are set as the same, either global or local. The proposed AFF is dubbed as “Global + Local” here. All of them have the same parameter number. The only difference is their context aggregation scale.
Table 3 presents their comparison on CIFAR-100, ImageNet, and StopSign on various host networks. It can be seen that the multi-scale contextual aggregation (Global + Local) outperforms single-scale ones in all settings. The results suggest that the multi-scale feature context is vital for the attentional feature fusion.
|Aggregation Scale||InceptionNet on CIFAR-100||ResNet on CIFAR-100||ResNet + FPN on StopSign||ResNet on ImageNet|
|Global + Global||0.735||0.766||0.775||0.789||0.754||0.796||0.811||0.821||0.911||0.923||0.936||0.939||0.777|
|Local + Local||0.746||0.771||0.785||0.787||0.754||0.794||0.808||0.814||0.895||0.919||0.921||0.924||0.780|
|Global + Local||0.756||0.784||0.794||0.801||0.763||0.804||0.816||0.826||0.924||0.935||0.939||0.944||0.784|
|Fusion Type||Context||Strategy||InceptionNet (Same Layer)||ResNet (Short Skip)||ResNet + FPN (Long Skip)|
Further, we investigate which feature fusion strategy is the best in table 1. For fairness, we re-implement these approaches based on the proposed MS-CAM for attention weights. Since MS-CAM are different from their original attention modules, we add a prefix of ”MS-” to these newly implemented schemes. To keep the parameter budget the same, here the channel reduction ratio in MS-GAU, MS-SE, MS-SA, and AFF is , while in iAFF is .
table 4 provides the comparison results in three scenarios, from which it can be seen that: 1) compared to the linear approach, namely addition and concatenation, the non-linear fusion strategy with attention mechanism always offers better performance; 2) our fully context-aware and selective strategy is slightly but consistently better than the others, suggesting that it should be preferred for multiple feature integration; 3) the proposed iAFF approach is significantly better than the rest in most cases. The results strongly demonstrate our hypothesis that the early integration quality has a large impact on the attentional feature fusion, and another level of attentional feature fusion can further improve the performance. However, this improvement may be obtained at the cost of increasing the difficulty in optimization. We notice that when the network depth increases as changes from to , the performance of iAFF-ResNet did not improve but degraded.
To study the impact of the proposed MS-CAM on object localization and small object recognition, we apply Grad-CAM  to ResNet-50, SENet-50, and AFF-ResNet-50 for the visualization results of images from the ImageNet dataset, which are illustrated in fig. 18. Given a specific class, Grad-CAM results show the network’s attended regions clearly. Here, we show the heatmaps of the predicted class, and the wrongly predicted image is denoted with the symbol ✖. The predicted class names and their softmax scores are also shown at the bottom of heatmaps.
From the upper part of fig. 18, it can be seen clearly that the attended regions of the AFF-ResNet-50 highly overlap with the labeled objects, which shows that it learns well to localize objects and exploit the features in object regions. On the contrary, the localization capacity of the baseline ResNet-50 is relatively poor, misplacing the center of attended regions in many cases. Although SENet-50 are able to locate the true objects, the attended regions are over-large including many background components. It is because SENet-50 only utilizes the global channel attention, which is biased to the context of a global scale, whereas the proposed MS-CAM also aggregates the local channel context, which helps the network to attend the objects with fewer background clutters and is also beneficial to the small object recognition. In the bottom half of fig. 18, we can clearly see that AFF-ResNet-50 can predict correctly on the small-scale objects, while ResNet-50 fails in most cases.
To show that the network performance can be improved by replacing original fusion operations with the proposed attentional feature fusion, we compare the AFF and iAFF modules with other attention modules based on the same host networks in different feature fusion scenarios. fig. 22 illustrates the comparison results with a gradual increase in network depth for all networks. It can be seen that: 1) Comparing SKNet / SENet / GAU-FPN with AFF-InceptionNet / AFF-ResNet / AFF-FPN, we can see that our AFF or iAFF integrated networks are better in all scenarios, which shows that our (iterative) attentional feature fusion approach not only has superior performance, but a good generality. We believe the improved performance comes from the proposed multi-scale channel contextual aggregation inside the attention module. 2) Comparing the performance of iAFF-based networks with AFF-based networks, it should be noted that the proposed iterative attentional feature fusion scheme can further improve the performance. 3) By replacing the simple addition or concatenation with the proposed AFF or iAFF module, we can get a more efficient network. For example, in Fig. LABEL:sub@subfig:sotashortskip, iAFF-ResNet () achieves similar performance with the baseline ResNet (), while only 54% of the parameters were required.
Last, we validate the performance of AFF/iAFF based networks with state-of-the-art networks on CIFAR-100 and ImageNet. The results are listed in table 5 and table 6. The results show that the proposed AFF/iAFF based networks can improve performance over the state-of-the-art networks under much smaller parameter budgets. Remarkably, on CIFAR-100, AFF-ResNeXt-38-32x4d outperforms NAT-M4  by above absolute 2% , although NAT-M4 has 15% more parameters. AFF-ResNet-32 achieves the same accuracy as AutoAugment+PyramidNet+ShakeDrop  by merely utilizing 19% of its parameters. On ImageNet, the proposed iAFF-ResNet-50 outperforms Gather-Excite--ResNet-101  by 0.3% with only 60% parameters. These results indicate that the feature fusion in short skip connections matters a lot for ResNet and ResNeXt. Instead of blindly increasing the depth of the network, we should pay more attention to the quality of feature fusion.
|Attention-Augmented-Wide-ResNet-28-10 ||81.6||36.2 M|
|SENet-29 ||82.2||35.0 M|
|PyramidNet-272-200 ||83.6||26.0 M|
|Neural Architecture Transfer (NAT-M4) ||88.3||9.0 M|
|AutoAugment+PyramidNet+ShakeDrop ||89.3||26.0 M|
|AFF-ResNet-32 (ours)||89.3||5.0 M|
|AFF-ResNeXt-38-32x4d (ours)||90.3||7.8 M|
|ResNet-101 ||23.2||42.5 M|
|Efficient-Channel-Attention-Net-101 ||21.4||42.5 M|
|Attention-Augmented-ResNet-101 ||21.3||45.4 M|
|SENet-101 ||20.9||49.4 M|
|Gather-Excite--ResNet-101 ||20.7||58.4 M|
|Local-Importance-Pooling-ResNet-101 ||20.7||42.9 M|
|AFF-ResNet-50 (ours)||20.9||30.3 M|
|AFF-ResNeXt-50-32x4d (ours)||20.8||29.9 M|
|iAFF-ResNet-50 (ours)||20.4||35.1 M|
|iAFF-ResNeXt-50-32x4d (ours)||20.2||34.7 M|
We generalize the concept of attention mechanisms as a selective and dynamic type of feature fusion to most scenarios, namely the same layer, short skip, and long skip connections as well as information integration inside the attention mechanism. To overcome the semantic and scale inconsistency issue among input features, we propose the multi-scale channel attention module, which adds local channel contexts to the global channel-wise statistics. Further, we point out that the initial integration of received features is a bottleneck in attention-based feature fusion, and it can be alleviated by adding another level of attention that we call iterative attentional feature fusion. We conducted detailed ablation studies to empirically verify the individual impact of the context-aware level, the feature integration type, and the contextual aggregation scales of our proposed attention mechanism. Experimental results on both the CIFAR-100 and the ImageNet dataset show that our models outperform state-of-the-art networks with fewer layers or parameters per network, which suggests that one should pay attention to the feature fusion in deep neural networks and that more sophisticated attention mechanisms for feature fusion hold the potential to consistently yield better results.
2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 3640–3649, 2016.
Multi-context attention for human pose estimation.In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 5669–5678. IEEE Computer Society, 2017.
International Conference on Machine Learning (ICML), pages 448–456, 2015.
Rectified linear units improve restricted boltzmann machines.In International Conference on Machine Learning (ICML), ICML’10, pages 807–814, USA, 2010.
Inception-v4, inception-resnet and the impact of residual connections on learning.In
AAAI Conference on Artificial Intelligence, pages 4278–4284, 2017.