FcaNet: Frequency Channel Attention Networks
Attention mechanism, especially channel attention, has gained great success in the computer vision field. Many works focus on how to design efficient channel attention mechanisms while ignoring a fundamental problem, i.e., using global average pooling (GAP) as the unquestionable pre-processing method. In this work, we start from a different view and rethink channel attention using frequency analysis. Based on the frequency analysis, we mathematically prove that the conventional GAP is a special case of the feature decomposition in the frequency domain. With the proof, we naturally generalize the pre-processing of channel attention mechanism in the frequency domain and propose FcaNet with novel multi-spectral channel attention. The proposed method is simple but effective. We can change only one line of code in the calculation to implement our method within existing channel attention methods. Moreover, the proposed method achieves state-of-the-art results compared with other channel attention methods on image classification, object detection, and instance segmentation tasks. Our method could improve by 1.8 compared with the baseline SENet-50, with the same number of parameters and the same computational cost. Our code and models will be made publicly available.READ FULL TEXT VIEW PDF
Global Average Pooling (GAP) is used by default on the channel-wise atte...
Attention mechanism has been regarded as an advanced technique to captur...
Deep neural networks have achieved remarkable success in computer vision...
Attention mechanism of late has been quite popular in the computer visio...
In this paper, we propose a frequency-time division network (FreqTimeNet...
Generic object detection has been immensely promoted by the development ...
Convolutional Neural Networks have achieved impressive results in variou...
FcaNet: Frequency Channel Attention Networks
Simple reimplemetation experiments about FcaNet
As an important and challenging problem in feature modeling, attention mechanisms for convolutional neural networks (CNNs) have recently attracted considerable attention and are widely used in many fields like computer vision[xu2015show]vaswani2017attention]
. In principle, they aim at selectively concentrating on some important information and have many types of variants (e.g., spatial attention, channel attention, and self-attention) corresponding to different feature dimensions. Due to the simplicity and effectiveness in feature modeling, channel attention directly learns to attach importance weights with different channels, becoming a popular and powerful tool for the deep learning community.
In the literature, conventional channel attention approaches devote great efforts to constructing various channel importance weight functions (e.g., SENet [hu2018squeeze] using fully connected layers and ECANet [wang2020eca] based on local one-dimensional convolutions). Typically, such weight functions require a scalar for each channel to conduct the calculation due to the constrained computational overhead, and global average pooling (GAP) becomes the de-facto standard choice in the deep learning community because of its simplicity and efficiency. Despite the simplicity and efficiency, there exists a potential problem that GAP is incapable of well capturing the rich input pattern information, and thus lacks feature diversity when processing different inputs. Consequently, there arises a natural question of whether the mean value information only is adequate for representing various channels in channel attention. With the above motivation, we carry out a theoretical analysis of the GAP for channel attention in the following aspects. First, there exist numerous possibilities that different channels could have the same mean values, while their corresponding semantic content information is distinct. Second, from the perspective of frequency analysis, we prove that the GAP is equivalent to the lowest frequency of the discrete cosine transform (DCT) and only using GAP is equivalent to discarding the other frequency components containing much useful information on feature channels. Third, CBAM [woo2018cbam]
also shows that only using GAP is not enough and adopts both GAP and global max pooling to enhance feature diversity. In this paper, we propose a simple, novel, but effective multi-spectral channel attention framework. First, we mathematically prove that GAP is a special case of DCT frequency analysis, and it is equivalent to the lowest frequency component of DCT. Second, we naturally generalize the existing GAP channel attention mechanism in the frequency domain. We propose to use multiple but limited frequency components instead of one single GAP in the attention mechanism. By incorporating more frequency components into the attention processing, the information from these different frequency components can be exploited, leading to a multi-spectral description. As a result, the problem of insufficient information used in channel attention from the single frequency (i.e., GAP) can be addressed. Third, based on performance evaluations, we propose a two-step feature selection criterion for choosing different frequency components in the attention mechanism. Using the feature selection criterion, the proposed multi-spectral channel attention framework achieves state-of-the-art performance against the other channel attention ones. In a word, the main contribution of this work can be summarized as follows.
We prove that GAP is a special case of DCT. Based on this proof, we generalize the channel attention in the frequency domain and propose FcaNet with the multi-spectral channel attention framework.
We propose a two-step criterion for choosing frequency components by exploring the effects of using different numbers of frequency components as well as their different combinations.
Extensive experiments demonstrate the proposed method achieves state-of-the-art results on both ImageNet and COCO datasets. Based on the ResNet50 backbone, it could outperform SENet byin terms of Top-1 accuracy on ImageNet, with the same number of parameters and computational cost. The results are shown in Fig. 1.
Our method is simple yet effective and can be implemented with only one line change of code within existing channel attention implementations.
In [xu2015show], a visual attention method is first proposed to model the importance of features in the image caption task. Then many methods start to focus on the attention mechanism. A residual attention network [wang2017residual] is proposed with a spatial attention mechanism using downsampling and upsampling. Besides, SENet [hu2018squeeze] proposes the channel attention mechanism. It performs GAP on the channels and then calculates the weights of each channel using fully connected layers. What’s more, GE [hu2018gather] uses spatial attention to better exploit the feature context, and -Net [chen20182] builds a relation function for image or video recognition. Inspired by these works, a series of works like BAM [park2018bam], DAN [fu2019dual], CBAM [woo2018cbam], and scSE [roy2018recalibrating] are proposed to fuse spatial attention [zhu2019empirical] and channel attention. Among them, CBAM claims that GAP could only get a sub-optimal feature because of the loss of information. For addressing this problem, it uses both the GAP and the global max pooling and gains significant performance improvement. Motivated by CBAM, GSoP [gao2019global] introduces a second-order pooling method for downsampling. NonLocal [wang2018non] proposes to build a dense spatial feature map. AANet [bello2019attention] proposes to embed the attention map with position information into the feature. SkNet [li2019selective] introduces a selective channel aggregation and attention mechanism, and ResNeSt [zhang2020resnest] proposes a similar split attention method. Due to the complicated attention operation, these methods are relatively large. To improve efficiency, GCNet [cao2019gcnet] proposes to use a simple spatial attention module and replace the original spatial downsampling process. ECANet [wang2020eca] introduces one-dimensional convolution layers to reduce the redundancy of fully connected layers and obtains more efficient results. Besides these works, many methods try to extend the attention mechanism to specific tasks, like multi-label classification [guo2019visual], saliency detection [zhao2019pyramid], visual explanation [fukui2019attention]
, and super-resolution[zhang2018image].
Frequency analysis has always been a powerful tool in the signal processing field. In recent years, some applications of introducing frequency analysis in the deep learning field emerge. In [ehrlich2019deep, gueguen2018faster], frequency analysis is introduced in the CNNs by JPEG encoding. Then, DCT is incorporated in [xu2020learning] to reduce communication bandwidth. There are also some applications in the model compression and pruning tasks like [chen2016compressing, liu2018frequency, wang2016cnnpack].
In this section, we first revisit the formulation of channel attention and DCT frequency analysis. Then, based on these works, we elaborate on the derivation of our multi-spectral channel attention framework. Meanwhile, a two-step criterion for choosing frequency components in the framework is also proposed. At last, we give discussions about effectiveness, complexity, and code implementation.
We first elaborate on the definitions of channel attention mechanism and discrete cosine transform. Then, we briefly summarize the properties of channel attention and DCT.
The channel attention mechanism is widely used in CNNs. It uses a learnable network to weight the importance of each channel and generates more informative outputs. Suppose
is the image feature tensor in networks,is the number of channels, is the height of the feature, and is the width of the feature. Then the attention mechanism can be written as [hu2018squeeze, wang2020eca]:
is the attention vector,
is the Sigmoid function,represents the mapping functions like fully connected layer or one-dimensional convolution, and is the global average pooling. After obtaining the attention vector of all channels, each channel of input is scaled by the corresponding attention value:
in which is the output of attention mechanism, is the -th element of attention vector, and is the -th channel of input.
Typically, the definition of DCT can be written as [ahmed1974discrete]:
in which is the frequency spectrum of DCT, is the input, and is the length of the input . Moreover, two-dimensional (2D) DCT can be written as:
in which is the 2D DCT frequency spectrum, is the input, is the height of , and is the width of . Correspondingly, the inverse 2D DCT can be written as:
Please note that in Eqs. 4 and 5, some constant normalization factors are removed for simplicity, which will not affect the results in this work. With the definitions of channel attention and DCT, we can summarize two key properties: a) existing methods use GAP as their pre-processing before channel attention. b) DCT can be viewed as a weighted sum of inputs with the cosine parts in Eqs. 3 and 4 being the weights. GAP is an operation of mean value due to constrained computational overhead, and it can be viewed as the simplest spectrum of input. As described in the introduction section, it is inadequate to use single GAP information in channel attention. Motivated by these properties, we can proceed to introduce our multi-spectral channel attention method.
In this section, we first theoretically discuss the problem of existing channel attention mechanisms. Based on the theoretical analysis, we then elaborate on the network design of the proposed method.
As discussed in Sec. 3.1, DCT can be viewed as a weighted sum of inputs. We further propose that GAP is actually a special case of 2D DCT.
GAP is a special case of 2D DCT, and its result is proportional to the lowest frequency component of 2D DCT.
Based on theorem 1 that GAP is a special case of 2D DCT, this prompt that we could also incorporate other frequency components in the channel attention mechanism. Moreover, we can discuss the reason why we need to incorporate other frequency components using 2D DCT. For simplicity, we use to represent the basis functions of 2D DCT:
Then, the 2D DCT in Eq. 5 can be rewritten as:
It is natural to see that an image or feature can be represented as a combination of different frequency components. According to Eq. 1, we have:
Channel attention is only based on the results of the GAP. However, combined with Eq. 14, we can see that the information of input is not only composed of GAP111More rigorous derivation can be found in the appendix.:
The term is a constant scale factor and can be ignored in the attention mechanism. In this way, only a small part of the information is used by the channel attention mechanism. The other frequency components and information are discarded in the existing channel attention methods.
Based on the theoretical analysis and theorem 1, we find the information used in existing channel attention is inadequate, and the pre-processing method GAP of channel attention is a special case of 2D DCT. In this way, we could naturally generalize GAP to more frequency components of 2D DCT and introduce more information to solve the problem of inadequate information in channel attention. To introduce more information, we propose to use multiple frequency components of 2D DCT, including the lowest frequency component, i.e., GAP.
First, the input is split into many parts along the channel dimension. Denote as the parts, in which , , , and should be be divisible by . For each part, a corresponding 2D DCT frequency component is assigned, and the 2D DCT results can be used as the pre-processing results of channel attention. In this way, we have:
in which are the frequency component 2D indices corresponding to , and is the -dimensional vector after the pre-processing. The whole pre-processing vector can be obtained by concatenation:
in which is the obtained multi-spectral vector. The whole multi-spectral channel attention framework can be written as:
From Eqs. 12 and 13, we can see that our method generalizes the original method that only uses the GAP, i.e., the lowest frequency component to a framework with multiple frequency sources. By doing so, the inadequate problem of original methods is addressed. The overall illustration of our method is shown in Fig. 2.
There exists an important problem of how to choose frequency component indices for each part . For each channel with a spatial size of , we can get frequency components after 2D DCT. In this case, the total number of combinations of these frequency components is . For example, could equal to
for ResNet-50 backbone. It is expensive to test all combinations. In this way, we propose a heuristic two-step criterion to choose the frequency components in the multi-spectral attention module. The main idea is to first determine the importance of each frequency component and then determine the effects of using different numbers of frequency components together. First, we examine the results of each frequency component in channel attention individually. Then, we choose the Top-k highest performance frequency components based on the results. In this way, the multi-spectral channel attention can be fulfilled. The ablation studies about this two-step criterion can be seen in Sec.4.2.
In Sec. 3.2, we show that only using GAP in channel attention is actually discarding information of all other frequency components except the lowest one, i.e., GAP. In this way, generalizing channel attention in the frequency domain and using the multi-spectral framework could naturally embed more information in the channel attention mechanism. Besides the above derivation, we also give a thought experiment to show that more information could be embedded. As we all know, deep networks are redundant [he2017channel, zhuang2018discrimination]. If two channels are redundant for each other, we can only get the same information using GAP. However, in our multi-spectral framework, it is possible to extract more information from redundant channels because different frequency components contain different information. In this way, the proposed multi-spectral framework could embed more information in the channel attention mechanism.
We analyze the complexity of our method from two aspects: the number of parameters and the computational cost. For the number of parameters, our method has no extra parameters compared with the baseline SENet because the weights of 2D DCT are pre-computed constant. For the computational cost, our method has a negligible extra cost and can be viewed as having the same computational cost as SENet. With ResNet-34, ResNet-50, ResNet-101, and ResNet-152 backbone, the relative computational cost increases of our method are , , , and compared with SENet, respectively. More results can be found in Table 2.
Another important property of the proposed multi-spectral framework is that it can be easily realized with existing channel attention implementations. As described in Sec. 3.1 and Eq. 11, 2D DCT can be viewed as a weighted sum of inputs. In this way, the implementation of our method can be simply achieved by element-wise multiplication and summation. The implementation is illustrated in Fig. 3. As we can see, the only difference between the calculation of SENet and our method is the pre-processing part. For SENet, GAP is used while we use multi-spectral 2D DCT. In this way, our method could be easily integrated into arbitrary channel attention methods.
In this section, we first elaborate on the details of our experiments. Then, we show ablation studies about FcaNet. Last, we investigate the effectiveness of our method on the task of image classification, object detection, and instance segmentation.
To evaluate the results of the proposed FcaNet on ImageNet [ILSVRC15], we employ four widely used CNNs as backbone models, including ResNet-34, ResNet-50, ResNet-101, and ResNet-152. We follow the data augmentation and hyper-parameter settings in [he2016deep] and [he2019bag]. Concretely, the input images are cropped randomly to 224224 with random horizontal flipping. We use an SGD optimizer with a momentum of 0.9, a weight decay of 1e-4, and a batch size of 128 per GPU at training time. For large models like ResNet-101 and ResNet-152, the batch size is set to 64. The learning rate is set to 0.1 for a batch size of 256 with the linear scaling rule [goyal2017accurate]
. All models are trained within 100 epochs with cosine learning rate decay. Notably, for training efficiency, we use the Nvidia APEX mixed precision training toolkit. To evaluate our method on MS COCO[lin2014microsoft] using Faster R-CNN [ren2015faster] and Mask R-CNN [he2017mask]. We use the implementation of detectors from the MMDetection [mmdetection]
toolkit and employ its default settings. During training, the shorter side of the input image is resized to 800. All models are optimized using SGD with a weight decay of 1e-4, a momentum of 0.9, and a batch size of 2 per GPU within 12 epochs. The learning rate is initialized to 0.01 and is decreased by the factor of 10 at the 8th and 11th epochs, respectively. All models are implemented in PyTorch[paszke2019pytorch] framework and with eight Nvidia RTX 2080Ti GPUs.
As discussed in Sec. 3.2, it is expensive to verify all combinations of frequency components in our method. In this way, we propose the two-step criterion to select frequency components. In this section, we first show the results of using different components in channel attention individually. Then, we show the results of combinations with different numbers of Top-k settings.
To investigate the effects of different frequency components individually in channel attention, we only use one frequency component at a time. We divide the whole 2D DCT frequency space into parts since the smallest feature map size is on ImageNet. In this way, there are in total of 49 experiments. To speed up the experiments, we first train a standard ResNet-50 network for 100 epochs as the base model. Then we add channel attention to the base model with different frequency components to verify the effects. All added models are trained within 20 epochs with a similar optimization setting in Sec. 4.1, while the learning rate is set to 0.02.
As shown in Fig. 4, we can see that using lower frequency could have better performance, which is intuitive and verifies the success of SENet. This also verifies the conclusion [xu2020learning] that deep networks prefer low-frequency information. Nevertheless, interestingly, we can see that nearly all frequency components (except the highest component) have very small gaps ( Top-1 accuracy) between the lowest one, i.e., vanilla channel attention with GAP. This shows that other frequency components can also cope well with the channel attention mechanism, and it is effective to generalize the channel attention in the frequency domain.
After obtaining the performance of each frequency component, the second step is to determine the number of components that should be used in multi-spectral channel attention. For simplicity, we select Top-k highest performance frequency components, where k could be 1, 2, 4, 8, 16, or 32.
|Number||Top-1 acc||Top-5 acc|
As shown in Table 1, we can see two phenomena. 1) All experiments with multi-spectral attention have a significant performance gap compared with the one only using the GAP in channel attention. This verifies our idea of using multiple frequency components in channel attention. 2) The setting with 16 frequency components gains the best performance. In this way, we use the Top-16 highest performance frequency components in our method and all other experiments222Some other kinds of combinations can be found in the appendix..
We compare our FcaNet with the state-of-the-art methods using ResNet-34, ResNet-50, ResNet-101, and ResNet-152 backbones on ImageNet, including SENet [hu2018squeeze], CBAM [woo2018cbam], GSoP-Net1 [gao2019global], GCNet [cao2019gcnet], AANet [bello2019attention], and ECANet [wang2020eca]
. The evaluation metrics include both efficiency (i.e., network parameters, floating point operations per second (FLOPs)) and effectiveness (i.e., Top-1/Top-5 accuracy).
|Method||Years||Backbone||Parameters||FLOPS||Top-1 acc||Top-5 acc|
|ResNet [he2016deep]||CVPR16||ResNet-34||21.80 M||3.68 G||73.31||91.40|
|SENet [hu2018squeeze]||CVPR18||21.95 M||3.68 G||73.87||91.65|
|ECANet [wang2020eca]||CVPR20||21.80 M||3.68 G||74.21||91.83|
|FcaNet (ours)||21.95 M||3.68 G||75.07||92.16|
|ResNet [he2016deep]||CVPR16||ResNet-50||25.56 M||4.12 G||75.20||92.52|
|SENet [hu2018squeeze]||CVPR18||28.07 M||4.13 G||76.71||93.38|
|CBAM [woo2018cbam]||ECCV18||28.07 M||4.14 G||77.34||93.69|
|GSoPNet1 [gao2019global]||CVPR19||28.29 M||6.41 G||77.98||94.12|
|GCNet [cao2019gcnet]||ICCVW19||28.11 M||4.13 G||77.70||93.66|
|AANet [bello2019attention]||ICCV19||25.80 M||4.15 G||77.70||93.80|
|ECANet [wang2020eca]||CVPR20||25.56 M||4.13 G||77.48||93.68|
|FcaNet (ours)||28.07 M||4.13 G||78.52||94.14|
|ResNet [he2016deep]||CVPR16||ResNet-101||44.55 M||7.85 G||76.83||93.48|
|SENet [hu2018squeeze]||CVPR18||49.29 M||7.86 G||77.62||93.93|
|CBAM [woo2018cbam]||ECCV18||49.30 M||7.88 G||78.49||94.31|
|AANet [bello2019attention]||ICCV19||45.40 M||8.05 G||78.70||94.40|
|ECANet [wang2020eca]||CVPR20||44.55 M||7.86 G||78.65||94.34|
|FcaNet (ours)||49.29 M||7.86 G||79.64||94.63|
|ResNet [he2016deep]||CVPR16||ResNet-152||60.19 M||11.58 G||77.58||93.66|
|SENet [hu2018squeeze]||CVPR18||66.77 M||11.60 G||78.43||94.27|
|AANet [bello2019attention]||ICCV19||61.60 M||11.90 G||79.10||94.60|
|ECANet [wang2020eca]||CVPR20||60.19 M||11.59 G||78.92||94.55|
|FcaNet (ours)||66.77 M||11.60 G||80.08||94.88|
As shown in Table 2, our method achieves the best performance in all experimental settings. Specifically, with the same number of parameters and computational cost, our method outperforms SENet by a large margin. FcaNet outperforms SENet by 1.20%, 1.81%, 2.02%, and 1.65% in terms of Top-1 accuracy under different backbones. Note that FcaNet could also outperform GSoPNet, which has a significantly higher computational cost than our method. This shows the effectiveness of our method.
Besides the classification task on ImageNet, we also evaluate our method on object detection task to verify its effectiveness and generalization ability. We use our FcaNet with FPN [lin2017feature] as the backbone (ResNet-50 and ResNet-101) of Faster R-CNN and Mask R-CNN and test their performance on the MS COCO dataset. SENet, CBAM, GCNet, and ECANet are used for comparison.
|ResNet-50||Faster-RCNN||41.53 M||215.51 G||36.4||58.2||39.2||21.8||40.0||46.2|
|SENet||44.02 M||215.63 G||37.7||60.1||40.9||22.9||41.9||48.2|
|ECANet||41.53 M||215.63 G||38.0||60.6||40.9||23.4||42.1||48.0|
|FcaNet (Ours)||44.02 M||215.63 G||39.0||61.1||42.3||23.7||42.8||49.6|
|ResNet-101||60.52 M||295.39 G||38.7||60.6||41.9||22.7||43.2||50.4|
|SENet||65.24 M||295.58 G||39.6||62.0||43.1||23.7||44.0||51.4|
|ECANet||60.52 M||295.58 G||40.3||62.9||44.0||24.5||44.7||51.3|
|FcaNet (Ours)||65.24 M||295.58 G||41.2||63.3||44.6||23.8||45.2||53.1|
|ResNet-50||Mask-RCNN||44.17 M||261.81 G||37.2||58.9||40.3||22.2||40.7||48.0|
|SENet||46.66 M||261.93 G||38.7||60.9||42.1||23.4||42.7||50.0|
|GCNet||46.69 M||261.94 G||39.4||61.6||42.4||N/A||N/A||N/A|
|ECANet||44.17 M||261.93 G||39.0||61.3||42.1||24.2||42.8||49.9|
|FcaNet (Ours)||46.66 M||261.93 G||40.3||62.0||44.1||25.2||43.9||52.0|
As shown in Table 3, our method could also achieve the best performance with both Faster-RCNN and Mask-RCNN framework. Identical to the classification task on ImageNet, FcaNet could also outperform SENet by a large margin with the same number of parameters and computational cost. Compared with the SOTA method ECANet, FcaNet could outperform it by 0.9-1.3% in terms of AP.
Besides the object detection, we then test our method on the instance segmentation task. As shown in Table 4, our method outperforms other methods by a more considerable margin. Specifically, FcaNet outperforms GCNet by 0.5% AP, while the gaps between other methods are roughly 0.1-0.2%. These results verify the effectiveness of our method.
In this paper, we have proven that GAP is a special case of DCT and proposed the FcaNet with the multi-spectral attention module, which generalizes the existing channel attention mechanism in the frequency domain. Meanwhile, we have explored different combinations of frequency components in our multi-spectral framework and proposed a two-step criterion for frequency components selection. With the same number of parameters and computational cost, our method could consistently outperform SENet by a large margin. We also have achieved state-of-the-art performance on image classification, object detection, and instance segmentation compared with other channel attention methods. Moreover, FcaNet is simple yet effective. Our method could be implemented with only one line change of code based on existing channel attention methods.
This section shows more results about using different frequency combinations in the proposed multi-spectral channel attention module. In Sec. 4.2, we present a two-step method to select the best frequency component combinations in the proposed multi-spectral channel attention mechanism. Besides the proposed combinations, we also try some other possibilities of combinations, as shown in Fig. 5. The first one is an intuitive method, termed as Low-k (Lowest-k), as shown in Fig. 5(b). Low-k selects the lowest k frequency components (left upper triangle part of the 2D frequency spectrum) as the combinations. It only considers the frequency of the components and has no relation to the performance of the individual component in Fig. 4. The second one is the counterpart of our two-step Top-k method, termed as Bot-k (Bottom-k), as shown in Fig. 5(c). Bot-k selects the k frequency components with the lowest performance, which is exactly the opposite of the Top-k method.
|Number||Top-1 acc||Top-5 acc|
|Number||Top-1 acc||Top-5 acc|
The highest performance of our Top-k method is 78.52%. Compared with Bot-k in Table 6, the results show that low-frequency components are important. Compared with Low-k in Table 6, the Top-k method also performs better. This shows that we should take the performance of individual frequency component into consideration and demonstrates the effectiveness of our two-step criterion.
In this section, we show some visualization results related to the discrete cosine transform (DCT).
In Fig. 6(a), we show the image of the basis functions of 2D DCT. We can see that 2D DCT basis functions are composed of regular horizontal and vertical cosine waves. These basis functions are orthogonal and data-independent. In Fig. 6(b), we show the selected frequency components using our two-step criterion. We can see that the selected frequency components are usually low-frequency.
subsequently, we can give a more detailed derivation.
in which is the i-th channel of feature, , , , and We can see that the conventional channel attention is actually discarding information from all other frequency components except the lowest one. Note that this derivation is in the matrix form.