FcaNet: Frequency Channel Attention Networks

12/22/2020 ∙ by Zequn Qin, et al. ∙ Zhejiang University 33

Attention mechanism, especially channel attention, has gained great success in the computer vision field. Many works focus on how to design efficient channel attention mechanisms while ignoring a fundamental problem, i.e., using global average pooling (GAP) as the unquestionable pre-processing method. In this work, we start from a different view and rethink channel attention using frequency analysis. Based on the frequency analysis, we mathematically prove that the conventional GAP is a special case of the feature decomposition in the frequency domain. With the proof, we naturally generalize the pre-processing of channel attention mechanism in the frequency domain and propose FcaNet with novel multi-spectral channel attention. The proposed method is simple but effective. We can change only one line of code in the calculation to implement our method within existing channel attention methods. Moreover, the proposed method achieves state-of-the-art results compared with other channel attention methods on image classification, object detection, and instance segmentation tasks. Our method could improve by 1.8 compared with the baseline SENet-50, with the same number of parameters and the same computational cost. Our code and models will be made publicly available.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 7

page 9

page 10

Code Repositories

FcaNet

FcaNet: Frequency Channel Attention Networks


view repo

FcaNet

Simple reimplemetation experiments about FcaNet


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As an important and challenging problem in feature modeling, attention mechanisms for convolutional neural networks (CNNs) have recently attracted considerable attention and are widely used in many fields like computer vision

[xu2015show]

and natural language processing

[vaswani2017attention]

. In principle, they aim at selectively concentrating on some important information and have many types of variants (e.g., spatial attention, channel attention, and self-attention) corresponding to different feature dimensions. Due to the simplicity and effectiveness in feature modeling, channel attention directly learns to attach importance weights with different channels, becoming a popular and powerful tool for the deep learning community.

Figure 1:

Classification accuracy comparison on ImageNet. With the same number of parameters and computational cost, our method consistently outperforms the baseline SENet by a large margin. Our method with the ResNet-50 backbone could even outperform SENet with the ResNet-152 backbone.

In the literature, conventional channel attention approaches devote great efforts to constructing various channel importance weight functions (e.g., SENet [hu2018squeeze] using fully connected layers and ECANet [wang2020eca] based on local one-dimensional convolutions). Typically, such weight functions require a scalar for each channel to conduct the calculation due to the constrained computational overhead, and global average pooling (GAP) becomes the de-facto standard choice in the deep learning community because of its simplicity and efficiency. Despite the simplicity and efficiency, there exists a potential problem that GAP is incapable of well capturing the rich input pattern information, and thus lacks feature diversity when processing different inputs. Consequently, there arises a natural question of whether the mean value information only is adequate for representing various channels in channel attention. With the above motivation, we carry out a theoretical analysis of the GAP for channel attention in the following aspects. First, there exist numerous possibilities that different channels could have the same mean values, while their corresponding semantic content information is distinct. Second, from the perspective of frequency analysis, we prove that the GAP is equivalent to the lowest frequency of the discrete cosine transform (DCT) and only using GAP is equivalent to discarding the other frequency components containing much useful information on feature channels. Third, CBAM [woo2018cbam]

also shows that only using GAP is not enough and adopts both GAP and global max pooling to enhance feature diversity. In this paper, we propose a simple, novel, but effective multi-spectral channel attention framework. First, we mathematically prove that GAP is a special case of DCT frequency analysis, and it is equivalent to the lowest frequency component of DCT. Second, we naturally generalize the existing GAP channel attention mechanism in the frequency domain. We propose to use multiple but limited frequency components instead of one single GAP in the attention mechanism. By incorporating more frequency components into the attention processing, the information from these different frequency components can be exploited, leading to a multi-spectral description. As a result, the problem of insufficient information used in channel attention from the single frequency (i.e., GAP) can be addressed. Third, based on performance evaluations, we propose a two-step feature selection criterion for choosing different frequency components in the attention mechanism. Using the feature selection criterion, the proposed multi-spectral channel attention framework achieves state-of-the-art performance against the other channel attention ones. In a word, the main contribution of this work can be summarized as follows.

  • We prove that GAP is a special case of DCT. Based on this proof, we generalize the channel attention in the frequency domain and propose FcaNet with the multi-spectral channel attention framework.

  • We propose a two-step criterion for choosing frequency components by exploring the effects of using different numbers of frequency components as well as their different combinations.

  • Extensive experiments demonstrate the proposed method achieves state-of-the-art results on both ImageNet and COCO datasets. Based on the ResNet50 backbone, it could outperform SENet by

    in terms of Top-1 accuracy on ImageNet, with the same number of parameters and computational cost. The results are shown in Fig. 1.

  • Our method is simple yet effective and can be implemented with only one line change of code within existing channel attention implementations.

2 Related Work

Attention Mechanism in CNNs

In [xu2015show], a visual attention method is first proposed to model the importance of features in the image caption task. Then many methods start to focus on the attention mechanism. A residual attention network [wang2017residual] is proposed with a spatial attention mechanism using downsampling and upsampling. Besides, SENet [hu2018squeeze] proposes the channel attention mechanism. It performs GAP on the channels and then calculates the weights of each channel using fully connected layers. What’s more, GE [hu2018gather] uses spatial attention to better exploit the feature context, and -Net [chen20182] builds a relation function for image or video recognition. Inspired by these works, a series of works like BAM [park2018bam], DAN [fu2019dual], CBAM [woo2018cbam], and scSE [roy2018recalibrating] are proposed to fuse spatial attention [zhu2019empirical] and channel attention. Among them, CBAM claims that GAP could only get a sub-optimal feature because of the loss of information. For addressing this problem, it uses both the GAP and the global max pooling and gains significant performance improvement. Motivated by CBAM, GSoP [gao2019global] introduces a second-order pooling method for downsampling. NonLocal [wang2018non] proposes to build a dense spatial feature map. AANet [bello2019attention] proposes to embed the attention map with position information into the feature. SkNet [li2019selective] introduces a selective channel aggregation and attention mechanism, and ResNeSt [zhang2020resnest] proposes a similar split attention method. Due to the complicated attention operation, these methods are relatively large. To improve efficiency, GCNet [cao2019gcnet] proposes to use a simple spatial attention module and replace the original spatial downsampling process. ECANet [wang2020eca] introduces one-dimensional convolution layers to reduce the redundancy of fully connected layers and obtains more efficient results. Besides these works, many methods try to extend the attention mechanism to specific tasks, like multi-label classification [guo2019visual], saliency detection [zhao2019pyramid], visual explanation [fukui2019attention]

, and super-resolution

[zhang2018image].

Frequency Domain Learning

Frequency analysis has always been a powerful tool in the signal processing field. In recent years, some applications of introducing frequency analysis in the deep learning field emerge. In [ehrlich2019deep, gueguen2018faster], frequency analysis is introduced in the CNNs by JPEG encoding. Then, DCT is incorporated in [xu2020learning] to reduce communication bandwidth. There are also some applications in the model compression and pruning tasks like [chen2016compressing, liu2018frequency, wang2016cnnpack].

3 Method

In this section, we first revisit the formulation of channel attention and DCT frequency analysis. Then, based on these works, we elaborate on the derivation of our multi-spectral channel attention framework. Meanwhile, a two-step criterion for choosing frequency components in the framework is also proposed. At last, we give discussions about effectiveness, complexity, and code implementation.

3.1 Revisiting Channel Attention and DCT

We first elaborate on the definitions of channel attention mechanism and discrete cosine transform. Then, we briefly summarize the properties of channel attention and DCT.

Channel Attention

The channel attention mechanism is widely used in CNNs. It uses a learnable network to weight the importance of each channel and generates more informative outputs. Suppose

is the image feature tensor in networks,

is the number of channels, is the height of the feature, and is the width of the feature. Then the attention mechanism can be written as [hu2018squeeze, wang2020eca]:

(1)

where

is the attention vector,

is the Sigmoid function,

represents the mapping functions like fully connected layer or one-dimensional convolution, and is the global average pooling. After obtaining the attention vector of all channels, each channel of input is scaled by the corresponding attention value:

(2)

in which is the output of attention mechanism, is the -th element of attention vector, and is the -th channel of input.

Discrete Cosine Transform (DCT)

Typically, the definition of DCT can be written as [ahmed1974discrete]:

(3)

in which is the frequency spectrum of DCT, is the input, and is the length of the input . Moreover, two-dimensional (2D) DCT can be written as:

(4)

in which is the 2D DCT frequency spectrum, is the input, is the height of , and is the width of . Correspondingly, the inverse 2D DCT can be written as:

(5)

Please note that in Eqs. 4 and 5, some constant normalization factors are removed for simplicity, which will not affect the results in this work. With the definitions of channel attention and DCT, we can summarize two key properties: a) existing methods use GAP as their pre-processing before channel attention. b) DCT can be viewed as a weighted sum of inputs with the cosine parts in Eqs. 3 and 4 being the weights. GAP is an operation of mean value due to constrained computational overhead, and it can be viewed as the simplest spectrum of input. As described in the introduction section, it is inadequate to use single GAP information in channel attention. Motivated by these properties, we can proceed to introduce our multi-spectral channel attention method.

3.2 Multi-Spectral Channel Attention

In this section, we first theoretically discuss the problem of existing channel attention mechanisms. Based on the theoretical analysis, we then elaborate on the network design of the proposed method.

Theoretical Analysis of Channel Attention

As discussed in Sec. 3.1, DCT can be viewed as a weighted sum of inputs. We further propose that GAP is actually a special case of 2D DCT.

Theorem 1.

GAP is a special case of 2D DCT, and its result is proportional to the lowest frequency component of 2D DCT.

(a) Original SENet
(b) Multi-spectral channel attention
Figure 2: Illustration of existing channel attention and multi-spectral channel attention. For simplicity, the 2D DCT indices are represented in the one-dimensional format. We can see that our method uses multiple frequency components with the selected DCT bases, while SENet only uses GAP in channel attention. Best viewed in color.
Proof.

Suppose and in Eq. 4 are , we have:

(6)

In Eq. 6, represents the lowest frequency component of 2D DCT, and it is proportional to GAP. In this way, theorem 1 is proved. ∎

Based on theorem 1 that GAP is a special case of 2D DCT, this prompt that we could also incorporate other frequency components in the channel attention mechanism. Moreover, we can discuss the reason why we need to incorporate other frequency components using 2D DCT. For simplicity, we use to represent the basis functions of 2D DCT:

(7)

Then, the 2D DCT in Eq. 5 can be rewritten as:

(8)

It is natural to see that an image or feature can be represented as a combination of different frequency components. According to Eq. 1, we have:

(9)

Channel attention is only based on the results of the GAP. However, combined with Eq. 14, we can see that the information of input is not only composed of GAP111More rigorous derivation can be found in the appendix.:

(10)

The term is a constant scale factor and can be ignored in the attention mechanism. In this way, only a small part of the information is used by the channel attention mechanism. The other frequency components and information are discarded in the existing channel attention methods.

Multi-Spectral Attention Module

Based on the theoretical analysis and theorem 1, we find the information used in existing channel attention is inadequate, and the pre-processing method GAP of channel attention is a special case of 2D DCT. In this way, we could naturally generalize GAP to more frequency components of 2D DCT and introduce more information to solve the problem of inadequate information in channel attention. To introduce more information, we propose to use multiple frequency components of 2D DCT, including the lowest frequency component, i.e., GAP.

(a) Original SENet
(b) Our implementation
Figure 3: Implementation of our method and SENet. In the calculation, we only need to change one line of code to implement our method based on the existing code. The lines in red and green indicate the difference between SENet and our work. The get_dct_weights function is to implement Eq. 7 and the details can be found in the appendix.

First, the input is split into many parts along the channel dimension. Denote as the parts, in which , , , and should be be divisible by . For each part, a corresponding 2D DCT frequency component is assigned, and the 2D DCT results can be used as the pre-processing results of channel attention. In this way, we have:

(11)

in which are the frequency component 2D indices corresponding to , and is the -dimensional vector after the pre-processing. The whole pre-processing vector can be obtained by concatenation:

(12)

in which is the obtained multi-spectral vector. The whole multi-spectral channel attention framework can be written as:

(13)

From Eqs. 12 and 13, we can see that our method generalizes the original method that only uses the GAP, i.e., the lowest frequency component to a framework with multiple frequency sources. By doing so, the inadequate problem of original methods is addressed. The overall illustration of our method is shown in Fig. 2.

Criterion for choosing frequency components

There exists an important problem of how to choose frequency component indices for each part . For each channel with a spatial size of , we can get frequency components after 2D DCT. In this case, the total number of combinations of these frequency components is . For example, could equal to

for ResNet-50 backbone. It is expensive to test all combinations. In this way, we propose a heuristic two-step criterion to choose the frequency components in the multi-spectral attention module. The main idea is to first determine the importance of each frequency component and then determine the effects of using different numbers of frequency components together. First, we examine the results of each frequency component in channel attention individually. Then, we choose the Top-k highest performance frequency components based on the results. In this way, the multi-spectral channel attention can be fulfilled. The ablation studies about this two-step criterion can be seen in Sec. 

4.2.

3.3 Discussion

How the multi-spectral framework embed more information

In Sec. 3.2, we show that only using GAP in channel attention is actually discarding information of all other frequency components except the lowest one, i.e., GAP. In this way, generalizing channel attention in the frequency domain and using the multi-spectral framework could naturally embed more information in the channel attention mechanism. Besides the above derivation, we also give a thought experiment to show that more information could be embedded. As we all know, deep networks are redundant [he2017channel, zhuang2018discrimination]. If two channels are redundant for each other, we can only get the same information using GAP. However, in our multi-spectral framework, it is possible to extract more information from redundant channels because different frequency components contain different information. In this way, the proposed multi-spectral framework could embed more information in the channel attention mechanism.

Complexity analysis

We analyze the complexity of our method from two aspects: the number of parameters and the computational cost. For the number of parameters, our method has no extra parameters compared with the baseline SENet because the weights of 2D DCT are pre-computed constant. For the computational cost, our method has a negligible extra cost and can be viewed as having the same computational cost as SENet. With ResNet-34, ResNet-50, ResNet-101, and ResNet-152 backbone, the relative computational cost increases of our method are , , , and compared with SENet, respectively. More results can be found in Table 2.

One line change of code

Another important property of the proposed multi-spectral framework is that it can be easily realized with existing channel attention implementations. As described in Sec. 3.1 and Eq. 11, 2D DCT can be viewed as a weighted sum of inputs. In this way, the implementation of our method can be simply achieved by element-wise multiplication and summation. The implementation is illustrated in Fig. 3. As we can see, the only difference between the calculation of SENet and our method is the pre-processing part. For SENet, GAP is used while we use multi-spectral 2D DCT. In this way, our method could be easily integrated into arbitrary channel attention methods.

4 Experiments

In this section, we first elaborate on the details of our experiments. Then, we show ablation studies about FcaNet. Last, we investigate the effectiveness of our method on the task of image classification, object detection, and instance segmentation.

4.1 Implementation Details

To evaluate the results of the proposed FcaNet on ImageNet [ILSVRC15], we employ four widely used CNNs as backbone models, including ResNet-34, ResNet-50, ResNet-101, and ResNet-152. We follow the data augmentation and hyper-parameter settings in [he2016deep] and [he2019bag]. Concretely, the input images are cropped randomly to 224224 with random horizontal flipping. We use an SGD optimizer with a momentum of 0.9, a weight decay of 1e-4, and a batch size of 128 per GPU at training time. For large models like ResNet-101 and ResNet-152, the batch size is set to 64. The learning rate is set to 0.1 for a batch size of 256 with the linear scaling rule [goyal2017accurate]

. All models are trained within 100 epochs with cosine learning rate decay. Notably, for training efficiency, we use the Nvidia APEX mixed precision training toolkit. To evaluate our method on MS COCO 

[lin2014microsoft] using Faster R-CNN [ren2015faster] and Mask R-CNN [he2017mask]. We use the implementation of detectors from the MMDetection [mmdetection]

toolkit and employ its default settings. During training, the shorter side of the input image is resized to 800. All models are optimized using SGD with a weight decay of 1e-4, a momentum of 0.9, and a batch size of 2 per GPU within 12 epochs. The learning rate is initialized to 0.01 and is decreased by the factor of 10 at the 8th and 11th epochs, respectively. All models are implemented in PyTorch 

[paszke2019pytorch] framework and with eight Nvidia RTX 2080Ti GPUs.

4.2 Ablation Study

As discussed in Sec. 3.2, it is expensive to verify all combinations of frequency components in our method. In this way, we propose the two-step criterion to select frequency components. In this section, we first show the results of using different components in channel attention individually. Then, we show the results of combinations with different numbers of Top-k settings.

The effects of individual frequency components

To investigate the effects of different frequency components individually in channel attention, we only use one frequency component at a time. We divide the whole 2D DCT frequency space into parts since the smallest feature map size is on ImageNet. In this way, there are in total of 49 experiments. To speed up the experiments, we first train a standard ResNet-50 network for 100 epochs as the base model. Then we add channel attention to the base model with different frequency components to verify the effects. All added models are trained within 20 epochs with a similar optimization setting in Sec. 4.1, while the learning rate is set to 0.02.

Figure 4: Top-1 accuracies on ImageNet using different frequency components in channel attention individually.

As shown in Fig. 4, we can see that using lower frequency could have better performance, which is intuitive and verifies the success of SENet. This also verifies the conclusion [xu2020learning] that deep networks prefer low-frequency information. Nevertheless, interestingly, we can see that nearly all frequency components (except the highest component) have very small gaps ( Top-1 accuracy) between the lowest one, i.e., vanilla channel attention with GAP. This shows that other frequency components can also cope well with the channel attention mechanism, and it is effective to generalize the channel attention in the frequency domain.

The effects of different numbers of frequency components

After obtaining the performance of each frequency component, the second step is to determine the number of components that should be used in multi-spectral channel attention. For simplicity, we select Top-k highest performance frequency components, where k could be 1, 2, 4, 8, 16, or 32.

Number Top-1 acc Top-5 acc
1 77.26 93.55
2 78.40 94.07
4 78.38 94.03
8 78.33 94.06
16 78.52 94.14
32 78.42 94.10
Table 1: The results of using different numbers of frequency components on ImageNet with ResNet-50 backbone.

As shown in Table 1, we can see two phenomena. 1) All experiments with multi-spectral attention have a significant performance gap compared with the one only using the GAP in channel attention. This verifies our idea of using multiple frequency components in channel attention. 2) The setting with 16 frequency components gains the best performance. In this way, we use the Top-16 highest performance frequency components in our method and all other experiments222Some other kinds of combinations can be found in the appendix..

4.3 Image Classification on ImageNet

We compare our FcaNet with the state-of-the-art methods using ResNet-34, ResNet-50, ResNet-101, and ResNet-152 backbones on ImageNet, including SENet [hu2018squeeze], CBAM [woo2018cbam], GSoP-Net1 [gao2019global], GCNet [cao2019gcnet], AANet [bello2019attention], and ECANet [wang2020eca]

. The evaluation metrics include both efficiency (i.e., network parameters, floating point operations per second (FLOPs)) and effectiveness (i.e., Top-1/Top-5 accuracy).

Method Years Backbone Parameters FLOPS Top-1 acc Top-5 acc
ResNet [he2016deep] CVPR16 ResNet-34 21.80 M 3.68 G 73.31 91.40
SENet [hu2018squeeze] CVPR18 21.95 M 3.68 G 73.87 91.65
ECANet [wang2020eca] CVPR20 21.80 M 3.68 G 74.21 91.83
FcaNet (ours) 21.95 M 3.68 G 75.07 92.16
ResNet [he2016deep] CVPR16 ResNet-50 25.56 M 4.12 G 75.20 92.52
SENet [hu2018squeeze] CVPR18 28.07 M 4.13 G 76.71 93.38
CBAM [woo2018cbam] ECCV18 28.07 M 4.14 G 77.34 93.69
GSoPNet1 [gao2019global] CVPR19 28.29 M 6.41 G 77.98 94.12
GCNet [cao2019gcnet] ICCVW19 28.11 M 4.13 G 77.70 93.66
AANet [bello2019attention] ICCV19 25.80 M 4.15 G 77.70 93.80
ECANet [wang2020eca] CVPR20 25.56 M 4.13 G 77.48 93.68
FcaNet (ours) 28.07 M 4.13 G 78.52 94.14
ResNet [he2016deep] CVPR16 ResNet-101 44.55 M 7.85 G 76.83 93.48
SENet [hu2018squeeze] CVPR18 49.29 M 7.86 G 77.62 93.93
CBAM [woo2018cbam] ECCV18 49.30 M 7.88 G 78.49 94.31
AANet [bello2019attention] ICCV19 45.40 M 8.05 G 78.70 94.40
ECANet [wang2020eca] CVPR20 44.55 M 7.86 G 78.65 94.34
FcaNet (ours) 49.29 M 7.86 G 79.64 94.63
ResNet [he2016deep] CVPR16 ResNet-152 60.19 M 11.58 G 77.58 93.66
SENet [hu2018squeeze] CVPR18 66.77 M 11.60 G 78.43 94.27
AANet [bello2019attention] ICCV19 61.60 M 11.90 G 79.10 94.60
ECANet [wang2020eca] CVPR20 60.19 M 11.59 G 78.92 94.55
FcaNet (ours) 66.77 M 11.60 G 80.08 94.88
Table 2: Comparison of different attention methods on ImageNet. All other results are quoted from their original papers if available.

As shown in Table 2, our method achieves the best performance in all experimental settings. Specifically, with the same number of parameters and computational cost, our method outperforms SENet by a large margin. FcaNet outperforms SENet by 1.20%, 1.81%, 2.02%, and 1.65% in terms of Top-1 accuracy under different backbones. Note that FcaNet could also outperform GSoPNet, which has a significantly higher computational cost than our method. This shows the effectiveness of our method.

4.4 Object Detection on MS COCO

Besides the classification task on ImageNet, we also evaluate our method on object detection task to verify its effectiveness and generalization ability. We use our FcaNet with FPN [lin2017feature] as the backbone (ResNet-50 and ResNet-101) of Faster R-CNN and Mask R-CNN and test their performance on the MS COCO dataset. SENet, CBAM, GCNet, and ECANet are used for comparison.

Method Detector Parameters FLOPs AP
ResNet-50 Faster-RCNN 41.53 M 215.51 G 36.4 58.2 39.2 21.8 40.0 46.2
SENet 44.02 M 215.63 G 37.7 60.1 40.9 22.9 41.9 48.2
ECANet 41.53 M 215.63 G 38.0 60.6 40.9 23.4 42.1 48.0
FcaNet (Ours) 44.02 M 215.63 G 39.0 61.1 42.3 23.7 42.8 49.6
ResNet-101 60.52 M 295.39 G 38.7 60.6 41.9 22.7 43.2 50.4
SENet 65.24 M 295.58 G 39.6 62.0 43.1 23.7 44.0 51.4
ECANet 60.52 M 295.58 G 40.3 62.9 44.0 24.5 44.7 51.3
FcaNet (Ours) 65.24 M 295.58 G 41.2 63.3 44.6 23.8 45.2 53.1
ResNet-50 Mask-RCNN 44.17 M 261.81 G 37.2 58.9 40.3 22.2 40.7 48.0
SENet 46.66 M 261.93 G 38.7 60.9 42.1 23.4 42.7 50.0
GCNet 46.69 M 261.94 G 39.4 61.6 42.4 N/A N/A N/A
ECANet 44.17 M 261.93 G 39.0 61.3 42.1 24.2 42.8 49.9
FcaNet (Ours) 46.66 M 261.93 G 40.3 62.0 44.1 25.2 43.9 52.0
Table 3: Object detection results of different methods on COCO val 2017.

As shown in Table 3, our method could also achieve the best performance with both Faster-RCNN and Mask-RCNN framework. Identical to the classification task on ImageNet, FcaNet could also outperform SENet by a large margin with the same number of parameters and computational cost. Compared with the SOTA method ECANet, FcaNet could outperform it by 0.9-1.3% in terms of AP.

4.5 Instance Segmentation on MS COCO

Besides the object detection, we then test our method on the instance segmentation task. As shown in Table 4, our method outperforms other methods by a more considerable margin. Specifically, FcaNet outperforms GCNet by 0.5% AP, while the gaps between other methods are roughly 0.1-0.2%. These results verify the effectiveness of our method.

Method AP
ResNet-50 34.1 55.5 36.2
SENet 35.4 57.4 37.8
GCNet 35.7 58.4 37.6
ECANet 35.6 58.1 37.7
FcaNet (Ours) 36.2 58.6 38.1
Table 4: Instance segmentation results of different methods using Mask R-CNN on COCO val 2017.

5 Conclusion

In this paper, we have proven that GAP is a special case of DCT and proposed the FcaNet with the multi-spectral attention module, which generalizes the existing channel attention mechanism in the frequency domain. Meanwhile, we have explored different combinations of frequency components in our multi-spectral framework and proposed a two-step criterion for frequency components selection. With the same number of parameters and computational cost, our method could consistently outperform SENet by a large margin. We also have achieved state-of-the-art performance on image classification, object detection, and instance segmentation compared with other channel attention methods. Moreover, FcaNet is simple yet effective. Our method could be implemented with only one line change of code based on existing channel attention methods.

6 Appendix

6.1 Investigation of More Frequency Combinations

This section shows more results about using different frequency combinations in the proposed multi-spectral channel attention module. In Sec. 4.2, we present a two-step method to select the best frequency component combinations in the proposed multi-spectral channel attention mechanism. Besides the proposed combinations, we also try some other possibilities of combinations, as shown in Fig. 5. The first one is an intuitive method, termed as Low-k (Lowest-k), as shown in Fig. 5(b). Low-k selects the lowest k frequency components (left upper triangle part of the 2D frequency spectrum) as the combinations. It only considers the frequency of the components and has no relation to the performance of the individual component in Fig. 4. The second one is the counterpart of our two-step Top-k method, termed as Bot-k (Bottom-k), as shown in Fig. 5(c). Bot-k selects the k frequency components with the lowest performance, which is exactly the opposite of the Top-k method.

(a) Performance rank
(b) Low-k
(c) Bot-k
Figure 5: Illustration of different frequency combinations. (a) shows the performance rank as Fig. 4. (b) shows the selection method of Low-k, which only considers the frequency itself. (c) shows the Top-k and Bot-k, which consider the performance rank of each frequency component.
Number Top-1 acc Top-5 acc
1 77.26 93.55
2 78.40 94.07
4 78.27 94.06
8 78.25 94.01
16 78.26 94.12
32 78.37 94.08
Table 6: The results of Bot-k combinations.
Number Top-1 acc Top-5 acc
1 77.09 93.50
2 77.30 93.51
4 77.27 93.57
8 77.51 93.63
16 77.17 93.53
32 77.60 93.78
Table 5: The results of Low-k combinations.

The highest performance of our Top-k method is 78.52%. Compared with Bot-k in Table 6, the results show that low-frequency components are important. Compared with Low-k in Table 6, the Top-k method also performs better. This shows that we should take the performance of individual frequency component into consideration and demonstrates the effectiveness of our two-step criterion.

6.2 Visualization of Discrete Cosine Transform

In this section, we show some visualization results related to the discrete cosine transform (DCT).

(a) All frequency components
(b) Selected frequency components
Figure 6: The visualization about DCT basis functions. (a) shows the visualizations of all frequency components. (b) shows the selected frequency components using our two-step Top-k criterion. The selected components are also highlighted in (a) with red dashed box.

In Fig. 6(a), we show the image of the basis functions of 2D DCT. We can see that 2D DCT basis functions are composed of regular horizontal and vertical cosine waves. These basis functions are orthogonal and data-independent. In Fig. 6(b), we show the selected frequency components using our two-step criterion. We can see that the selected frequency components are usually low-frequency.

6.3 Analysis of Channel Attention

In this section, we give a more detailed mathematically analysis of channel attention based on Sec. 3.2. According to Sec. 3.2, we have:

(14)

and

(15)

subsequently, we can give a more detailed derivation.

(16)

in which is the i-th channel of feature, , , , and We can see that the conventional channel attention is actually discarding information from all other frequency components except the lowest one. Note that this derivation is in the matrix form.

6.4 Initialization of DCT weights

In this section, we give the details of the initialization, i.e., the get_dct_weights function in Fig. 3. The get_dct_weights function is shown in Fig. 7. It should be noted that the get_dct_weights function is only for the initialization, so the one-line change holds for training and inference.

Figure 7: The details of the get_dct_weights function. This code is only for the initialization of the dct weights. It will run only at the very beginning, and will not participate in the training and testing.

References