Fully Attentional Network for Semantic Segmentation

Recent non-local self-attention methods have proven to be effective in capturing long-range dependencies for semantic segmentation. These methods usually form a similarity map of RC*C (by compressing spatial dimensions) or RHW*HW (by compressing channels) to describe the feature relations along either channel or spatial dimensions, where C is the number of channels, H and W are the spatial dimensions of the input feature map. However, such practices tend to condense feature dependencies along the other dimensions,hence causing attention missing, which might lead to inferior results for small/thin categories or inconsistent segmentation inside large objects. To address this problem, we propose anew approach, namely Fully Attentional Network (FLANet),to encode both spatial and channel attentions in a single similarity map while maintaining high computational efficiency. Specifically, for each channel map, our FLANet can harvest feature responses from all other channel maps, and the associated spatial positions as well, through a novel fully attentional module. Our new method has achieved state-of-the-art performance on three challenging semantic segmentation datasets,i.e., 83.6 test set,the ADE20K validation set, and the PASCAL VOC test set,respectively.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 5

page 6

page 7

09/09/2018

Dual Attention Network for Scene Segmentation

In this paper, we address the scene segmentation task by capturing rich ...
01/19/2021

CAA : Channelized Axial Attention for Semantic Segmentation

Self-attention and channel attention, modelling the semantic interdepend...
11/28/2018

CCNet: Criss-Cross Attention for Semantic Segmentation

Long-range dependencies can capture useful contextual information to ben...
09/16/2019

Global Aggregation then Local Distribution in Fully Convolutional Networks

It has been widely proven that modelling long-range dependencies in full...
04/27/2020

Distance Guided Channel Weighting for Semantic Segmentation

Recent works have achieved great success in improving the performance of...
08/17/2021

Global Pooling, More than Meets the Eye: Position Information is Encoded Channel-Wise in CNNs

In this paper, we challenge the common assumption that collapsing the sp...
03/04/2021

Coordinate Attention for Efficient Mobile Network Design

Recent studies on mobile network design have demonstrated the remarkable...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Recently, semantic segmentation models achieve great progress by capturing long-range dependencies Zhao et al. (2017); Yang et al. (2018); Yuan et al. (2020); Sun et al. (2019), in which Non-Local (NL) based methods are the mainstream Zhao et al. (2018); Fu et al. (2019a); Huang et al. (2019); Zhang et al. (2019a); Zhu et al. (2019); Song et al. (2021); Ramachandran et al. (2019). To generate dense and well-rounded contextual information, NL based models utilize a self-attention mechanism to explore the interdependencies along the channel Cao et al. (2019); Zhao et al. (2018) or spatial Huang et al. (2019); Yin et al. (2020) dimensions. We denote these two variants of NL block as “Channel NL” and “Spatial NL”, respectively, and the architectures of these two variants are illustrated in Fig.1 (a) and (b). Although these explorations have made impressive contributions to semantic segmentation, one acute issue, i.e., attention missing, was mostly ignored. Take Channel NL for example, the channel attention map is generated by the matrix multiplication of two inputs with a dimension of and . It can be found that each channel can be connected with all other channel maps while the spatial information will be integrated and each spatial position fails to perceive feature response from other positions during the matrix multiplication. Similarly, interactions among channel dimensions are also missing in the Spatial NL.

We argue that the attention missing issue would damage the integrity of 3D context information () and thus both NL variants can only benefit partially in a complementary way. To verify this hypothesis, we present the per-class comparison results on the Cityscapes val set in Fig.2. As shown in the figure, Channel NL gets better segmentation results among large objects, such as truck, bus and train, while Spatial NL performs much better on small/thin categories, e.g., poles, rider and mbike. They both lose precision in some categories due to the mentioned attention missing issue. Besides, we are also curious about whether this issue can be solved by stacking the two blocks sequentially. We denote the parallel connection mode in DANet Fu et al. (2019a) and the sequential Channel-Spatial NL as “Dual NL” and “CS NL”, respectively111Since there are no convolutional layers in the Channel NL, if we employ Spatial NL before Channel NL (SC NL), the feature weights tend to be either extremely large or extremely small after two consecutive enhancements and the training loss does not converge. So the performance of this connection mode is not reported.. Intuitively, when two NLs are employed at the same time, the accuracy gain of each class should be no less than that of a single NL. However, it is observed that the performance of Dual NL drops a lot in large objects such as truck and train, and CS NL gets poor IoU results in some thin categories like pole and mbike. We can find that both Dual NL and CS NL can only preserve partial benefits brought by either Channel NL or Spatial NL. Therefore, we can conclude that: the attention missing issue hurts the feature representation ability and it cannot be solved by simply stacking different NL blocks.

Motivated by this, we propose a novel non-local block namely Fully Attentional block (FLA) to efficiently retain attentions in all dimensions. And the workflow is shown in Fig.1 (c). The basic idea is to utilize the global context information to receive spatial responses when computing the channel attention map, which enables full attentions in a single attention unit with high computational efficiency. Specifically, we first enable each spatial position to harvest feature responses from the global contexts with the same horizontal and vertical coordinates. Second, we use the self-attention mechanism to capture the fully attentional similarities between any two channel maps and the associated spatial positions. Finally, the generated fully attentional similarities are used to re-weight each channel map by integrating features among all channel maps and associated global clues.

It should be noted that our method is more effective and efficient than previous works Fu et al. (2019a); Babiloni et al. (2020) when modeling interdependencies in all dimensions. Since we encode spatial interactions into the traditional Channel NL and capture full attentions in a single attention map, our FLA is with high computational efficiency. Specifically, our FLA significantly reduces FLOPs by about 83% and only requires 34% GPU memory usage of DANet in computing both spatial and channel dependencies.

We have carried out extensive experiments on three challenging semantic segmentation datasets and our approach achieves state-of-the-art performance on these experiments. Moreover, our model outperforms other non-local based methods by a large margin with the same backbone network. Our contributions mainly lie in three aspects:

  • Through the theoretical and experimental analysis, we find out the attention missing issue existing in the non-local self-attention methods, which would hurt the integrity of feature representation.

  • We reformulate the self-attention mechanism into a fully attentional manner to generate dense and well-rounded feature dependencies, which addresses the attention missing issue effectively and efficiently. To the best of our knowledge, this paper is the first to achieve full attentions in a single non-local block.

  • We conducted extensive experiments on three challenging semantic segmentation datasets, including Cityscapes, ADE20K, and PASCAL VOC, which demonstrate the superiority of our approach over other state-of-the-art methods.

Figure 3: The details of Fully Attentional block. Since equals in our implementation, we use the letter to represent the dimension after merge for a clear illustration.

Related Work

Semantic Segmentation.

Semantic segmentation is a vital task in computer vision, which predicts correct semantic labels for all pixels in an image. The traditional classification network based on CNNs can only identify the class of the whole image, not the label of each pixel. Instead of the fully connected layer in the CNN, the FCN

Long et al. (2015) utilizes a convolutional layer to get the segmentation result. UNet Ronneberger et al. (2015) adopts an encoder-decoder structure to recover the detailed information damaged by the step-by-step downsampling operations. To model interdependencies between different channel maps, SENet Hu et al. (2018) produces an embedding of the global distribution of channel-wise feature responses. To enhance the global connections between spatial positions, self-attention based methods are thus proposed to weigh the importance of each spatial position whilst sacrificing the channel-wise attention. Different from these approaches, we argue that the attention missing issue might lead to inconsistent segmentation inside large objects or inferior results for small categories in the semantic segmentation task. Thus in this paper, we consider both channel and spatial dependencies are of equal importance and try to capture both of them in a single attention unit.

Self-Attention Mechanism.

Self-attention is initially used for machine translation Chorowski et al. (2015); Vaswani et al. (2017) to capture long-range features. After that, self-attention modules are widely applied in the semantic segmentation field, in which the Non-local network Wang et al. (2018) is the pioneering work. CCNet Huang et al. (2019) harvests the contextual information for each pixel on the criss-cross path. AttaNet Song et al. (2021) utilizes a striping operation to encode the global context in the vertical direction and then harvests long-range relations along the horizontal axis. OCNet Yuan et al. (2021) utilizes the interlaced self-attention scheme to model both global and local relations. However, these approaches construct a similarity map to leverage relationships along a single dimension, where the dependency along other dimensions is discarded during the matrix multiplication. To generate both spatial-wise and channel-wise attentions, many studies were proposed. DANet Fu et al. (2019a) proposes the position attention module and channel attention module to model dependencies along the spatial and channel dimension respectively. TESA Babiloni et al. (2020)

views the input tensor as a combination of its three-mode matricizations and then captures similarities for each dimension. Although these methods capture relations in all dimensions, they consider different dimensions separately and the attention missing issue still exists in each attention map. To mitigate this issue, we propose a Fully Attentional block to encode both spatial and channel attentions in a single similarity map with high computational efficiency.

Method

Network Architecture

In this paper, we employ ResNet-101 He et al. (2016) and HRNetV2-W48 Sun et al. (2019) as the backbone network. For ResNet-101, dilation convolutions were applied in the last two layers to obtain more detailed information, and the output feature map was enlarged to 1/8 of the input image. Initially, the input image is processed by the backbone network to produce feature maps . After that, we first apply a convolution layer on to reduce the channel dimension and obtain the feature maps . Then, the feature maps would be fed into the Fully Attentional block (FLA) and generate new feature maps which aggregate non-local contextual information in all dimensions. Finally, the dense contextual feature is sent to the prediction layer to generate the final segmentation map.

Fully Attentional Block

Previous works try to generate full attentions by applying the attention operation in each dimension in turn, which yields high computational complexity and the single-dimensional attention still overlooks correlations along other dimensions. To capture full attentions in a single attention map with high computational efficiency, we propose a novel non-local block named Fully Attentional block. Specifically, to avoid adding extra computation burden, we try to introduce spatial interactions into channel NL mechanism by utilizing the global average pooling result as the global contextual prior.

The pipeline of our method is shown in Fig.3. Given an input feature map , where is the number of channels, and are the spatial dimensions of the input tensor. First, we feed into two parallel pathways at the bottom (i.e., the Construction), each of which contains a global average pooling layer followed by a Linear layer. When choosing the size of pooling windows, we considered the following two aspects. Firstly, to obtain richer global contextual priors, we choose to use unequal global pooling size in height and width directions rather than kernel windows like . Secondly, to make sure that each spatial position is connected with the corresponding global prior with the same horizontal or vertical coordinate, i.e., maintain the spatial consistency when computing channel relations, we choose to keep the length of one dimension constant. Therefore, we employ pooling windows of size and in these two pathways respectively. This gives and . After that, we repeat and to form global features and . Note that and represent the global priors in the horizontal and vertical directions respectively and they will be used to achieve spatial interactions in the corresponding dimension. Furthermore, we cut along the dimension, from which we can generate a group of slices with a size of . Similarly, we cut along the dimension. We then merge these two groups to form the final global contexts . The cut and merge operations are detailly illustrated in Fig.3.

Meanwhile, we cut the input feature along the dimension, yielding a group of slices with the size of . Similarly, we do this along the dimension. Like the merge process of , these two groups are integrated to form the features . In the same way, we can generate the feature maps .

After that, we can make each spatial position to receive the feature responses from the global priors in the same row and the same column, i.e., capturing the full attentions , via the Affinity operation. The Affinity operation is defined as follows:

(1)

where denotes the degree of correlation between the and channel at a specific spatial position.

Then we perform a matrix multiplication between and to update each channel map with the generated full attentions. After that, we reshape the result into two groups, and each group is with a size of (i.e., the inverse operation of merge). We sum these two groups to form the long-range contextual information. Finally, we multiply the contextual information by a scale parameter and perform an element-wise sum operation with the input feature map to obtain the final output as follows:

(2)

where

is a feature vector in the output feature map

at the channel map.

It is noted that different from the traditional Channel NL method which explores channel correlations by multiplying the spatial information from the same position, our FLA enables spatial connections between different spatial positions, i.e., we exploit full attentions in both spatial and channel dimensions with a single attention map. In this way, our FLA has a more holistic contextual view and is more robust to different scenarios. Moreover, the constructed prior representation brings a global receptive field and helps to boost the feature discrimination ability.

Complexity Analysis

Given a feature map with a size of , the typical Spatial NL has a computational complexity of , and the Channel NL has a computational complexity of . Both of them can only capture similarities along a single dimension. To model feature dependencies in all dimensions, previous work like DANet applies both Spatial NL and Channel NL to calculate spatial and channel relations separately, which yields higher computational complexity and occupies much more GPU memory. Different from previous works, we achieve full attention in a single NL block and in a more efficient way. Specifically, we utilize the newly constructed global representations to achieve interactions between different spatial positions and collect contextual similarities from all dimensions. And the complexity of our FLA block (both in time and space) is . Since in our paper, our complexity is of the same order with Channel NL and only differs by a small constant.

Experiments

Method Backbone mIoU (%)
Simple Backbone
PSPNet Zhao et al. (2017) ResNet-101 78.4
AAF Ke et al. (2018) ResNet-101 79.1
CFNet Zhang et al. (2019b) ResNet-101 79.6
PSANet Zhao et al. (2018) ResNet-101 80.1
ANNet Zhu et al. (2019) ResNet-101 81.3
CCNet Huang et al. (2019) ResNet-101 81.4
OCNet Yuan et al. (2021) ResNet-101 81.9
DGCNet Zhang et al. (2019c) ResNet-101 82.0
HANet Choi et al. (2020) ResNet-101 82.1
ACNet Fu et al. (2019b) ResNet-101 82.3
RecoNet Chen et al. (2020) ResNet-101 82.3
FLANet (Ours) ResNet-101 83.0
Advanced Backbone
SPGNet Cheng et al. (2019) ResNet-50 81.1
DANet Fu et al. (2019a) ResNet-101+MG 81.5
ACFNet Zhang et al. (2019a) ResNet-101+ASPP 81.8
GALD Li et al. (2019b) ResNet-101+ASPP 81.8
GFF Li et al. (2020) ResNet-101+PPM 82.3
HRNet Sun et al. (2019) HRNetV2-W48 81.6
OCNet Yuan et al. (2021) HRNetV2-W48 82.5
FLANet (Ours) HRNetV2-W48 83.6
Table 1: Comparison with state-of-the-art models on the Cityscapes test set. For fair comparison, all these methods use only Cityscapes fine-data for training.
Method road swalk build wall fence pole tlight sign veg. terrain sky person rider car truck bus train mbike bike mIoU (%)
PSPNet Zhao et al. (2017) 98.6 86.2 92.9 50.8 58.8 64.0 75.6 79.0 93.4 72.3 95.4 86.5 71.3 95.9 68.2 79.5 73.8 69.5 77.2 78.4
AttaNetSong et al. (2021) 98.7 87.0 93.5 55.9 62.6 70.2 78.4 81.4 93.9 72.8 95.4 87.9 74.7 96.3 71.2 84.4 78.0 68.6 78.2 80.5
DANet Fu et al. (2019a) 98.6 87.1 93.5 56.1 63.3 69.7 77.3 81.3 93.9 72.9 95.7 87.3 72.9 96.2 76.8 89.4 86.5 72.2 78.2 81.5
ACFNet Zhang et al. (2019a) 98.7 87.1 93.9 60.2 63.9 71.1 78.6 81.5 94.0 72.9 95.9 88.1 74.1 96.5 76.6 89.3 81.5 72.1 79.2 81.8
GFF Li et al. (2020) 98.7 87.2 93.9 59.6 64.3 71.5 78.3 82.2 94.0 72.6 95.9 88.2 73.9 96.5 79.8 92.2 84.7 71.5 78.8 82.3
FLANet (Ours) 98.8 87.7 94.3 64.1 64.9 72.4 78.9 82.6 94.2 73.5 96.2 88.7 76.0 96.6 80.2 93.8 91.6 74.3 79.5 83.6
Table 2: Per-class results on the Cityscapes test set. Our method outperforms existing methods and achieves 83.6% in mIoU.

To evaluate the proposed FLANet, we conduct extensive experiments on the Cityscapes Cordts et al. (2016), the ADE20K Zhou et al. (2017), and the PASCAL VOC Everingham et al. (2009).

Datasets

Cityscapes

Cityscapes is a dataset for urban scene segmentation, which contains 5K images with fine pixel-level annotations and 20K images with coarse annotations. The dataset has 19 classes and each image is with resolution. The 5K images with fine annotations are further divided into 2975, 500, and 1525 images for training, validation, and testing, respectively.

Ade20k

ADE20K is a challenging scene parsing benchmark. The dataset contains 20K/2K images for training and validation which is densely labeled as 150 stuff/object categories. Images in this dataset are from different scenes with more scale variations.

Pascal Voc

PASCAL VOC is a golden benchmark of semantic segmentation, which includes 20 object categories and one background class. The dataset contains 10582, 1449, 1456 images for training, validation, and testing.

Implementation Details

Our implementation is based on PyTorch

Paszke et al. (2017)

, and uses ResNet-101 and HRNetV2-W48 pre-trained from ImageNet

Russakovsky et al. (2015) as the backbone network. Following prior works Yu et al. (2018), we apply the poly learning rate policy where the initial learning rate is multiplied by

after each iteration. Momentum and weight decay coefficients are set to 0.9 and 5e-4, respectively. All models are trained for 240 epochs with an initial learning rate of 1e-2 and batch size of 8. We set the crop size as 768 × 768 and 520 × 520 for Cityscapes and other datasets, respectively. For data augmentation, we apply the common scaling (0.5 to 2.0), cropping, and flipping to augment the training data. Besides, the synchronized batch normalization is used to synchronize the mean and standard deviation of batch normalization across multiple GPUs. For evaluation, the commonly used Mean IoU metric is adopted.

Experiments on the Cityscapes Dataset

Figure 4: Visualization results of FLANet on the Cityscapes validation set.

Comparisons to the State of the Art

We first compare our proposed method with the state-of-the-art approaches on the Cityscapes test set. Specifically, all models are trained with only fine annotated data, and the comparison results are summarized in Tab.1. Among these approaches, the self-attention based models are most related to our method, and more detailed analyses and comparisons will be illustrated in the following subsection.

From Tab.1, it can be observed that our approach substantially outperforms all the previous techniques based on ResNet-101 or stronger backbones and achieves a new state-of-the-art performance of 83.6% mIoU. Moreover, it achieves a performance that is comparable with methods based on some larger backbones. Detailed per-category comparisons are reported in Tab.2, where our method achieves the highest IoU score on all categories, and large improvements are from categories such as rider, bus, train, and mbike. It also proves the benefits of the proposed FLANet in predicting distant objects and maintaining the segmentation consistency inside large objects.

Method SS (%) MS+F (%)
ShuffleNetV2 69.2 70.8
+FLA 74.7 76.3
ResNet-18 71.3 72.5
+FLA 76.5 78.1
ResNet-50 72.8 74.1
+FLA 78.9 79.7
ResNet-101 75.6 76.9
+FLA 82.1 83.0
Table 3: Ablation study between the baseline and FLANet on Cityscapes validation set according to various backbone networks. SS: Single scale input during evaluation. MS: Multi-scale input. F: Adding left-right flipped input.
Figure 5: Qualitative comparisons against the NL methods on the Cityscapes validation set. Due to the limited space, we remove the input images and only show the segmentation results and ground truth (GT).

Ablation Studies

To demonstrate the wide applicability of FLANet, we conduct ablation studies on various backbone networks, including ShuffleNetV2 Ma et al. (2018) and ResNet series. As listed in Tab.3, models with FLA consistently outperform baseline models with significant increases no matter what backbone network we use.

In addition, we provide the qualitative comparisons between FLANet and the Baseline (ResNet-50) in Fig.4. In Fig.4, we use the white squares to mark the challenging regions. One can observe that the baseline easily mislabels those regions but our proposed network is able to correct them, which clearly shows the effectiveness of FLANet for semantic segmentation. For example, building in row 1, side walk and distant train in row 2, and large truck in row 4.

Comparison with NL Methods

Method SS(%) MS+F(%) GFLOPs Memory(M)
Baseline 72.8 74.1 - -
+EMA 76.4 77.1 13.91 98
+Channel NL 75.6 76.9 9.66 40
+RCCA 76.4 77.5 16.18 174
+Spatial NL 76.2 77.4 103.90 1320
+Dual NL 77.1 78.0 113.56 1378
+CS NL 77.4 78.2 113.56 1378
+FLA (Ours) 78.9 79.7 19.37 436
Table 4: Detailed comparisons with existing NL models on the Cityscapes validation set. The GFLOPs and Memory are computed with the input size 768 × 768. Adding FLA to the baseline largely increase the mIoU with fewer computation.

We compare our FLANet with several existing non-local models on the Cityscapes validation set. We measure the increased computation complexity (measured by the number of GFLOPs) and GPU memory usage that are introduced by the NL blocks and do not count the complexity from the baselines. Besides, to speed up the training procedure, we carry out these comparison experiments on ResNet-50, with batch size 8.

Specifically, the NL models compared in Tab.4

include 1) Expectation-Maximization Attention in EMANet

Li et al. (2019a), donated as “+EMA”; 2) Recurrent Criss-Cross Attention (R=2) in CCNet Huang et al. (2019), donated as “+RCCA”; 3) two typical NL blocks introduced in Sec.1., donated as “+Channel NL” and “+Spatial NL” respectively; 4) two connection modes introduced in Sec.1., donated as “+Dual NL” and “+CS NL” respectively. Besides, according to whether calculate the channel-only attention, spatial-only attention, and both channel-spatial attention, Tab.4 is divided into three groups.

As illustrated in Tab.4, FLA outperforms these NL methods by a large margin, and the complexity comparison results indicate that the cost of adding FLA is practically negligible even compared with the lightweight-designed models like EMA and RCCA. Moreover, it can be found that the increased computational cost of FLA for capturing spatial attentions is the lowest (about 9.71 GFLOPs) compared with all these spatial-modeling NLs. Even when compared with the Channel NL who requires the lowest computational cost, our FLA outperforms it by 2.8% with the minimum computational increment. And the computational complexity of FLA is consistent to our previous theoretical analysis in Sec.3.3. It is noted that our FLA significantly reduces GFLOPs by about 83% and only requires 34% GPU memory usage of DANet (Dual NL) and CS NL when modeling both channel-wise and spatial-wise relationships. Therefore, FLA has a great advantage of not only an effective way of improving segmentation accuracy but also a lightweight algorithm design for practical usage.

The Efficacy of FLA

To further prove that our method can successfully solve the attention missing issue, we also present several qualitative comparison results in Fig.5. As shown in Fig.5

, we can find that Dual NL and CS NL can combine the advantages of Channel NL and Spatial NL to some extent and generates better segmentation results. However, it is obvious that sometimes they obtain wrong predictions even when they are correctly classified in Channel NL and Spatial NL, such as the examples shown in the second row. This coincides with our claims that the attention missing issue would distort interactions between dimensions and can not be solved by stacking different NL blocks. Compared with those NL methods, accuracies of predictions for both distant/thin categories (e.g.,

poles in the first row) and the ability to maintain the segmentation consistency inside large objects (e.g., train in the third row and car in the last row) are significantly improved after using the proposed FLA. And the quantitative per-class comparisons can be seen in Fig.2. This phenomenon can also demonstrate that FLA can optimally model both channel-wise and spatial-wise relations only by using a single non-local block.

Visualization of Attention Module

Figure 6: Visualization of attended feature maps of Channel NL and our FLA on the Cityscapes validation set, where feature maps are visualized by averaging along the channel dimension.

To get a deeper understanding of the effectiveness of our FLA in encoding spatial attentions into the channel affinity map, we visualize the attended feature maps and analyze how FLA improves the final result. We also visualized the attended feature maps of the traditional Channel NL for further comparison. As shown in Fig.6, both Channel NL and our FLA highlight some semantic areas and guarantee consistent representation inside large objects like roads and buildings. Furthermore, it is noted that the attended feature maps of FLA are more structured and detailed than that of Channel NL. For example, distant poles and object boundaries are highlighted for all images. Particularly, FLA can also distinguish different classes, e.g., bus and car in the third row. These visualization results further demonstrate that our proposed module can capture and encode the spatial similarities into the channel attention map to achieve full attentions.

Method Backbone mIoU (%)
Simple Backbone
CCNet Huang et al. (2019) ResNet-101 45.22
GFF Li et al. (2020) ResNet-101 45.33
OCNet Yuan et al. (2021) ResNet-101 45.40
DMNet He et al. (2019) ResNet-101 45.50
RecoNet Chen et al. (2020) ResNet-101 45.54
ACNet Fu et al. (2019b) ResNet-101 45.90
DNL Yin et al. (2020) ResNet-101 45.97
CPNet Yu et al. (2020) ResNet-101 46.27
FLANet (Ours) ResNet-101 46.68
Advanced Backbone
HRNetV2 Sun et al. (2019) HRNetV2-W48 42.99
DANet Fu et al. (2019a) ResNet-101+MG 45.22
OCNet Yuan et al. (2021) HRNetV2-W48 45.50
DNL Yin et al. (2020) HRNetV2-W48 45.82
FLANet (Ours) HRNetV2-W48 46.99
Table 5: Comparisons on the ADE20K val set.
Method Backbone mIoU (%)
Simple Backbone
DeepLabv3 Chen et al. (2017) ResNet-101 85.7
EncNet Zhang et al. (2018) ResNet-101 85.9
DFN Yu et al. (2018) ResNet-101 86.2
CFNet Zhang et al. (2019b) ResNet-101 87.2
EMANet Li et al. (2019a) ResNet-101 87.7
DeeplabV3+ Chen et al. (2018) Xception+JFT 89.0
RecoNet Chen et al. (2020) ResNet-101 88.5
FLANet (Ours) ResNet-101 87.9
Advanced Backbone
EMANet Li et al. (2019a) ResNet-150 88.2
RecoNet Chen et al. (2020) ResNet-150 89.0
FLANet (Ours) HRNetV2-W48 88.5
Table 6: Comparisons on the PASCAL VOC.

indicates that FLANet is trained without using COCO-pretrained model.

Experiments on the ADE20K Dataset

To further validate the effectiveness of our FLANet, we conduct experiments on the ADE20K dataset, which is a challenging scene parsing dataset with both indoor and outdoor images. Tab.5 reports the performance comparisons between FLANet and the state-of-the-art models on the ADE20K validation set. Our approach achieves 46.99% mIoU score, outperforms the previous state-of-the-art methods by 0.72%, which is significant due to the fact that this benchmark is very competitive. CPNet achieves previous best performance among those methods and utilizes the learned context prior with the supervision of the affinity loss to capture the intra-class and inter-class contextual dependencies. In contrast, our FLANet try to capture both spatial-wise and channel-wise dependencies in a single attention map and achieve better performance.

Experiments on the PASCAL VOC Dataset

To verify the generalization of our proposed FLANet, we conduct experiments on the PASCAL VOC dataset. The comparison results are shown in Tab.6. FLANet based on ResNet-101 and HRNetV2-W48 achieves comparable performance on the PASCAL VOC test set, enen when other methods are pretrained on additional data.

Conclusions and Future Work

In this paper, we find that traditional self-attention methods suffer from the attention missing problem caused by matrix multiplication. To mitigate this issue, we reformulate the self-attention mechanism into a fully attentional manner, which can capture both channel and spatial attentions with a single attention map and also with much less computational complexity. Specifically, we construct global contexts to introduce spatial interactions into the channel attention maps. Our FLANet achieves outstanding performance on three semantic segmentation datasets. Besides, we also consider the way of introducing channel interactions into the traditional Spatial NL. However, the extremely high computational load limits its practical application. In the future, we will try to achieve that in a more efficient way.

Acknowledgments

This work was supported in part by Shenzhen Natural Science Foundation under Grant JCYJ20190813170601651, and in part by Shenzhen Institute of Artificial Intelligence and Robotics for Society under Grant AC01202101006 and Grant AC01202101010.

References

  • F. Babiloni, I. Marras, G. Slabaugh, and S. Zafeiriou (2020) TESA: tensor element self-attention via matricization.

    2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    , pp. 13942–13951.
    Cited by: Introduction, Self-Attention Mechanism..
  • Y. Cao, J. Xu, S. Lin, F. Wei, and H. Hu (2019) GCNet: non-local networks meet squeeze-excitation networks and beyond. 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pp. 1971–1980. Cited by: Introduction.
  • L. Chen, G. Papandreou, F. Schroff, and H. Adam (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. Cited by: Table 6.
  • L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 801–818. Cited by: Table 6.
  • W. Chen, X. Zhu, R. Sun, J. He, R. Li, X. Shen, and B. Yu (2020) Tensor low-rank reconstruction for semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: Table 1, Table 5, Table 6.
  • B. Cheng, L. Chen, Y. Wei, Y. Zhu, Z. Huang, J. Xiong, T. Huang, W. Hwu, and H. Shi (2019) SPGNet: semantic prediction guidance for scene parsing. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5217–5227. Cited by: Table 1.
  • S. Choi, J. Kim, and J. Choo (2020) Cars can’t fly up in the sky: improving urban-scene segmentation via height-driven attention networks. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9370–9380. Cited by: Table 1.
  • J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio (2015) Attention-based models for speech recognition. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: Self-Attention Mechanism..
  • M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3213–3223. Cited by: Experiments.
  • M. Everingham, L. Gool, C. K. I. Williams, J. Winn, and A. Zisserman (2009) The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88, pp. 303–338. Cited by: Experiments.
  • J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu (2019a) Dual attention network for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3146–3154. Cited by: Introduction, Introduction, Introduction, Self-Attention Mechanism., Table 1, Table 2, Table 5.
  • J. Fu, J. Liu, Y. Wang, Y. Li, Y. Bao, J. Tang, and H. Lu (2019b) Adaptive context network for scene parsing. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 6748–6757. Cited by: Table 1, Table 5.
  • J. He, Z. Deng, and Y. Qiao (2019) Dynamic multi-scale filters for semantic segmentation. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3561–3571. Cited by: Table 5.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Identity mappings in deep residual networks. In European Conference on Computer Vision (ECCV), pp. 630–645. Cited by: Network Architecture.
  • J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: Semantic Segmentation..
  • Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu (2019) Ccnet: criss-cross attention for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 603–612. Cited by: Introduction, Self-Attention Mechanism., Comparison with NL Methods, Table 1, Table 5.
  • T. Ke, J. Hwang, Z. Liu, and S. Yu (2018) Adaptive affinity fields for semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: Table 1.
  • X. Li, Z. Zhong, J. Wu, Y. Yang, Z. Lin, and H. Liu (2019a) Expectation-maximization attention networks for semantic segmentation. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9167–9176. Cited by: Comparison with NL Methods, Table 6.
  • X. Li, L. Zhang, A. You, M. Yang, K. Yang, and Y. Tong (2019b) Global aggregation then local distribution in fully convolutional networks. In British Machine Vision Conference (BMVC), Cited by: Table 1.
  • X. Li, H. Zhao, L. Han, Y. Tong, S. Tan, and K. Yang (2020) Gated fully fusion for semantic segmentation. In Association for the Advancement of Artificial Intelligence (AAAI), Cited by: Table 1, Table 2, Table 5.
  • J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440. Cited by: Semantic Segmentation..
  • N. Ma, X. Zhang, H. Zheng, and J. Sun (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV), pp. 116–131. Cited by: Ablation Studies.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. Devito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: Implementation Details.
  • P. Ramachandran, N. Parmar, A. Vaswani, I. Bello, A. Levskaya, and J. Shlens (2019) Stand-alone self-attention in vision models. In Advances in Neural Information Processing Systems, Cited by: Introduction.
  • O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: Semantic Segmentation..
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: Implementation Details.
  • Q. Song, K. Mei, and R. Huang (2021) AttaNet: attention-augmented network for fast and accurate scene parsing. In AAAI, Cited by: Introduction, Self-Attention Mechanism., Table 2.
  • K. Sun, B. Xiao, D. Liu, and J. Wang (2019)

    Deep high-resolution representation learning for human pose estimation

    .
    2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5686–5696. Cited by: Introduction, Network Architecture, Table 1, Table 5.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), pp. 5998–6008. Cited by: Self-Attention Mechanism..
  • X. Wang, R. Girshick, A. Gupta, and K. He (2018)

    Non-local neural networks

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7794–7803. Cited by: Self-Attention Mechanism..
  • M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang (2018) DenseASPP for semantic segmentation in street scenes. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3684–3692. Cited by: Introduction.
  • M. Yin, Z. Yao, Y. Cao, X. Li, Z. Zhang, S. Lin, and H. Hu (2020) Disentangled non-local neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: Introduction, Table 5.
  • C. Yu, J. Wang, C. Gao, G. Yu, C. Shen, and N. Sang (2020) Context prior for scene segmentation. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12413–12422. Cited by: Table 5.
  • C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang (2018) Learning a discriminative feature network for semantic segmentation. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 1857–1866. Cited by: Implementation Details, Table 6.
  • Y. Yuan, X. Chen, and J. Wang (2020) Object-contextual representations for semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: Introduction.
  • Y. Yuan, L. Huang, J. Guo, C. Zhang, X. Chen, and J. Wang (2021) OCNet: object context for semantic segmentation. Int. J. Comput. Vis. 129, pp. 2375–2398. Cited by: Self-Attention Mechanism., Table 1, Table 5.
  • F. Zhang, Y. Chen, Z. Li, Z. Hong, J. Liu, F. Ma, J. Han, and E. Ding (2019a) Acfnet: attentional class feature network for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6798–6807. Cited by: Introduction, Table 1, Table 2.
  • H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal (2018) Context encoding for semantic segmentation. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7151–7160. Cited by: Table 6.
  • H. Zhang, C. Wang, and J. Xie (2019b) Co-occurrent features in semantic segmentation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 548–557. Cited by: Table 1, Table 6.
  • L. Zhang, X. Li, A. Arnab, K. Yang, Y. Tong, and P. Torr (2019c) Dual graph convolutional network for semantic segmentation. In British Machine Vision Conference (BMVC), Cited by: Table 1.
  • H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890. Cited by: Introduction, Table 1, Table 2.
  • H. Zhao, Y. Zhang, S. Liu, J. Shi, C. C. Loy, D. Lin, and J. Jia (2018) PSANet: point-wise spatial attention network for scene parsing. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 270–286. Cited by: Introduction, Table 1.
  • B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba (2017) Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Experiments.
  • Z. Zhu, M. Xu, S. Bai, T. Huang, and X. Bai (2019) Asymmetric non-local neural networks for semantic segmentation. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 593–602. Cited by: Introduction, Table 1.