CCNet: Criss-Cross Attention for Semantic Segmentation

11/28/2018 ∙ by Zilong Huang, et al. ∙ 6

Long-range dependencies can capture useful contextual information to benefit visual understanding problems. In this work, we propose a Criss-Cross Network (CCNet) for obtaining such important information through a more effective and efficient way. Concretely, for each pixel, our CCNet can harvest the contextual information of its surrounding pixels on the criss-cross path through a novel criss-cross attention module. By taking a further recurrent operation, each pixel can finally capture the long-range dependencies from all pixels. Overall, our CCNet is with the following merits: 1) GPU memory friendly. Compared with the non-local block, the recurrent criss-cross attention module requires 11× less GPU memory usage. 2) High computational efficiency. The recurrent criss-cross attention significantly reduces FLOPs by about 85% of the non-local block in computing long-range dependencies. 3) The state-of-the-art performance. We conduct extensive experiments on popular semantic segmentation benchmarks including Cityscapes, ADE20K, and instance segmentation benchmark COCO. In particular, our CCNet achieves the mIoU score of 81.4 and 45.22 on Cityscapes test set and ADE20K validation set, respectively, which are the new state-of-the-art results. We make the code publicly available at <https://github.com/speedinghzl/CCNet .>

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 7

page 8

Code Repositories

CCNet

CCNet: Criss-Cross Attention for Semantic Segmentation


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Semantic segmentation is a fundamental topic in computer vision, whose goal is to assign semantic class labels to every pixel in the image. It has been actively studied in many recent papers and is also critical for various challenging applications such as autonomous driving, virtual reality, and image editing.

Recently, state-of-the-art semantic segmentation frameworks based on the fully convolutional network (FCN) [26] have made remarkable progress. Due to the fixed geometric structures, they are inherently limited to local receptive fields and short-range contextual information. These limitations impose a great adverse effect on FCN-based methods due to insufficient contextual information.

To capture long-range dependencies, Chen et al[6] proposed atrous spatial pyramid pooling module with multi-scale dilation convolutions for contextual information aggregation. Zhao et al[42] further introduced PSPNet with pyramid pooling module to capture contextual information. However, the dilated convolution based methods [7, 6, 13] collect information from a few surrounding pixels and can not generate dense contextual information actually. Meanwhile, the pooling based methods [42, 40] aggregate contextual information in a non-adaptive manner and the homogeneous contextual information is adopted by all image pixels, which does not satisfy the requirement the different pixel needs the different contextual dependencies.

To generate dense and pixel-wise contextual information, PSANet [43] learns to aggregate contextual information for each position via a predicted attention map. Non-local Networks [32] utilizes a self-attention mechanism [10, 29], which enable a single feature from any position to perceive features of all the other positions, leading to generate more power pixel-wise representation. Here, each position in the feature map is connected with all other ones through self-adaptively predicted attention maps, thus harvesting various range contextual information, see in Fig. 1 (a). However, these attention-based methods need to generate huge attention maps to measure the relationships for each pixel-pair, whose complexity in time and space are both , where donates the spatial dimension of input feature map. Since the input feature map is always with high resolution in semantic segmentation task, self-attention based methods have high computation complexity and occupy a huge number of GPU memory. We argue that: Is there an alternative solution to achieve such a target in a more efficient way?

We found that the current no-local operation adopted by [32] can be alternatively replaced by two consecutive criss-cross operations, in which each one only has sparse connections () for each position in the feature maps. This motivates us to propose the criss-cross attention module to aggregate long-range pixel-wise contextual information in horizontal and vertical direction. By serially stacking two criss-cross attention modules, it can collect contextual information from all pixels. The decomposition greatly reduce the complexity in time and space from to .

Concretely, our criss-cross attention module is able to harvest various information nearby and far away on the criss-cross path. As shown in Fig. 1, both non-local module, and criss-cross attention module feed the input feature maps with spatial size to generate attention maps (upper branch) and adapted feature maps (lower branch), respectively. Then, the weighted sum is adopted as aggregation way. In criss-cross attention module, each position (e.g., blue color) in the feature map is connected with other ones which are in the same row and the same column through predicted sparsely attention map. The predicted attention map only has weights rather than in non-local module. Furthermore, we propose the recurrent criss-cross attention module to capture the long-range dependencies from all pixels. The local features are passed into criss-cross attention module only once, which collects the contextual information in horizontal and vertical directions. The output feature map of a criss-cross attention module is fed into the next criss-cross attention module; each position (e.g

. red color) in the second feature map collects information from all others to augment the pixel-wise representations. All the criss-cross attention modules share parameters for reducing extra parameters. Our criss-cross attention module can be plugged into any fully convolutional neural network, named CCNet, for leaning to segment in an end-to-end manner.

We have carried out extensive experiments on large-scale datasets. Our proposed CCNet achieves top performance on two most competitive semantic segmentation datasets, i.e., Cityscapes [11], and ADE20K [45]. Besides semantic segmentation, the proposed criss-cross attention even improves the state-of-the-art instance segmentation method, i.e., Mask R-CNN with ResNet-101 [16]. The results show that criss-cross attention is generally beneficial to the dense prediction tasks. In summary, our main contributions are two-fold:

  • We propose a novel criss-cross attention module in this work, which can be leveraged to capture contextual information from long-range dependencies in a more efficient and effective way.

  • We propose a CCNet by taking advantages of two recurrent criss-cross attention modules, achieving leading performance on segmentation-based benchmarks, including Cityscapes, ADE20K and MSCOCO.

Figure 2: Overview of the proposed CCNet for semantic segmentation. The proposed recurrent criss-cross attention takes as input feature maps and output feature maps which obtain rich and dense contextual information from all pixels. Recurrent criss-cross attention module can be unrolled into loops, in which all Criss-Cross Attention module share parameters.

2 Related work

Semantic segmentation The last years have seen a renewal of interest on semantic segmentation. FCN [26] is the first approach to adopt fully convolution network for semantic segmentation. Later, FCN-based methods have made great progress in image semantic segmentation. Chen et al[5] and Yu et al[38] removed the last two downsample layers to obtain dense prediction and utilized dilated convolutions to enlarge the receptive field. Unet [28], Deeplabv3+ [9], RefineNet [21] and DFN [37] adopted encoder-decoder structures that fuse the information in low-level and high-level layers to predict segmentation mask. SAC [41] and Deformable Convolutional Networks [12] improved the standard convolution operator to handle the deformation and various scale of objects. CRF-RNN [38] and DPN [25] used Graph model i.e. CRF,MRF, for semantic segmentation. AAF [19] used adversarial learning to capture and match the semantic relations between neighboring pixels in the label space. BiSeNet [36] was designed for real-time semantic segmentation.

In addition, some works aggregate the contextual information to augment the feature representation. Deeplabv2 [6] proposed ASPP module to use different dilation convolutions to capture contextual information. DenseASPP [35] brought dense connections into ASPP to generate features with various scale. DPC [4] utilized architecture search techniques to build multi-scale architectures for semantic segmentation. PSPNet [42] utilized pyramid pooling to aggregate contextual information. GCN [27] utilized global convolution module and ParseNet [24] utilized global pooling to harvest context information for global representations. Recently, Zhao et al[43] proposed the point-wise spatial attention network which uses predicted attention map to guide contextual information collection. Liu et al[23] and Visin et al[30] utilized RNNs to capture long-range contextual dependency information. conditional random field (CRF) [2, 3, 5, 44], Markov random field (MRF) [25]

and recurrent neural network (RNN) 

[23] are also utilized to capture long-range dependencies.

Attention modelAttention model is widely used for various tasks. Squeeze-and-Excitation Networks [17] enhanced the representational power of the network by modeling channel-wise relationships in an attention mechanism. Chen et al[8] made use of several attention masks to fuse feature maps or predictions from different branches. Vaswani et al[29] applied a self-attention model on machine translation. Wang et al[32] proposed the non-local module to generate the huge attention map by calculating the correlation matrix between each spatial point in the feature map, then the attention guided dense contextual information aggregation. OCNet [39] and DANet [14] utilized self-attention mechanism to harvest the contextual information. PSA [43] learned an attention map to aggregate contextual information for each individual point adaptively and specifically. Our CCNet is different from the aforementioned studies to generate huge attention map to record the relationship for each pixel-pair in feature map. The contextual information is aggregated by criss-cross attention module on the criss-cross path. Beside, CCNet can also obtain dense contextual information in a recurrent fashion which is more effective and efficient.

3 Approach

In this section, we give the details of the proposed Criss-Cross Network (CCNet) for semantic segmentation. At first, we will first present a general framework of our network. Then, we will introduce criss-cross attention module which captures long-range contextual information in horizontal and vertical direction. At last, to capture the dense and global contextual information, we propose the recurrent criss-cross attention module.

3.1 Overall

The network architecture is given in Fig. 2. An input image is passed through a deep convolutional neural networks (DCNN), which is designed in a fully convolutional fashion [6], then, produces a feature map . We denote the spatial size of as . In order to retain more details and efficiently produce dense feature maps, we remove the last two down-sampling operations and employ dilation convolutions in the subsequent convolutional layers, thus enlarging the width/height of the output feature maps to 1/8 of the input image.

After obtaining feature maps , we first apply a convolution layer to obtain the feature maps of dimension reduction, then, the feature maps would be fed into the criss-cross attention (CCA) module and generate new feature maps which aggregate long-range contextual information together for each pixel in a criss-cross way. The feature maps only aggregate the contextual information in horizontal and vertical directions which are not powerful enough for semantic segmentation. To obtain richer and denser context information, we feed the feature maps into the criss-cross attention module again and output feature maps . Thus, each position in feature maps actually gather the information from all pixels. Two criss-cross attention modules before and after share the same parameters to avoid adding too many extra parameters. We name this recurrent structure as recurrent criss-cross attention (RCCA) module.

Then we concatenate the dense contextual feature with the local representation feature

. It is followed by one or several convolutional layers with batch normalization and activation for feature fusion. Finally, the fused features are fed into the segmentation layer to generate the final segmentation map.

Figure 3: The details of criss-cross attention module.

3.2 Criss-Cross Attention

In order to model long-range contextual dependencies over local feature representations using lightweight computation and memory, we introduce a criss-cross attention module. The criss-cross attention module collects contextual information in horizontal and vertical directions to enhance pixel-wise representative capability.

As shown in Fig 3, given a local feature , the criss-cross attention module firstly applies two convolution layers with filters on to generate two feature maps and , respectively, where . is the channel number of feature maps, which is less than for dimension reduction.

After obtaining feature maps and , we further generate attention maps via Affinity operation. At each position in spatial dimension of feature maps

, we can get a vector

. Meanwhile, we can obtain the set by extracting feature vectors from which are in the same row or column with position . Thus, . is th element of . The Affinity operation is defined as follows:

(1)

in which denotes the degree of correlation between feature and , ,

. Then, we apply a softmax layer on

along the channel dimension to calculate the attention map .

Then another convolutional layer with filters is applied on to generate for feature adaption. At each position in spatial dimension of feature maps , we can obtain a vector and a set . The set is collection of feature vectors in which are in the same row or column with position . The long-range contextual information is collected by the Aggregation operation:

(2)

in which denotes a feature vector in output feature maps at position . is a scalar value at channel and position in . The contextual information is added to local feature to enhance the local features and augment the pixel-wise representation. Therefore, it has a wide contextual view and selectively aggregates contexts according to the spatial attention map. This feature representations achieve mutual gains and are more robust for semantic segmentation.

The proposed criss-cross attention module is a self-contained module which can be dropped into a CNN architecture at any point, and in any number, obtaining rich contextual information. This module is very computationally cheap and adds a few parameters, causing very little GPU memory usage.

Figure 4: An example of information propagation when the loop number is 2.

3.3 Recurrent Criss-Cross Attention

Despite a criss-cross attention module can capture long-range contextual information in horizontal and vertical direction, the connections between the pixel and around pixels are still sparse. It is helpful to obtain dense contextual information for semantic segmentation. To achieve this, we introduce the recurrent criss-cross attention based on the criss-cross attention module described above. The recurrent criss-cross attention module can be unrolled into loops. In the first loop, the criss-cross attention module takes as input feature maps extracted from a CNN model and output feature maps , where and have the same shape. In the second loop, the criss-cross attention module takes as input feature maps and output feature maps . As shown in Fig. 2, recurrent criss-cross attention module has two loops (R=2) which is enough to harvest long-range dependencies from all pixels to generate new feature maps with dense and rich contextual information.

The and are donated as the attention maps in loop 1 and loop 2, respectively. Since we are interested only in contextual information spreads in spatial dimension rather than in channel dimension, the convolutional layer with filters can be view as identical connection. In addition, the mapping function from position to weight is defined as . For any position at feature map and any position at feature map , there is a connection if . One case is that and are in the same row or column:

(3)

in which donates add-to operation. Another case is that and are not in the same row and column. Fig 4 shows the propagation path of context information in spatial dimension:

(4)

In general, our recurrent criss-cross attention module makes up for the deficiency of criss-cross attention module that cannot obtain the dense contextual information from all pixels. Compared with criss-cross attention module, the recurrent criss-cross attention module () does not bring extra parameters and can achieve better performance with the cost of minor computation increment. The recurrent criss-cross attention module is also a self-contained module that can be plugged into any CNN architecture at any stage and be optimized in an end-to-end manner.

4 Experiments

To evaluate the proposed method, we carry out comprehensive experiments on Cityscapes dataset, ADE20K dataset, and COCO dataset. Experimental results demonstrate that CCNet achieves state-of-the-art performance on Cityscapes and ADE20K. Meanwhile, CCNet can bring constant gain on COCO for instance segmentation. In the following subsections, we first introduce the datasets and implementation details, then we perform a series of ablation experiments on Cityscapes dataset. Finally, we report our results on ADE20K and COCO dataset.

4.1 Datasets and Evaluation Metrics

We adopt Mean IoU (mean of class-wise intersection over union) for Cityscapes and ADE20K and standard COCO metrics Average Precision (AP) for COCO.

  • Cityscapes is tasked for urban segmentation, which contains 5,000 high quality pixel-level finely annotated images and 20,000 coarsely annotated images captured from 50 different cities. Each image is with 1024 2048 resolution, which has 19 classes for semantic segmentation evaluation. Only the 5,000 finely annotated images are used in our experiments and are divided into 2,975/500/1,525 images for training, validation, and testing.

  • ADE20K is a recent scene parsing benchmark containing dense labels of 150 stuff/object categories. The dataset includes 20K/2K/3K images for training, validation and test.

  • COCO is a very challenging dataset that contains 115K images over 80 categories for training, 5K for validation and 20k for testing.

4.2 Implementation Details

Network Structure

We implement our method based on open source pytorch segmentation toolbox 

[18]

. For semantic segmentation, we choose the ImageNet pre-trained ResNet-101 

[16] as our backbone and remove the last two down-sampling operations and employ dilated convolutions in the subsequent convolutional layers following the previous works [5]

, the output stride becomes 8. Meanwhile, we replace the standard Batchnorm with InPlace-ABN 

[1]

to the mean and standard-deviation of BatchNorm across multiple GPUs. For instance segmentation, we choose Mask-RCNN 

[15] as our baseline.

Training settings The SGD with mini-batch is used for training. For semantic segmentation, the initial learning rate is 1e-2 for Cityscapes and ADE20K. Following prior works [6, 40], we employ a poly learning rate policy where the initial learning rate is multiplied by with power = 0.9. We use the momentum of 0.9 and a weight decay of 0.0001. For Cityscapes, the training images are augmented by randomly scaling (from 0.75 to 2.0), then randomly cropping out the high-resolution patches () from the resulting images. Since the images from ADE20K are with various sizes, we adopt an augmentation strategy of resizing the short side of input image to the length randomly chosen from the set {300, 375, 450, 525, 600}. In addition, we also apply random flipping horizontally for data augmentation. We employ 4 TITAN XP GPUs for training and batch size is 8. For instance segmentation, we take the same training settings as that of Mask-RCNN [15].

4.3 Experiments on Cityscapes

4.3.1 Comparisons with state-of-the-arts

Results of other state-of-the-art semantic segmentation solutions on cityscapes validation set are summarized in Tab. 1. We provide these results for reference and emphasize that they should not be directly compared with our method. Among the approaches, Deeplabv3 [7] and CCNet uses the same backbone and multi-scale testing strategy. Deeplabv3+ [9] and DPC [4] use more stronger backbone. In particular, DPC [4] make use of COCO dataset for training rather Cityscapes training set. The results show that the proposed CCNet with multi-scale testing achieves the new state-of-the-art performance.

In addition, we also train the best learned CCNet with ResNet-101 [16] as the backbone using both training and validation sets and make the evaluation on the test set by submitting our test results to the official evaluation server. Most of methods [6, 21, 41, 27, 31, 42, 36, 19, 43, 37] adopt the same backbone as ours and the others [33, 35] utilize stronger backbones. From Tab. 2, it can be observed that our CCNet substantially outperforms all the previous techniques. Among the approaches, PSANet [43] is most related to our method which generates sub attention map for each pixel. One of the differences is that the sub attention map has weights in PSANet and weights in CCNet. Our method can achieve better performance with low computation cost and low memory usage.

Method Backbone multi-scale mIOU(%)
DeepLabv3 [7] ResNet-101 Yes 79.3
DeepLabv3+ [9] Xception-65 No 79.1
DPC [4] † Xception-71 No 80.8
CCNet ResNet-101 Yes 81.3
  • † use extra COCO dataset for training.

Table 1: Comparison with state-of-the-arts on Cityscapes validation set.
Method Backbone mIOU(%)
DeepLab-v2 [6] ResNet-101 70.4
RefineNet [21] ‡ ResNet-101 73.6
SAC [41] ‡ ResNet-101 78.1
GCN [27] ‡ ResNet-101 76.9
DUC [31] ‡ ResNet-101 77.6
ResNet-38 [33] WiderResnet-38 78.4
PSPNet [42] ResNet-101 78.4
BiSeNet [36] ‡ ResNet-101 78.9
AAF [19] ‡ ResNet-101 79.1
PSANet [43] ‡ ResNet-101 80.1
DFN [37] ‡ ResNet-101 79.3
DenseASPP [35] ‡ DenseNet-161 80.6
CCNet ‡ ResNet-101 81.4
  • ‡ train with both the train-fine and val-fine datasets.

Table 2: Cityscapes test set performance across leading competitive models.
Figure 5: Visualization results of RCCA with different loops on Cityscapes validation set.

4.3.2 Ablation studies

To further prove the effectiveness of the CCNet, we conduct extensive ablation experiments on the validation set of Cityscapes with different settings for CCNet.

The effect of attention module Tab. 3 demonstrates the performance on Cityscapes validation set by adopting different number of recurrent criss-cross attention module (RCCA). All experiments are conducted using Resnet-101 as the backbone. Beside, the input size of image is , resulting in the size of input feature map of RCCA is . Our baseline network is ResNet-based FCN with dilated convolution module incorporated at stage 4 and 5, i.e

., dilations are set to 2 and 4 for these two stages respectively. The increment of FLOPs and Memory usage are estimated when

, respectively. We can observe that adding a criss-cross attention into the baseline, donated as , improves the performance by 2.9% compared with the baseline, which can effectively demonstrate the significance of criss-cross attention module. Furthermore, increasing loops from 1 to 2 can improve the performance by 1.8%, demonstrating the effectiveness of dense contextual information. Finally, increasing loops from 2 to 3 slightly improves the performance by 0.4%. Meanwhile, with the increasing of loops, the usage of FLOPs and GPU memory will still be increased. These results prove that the proposed criss-cross attention module can significantly improve the performance by capturing long-range contextual information in horizontal and vertical direction. In addition, the proposed criss-cross attention is effective in capturing the dense and global contextual information, which can finally benefit the performance of semantic segmentation. To balance the performance and resource usage, we choose as default settings in all the following experiments.

We provide the qualitative comparisons in Fig. 5 to further validate the effectiveness of the criss-cross module. We leverage the white circles to indicate those challenging regions that are easily to be misclassified. We can observe that these challenging regions are progressively corrected with the increasing of loops, which can well prove the effectiveness of dense contextual information aggregation for semantic segmentation.

Comparison of context aggregation approaches We compare the performance of several different context aggregation approaches on the Cityscapes validation set with Resnet-50 and Resnet-101 as backbones. It should be noted that we do not provide the result of “Resnet-101 + NL”, because we can not run the experiment that integrates non-local block into Resnet-101 backbone due to the limitation of 12G GPU memory.

Specifically, the baselines of context aggregation mainly include: 1) Zhao et al[42] proposed Pyramid pooling which is the simple and effective way to capture global contextual information, donated as “+PP”; 2) Chen et al[7] used different dilation convolutions to harvest pixel-wise contextual information at the different range, donated as “+ASPP”; 3) Wang et al[32] introduced non-local network whose attention mask for each position is generated by calculating the feature correlation between each pixel-pair to guide context aggregation, donated as “+NL”.

In Tab. 4, Both “+NL” and “+RCCA” achieve better performance compared with other the context aggregation approaches, which demonstrates the importance of capturing dense long-range contextual information. More interestingly, our method achieves better performance than “+NL” which can also form dense long-range contextual information. One cause may be that the attention map plays a key role for contextual information aggregation. “+NL” generates an attention map from the feature which has limit receptive field and short range contextual information, but our “+RCCA” takes two steps to form dense contextual information, leading to that the latter step can learn a better attention map benefiting from the feature map produced by the first step in which some long rage contextual information has already been embedded.

We further explore the amount of computation and memory footprint of RCCA. As shown in Table 5, compared with “+NL” method, the proposed “+RCCA” requires less GPU memory usage and significantly reduce FLOPs by about 85% of non-local block in computing long-range dependencies, which shows that the CCNet is an efficient way to capture long-range contextual information in the least amount of computation and memory footprint.

Visualization of Attention Map To get a deeper understanding of our RCCA, we visualize the learned attention masks as shown in Fig. 6. For each input image, we select one point (green cross) and show its corresponding attention maps when and in columns 2 and 3 respectively. From Fig. 6, only contextual information from the criss-cross path of the target point is capture when . By adopting one more criss-cross module, i.e., the RCCA can finally aggregate denser and richer contextual information compared with that of . Besides, we observe that the attention module could capture semantic similarity and long-range dependencies.

Figure 6: Visualization results of attention module on Cityscapes validation set. The left column is the images from the validation set of Cityscapes, the 2 and 3 columns are pixel-wise attention maps when and in RCCA. The last column is ground truth.
Loops GFLOPs() Memory(M) mIOU(%)
baseline 0 0 75.1
R=1 8.3 53 78.0
R=2 16.5 127 79.8
R=3 24.7 208 80.2
Table 3: Performance on Cityscapes validation set for different loops in RCCA. FLOPs and Memory usage are estimated for an input of .
Method mIOU(%)
ResNet50-Baseline 73.3
ResNet50+PSP 76.4
ResNet50+ASPP 77.1
ResNet50+NL 77.3
ResNet50+RCCA(R=2) 78.5
ResNet101-Baseline 75.1
ResNet101+PSP 78.5
ResNet101+ASPP 78.9
ResNet101+RCCA(R=2) 79.8
Table 4: Comparison of context aggregation approaches on Cityscapes validation set.
Method GFLOPs() Memory(M) mIOU(%)
baseline 0 0 73.3
+NL 108 1411 77.3
+RCCA(R=2) 16.5 127 78.5
Table 5: Comparison of Non-local module and RCCA. FLOPs and Memory usage are estimated for an input of .

4.4 Experiments on ADE20K

In this subsection, we conduct experiments on the AED20K dataset, which is very challenging segmentation dataset for both indoor and outdoor scene parsing, to validate the effectiveness of our method. As shown in Tab. 6, the CCNet achieves the state-of-the-art performance of 45.22%, outperforms the previous state-of-the-art methods by more than 0.6%. Among the approaches, most of methods [41, 42, 43, 20, 34, 40] adopt the ResNet-101 as backbone and RefineNet [21] adopts a more powerful network, i.e., ResNet-152, as the backbone. EncNet [40] achieves previous best performance among the methods and utilizes global pooling with image-level supervision to collect image-level context information. In contrast, our CCNet adopts an alternative way to integrate contextual information by capture pixel-wise long-range dependencies and achieve better performance.

4.5 Experiments on COCO

To further demonstrate the generality of our CCNet, We conduct the instance segmentaion task on COCO [22] using the competitive Mask R-CNN model [15] as the baseline. Following [32], we modify the Mask R-CNN backbone by adding the RCCA module right before the last convolutional residual block of res4. We evaluate a standard baseline of ResNet-50/101. All models are fine-tuned from ImageNet pre-training. We use open source implementation111https://github.com/facebookresearch/maskrcnn-benchmarkwith end-to-end joint training whose performance is almost the same as the baseline reported in  [32]. We report the comparisons in terms of box AP and mask AP in Tab. 7 on COCO. The results demonstrate that our method substantially outperforms the baseline in all metrics. Meanwhile, the network with “+RCCA” also achieve the better performance than the network with one non-local block “+NL”.

Method Backbone mIOU(%)
RefineNet [21] ResNet-152 40.70
SAC [41] ResNet-101 44.30
PSPNet [42] ResNet-101 43.29
PSANet [43] ResNet-101 43.77
DSSPN [20] ResNet-101 43.68
UperNet [34] ResNet-101 42.66
EncNet [40] ResNet-101 44.65
CCNet ResNet-101 45.22
Table 6: State-of-the-art Comparison experiments on ADE20K validation set.
Method
R50 baseline 38.2 34.8
+NL 39.0 35.5
+RCCA 39.3 36.1
R101 baseline 40.1 36.2
+NL 40.8 37.1
+RCCA 41.0 37.3
Table 7: Results of object detection and instance segmentation on COCO.

5 Conclusion and future work

In this paper, we have presented a Criss-Cross Network (CCNet) for semantic segmentation, which adaptively captures long-range contextual information on the criss-cross path. To obtain dense contextual information, we introduce recurrent criss-cross attention module which aggregates contextual information from all pixels. The ablation experiments demonstrate that recurrent criss-cross attention captures dense long-range contextual information in less computation cost and less memory cost. Our CCNet achieves outstanding performance consistently on two semantic segmentation datasets, i.e. Cityscapes, ADE20K and instance segmentation dataset, i.e. COCO.

References