1 Introduction
Semantic segmentation, a process of labelling all pixels of an image to various classes based on their properties, is a fundamental topic in computer vision. In recent years, deep networks, esp. convolutional neural networks (CNNs) have elevated the performance of semantic segmentation algorithms to new heights
[1, 2, 3, 4, 5]. It has been shown that encoding rich contextual information helps achieve good results in semantic segmentation [5, 6, 4, 7]. However, due to the local connectivity of CNN layers, the size of receptive field is often too small to exploit sufficient features. Even in recent deep models, effective receptive field is still not large enough to cover the entire image [8].To address this problem, many methods have been proposed to aggregate global contextual information from local features. Some [6, 5, 9, 10] utilize large filters, atrous spatial pyramid pooling (ASPP) or pyramid pooling module (PPM) to enlarge receptive fields and extract multiscale features at the end of the network. Others [11, 7, 12] adopt spatial attention mechanism to capture longrange dependencies within the feature maps. The nonlocal block proposed in [11] has been widely applied for semantic segmentation as an instance of spatial attention mechanism to aggregate global contextual information densely. For each query element in a nonlocal block, all key elements are used to compute the pairwise relations with the query element and generate a dense affinity matrix. Then the contextual information at all locations is aggregated by a weighted sum with the weights defined by the affinity matrix. Although the nonlocal block improves the performance significantly, it requires high computational costs on highend GPUbased platforms [13, 14]. Moreover, it is worth noting that some features may contain irrelevant information, thus aggregating these features is useless or even harmful to the performance.
Based on these observations, we propose a sparse nonlocal (SNL) block to aggregate contextual information from a sparse set of features, reducing the computational cost without sacrificing the performance. For each query element in the proposed SNL block, only a few key and value elements are sampled to compute a small sparse affinity matrix, and the sampling locations are learnable. To demonstrate the effectiveness of the proposed SNL block, we build a sparse spatial attention network (SSANet) based on the ResNetFCN backbone with the SNL block integrated. The proposed network has been evaluated on various semantic segmentation benchmarks, including the Cityscapes [15], the PASCAL Context [16] and the ADE20K [17], and achieved stateoftheart performances.
2 Related Work
Semantic Segmentation. Long et al. proposed the fully convolutional network [1] where the fully connected layers are replaced with convolutional layers to convert the semantic segmentation task into a pixellevel classification task. Chen et al. proposed a group of segmentation networks called DeepLab [6, 18, 3], which adopted atrous (dilated) convolutions to increase receptive fields and preserve spatial resolution of feature maps simultaneously. Zhao et al. [5] utilized a group of average pooling operations to encode multiscale features. Another popular method is the encoderdecoder structure [19, 2], composed of an encoder and a symmetric decoder, where different level features are extracted in each stage of the encoder and the image resolution is recovered in the decoder step by step.
Contextual Information. Exploring contextual information helps improve performance in semantic segmentation due to multiple scales of objects in images and longrange dependencies between different locations. In [6], an atrous spatial pyramid pooling (ASPP) module was proposed to encode multiscale contextual features by several parallel atrous convolutions of different rates. While in [5], a pyramid pooling module was adopted at the end of the model to exploit multilevel contextual information by pooling operations.
Spatial Attention. Spatial attention mechanism can be considered as a contextual feature aggregation method, which models the longrange dependencies between each pair of elements in the feature maps and then gathers contextual information to each location. Wang et al. [11] proposed the nonlocal block and integrated it in deep networks as a selfattention module to capture longrange dependencies for video classification. In [7], a spatial and a channel attention modules were adopted to capture relations and reweight each location and channel, respectively. Zhao et al. [12] proposed a twobranch attention block to collect and distribute information separately.
3 Methods
3.1 Nonlocal Block and Spatial Attention
The structure of the nonlocal block [11] is depicted in Fig. 1(a). For a given input feature map , where and are the channel number and spatial size of the feature map, respectively. Three convolutions: , and are used to transform to the query, key and value embeddings respectively, i.e. and . Then a dense affinity matrix, , is computed by a matrix multiplication and softmax normalization as
(1) 
where is the normalized similarity between and , and , and are spatial location indexes of query and key respectively. After generating the affinity matrix, another matrix multiplication is used to aggregate the contextual information with value contents as follows,
(2) 
The final output can be computed as
(3) 
where refers to a convolution. In such scenario, longrange dependencies are captured and used to provide rich contextual information.
3.2 Sparse Nonlocal Block
Although the nonlocal block has proven effective for semantic segmentation [7, 20], it is very time consuming. The review of nonlocal block in Sec. 3.1 indicates that the matrix multiplications in Eqn. 1 and Eqn. 2 dominate most of the computational resources, and the computational complexity is . With this consideration, if the calculation of the dense affinity matrix could be replaced with a small sparse affinity matrix, the computational complexity would be reduced.
In the proposed sparse nonlocal (SNL) block, we sample a subset of key elements to multiply with each query element and generate a sparse affinity matrix denoted as , where is the number of sampled key elements. The sampled key elements should be representative and the sampling region should cover large area to encode contextual information, and the number of sampled elements should also be small to reduce the computational complexity.
Motivated by the deformable convolution [21], we restrict the initial sampling locations to the neighbouring locations of the corresponding query pixel, and then shift the locations with offsets , which is learned by a convolution performed on , and the channel dimension corresponds to 2D offsets. Hence, the final sampling coordinates of each key element are calculated as
(4) 
where and denote the the initial coordinates and the corresponding offsets in
, respectively. As the offsets are typically fractional, bilinear interpolation is used to sample key values and make the offsets computation differentiable. The sampled key element
is calculated as(5) 
where is the bilinear interpolation function. For and , the pairwise similarity can be computed as
(6) 
Afterwards, the value contents are sampled at the same locations as , and the contextual information is aggregated as
(7)  
Because of replacing the dense matrix multiplications in the nonlocal block with sparse matrix multiplications defined by Eqn. 6 and Eqn. 7, the computational complexity is reduced to , which is significantly lower than with ^{1}^{1}1 is 2401 for an input feature map of spatial size when evaluating on cityscapes dataset, and is 81 in our model..
The structure of SNL block is depicted in Fig. 1(b). First, four convolutions are performed on the input features to generate query, key, value and offsets, respectively. Next, key elements are sampled for each query element based on the offsets, and then are multiplied by the corresponding query element and normalized to generate the sparse affinity matrix. Then, the generated matrix is multiplied with the value features to obtain . Finally, a convolution is applied for feature fusion.
3.3 Network Architecture
The entire network architecture is shown in Fig. 2. We use MobileNetV2 [22] and ResNet101 [23] as our backbones. Following the previous studies [3], we remove the last downsampling operation in the backbones and utilize dilated convolutions in the last stage to maintain the size of receptive field. We reduce the number of channels to 512 in ResNet101based network and 256 in MobileNetV2based network with a convolution, and then we apply the SNL block on the reduced features. Besides, a global average pooling branch is introduced and concatenated with the reduced features to provide imagelevel information and improve the performance. Finally, we also employ a simple decoder. Output features of the encoder part are first upsampled, and then concatenated with the reduced lowlevel features from stage 2 in the backbone. After the concatenation, two convolutions are used to fuse features before the final classifier.
4 Experiments
4.1 Implementations and Datasets
We conducted all the experiments based on PyTorch
[24]with CUDA 10.0. We employed ImageNetpretrained
[25]MobileNetV2 and ResNet101 as the backbones, and stochastic gradient descent (SGD) algorithm with momentum 0.9 and weight decay 0.0001 was used to train the networks on all the datasets. The “poly” learning rate policy
[4] (the learning rate is multiplied by with) was used, and the initial learning rate was set to 0.005. We employed random horizontal flipping, random scaling and random cropping for data augmentation. The mean IntersectionoverUnion (IoU) metric was used as the evaluation metric.
Cityscapes. The dataset contains 5000 finely annotated images. It is divided into 2975, 500 and 1525 for training, validation and test, respectively. All images are of size , and 19 semantic classes are used for segmentation.
PASCAL Context. The dataset provides dense semantic annotations for the PASCAL VOC 2010 images and contains 4998 training and 5105 validation images. Following the previous work [26], we used the 59 classes annotations to train the networks and to measure the performances.
ADE20K. The dataset is a challenging segmentation dataset which contains 150 semantic classes. The training and validation sets consist of 20K and 2K images respectively.
4.2 Ablation Studies
To evaluate the performance of the proposed method, we first conducted ablation studies on the Cityscapes validation set with single scale testing.
Ablation on sampling number. We adopted a simple MobileNetV2based FCN as the baseline, and it achieved 70.39% mean IoU result. Then we integrated the sparse nonlocal block at the end of backbone. We adopted different schemes to evaluate the effect of the SNL block by changing the number of sampled key and value elements for each query element. Results are shown in Table 1, indicating that the sparse nonlocal block gave better results. Specifically, the improvements comparing with the baseline increased with sampling more key elements, indicating that encoding more contextual information are beneficial for segmentation task. When sampling number was larger than 81, the improvements began to drop, this may due to that some key elements contain redundant or unhelpful information, leading to distortion. Hence we selected 81 as the sampling number in the final architecture.
Method  NoS  Mean IoU (%) 

Baseline  —  70.39 
Baseline + SNL  9  73.54 
Baseline + SNL  49  74.26 
Baseline + SNL  81  74.51 
Baseline + SNL  99  73.98 
Baseline + SNL  121  73.96 
Baseline + SNL + GP  81  74.94 
Baseline + SNL + GP + Decoder  81  75.85 
Method  Inf. time (ms)  Mean IoU (%) 

Baseline  27  70.39 
+ PPM [5]  41  72.88 
+ ASPP [18]  55  74.48 
+ NL [11]  42  73.75 
+ SNL  39  74.51 
Ablation on global pooling. As shown in Table 1, adopting the global average pooling path improved the mean IoU from 74.51% to 74.94%, which indicates the effect of this branch.
Ablation on decoder. A simple decoder was employed at the end of the network to replace the naive bilinear upsampling operation. Accuracy was improved by around 0.9% (74.94% 75.85%), as shown in Table 1. This improvement mainly came from the lowlevel features, which provided spatial information to help refine boundaries.
Ablation on SNL block. To compare the performance of our approach with other context aggregation approaches, we replaced the SNL block with the PPM [5], the ASPP module [18] and the standard NL block [11]
, respectively, in the MobileNetV2 backbone, and then we evaluated these models and estimated the inference time with
input images on a Titan V100 GPU. The results are reported in Table 2. It can be seen that our approach outperformed all the other methods with faster inference speed. This indicates spatial attention is a more effective way to encode multiscale contextual features, and using part of key elements instead of all can sift out irrelevant contextual information.4.3 Comparisons with StateoftheArt Methods
Based on the ablation studies, we designed the sparse spatial attention network with ResNet101 backbone and SNL block, and evaluated it on the Cityscapes, PASCAL Context and ADE20K datasets using multiscale testing strategy.
The comparisons with the stateoftheart methods on the cityscapes test set are shown in Table 3. SSANet outperformed these previous methods with 81.8% mean IoU. For training on the PASCAL Context and ADE20K datasets, we changed the sampling number in SNL block to 49 to fit the smaller cropping size (i.e., for PASCAL Context and for ADE20K). As shown in Table 4, SSANet achieved stateoftheart performance on both datasets, showing the robustness of our approach.
5 Conclusions
In this paper, we present the sparse spatial attention network (SSANet) for semantic segmentation. A sparse nonlocal (SNL) block is proposed and integrated in the network. It utilizes spatial attention mechanism to aggregate multiscale contextual information and capture longrange dependencies adaptively to improve the performance. Different from the standard nonlocal block, the dense affinity matrix is replaced with a sparse affinity matrix in the proposed SNL block to improve efficiency and accuracy. The ablation experiments show that SNL block significantly improves segmentation accuracy and achieves the best performance comparing with other context aggregation approaches. SSANet has achieved stateoftheart results on the Cityscapes, PASCAL Context and ADE20K datasets, demonstrating the benefit and effectiveness of the proposed method.
References
 [1] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proc. CVPR, 2015, pp. 3431–3440.
 [2] O. Ronneberger, P. Fischer, and T. Brox, “Unet: Convolutional networks for biomedical image segmentation,” in Proc. MICCAI, 2015, pp. 234–241.
 [3] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoderdecoder with atrous separable convolution for semantic image segmentation,” in Proc. ECCV, 2018, pp. 801–818.
 [4] W. Liu, A. Rabinovich, and A. C. Berg, “Parsenet: Looking wider to see better,” arXiv:1506.04579, 2015.
 [5] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proc. CVPR, 2017, pp. 2881–2890.
 [6] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” TPAMI, vol. 40, pp. 834–848, 2018.
 [7] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attention network for scene segmentation,” in Proc. CVPR, 2019, pp. 3146–3154.
 [8] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Object detectors emerge in deep scene cnns,” arXiv:1412.6856, 2014.
 [9] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun, “Large kernel matters–improve semantic segmentation by global convolutional network,” in Proc. CVPR, 2017, pp. 4353–4361.
 [10] M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang, “Denseaspp for semantic segmentation in street scenes,” in Proc. CVPR, 2018, pp. 3684–3692.
 [11] X. Wang, R. Girshick, A. Gupta, and K. He, “Nonlocal neural networks,” in Proc. CVPR, 2018, pp. 7794–7803.
 [12] H. Zhao, Y. Zhang, S. Liu, J. Shi, C. Change Loy, D. Lin, and J. Jia, “Psanet: Pointwise spatial attention network for scene parsing,” in Proc. ECCV, 2018, pp. 267–283.
 [13] Z. Zhu, M. Xu, S. Bai, T. Huang, and X. Bai, “Asymmetric nonlocal neural networks for semantic segmentation,” in Proc. ICCV, 2019, pp. 593–602.
 [14] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu, “Ccnet: Crisscross attention for semantic segmentation,” in Proc. ICCV, 2019, pp. 603–612.

[15]
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,
U. Franke, S. Roth, and B. Schiele,
“The cityscapes dataset for semantic urban scene understanding,”
in Proc. CVPR, 2016, pp. 3213–3223.  [16] R. Mottaghi, X. Chen, X. Liu, N. Cho, S. Lee, S. Fidler, and A. Urtasun, R.and Yuille, “The role of context for object detection and semantic segmentation in the wild,” in Proc. CVPR, 2014, pp. 891–898.
 [17] B. Zhou, H. Zhao, X. Puig, S. Xiao, T.and Fidler, A. Barriuso, and A. Torralba, “Semantic understanding of scenes through the ade20k dataset,” IJCV, vol. 127, pp. 302–321, 2019.
 [18] L. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” arXiv:1706.05587, 2017.
 [19] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoderdecoder architecture for image segmentation,” arXiv:1511.00561, 2015.
 [20] X. Zhu, D. Cheng, Z. Zhang, S. Lin, and J. Dai, “An empirical study of spatial attention mechanisms in deep networks,” in Proc. ICCV, 2019, pp. 6688–6697.
 [21] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, “Deformable convolutional networks,” in Proc. ICCV, 2017, pp. 764–773.
 [22] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proc. CVPR, 2018, pp. 4510–4520.
 [23] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. CVPR, 2016, pp. 770–778.
 [24] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” 2017.

[25]
J. Deng, W. Dong, L. Socher, R.and Li, K. Li, and L. FeiFei,
“Imagenet: A largescale hierarchical image database,”
in Proc. CVPR, 2009, pp. 248–255.  [26] X. Yuan, Y.and Chen and J. Wang, “Objectcontextual representations for semantic segmentation,” arXiv:1909.11065, 2019.
 [27] T. Ke, J. Hwang, Z. Liu, and S. X. Yu, “Adaptive affinity fields for semantic segmentation,” in Proc. ECCV, 2018, pp. 587–602.
 [28] H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal, “Context encoding for semantic segmentation,” in Proc. CVPR, 2018, pp. 7151–7160.
Comments
There are no comments yet.